# Creating a logistic regression to predict absenteeism
The logistic regression will indicate which variables are important for analysis. 

#### Import Libraries

In [1]:
import pandas as pd
import numpy as np

#### Load the data

In [2]:
data_prep = pd.read_csv('1.1 Absenteeism_preprocessed.csv')

### Logistic regression is a type of classification. 
So we will be classifying people. You need to decide how to classify them before doing some pre-processing work.

We will classify them in two types - Moderately Absent and Excessively Absent. 

 

### Step 1 - Find the Median
Take the median value of the absenteeism time into our cell. Everything below median would be considered normal.

In [3]:
hours_median = data_prep['Absenteeism Time in Hours'].median()

#### Classes 

Median is 3 hours 

Moderately absent (<= 3 hours) 

Excessively absent (>= 4 hours)

### Step 2 - Assign Values
Moderately Absent = 0
Excessively Absent = 1

In supervised machine learnings these Zeros and Ones are called targets. Our task will be to be predict whether we will obtain a 0/1.

np.where(condition, value if True, value if False)

In [4]:
targets = np.where(data_prep['Absenteeism Time in Hours'] > hours_median, 1, 0)

In [5]:
data_prep['Excessive Absenteeism'] = targets

In [6]:
targets.sum() / targets.shape[0]

0.45571428571428574

### Around 46% of the target are 1s. Around 54% of the target are 0s. 

### For a logistic regression 50/50 is good. But 60-40 split will work equally well. Or even 45-55.

## CHECKPOINT! 

Don't need Absenteeism Time in Hours 

In [7]:
data_with_targets = data_prep.drop(['Absenteeism Time in Hours', 'Day of the Week', 'Daily Work Load Average', 'Distance to Work'], axis=1)

### Using the Reserved Word IS 

Output is True if the two variables refer to the same object

Output is False if the two variables refer to different object

In [8]:
data_with_targets is data_prep

False

In [9]:
data_with_targets.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pet,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


### Select the INPUTS for the Regression

In [10]:
data_with_targets.shape

(700, 12)

### METHOD - DataFrame.iloc

This method is commonly used. 

DataFram.iloc[row indices, column indices] - select slices, data by position when given rows and columns wanted 

Excludes the ending index

Using colon : selects ALL rows. The codes below are give the same results. 

data_with_targets.iloc[:,0:14]

data_with_targets.iloc[:,:14]

data_with_targets.iloc[:,:-1]

In [11]:
unscaled_inputs = data_with_targets.iloc[:,:-1]

### Standardize the DATA

absenteeism_scaler will be used to subtract the mean and divide by the standard deviation variablewise (featurewise)

In [12]:
#from sklearn.preprocessing import StandardScaler

#absenteeism_scaler = StandardScaler()

In [13]:
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

# the custom scaler class 
class CustomScaler(BaseEstimator,TransformerMixin): 
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None

    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.array(np.mean(X[self.columns]))
        self.var_ = np.array(np.var(X[self.columns]))
        return self

    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]


In [14]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pet'], dtype=object)

In [15]:
#columns_to_scale = ['Month Value',
       #'Day of the Week', 'Transportation Expense', 'Distance to Work',
       #'Age', 'Daily Work Load Average', 'Body Mass Index', 'Children', 'Pet']
        
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']

In [16]:
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [17]:
absenteeism_scaler = CustomScaler(columns_to_scale)

#### 'Scaling Mechanism' - Calculate and store the mean and the standard deviation. 
Whenever new data is added, it will be standardized and contained in absenteeism_scaler.

To do that - which is a very common method - use the following steps: 

1 - new_data_raw = pd.read_csv('new_data.csv') 

2 - new_data_scaled = absenteeism_scaler.transform(new_data_raw)

In [18]:
absenteeism_scaler.fit(unscaled_inputs)

  return self.partial_fit(X, y)


CustomScaler(columns=['Month Value', 'Transportation Expense', 'Age', 'Body Mass Index', 'Children', 'Pet'],
       copy=None, with_mean=None, with_std=None)

#### Method - .transform( ) does the actual scaling. We subtact the mean and divide by the standard deviation.

In [19]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)



In [20]:
scaled_inputs.shape

(700, 11)

### Split the Data in to TRAIN & TEST & SHUFFLE

#### Import the relevant module

Import train_test_split module from sklearn 

train_test_split (inputs, targets)

This splits arrays or matrices into random train and test subsets. 

In [21]:
from sklearn.model_selection import train_test_split

#### Split

This will result in 4 arrays, which will be contained in 4 new variables

Array 1 - a training dataset with inputs 

Array 2 - a training dataset with targets

Array 3 - a test dataset with inputs

Array 4 - a test dataset with targets

In [22]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets)

#### Let's see what they give us: 

#### The inputs contain 525 observations along 14 features. The targets are a vector of length 525. 

525 : 175 

75% of the observations will help us with training. And 25% will serve as training. 


In [23]:
print (x_train.shape, y_train.shape)

(525, 11) (525,)


In [24]:
print (x_test.shape, y_test.shape)

(175, 11) (175,)


### Tweaks

Usually we opt for 90-10 or 80-20. We don't want too much data used as testing. 

The default with train_test_split is a 75/25. So let's change it to 80/20.

Also let's use the SHUFFLE method. 

By setting the random_state to something. We do this in order to shuffle the observations in the same "random" way. 

In [25]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 20)



In [26]:
print (x_train.shape, y_train.shape)

(560, 11) (560,)


In [27]:
print (x_test.shape, y_test.shape)

(140, 11) (140,)


### Logistic Regression with SKLEARN

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

#### Training the Model

In [29]:
reg = LogisticRegression()

#### sklearn.linear_model.LogisticRegression.fit(x.y) 

This fits the model according to the given training data. 



####  This is a MACHINE LEARNING model.

In [30]:
reg.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

#### Let's evaluate the model's accuracy

sklearn.linear_model.LogisticRegression.score(inputs, targets) 

This returns the mean accuracy on the test data and labels.

Our result is .78. 

Based on the data we used, our model learned to classify ~80% of the observations correctly. 

In [31]:
reg.score(x_train, y_train)

0.775

### Find the results manually. To confirm its accuracy. 

A logistic regression is trained on the TRAINED inputs.

Accuracy means that x% of the model outputs match the targets. 

To do it manually we should find the outputs and compare to the targets.

#### reg.predict(inputs) 
This predicts class labels (logistic regression outputs) for given input samples.

In [32]:
model_outputs = reg.predict(x_train)

In [33]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

#### How many TRUE entries are there?

437 is the TOTAL number of correct predictions (true entries).

In [34]:
np.sum((model_outputs == y_train))

434

#### ACCURACY = Correct Predictions / # Observations

In [35]:
#model_ouputs.shape[0]

In [36]:
np.sum((model_outputs == y_train)) / 560

0.775

### Find the Intercept and Coeffients

In [37]:
reg.intercept_

array([-1.43138127])

In [38]:
reg.coef_

array([[ 2.60237227,  0.84350002,  2.94078723,  0.63723433,  0.00565051,
         0.61953401, -0.17635497,  0.28410321, -0.26372527,  0.35195032,
        -0.27369766]])

### Which variable do those coeffiecents refer to? Get them from the input column. 

In [39]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pet'], dtype=object)

In [40]:
feature_name = unscaled_inputs.columns.values

### Create a Data Fram to contain new variables

In [41]:
summary_table = pd.DataFrame (columns = ['Feature name'], data = feature_name)

In [42]:
summary_table['Coefficient'] = np.transpose(reg.coef_)

In [43]:
summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.602372
1,Reason_2,0.8435
2,Reason_3,2.940787
3,Reason_4,0.637234
4,Month Value,0.005651
5,Transportation Expense,0.619534
6,Age,-0.176355
7,Body Mass Index,0.284103
8,Education,-0.263725
9,Children,0.35195


### Add the Intercept. 

You can't use the Concat method because it will move the intercept to the end of the table. 

Use the .index method to shift up all indices by 1. Now the 0th index is empty. 

In [44]:
summary_table.index = summary_table.index + 1

In [45]:
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

In [46]:
summary_table = summary_table.sort_index()

In [47]:
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.431381
1,Reason_1,2.602372
2,Reason_2,0.8435
3,Reason_3,2.940787
4,Reason_4,0.637234
5,Month Value,0.005651
6,Transportation Expense,0.619534
7,Age,-0.176355
8,Body Mass Index,0.284103
9,Education,-0.263725


### Time to interpret the Coefficients (which is also called weights) and the Intercept (bias) 

For Weights, the closer they are to 0, the smaller the weight. And the further away from 0, the larger the weight.

This is true for our model because we have built a model where the variables are of the same scale. That is, they've been standardized.

Standardized Coefficients are basically the coefficients of a regression where all variables have been standardized. 

We use this because it's easier to interpret. Whichever weight is bigger, its corresponding feature is more important. 

For machine learning purposes we usually standardize the variables.

For Logistic Regressions we are getting Coefficients called Log(odds) which are transformed in Zero and Ones

### Find the Exponentials of the Coefficients

In [48]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

In [49]:
summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-1.431381,0.238979
1,Reason_1,2.602372,13.495716
2,Reason_2,0.8435,2.324489
3,Reason_3,2.940787,18.930743
4,Reason_4,0.637234,1.891243
5,Month Value,0.005651,1.005667
6,Transportation Expense,0.619534,1.858062
7,Age,-0.176355,0.83832
8,Body Mass Index,0.284103,1.32857
9,Education,-0.263725,0.768185


#### From Most Important to Least Important

In [50]:
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,Reason_3,2.940787,18.930743
1,Reason_1,2.602372,13.495716
2,Reason_2,0.8435,2.324489
4,Reason_4,0.637234,1.891243
6,Transportation Expense,0.619534,1.858062
10,Children,0.35195,1.421838
8,Body Mass Index,0.284103,1.32857
5,Month Value,0.005651,1.005667
7,Age,-0.176355,0.83832
9,Education,-0.263725,0.768185


### Interpreting the Coefficient and Odd Ratio

A feature is not important if:

- its coefficient is around 0
- its odd ratio is around 1 

A weight (coefficient) of 0 implies that no matter the feature value, we will multiply it by 0 (in the model)

For a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio (1 = no change)

For example, if the odds are 5:1 and Odds ratio is 2, then we say then the NEW ODDS is 10:2. You multiply the ODDS x ODDS Ratio. 

If we look at the Odds_ratio column, we see that Daily Work Load Average has a coefficient of 0 and almost a 1 Odds_ratio. This means that this feature has no impact in our model and that if we remove it nothing would change. 

For Pet - for each standardized unit of Pet, the odds are 1 - .75 = 24% chance lower than base model (no pets) of skipping work. 

The INTERCEPT (BIAS) - no specific meaning attached. It calibrates the model. 

## 3 Different Needs

Machine Learning - prefer models with high accuracy, so Standardization becomes common

Econometricians - prefer less accuracy and aim for understanding of undelying reason

Data Scientist - could be either. Play with models. See what gives the best results. 

## Backward Elimination

Looking at the Summary Table, we see that a few features have no effect on work absenteeism (Daily Work Load Average, Distance to Work, Day of the Week). 

We can remove these features which play no role on our dependent variable to simplify our model. 

This is like removing coefficients with p-values > 0.05

Removing these features should not change the rest of the coefficients ny much. 