In [109]:
import pandas as pd
import numpy as np

## Load data 

In [229]:
df_preprocessed = pd.read_csv('absenteeism_data_preprocessed.csv')

In [230]:
df = df_preprocessed.copy()
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Day of Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


## Create targets for logistic regression

We will take median time for absenteeism, and then use logistic regression to classify above/below average absence.

Using the median is numerically stable and rigid - it __always__ produces a balanced dataset (~45:55 maximum difference).

In [231]:
targets = np.where(df['Absenteeism Time in Hours'] > df['Absenteeism Time in Hours'].median(),
                  1,
                  0)

df['Excessive Absenteeism'] = targets

# We drop variables that aren't useful for our model as well as the target
# If you put these variables back into the model and re-run you'll see their coefficients are ~0
# e^0 = 1, effect on odds is multiplied by 1 - no effect
df = df.drop(['Absenteeism Time in Hours','Distance to Work', 'Daily Work Load Average','Day of Week','Education'], axis=1)
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Transportation Expense,Age,Body Mass Index,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,2,1,1
1,0,0,0,0,7,118,50,31,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0
3,1,0,0,0,7,279,39,24,2,0,1
4,0,0,0,1,7,289,33,30,2,1,0


In [232]:
df_with_targets = df.copy()

We can use the reserved word ***is*** to see if 2 objects take up the same bit of memory.

In [233]:
df is df_with_targets

False

## Creating inputs

In [234]:
unscaled_inputs = df_with_targets.iloc[:, :-1]
unscaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Transportation Expense,Age,Body Mass Index,Children,Pets
0,0,0,0,1,7,289,33,30,2,1
1,0,0,0,0,7,118,50,31,1,0
2,0,0,0,1,7,179,38,31,0,0
3,1,0,0,0,7,279,39,24,2,0
4,0,0,0,1,7,289,33,30,2,1
...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,2,0
696,1,0,0,0,5,225,28,24,1,2
697,1,0,0,0,5,330,28,25,0,0
698,0,0,0,1,5,235,32,25,0,0


##### We now need to standardise the inputs!!!

Importantly, we want to only scale **some** inputs, we want to leave our Reason_x columns unchanged so that they're more meaningful in our final output.

In [235]:
from sklearn.base import BaseEstimator, TransformerMixin
# BaseEstimator is the base class for all scikitlearn estimators https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/base.py#L150
# TransformerMixin is used for fit_transform method
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator, TransformerMixin):
    
    # Build our scaler which copies the data, then standardises using mean and std
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns # Specify columns to be fitted
        self.mean_ = None
        self.var_ = None
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y) # Only fit a subset of columns
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self # Returns scaler fitted to specific columns from X
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(data=self.scaler.transform(X[self.columns]), columns=self.columns)
        # The tilde (~) operator returns the complement
        # In this case, everything that is NOT X.columns
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [237]:
columns_to_scale = ['Month', 'Transportation Expense', 'Age',
       'Body Mass Index','Children', 'Pets']

In [238]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [239]:
absenteeism_scaler.fit(unscaled_inputs)



CustomScaler(columns=['Month', 'Transportation Expense', 'Age',
                      'Body Mass Index', 'Children', 'Pets'],
             copy=None, with_mean=None, with_std=None)

In [240]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

As we have a scaler object - we can apply the SAME standardisation by reusing this object as it stored mean and std!

fit_transform applies the fit method to set up the object and the transform to make the new dataset.

In [241]:
scaled_inputs.shape

(700, 10)

In [242]:
scaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month,Transportation Expense,Age,Body Mass Index,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.654143,0.248310,1.002633,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0.880469,-0.589690
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.654143,0.562059,-1.114186,0.880469,-0.589690
696,1,0,0,0,-0.388293,0.040034,-1.320435,-0.643782,-0.019280,1.126663
697,1,0,0,0,-0.388293,1.624567,-1.320435,-0.408580,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.190942,-0.692937,-0.408580,-0.919030,-0.589690


## Train test split

In [243]:
from sklearn.model_selection import train_test_split

In [244]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, 
                                                    targets, 
                                                    train_size = 0.8,
                                                    random_state=1)


In [245]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(560, 10) (140, 10) (560,) (140,)


# Modelling 🎉

In [246]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [247]:
reg = LogisticRegression()

In [248]:
reg.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [249]:
reg.score(x_train, y_train)

0.7732142857142857

### Manually check accuracy

In [250]:
model_outputs = reg.predict(x_train)

In [251]:
np.sum(model_outputs == y_train) / model_outputs.shape[0]

0.7732142857142857

### Create summary table

In [252]:
reg.intercept_

array([-1.70368942])

In [253]:
reg.coef_

array([[ 2.80431533,  0.98899259,  3.099096  ,  0.84885058,  0.10708124,
         0.57063203, -0.24989372,  0.29076326,  0.46916808, -0.29024301]])

In [254]:
feature_names = unscaled_inputs.columns.values
feature_names

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Children',
       'Pets'], dtype=object)

In [255]:
summary_table = pd.DataFrame(columns=['Feature name'], data = feature_names)

# We transpose to turn array into a column of values rather than a row
summary_table['Coefficient'] = np.transpose(reg.coef_)

summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.804315
1,Reason_2,0.988993
2,Reason_3,3.099096
3,Reason_4,0.848851
4,Month,0.107081
5,Transportation Expense,0.570632
6,Age,-0.249894
7,Body Mass Index,0.290763
8,Children,0.469168
9,Pets,-0.290243


We want to move everything up 1 index position to add our intercept

In [256]:
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.703689
1,Reason_1,2.804315
2,Reason_2,0.988993
3,Reason_3,3.099096
4,Reason_4,0.848851
5,Month,0.107081
6,Transportation Expense,0.570632
7,Age,-0.249894
8,Body Mass Index,0.290763
9,Children,0.469168


Remember that the logistic regression model is linear with respect to Log(odds). So to see how the weights affect the odds ratio we need to show e (2.71828) to the coefficient 

In [257]:
summary_table['Odds ratio'] = np.exp(summary_table.Coefficient)
summary_table.sort_values('Odds ratio', ascending=False)

Unnamed: 0,Feature name,Coefficient,Odds ratio
3,Reason_3,3.099096,22.177893
1,Reason_1,2.804315,16.515764
2,Reason_2,0.988993,2.688525
4,Reason_4,0.848851,2.336959
6,Transportation Expense,0.570632,1.769385
9,Children,0.469168,1.598664
8,Body Mass Index,0.290763,1.337448
5,Month,0.107081,1.113025
7,Age,-0.249894,0.778884
10,Pets,-0.290243,0.748082


To interpret this, **for a unit change in the standardised feature, the odds increase by a multiple equal to the odds ratio**, whereby an odds ratio of 1 implies no change in the odds.

eg. previous odds of 5:1, an increase of 1 standardised unit with an odds ratio of 2 would give new odds of 10:1.

(Note, we removed variables that had coefficients ~0 as a result of performing this step, identifying them, and getting rid of them from our model & re-running)

**Note that Reason_0 is the base case in this situation!**

Remember that reason 3 is poisoning, 1 is various diseases, 4 is light diseases, and 2 is pregnancy - so the amount of absence makes sense when you look at the coefficients.

# Testing

In [258]:
reg.score(x_test, y_test)

0.75

We can also predict the **probability** that a class is a 1 (ie. not just a value of 0 or 1, but a continuous value between 0 and 1).

In [263]:
predicted_proba = reg.predict_proba(x_test)
# This gives an array of n observations by 2 columns
# Column 1 is the probability of a zero
# Column 2 is the probability of a one
predicted_proba[:5,1]

array([0.74548611, 0.24453215, 0.24660453, 0.14883376, 0.43786041])

# Saving

Pickling a model is effectively saving the model weights and parameters for future use.

The pickle module converts an object into a character stream.

We need to save both the logistic regression model ***reg*** as well as the scaler ***absenteeism_scaler***.

In [264]:
import pickle

In [265]:
# Model is file name, wb means write bytes
with open('model', 'wb') as file:
    pickle.dump(reg, file) # .dump means save
    
with open('scaler', 'wb') as file:
    pickle.dump(absenteeism_scaler, file)