In [1]:
import pandas as pd
import numpy as np

## Load the data

In [2]:
data_preprocessed = pd.read_parquet('../../Absenteeism_preprocessed.parquet.gzip')

In [3]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


## Create the targets & inputs

We use `median` to create goals, because about 50 percent of the sample (or population) is less than that, and the other 50 percent of the sample (or population) is more than that. So it's a good option to classify raw data in a balanced way. The initial classification we apply points to two classes: The **moderately absenteeism** and **excessively absenteeism**. Since we have two classes, we can use **logistic regression**. A 60-40 split will usually work equally well for a logistic regression.

In [4]:
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

In [5]:
data_preprocessed['Excessive Absenteeism'] = targets

After we made the targets, we don't need to `Absenteeism Time in Hours` column, so we remove it from the data. We also remove `Daily Work Load Average`, `Day of the Week` and `Distance to Work` columns beacuse of the reasone we'll see later. The reason is they have little effects(coefficients) on our prediction. You can try and continue the process with presence of the mentioned features and see the effect values.

In [6]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours', 'Daily Work Load Average',
                                           'Day of the Week', 'Distance to Work'], axis = 1)

In [7]:
unscaled_inputs = data_with_targets.iloc[:, :-1]

## Standardize the data

Scaling data in preprocess phase is an important step. So we use a custom scaler(for scalability of possible upcomming changes) to standardize only numerical features.

In [8]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    def fit(self, X, y = None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:, ~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis = 1)[init_col_order]

In [9]:
columns_to_scale = ['Month Value', 'Transportation Expense', 'Age', 'Body Mass Index', 'Children', 'Pets']

In [10]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [11]:
absenteeism_scaler.fit(unscaled_inputs)

CustomScaler(columns=['Month Value', 'Transportation Expense', 'Age',
                      'Body Mass Index', 'Children', 'Pets'])

In [12]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

## Split the data

In [13]:
from sklearn.model_selection import train_test_split

Since we're shuffling the data every time we run the cell below, we get a different permutation of the data. We may be lucky and get high accuracies or may be unlucky and receive poor ones. So we need a pseudo-random mechanism to shuffle our data. The parameter `random_state` is the solution. If we use it, then the method will always **shuffle the observations in the same 'random' way**.  

In [14]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size = 0.2, random_state = 20)

In [15]:
print(x_train.shape, y_train.shape)

(560, 11) (560,)


In [16]:
print(x_test.shape, y_test.shape)

(140, 11) (140,)


## Training the model

### Mathematical background
An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input $t$, and outputs a value between zero and one. 

The standard logistic function $\sigma : \mathbb{R} \rightarrow (0, 1)$ is defined as follows:

$$\sigma (t) = \frac {e^t}{1 + e^t} = \frac {1}{1 + e^{-t}}$$

Let us assume that $t$ is a linear function of a single explanatory variable $x$. We can then express $t$ as follows:

$$t = \beta_0 + \beta_1x$$

And the general logistic function $p: \mathbb{R} \rightarrow (0, 1)$ can now be written as:

$$p(x) = \sigma (t) = \frac {1}{1 + e^{-(\beta_0 + \beta_1x)}}$$

In the logistic model, $p(x)$ is interpreted as the probability of the dependent variable $Y$ equaling a success/case rather than a failure/non-case.

We can now define the logit (log odds) function as the inverse $g = \sigma ^ {-1}$ of the standard logistic function. It is easy to see that it satisfies:

$$g(p(x)) = \sigma ^ {-1}(p(x)) = logit p(x) = ln(\frac {p(x)}{1 - p(x)}) = \beta_0 + \beta_1x$$

and equivalently, after exponentiating both sides we have the odds:

$$\frac{p(x)}{1 - p(x)} = e^{\beta_0 + \beta_1x}$$

In our problem, $p(x)$ is the probabilty of excessive absenteeism whereas $1-p(x)$ is the probability of moderate absenteeism. we can estimate the ration(which called `odds`) with an exponetial calcualtion of linear combination of coefficients($\beta_i$).

The `predictors`($\beta_i$ for $i > 0$) determine the effect of their corresponding feature in final decision making. 

When the predictors are equal to zero or the probabilities of classes are equal(0.5 in our case), $\beta_0$ is the value of criterion and it's called `intercept` or `bias`. In other words, bias calibrates the model.

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [18]:
reg = LogisticRegression()

In [19]:
reg.fit(x_train, y_train)

LogisticRegression()

In [20]:
reg.score(x_train, y_train)

0.7732142857142857

## Manually check the accuracy

In [21]:
model_outputs = reg.predict(x_train)

In [22]:
np.sum((model_outputs == y_train)) / model_outputs.shape[0]

0.7732142857142857

## Interpret coefficients

In [23]:
feature_name = unscaled_inputs.columns.values

In [24]:
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_)

In [25]:
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()

In [26]:
summary_table['Odds'] = np.exp(summary_table.Coefficient)

In [27]:
summary_table.sort_values('Odds', ascending=False)

Unnamed: 0,Feature name,Coefficient,Odds
3,Reason_3,3.115553,22.545903
1,Reason_1,2.800197,16.447892
2,Reason_2,0.951884,2.590585
4,Reason_4,0.839001,2.314054
6,Transportation Expense,0.605284,1.831773
10,Children,0.348262,1.416604
8,Body Mass Index,0.279811,1.32288
5,Month Value,0.15893,1.172256
7,Age,-0.169891,0.843757
9,Education,-0.210533,0.810152


As we see, Reason_3(poisoning), Reason_1(various diseases), Reason_2(pregnancy and giving birth), and Reason_4(light diseases) are the most important factors in our model to predict as an excessively absenteesim or a moderately one.

**A weight(coefficient) of 0 implies that no matter the feature value, we will multiply it by 0(in the model). And for a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio(1 = no change).**

## Test the data

In [28]:
reg.score(x_test, y_test)

0.75

In [29]:
predicted_proba = reg.predict_proba(x_test)
predicted_proba

array([[0.71340413, 0.28659587],
       [0.58724228, 0.41275772],
       [0.44020821, 0.55979179],
       [0.78159464, 0.21840536],
       [0.08410854, 0.91589146],
       [0.33487603, 0.66512397],
       [0.29984576, 0.70015424],
       [0.13103971, 0.86896029],
       [0.78625404, 0.21374596],
       [0.74903632, 0.25096368],
       [0.49397598, 0.50602402],
       [0.22484913, 0.77515087],
       [0.07129151, 0.92870849],
       [0.73178133, 0.26821867],
       [0.30934135, 0.69065865],
       [0.5471671 , 0.4528329 ],
       [0.55052275, 0.44947725],
       [0.5392707 , 0.4607293 ],
       [0.40201117, 0.59798883],
       [0.05361575, 0.94638425],
       [0.7003009 , 0.2996991 ],
       [0.78159464, 0.21840536],
       [0.42037128, 0.57962872],
       [0.42037128, 0.57962872],
       [0.24783565, 0.75216435],
       [0.74566259, 0.25433741],
       [0.51017274, 0.48982726],
       [0.85690195, 0.14309805],
       [0.20349733, 0.79650267],
       [0.78159464, 0.21840536],
       [0.

## Save the model

In [37]:
import pickle

In [38]:
with open('model', 'wb') as file:
    pickle.dump(reg, file)

In [39]:
with open('scaler', 'wb') as file:
    pickle.dump(absenteeism_scaler, file)