# Introduction
The business environment of today is more competitive than in the past. This naturally means there is an increased pressure in the workplace. Additionally, unachievable business goals and an elevated risk of becoming unemployed can increase stress levels for individuals. The continuous presence of these factors can become deterimental to one's health. This can result in illness, both minor (cold, spasms etc) and long-term (depression). Both of these types of illness lead to absenteeism from work.

Absenteeism is defined as the absence from work during normal working hours, resulting in temporary incapacity to execute regular working activity.

# Problem:
From the perspective of the manager(s) in charge of productivity. It can be important to predict the absenteeism from work. Additionally, it would be good to know how long we can expect someone to be missing from work.

# Purpose
Explore whether a person presenting certain characteristics is expected to be away from work at some point in time or not.

Having the information in advance, the manager(s) can improve descision making in such a way to reorganise the work flow that will avoid a lack of productivity and increase the quality of work generated.

# Resources
Preproccessed Data (secondary) is obtained from the preprocessing_absenteeism.ipynb in the form of a .csv file.

# Method
## Import Packages
## Load Preprocessed Data
## Create the Targets for Logistic Regression
## Select the Inputs
## Scale the Inputs
## Shuffle and Split the training and test data
## Train the Model
### Check the accuracy on the training manually
## Summary Table
## Interpretation
## Backward Elimination
## Test the model

## Import Packages

In [1]:
# import table and array packages
import numpy as np
import pandas as pd

# import graphing packages
import matplotlib.pyplot as plt
import seaborn as sns

# import ML packages
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

## Load Preprocessed Data

In [2]:
# Load the preprocessed data from the csv file
preprocessed_data = pd.read_csv('Absenteeism_preprocessed.csv')
preprocessed_data

Unnamed: 0,ID,Date,Reason 1,Reason 2,Reason 3,Reason 4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month,weekday
0,11,2015-07-07,0,0,0,1,289,36,33,239.554,30,0,2,1,4,7,1
1,36,2015-07-14,0,0,0,0,118,13,50,239.554,31,0,1,0,0,7,1
2,3,2015-07-15,0,0,0,1,179,51,38,239.554,31,0,0,0,2,7,2
3,7,2015-07-16,1,0,0,0,279,5,39,239.554,24,0,2,0,4,7,3
4,11,2015-07-23,0,0,0,1,289,36,33,239.554,30,0,2,1,2,7,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,17,2018-05-23,1,0,0,0,179,22,40,237.656,22,1,2,0,8,5,2
696,28,2018-05-23,1,0,0,0,225,26,28,237.656,24,0,1,2,3,5,2
697,18,2018-05-24,1,0,0,0,330,16,28,237.656,25,1,0,0,8,5,3
698,25,2018-05-24,0,0,0,1,235,16,32,237.656,25,1,0,0,2,5,3


## Create the Targets for Logistic Regression
Two classes will serve as the targets. A 'moderately absent' and an 'excessivly absent' class. The nature of logistic regression lends itself nicely to this type of classification.

We will calculate the level of absenteeism using the following. The median of the 'Absenteeism Time in Hours' will be seperation between the two classes. This is numerically stable, but can be naive. Values below the median will be considered moderate and everything above will be considered excessive.

The use of the median instead of the mean is that the targets are implicity balanced. This should prevent the model from only outputting 1s or 0s.

We can remove the original column once one has been derived from it.

In [3]:
# Calculate the median
median = preprocessed_data['Absenteeism Time in Hours'].median()
print(median)

3.0


In [4]:
# Set the targets to be 1 if higher than 3 and 0 if lower and equal to 3
targets = np.where(preprocessed_data['Absenteeism Time in Hours']<=median,0,1)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [5]:
# Check for balance
targets.sum() / targets.shape[0]

0.45571428571428574

The balance is sufficient for logistic regression between 45-55 percent.

### Checkpoint 1

In [6]:
# Add targets to the dataframe
preprocessed_data['Excessive Absenteeism'] = targets
# Checkpoint 1
targeted_data = preprocessed_data.drop(['Absenteeism Time in Hours'], axis = 1)

In [7]:
targeted_data

Unnamed: 0,ID,Date,Reason 1,Reason 2,Reason 3,Reason 4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Month,weekday,Excessive Absenteeism
0,11,2015-07-07,0,0,0,1,289,36,33,239.554,30,0,2,1,7,1,1
1,36,2015-07-14,0,0,0,0,118,13,50,239.554,31,0,1,0,7,1,0
2,3,2015-07-15,0,0,0,1,179,51,38,239.554,31,0,0,0,7,2,0
3,7,2015-07-16,1,0,0,0,279,5,39,239.554,24,0,2,0,7,3,1
4,11,2015-07-23,0,0,0,1,289,36,33,239.554,30,0,2,1,7,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,17,2018-05-23,1,0,0,0,179,22,40,237.656,22,1,2,0,5,2,1
696,28,2018-05-23,1,0,0,0,225,26,28,237.656,24,0,1,2,5,2,0
697,18,2018-05-24,1,0,0,0,330,16,28,237.656,25,1,0,0,5,3,1
698,25,2018-05-24,0,0,0,1,235,16,32,237.656,25,1,0,0,5,3,0


## Select the Inputs

In [8]:
col_list = targeted_data.columns.values
col_list

array(['ID', 'Date', 'Reason 1', 'Reason 2', 'Reason 3', 'Reason 4',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Month', 'weekday', 'Excessive Absenteeism'],
      dtype=object)

In [9]:
col_list = targeted_data.columns

exclude_list = [
    'Excessive Absenteeism','Reason 1', 'Reason 2',
    'Reason 3', 'Reason 4', 'Education',
    'Daily Work Load Average','Distance to Work','Date', 'weekday', 'ID']

scale_list = [x for x in col_list if x not in exclude_list]

to_scale_data = targeted_data[scale_list]
to_scale_data

Unnamed: 0,Transportation Expense,Age,Body Mass Index,Children,Pets,Month
0,289,33,30,2,1,7
1,118,50,31,1,0,7
2,179,38,31,0,0,7
3,279,39,24,2,0,7
4,289,33,30,2,1,7
...,...,...,...,...,...,...
695,179,40,22,2,0,5
696,225,28,24,1,2,5
697,330,28,25,0,0,5
698,235,32,25,0,0,5


## Scale the Inputs

In [10]:
# instantiate the scaler
# this scaler subtracts the mean and divides by the STDEV
scaler = StandardScaler()

In [11]:
# prepare the mechanism
scaler.fit(to_scale_data)

StandardScaler()

In [12]:
# transform the inputs
scaled_input = scaler.transform(to_scale_data)

# if we recieve new data we can apply the transform method directly

In [13]:
scaled_input

array([[ 1.00584437, -0.53606239,  0.76743118,  0.88046927,  0.26848661,
         0.18272635],
       [-1.57468098,  2.13080317,  1.00263338, -0.01928035, -0.58968976,
         0.18272635],
       [-0.6541427 ,  0.24830984,  1.00263338, -0.91902997, -0.58968976,
         0.18272635],
       ...,
       [ 1.62456682, -1.32043461, -0.40857982, -0.91902997, -0.58968976,
        -0.3882935 ],
       [ 0.19094163, -0.69293683, -0.40857982, -0.91902997, -0.58968976,
        -0.3882935 ],
       [ 1.03602595,  0.56205873, -0.40857982, -0.01928035,  0.26848661,
        -0.3882935 ]])

In [14]:
scaled_input.shape

(700, 6)

In [15]:
scaled_input = pd.DataFrame(scaled_input, columns=scale_list)

In [16]:
# Notice that the dummy variables are still needed. 
# Include the dummy variables from a previous frame
scaled_input['Reason 1'] = targeted_data['Reason 1']
scaled_input['Reason 2'] = targeted_data['Reason 2']
scaled_input['Reason 3'] = targeted_data['Reason 3']
scaled_input['Reason 4'] = targeted_data['Reason 4']
scaled_input['Education'] = targeted_data['Education']
#Include the ID for identification purposes
scaled_input['ID'] = targeted_data['ID']
#Include the Date
scaled_input['Date'] = targeted_data['Date']
#Include the Targets
scaled_input['Excessive Absenteeism'] = targeted_data['Excessive Absenteeism']



In [17]:
scaled_input.iloc[:,:-1]

Unnamed: 0,Transportation Expense,Age,Body Mass Index,Children,Pets,Month,Reason 1,Reason 2,Reason 3,Reason 4,Education,ID,Date
0,1.005844,-0.536062,0.767431,0.880469,0.268487,0.182726,0,0,0,1,0,11,2015-07-07
1,-1.574681,2.130803,1.002633,-0.019280,-0.589690,0.182726,0,0,0,0,0,36,2015-07-14
2,-0.654143,0.248310,1.002633,-0.919030,-0.589690,0.182726,0,0,0,1,0,3,2015-07-15
3,0.854936,0.405184,-0.643782,0.880469,-0.589690,0.182726,1,0,0,0,0,7,2015-07-16
4,1.005844,-0.536062,0.767431,0.880469,0.268487,0.182726,0,0,0,1,0,11,2015-07-23
...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,-0.654143,0.562059,-1.114186,0.880469,-0.589690,-0.388293,1,0,0,0,1,17,2018-05-23
696,0.040034,-1.320435,-0.643782,-0.019280,1.126663,-0.388293,1,0,0,0,0,28,2018-05-23
697,1.624567,-1.320435,-0.408580,-0.919030,-0.589690,-0.388293,1,0,0,0,1,18,2018-05-24
698,0.190942,-0.692937,-0.408580,-0.919030,-0.589690,-0.388293,0,0,0,1,1,25,2018-05-24


## Shuffle and Split the training and test data

In [18]:
# split the data whilst doing a "speudo random" shuffle.
# unpack the tuple into training x, test x, training_targets, test_targets
x_train, x_test, y_train, y_test = train_test_split(scaled_input.iloc[:,:-1], scaled_input.iloc[:,-3:], train_size=0.8, shuffle=True, random_state=20)

In [19]:
x_train.shape

(560, 13)

In [20]:
y_train.shape

(560, 3)

In [21]:
x_test.shape

(140, 13)

In [22]:
y_test.shape

(140, 3)

In [23]:
x_test

Unnamed: 0,Transportation Expense,Age,Body Mass Index,Children,Pets,Month,Reason 1,Reason 2,Reason 3,Reason 4,Education,ID,Date
535,-0.654143,0.248310,1.002633,-0.919030,-0.589690,1.324766,0,0,0,1,0,3,2017-11-10
281,1.036026,0.562059,-0.408580,-0.019280,0.268487,0.753746,0,0,0,1,0,15,2016-09-16
324,0.190942,1.032682,2.649049,-0.019280,-0.589690,1.324766,0,0,0,1,0,5,2016-11-14
645,-0.654143,0.248310,1.002633,-0.919030,-0.589690,-0.959313,0,0,0,1,0,3,2018-03-23
10,0.568211,-0.065439,-0.878984,2.679969,-0.589690,0.182726,1,0,0,0,0,20,2015-07-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...
136,1.005844,-0.536062,0.767431,0.880469,0.268487,-1.530333,0,0,0,1,0,11,2016-01-28
430,2.092381,-1.320435,0.061825,-0.019280,2.843016,-0.388293,0,0,0,1,0,10,2017-05-11
32,0.190942,0.091435,0.532229,-0.019280,0.268487,0.468236,0,0,0,1,1,1,2015-08-27
449,0.356940,0.718933,-0.878984,-0.919030,-0.589690,-0.102784,0,0,0,1,0,24,2017-06-13


## Train the Model

In [24]:
# instantiate the regression object
logistic_regression = LogisticRegression()

In [25]:
# fit the regression to the training data and targets excluding the ID and the Date
logistic_regression.fit(x_train.iloc[:,:-2], y_train.iloc[:,-1])

LogisticRegression()

In [26]:
# Evaluate the training accuracy
logistic_regression.score(x_train.iloc[:,:-2], y_train.iloc[:,-1])

0.7732142857142857

### Check the accuracy on the training manually

In [27]:
# extract the model predictions
model_outputs = logistic_regression.predict(x_train.iloc[:,:-2])
np.sum(model_outputs == y_train.iloc[:,-1])

433

In [28]:
# Calculate the total number of True entries
np.sum(model_outputs == y_train.iloc[:,-1])/model_outputs.shape[0]

0.7732142857142857

## Summary Table

In [29]:
# Extract the intercept or "bias"
intercept = logistic_regression.intercept_[0]

In [30]:
# Extract the coefficients or "weights"
coeffiecients = np.transpose(logistic_regression.coef_)

In [31]:
# corresponding features
features = scaled_input.iloc[:,:-3].columns.values

In [32]:
summary = pd.DataFrame(columns=['Feature'], data = features)
summary['Coefficient'] = coeffiecients
summary

Unnamed: 0,Feature,Coefficient
0,Transportation Expense,0.605284
1,Age,-0.169891
2,Body Mass Index,0.279811
3,Children,0.348262
4,Pets,-0.277396
5,Month,0.15893
6,Reason 1,2.800197
7,Reason 2,0.951884
8,Reason 3,3.115553
9,Reason 4,0.839001


In [33]:
summary.index = summary.index+1
summary.loc[0] = ['Intercept',intercept]
summary = summary.sort_index()
summary

Unnamed: 0,Feature,Coefficient
0,Intercept,-1.647455
1,Transportation Expense,0.605284
2,Age,-0.169891
3,Body Mass Index,0.279811
4,Children,0.348262
5,Pets,-0.277396
6,Month,0.15893
7,Reason 1,2.800197
8,Reason 2,0.951884
9,Reason 3,3.115553


## Interpretation
Logistic regression predicts the log(odds) coefficients. Therefore taking the exponential of the coefficients will yield a more intuiative value.

In [34]:
# transform the coefficients
summary['Odds Ratio'] = np.exp(summary.Coefficient)
# order the features by importance. A higher value is more important.
summary.sort_values('Odds Ratio', ascending = False)

Unnamed: 0,Feature,Coefficient,Odds Ratio
9,Reason 3,3.115553,22.545903
7,Reason 1,2.800197,16.447892
8,Reason 2,0.951884,2.590585
10,Reason 4,0.839001,2.314054
1,Transportation Expense,0.605284,1.831773
4,Children,0.348262,1.416604
3,Body Mass Index,0.279811,1.32288
6,Month,0.15893,1.172256
2,Age,-0.169891,0.843757
11,Education,-0.210533,0.810152


If a coefficient is close to zero (If the odds ratio is close to 1), the feature is not important. A weight of zero implies that an input will be zeroed in the model. Do not mind negative coefficients (odds ratios is close to 0).

For one unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio (1=no change)

The following features are considered to be less important according to the model (when given all features):
* Daily Work Load Average
* Distance to Work
* weekday
* Month

This means that in comparison to the other features, these carry less impact but they might still be important to the overall accuracy of the model.

Notice that all the "Reason" features are important in the following order:
* Reason 3 - Misc (posion and other non-normal reasons)
* Reason 1 - Various Diseases
* Reason 2 - Pregnancy and Birth
* Reason 4 - Light Diseases

Following that, the Transportation Expense, Children, BMI, Age, Education and Pets are intermediate in importance as predictors.

Drawbacks - the standardized data is difficult to interpret when comparing against non-standardized data. It is a tradeoff between increased accuracy (standardizing) and increased interpretability (when the underlying reasons are of interest more so than accuracy of the model). Since we are focussing on accurate predictors, standardizing is prefferable in this case.

## Backward Elimination
We can simplify the model by eliminating features that contribute very little to the accuracy of the model. Coefficients with p-values of less than 0.05 may be eliminated. However, with sklearn the p-values are not included. Rather, we can infer that coefficient values close to zero renders an input obsolete. 

## Test the model

In [35]:
logistic_regression.score(x_test.iloc[:,:-2], y_test.iloc[:,-1])

0.75

In [36]:
absence_probaility = x_test[['ID','Date']].copy()
absence_probaility['Excessive Absence Probability'] = logistic_regression.predict_proba(x_test.iloc[:,:-2])[:,1]

In [37]:
absence_probaility

Unnamed: 0,ID,Date,Excessive Absence Probability
535,3,2017-11-10,0.286596
281,15,2016-09-16,0.412758
324,5,2016-11-14,0.559792
645,3,2018-03-23,0.218405
10,20,2015-07-20,0.915891
...,...,...,...
136,11,2016-01-28,0.523761
430,10,2017-05-11,0.460729
32,1,2015-08-27,0.315010
449,24,2017-06-13,0.243550


The model was 75% accurate on the test data. Below, are the probabilities of any given individual in the test data of being excessivly absent on a particular date in the past. 

We can use this to feed data in to the model "on the day" when an individual needs to be absent in order to "guess" how long said individual will be absent for. In otherwords will the individual be absent for longer than the median (3 hours) on that day.