# Analysis of Employee Absenteeism

The project is addressed 'Absenteeism' at a company during work time.

*Absenteeism is absence from work during normal working hours, resulting in temporary incapacity to execute regular working activity.

*_Motivation behind this project is that application of my learnings in the area of Data Science._*

## _Install dependecies_

In [None]:
!pip install -r requirements.txt

## Importing Libraries


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import numpy as np
import tensorflow as tf
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, recall_score

%matplotlib inline
sns.set()

### Reading data and Exploring data 


In [None]:
raw_data = pd.read_csv("data/Absenteeism-data.csv")

In [None]:
raw_data.drop("ID",axis=1, inplace=True)

In [None]:
raw_data.describe(include="all")

In [None]:
data_with_dummies = raw_data.copy()
reasons = pd.get_dummies(data_with_dummies["Reason for Absence"],drop_first=True)

In [None]:
data_with_dummies.describe(include="all")

The age distribution is shown as between 27-30 and between 35-40 have higher percentage in the histogram.

In [None]:
data_with_dummies.hist(column=["Age"]);

In [None]:
data_with_dummies["Daily Work Load Average"].plot(kind='hist', color='blue')
plt.title("Daily Work Load Average")
plt.xlabel('Hour', fontsize=13);

## Reason for absence:
28 different reasons for absence from work. Checking for labels for classifications


In [None]:
reasons.head()


In the plot we can see that 7 absenteeism time in hours has the highest percentage and also the highest distance to work. Distance is an important factor for absenteeism.

In [None]:

data_with_dummies[['Transportation Expense', 'Distance to Work','Absenteeism Time in Hours']].groupby(['Absenteeism Time in Hours']).mean().plot(kind='bar', figsize=(14, 8), title='Absenteeism Time in Hours');

In [None]:
data_with_dummies[data_with_dummies.Age < 40].select_dtypes(include = ['float64', 'int64']).groupby('Age').agg(['count', 'mean']).transpose()

### We can drop the values and assign as label vector

In [None]:
data_with_dummies = data_with_dummies.drop("Reason for Absence", axis =1)

### Group the Reasons for Absence:
    - Manuel classifying the same reasons into group in order to decrease dimensionality of the data
    - Classification is based on real reasons which is mentioned above

In [None]:
reason_type1 = reasons.loc[:,"1":"14"].max(axis=1)
reason_type2 = reasons.loc[:, "15":"17"].max(axis=1)
reason_type3 = reasons.loc[:, "18":"21"].max(axis=1)
reason_type4 = reasons.loc[:, "22":].max(axis=1)

## Concatenate the Column Values
After grouping the reasons which are similar, we can reconstruct dataframe from them. Having said that we can compare the labels with our feature before we train our model

In [None]:
data_with_dummies = pd.concat([data_with_dummies,reason_type1, reason_type2, reason_type3,reason_type4], axis=1)

In [None]:
data_with_dummies

In [None]:
data_with_dummies.columns.values

#### We can rename our columns for reasons type

In [None]:
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', "reason_type1", "reason_type2", "reason_type3", "reason_type4"]


In [None]:
data_with_dummies.columns= column_names

In [None]:
data_with_dummies.columns.values

## Finalizing the columns
 - Reordering the columns for legibility
 - Data types of each columns should correctly mapped. For example, for date columns data type should be DatetimeIndex
 

In [None]:
columns = ['reason_type1',
       'reason_type2', 'reason_type3', 'reason_type4','Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours' ]

In [None]:
data_with_dummies = data_with_dummies[columns]

In [None]:
data_with_dummies.columns.values

In [None]:
type(data_with_dummies["Date"][0])

In [None]:
data_mod = data_with_dummies.copy()

In [None]:
data_mod["Date"] = pd.to_datetime(data_mod["Date"], format="%d/%m/%Y")

In [None]:
data_mod

In [None]:
type(data_mod["Date"][0])

## Check for class balance
It can be seen that the data have equal classes for the label. For the model that is used in this data analysis, data should be balanced. Here, it is shown that there are four reasons so we have four labels for feature dataset.

In [None]:
data_mod.info()

### Checking and correcting the data types of columns

In [None]:
months = []

for i in range(data_mod.shape[0]):
    months.append(data_mod["Date"][i].month)

In [None]:
months

In [None]:
len(months)

In [None]:
data_mod["months"] = months

In [None]:
data_mod.head()

In [None]:
data_mod["Date"][699].weekday()

In [None]:
def date_to_weekday(date_value):
    return date_value.weekday()

In [None]:
data_mod["Day of the Week"] = data_mod["Date"].apply(date_to_weekday)

In [None]:
data_mod

In [None]:
data_mod = data_mod.drop("Date", axis=1)

In [None]:
data_mod

In [None]:
data_mod.columns.values

In [None]:
new_cols = ['reason_type1', 'reason_type2', 'reason_type3', 'reason_type4', 'months',
       'Day of the Week',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']
data_final = data_mod[new_cols]

In [None]:
data_final.head(40)

In [None]:
data_final["Daily Work Load Average"].unique()

For Education column; 1: High school, 2: Graduate, 3: Post Graduate, 4: Master or PhD

In [None]:
data_final["Education"] = data_final["Education"].map({1:0,2:1,3:1,4:1})

In [None]:
data_final["Education"].value_counts()

In [None]:
## Exploratory Data Analysis

In [None]:
# visualization of Education and pets
pd.crosstab(data_final.Education, data_final.Pets).plot(kind = 'bar', color = ['red', 'green', 'blue', 'black'], title = 'Education and Pets co-relation Exploration')
plt.xlabel('Education', fontsize = 13)
plt.ylabel('Number of Employee', fontsize = 13)
plt.xticks([0, 1, 2, 3], ['High School', 'Higher Education', "Graduate", 'Post Graduate' ], rotation = -75)
plt.legend(['No', 'Yes']);

### Highly correlated variables
Pets ~ Distance to work

Pets ~ Children

Age ~ Children

In [None]:
train_numerical = data_final[["Education", "Age", "Children","Pets",
                         "Daily Work Load Average", "Distance to Work"]]
corr = train_numerical.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, cmap='Blues')
plt.title('Correlation Heatmap of Numeric Features');

The pairplot plot a pairwise relationships in our dataset. The pairplot function creates a grid of Axes such that each variable in our data will by shared in the y-axis across a single row and in the x-axis across a single column. 

In [None]:
# Pair grid of key variables.
g = sns.PairGrid(data_final, vars=["Education", "Distance to Work", "Age","Pets", 
                           "Day of the Week", "Transportation Expense"], 
                 palette='OrRd', hue='Absenteeism Time in Hours')
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
plt.subplots_adjust(top=0.95)
g.fig.suptitle('Pairwise Grid of Numeric Features');

## Final Checkpoint

In [None]:
data_final.head()

## Targets
There is the pandas method called median which can help us in this section.The median value of the absenteeism time is 3.0 , in the cells everything below the median would be considered as normal. Everything above the median would be excessive.

In [None]:
targets = np.where(data_final["Absenteeism Time in Hours"] > data_final["Absenteeism Time in Hours"].median(),1,0)

In [None]:
data_final["Excessive Absenteeism"] = targets

## Comment on Targets
Using the median as a cutoff line is numerically stable and rigid. That's because by using the median we have implicitly balance the dataset roughly half of the targets are 0s while the other half 1s. This will prevent our model from learning to output one of the two classes exclusively. Total number of targets is simply the shape on axis zero.The result is around 0.46. So around 46 percent of the targets are 1s thus around 54 percent of the targets are 0s. Usually 60 40 split will work equally well for a logistic regression.

In [None]:
targets.sum() / targets.shape[0]

In [None]:
data_with_targets = data_final.drop(["Absenteeism Time in Hours","Day of the Week", "Daily Work Load Average", "Distance to Work"], axis=1)

In [None]:
data_with_targets.head()

## Inputs for the Regression

In [None]:
data_with_targets.shape

In [None]:
data_with_targets.iloc[:,:-1]

## Standardize the Data
There are several ways to perform standardization. Here the relevant module is imported from sklearn. It will subtract the mean and divide by the standard deviation from each point variable wise.

In [None]:
unscaled_inputs = data_with_targets.iloc[:,:-1]

In [None]:
targets_data = data_with_targets.iloc[:,-1]

In [None]:
unscaled_inputs

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
#absenteeism_scaler = StandardScaler()

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self,columns, copy=True, with_mean=True, with_std=True):
        self.scaler = StandardScaler(copy,with_mean, with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns],y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.mean(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled =pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [None]:
unscaled_inputs.columns.values


In [None]:
#columns_to_scale = ['months', 'Day of the Week', 'Transportation Expense',
 #      'Distance to Work', 'Age', 'Daily Work Load Average',
  #     'Body Mass Index', 'Education', 'Children', 'Pets'] 

In [None]:
columns_to_omit = ['reason_type1', 'reason_type2', 'reason_type3', 'reason_type4','Education']

In [None]:
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [None]:
absenteeism_scaler =CustomScaler(columns_to_scale)

In [None]:
absenteeism_scaler.fit(unscaled_inputs)

In [None]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [None]:
scaled_inputs

In [None]:
scaled_inputs.shape

## Split the Data and into Train & Test and Shuffle
Sklearn has a pretty neat method of splitting the data into train and test in order to use it we must import it. The train size is selected as 0.8. This means that 80% of the data will be used for training and 20 % for testing.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs,targets_data,train_size=0.8, random_state=42)

## Logistic Regression with sklearn
For a machine learning model there are many mathematical issues arising in the background. Imperfect libraries such as statsmodel are not always numerically stable for more complicated models. That's why sklearn is used for this model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics


In [None]:
y_train.shape

## Training the Model

In [None]:
reg = LogisticRegression()

In [None]:
reg.fit(x_train,y_train)

###### We conclude that our model has an accuracy of 80%. In other words based on the data we used our model learned to classify 80 percent of the observations correctly.

In [None]:
reg.score(x_train,y_train)

## Manually Check the Accuracy
That's needed for two reasons. First, it is always good to have the full understanding of what we are doing and secondly, we will be using this idea later on.

In [None]:
model_outputs = reg.predict(x_train)

In [None]:
model_outputs == y_train

###### Train accuracy:

In [None]:
np.sum(model_outputs == y_train) / model_outputs.shape[0]

## Intercept and Coefficients

In [None]:
reg.intercept_


In [None]:
reg.coef_

In [None]:
feature_name = unscaled_inputs.columns.values

In [None]:
unscaled_inputs.columns.values

## Summary Table

In [None]:
summary_table = pd.DataFrame(data = feature_name, columns=["Feature Names"])

In [None]:
summary_table["Coefficient"] = np.transpose(reg.coef_)

In [None]:
summary_table

In [None]:
summary_table.index = summary_table.index + 1

In [None]:
summary_table.drop(1, axis=0)

In [None]:
summary_table

In [None]:
summary_table.loc[0] = ["intercepts", reg.intercept_[0]]
summary_table = summary_table.sort_index()

In [None]:
summary_table

There are a coefficient values and standardized coefficient values. These standardized coefficients are basically the coefficient values of a regression where all variables have been standardized other packages in software include the standardized coefficients because they allow for a simple and easy to understand comparison between the variables since in such cases the features are standardized.

In [None]:
summary_table["Odds_ratio"] = np.exp(summary_table.Coefficient)

If a coefficient is around zero or its odds ratio is close to 1, this means that the corresponding feature is not particularly important.

In [None]:
summary_table.sort_values(by="Odds_ratio", ascending=False)

## Testing the Model

In [None]:
reg.score(x_train,y_train)

In [None]:
reg.score(x_test,y_test)

Test and train accuracy is equal and this  mean that our model overfitted and it learned the train data very well.

###### Divide data into train and test

In [None]:
x_test.shape


In [None]:
x_train.shape

In [None]:
y_test.shape

In [None]:
y_train.shape

The first column shows the probability of our model assigned to the observation being 0 and the second the probability the model assigned to the observation being 1.

In [None]:
predict_proba =reg.predict_proba(x_test)
predict_proba

###### This give us the probabilities of excessive absenteeism.

In [None]:
predict_proba[:,1]

### Exporting the results to pickle for archive 

In [None]:
import pickle

In [None]:
with open("model", "wb") as file:
    pickle.dump(reg,file)

In [None]:
with open("absenteeism", "wb") as file:
    pickle.dump(absenteeism_scaler,file)

## Conclusion
This study has shown that analysis of absenteeism of employee. It depends on several factors. From the model, we noticed that children, pets, distance to work and transportation expenses have a significant effect on the absenteeism.

 ® Hasan Kaya 2020
 

*The Github repository can be found [here](https://github.com/mrhasankaya/Data-Analysis-Productivity)*