## **Business Case Study: Predictive Absenteeism Analysis for Improved Workforce Productivity**

**Introduction:**
In today's highly competitive business environment, organizations face increased pressure to achieve unattainable goals, leading to elevated stress levels among employees. The persistent presence of these factors may adversely impact an individual's health, potentially resulting in minor illnesses or even long-term conditions such as depression. Recognizing the importance of maintaining a healthy and productive workforce, our focus is on predicting absenteeism from work, specifically understanding whether an employee is likely to be absent for a certain number of hours during a workday.

**Objective:**
The primary objective is to leverage predictive analytics to anticipate employee absenteeism. By doing so, we aim to empower decision-makers with valuable insights into workforce availability, enabling proactive adjustments to the work process. The goal is to mitigate productivity gaps and enhance the overall quality of work generated within the organization.

**Defining Absenteeism:**
*"The absence from work during normal working hours, resulting in temporary incapacity to execute regular working activity."*

**Key Questions to Address:**
1. **Data Source and Information:** What information should be considered for predicting absenteeism? Should the focus be on predicting excessive absenteeism?
   
2. **Measurement of Absenteeism:** How will absenteeism be measured, and what parameters will be used to determine excessive absenteeism?

**Approach:**
Our analysis will focus on exploring the relationship between specific employee characteristics and the likelihood of being absent from work. Questions such as the distance an employee lives from their workplace, the number of dependents (children and pets), educational background, and other relevant factors will be considered. By understanding these associations, we aim to predict the number of working hours an employee could potentially be away from work.

**Utilizing Machine Learning:**
To achieve our predictive goals, we will employ machine learning techniques, including logistic regression and other classification models. These models will be trained on historical data to identify patterns and make predictions about future absenteeism. The model with the highest performance, determined through rigorous evaluation metrics, will be selected for implementation.

**Expected Outcomes:**
1. **Proactive Workforce Management:** Anticipating absenteeism allows for strategic workforce planning, minimizing disruptions to productivity.
   
2. **Quality Improvement:** By reorganizing work processes based on predicted absenteeism, the organization can enhance the quality of work generated.



In [1]:
# Import python libraries
import numpy as np
import pandas as pd


# To display all rows and columns in our results
pd.options.display.max_rows = None
pd.options.display.max_columns = None

In [2]:
# load the dataset
raw_data = pd.read_csv('Absenteeism_data.csv')

df = raw_data.copy()

In [3]:
# Displaying the first five rows in the dataset
df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [4]:
# Undestanding the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


In [5]:
# Drop column not needed in the analysis such as 'ID' column
df = df.drop(['ID'], axis = 1)

In [6]:
# Dealing with categorical variables (Dummy variables) such as Reason for Absence and Education

# Creating Dummy variables for Reasons for Absence column in a new dataframe
reason_columns = pd.get_dummies(df['Reason for Absence'], drop_first = True)

In [7]:
# Grouping the the reasons into 4 groups
reason_type_1 = reason_columns.loc[:,:14].max(axis = 1)
reason_type_2 = reason_columns.loc[:,15:17].max(axis = 1)
reason_type_3 = reason_columns.loc[:,18:21].max(axis = 1)
reason_type_4 = reason_columns.loc[:,21:].max(axis = 1)

In [8]:
# Concatenate df with the reason types created
df = pd.concat([df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis = 1)

# To prevent multicolinearity, we drop 'Reason for Absence' column
df =  df.drop(['Reason for Absence'], axis = 1)

In [9]:
df.columns.values

array(['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [10]:
# Renaming reason columns from 0,1,2,3 to reason_type_1, reason_type_2, reason_type_3,reason_type_4 respectively.
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours','reason_type_1', 'reason_type_2', 'reason_type_3','reason_type_4']

df.columns = column_names

In [11]:
# Mapping Education column to O or 1; 0 for higher Education level and 1 for others (such as postgraduate, masters or phd)
df['Education'] = df['Education'].map({1:0, 2:1,3:1,4:1})

In [12]:
# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

# Extract month from the date column
df['Month'] = df['Date'].apply(lambda x:x.month)

# Extract weekday from date column
df['Weekday'] = df['Date'].apply(lambda x:x.weekday())

# Drop the Date column
df = df.drop(['Date'], axis = 1)

In [13]:
# Reordering the columns following the initial arrangement were reason for absence he Absenteeism Time in Hours was last

columns_order = ['reason_type_1', 'reason_type_2', 'reason_type_3','reason_type_4','Month', 'Weekday', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']
df = df[columns_order]

In [14]:
# Check if the ordering is accurate by checking the first five rows of dataset
df.head()

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,Month,Weekday,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


In [15]:
# Addressing the Firstand Second key questions: 'What information should we use to predict absenteeism?' and 'What is the measurement for absenteeism'
# In the analysis, we focused on predicting excessive abseenteeism using the median of the Absenteeism time in hours
# Any absenteeism time greater than the median is considered excessive absenteeism.

# Find the median of the 'Absenteeism Time in Hours' column
median = df['Absenteeism Time in Hours'].median()

# Map the column to '0' for not excessive absenteeism and '1' for excessive
df['Target'] = (df['Absenteeism Time in Hours'] > median).astype(int)

In [16]:
# Drop the original 'Absenteeism Time in Hours' column in our dataset since we won't be using it
df = df.drop(['Absenteeism Time in Hours'], axis = 1)

In [17]:
# In case there are missing values or NAN values
df = df.fillna(value=0)

In [18]:
# creating a checkpoint
df_processed = df.copy()

In [19]:
df_processed

Unnamed: 0,reason_type_1,reason_type_2,reason_type_3,reason_type_4,Month,Weekday,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Target
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0
5,0,0,0,1,7,4,179,51,38,239.554,31,0,0,0,0
6,0,0,0,1,7,4,361,52,28,239.554,27,0,1,4,1
7,0,0,0,1,7,4,260,50,36,239.554,23,0,4,0,1
8,0,0,1,0,7,0,155,12,34,239.554,25,0,2,0,1
9,0,0,0,1,7,0,235,11,37,239.554,29,1,1,1,1


##  Utilizing Machine Learning

In [20]:
# Defining the Target and inputs variables

X = df_processed.iloc[:, :-1 ]
y = df_processed.iloc[:, -1]

In [21]:
# check if dataset is balanced (what % of targets are 1s)
# targets.sum() will give us the number of 1s that there are
# the shape[0] will give us the length of the targets array
y.sum() / y.shape[0]

0.45571428571428574

In [22]:
# Splitting data into train and test data
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [23]:
# Checking the shape of X train and y train
print(X_train.shape,y_train.shape)

(560, 14) (560,)


In [24]:
# Checking the shape of X test and y test
print(X_test.shape,y_test.shape)

(140, 14) (140,)


In [25]:
# Feature Scaling of the data but we have to exclude the dummy variables to avoid losing the interpretation of the data
from sklearn.preprocessing import StandardScaler

# Extract columns to exclude from feature scaling
exclude_columns = ['reason_type_1','reason_type_2','reason_type_3','reason_type_4', 'Education']

# Extract columns to scale
columns_to_scale = [col for col in X_train.columns if col not in exclude_columns]

# Create an object of StandardScaler
Sc = StandardScaler()

# fit_transform the X_train and transform the X_test
X_train[columns_to_scale] = Sc.fit_transform(X_train[columns_to_scale])
X_test[columns_to_scale] = Sc.transform(X_test[columns_to_scale])



## First Model: Logistic Regression

In [26]:
# Training the Training set using Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

In [27]:
# Predicting the test result
y_pred = classifier.predict(X_test)

#print result
result_df = pd.concat([y_test.reset_index(drop=True), pd.Series(y_pred, name='Predicted')], axis=1)
print(result_df[:20])

    Target  Predicted
0        0          0
1        0          0
2        0          0
3        0          0
4        1          0
5        0          0
6        1          1
7        1          1
8        1          1
9        1          1
10       0          1
11       1          1
12       1          0
13       0          0
14       1          1
15       1          1
16       1          1
17       1          1
18       1          1
19       1          0


In [28]:
# Create a confusion matrix and get the accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

#accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy Score {:.2f}%'.format(accuracy *100))

[[57 15]
 [16 52]]
Accuracy Score 77.86%


## Second Model: Suport Vector Machine

In [29]:
# Training the Training set using Support Vector Machine
from sklearn.svm import SVC
classifier_svm = SVC(kernel = 'linear', random_state = 0)
classifier_svm.fit(X_train, y_train)

In [30]:
# Predicting the test result
y_pred = classifier_svm.predict(X_test)

#print result
result_df = pd.concat([y_test.reset_index(drop=True), pd.Series(y_pred, name='Predicted')], axis=1)
print(result_df[:20])

    Target  Predicted
0        0          0
1        0          0
2        0          0
3        0          0
4        1          0
5        0          0
6        1          1
7        1          1
8        1          1
9        1          1
10       0          1
11       1          1
12       1          1
13       0          0
14       1          1
15       1          1
16       1          1
17       1          1
18       1          1
19       1          0


In [31]:
# Create a confusion matrix and get the accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

#accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy Score {:.2f}%'.format(accuracy *100))

[[57 15]
 [18 50]]
Accuracy Score 76.43%


## Third Model: Naive Bayes

In [32]:
# Training the Training set using Naive Bayes
from sklearn.naive_bayes import GaussianNB
classifier_nb = GaussianNB()
classifier_nb.fit(X_train, y_train)

In [33]:
# Predicting the test result
y_pred = classifier_nb.predict(X_test)

#print result
result_df = pd.concat([y_test.reset_index(drop=True), pd.Series(y_pred, name='Predicted')], axis=1)
print(result_df[:20])

    Target  Predicted
0        0          0
1        0          0
2        0          0
3        0          0
4        1          0
5        0          0
6        1          1
7        1          1
8        1          1
9        1          1
10       0          1
11       1          1
12       1          1
13       0          0
14       1          1
15       1          1
16       1          1
17       1          1
18       1          1
19       1          0


In [34]:
# Create a confusion matrix and get the accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

#accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy Score {:.2f}%'.format(accuracy *100))

[[56 16]
 [19 49]]
Accuracy Score 75.00%


### Analysis:
- **Naive Bayes:** It has a lower accuracy score (75.00%) compared to SVM and Logistic Regression. The confusion matrix indicates that it has a moderate number of false positives and false negatives.

- **Support Vector Machine (SVM):** Slightly better accuracy (76.43%) than Naive Bayes. The confusion matrix shows a relatively balanced distribution of true positives and true negatives.

- **Logistic Regression:** Achieves the highest accuracy (77.86%) among the three models. The confusion matrix suggests a good balance between true positives and true negatives.

### Conclusion:
- **Logistic Regression** appears to be the best-performing model based on the provided results. It has the highest accuracy score, indicating better overall performance in terms of correctly classifying instances.

- **Support Vector Machine** is close in performance but falls slightly behind Logistic Regression.

- **Naive Bayes** has the lowest accuracy and shows a less balanced performance compared to the other two models.

In summary,based on the provided metrics, Logistic Regression seems to be the most suitable modeL. Logistic Regression has the highest accuracy score, making it the best-performing model among the three.

## Using Logistic Regression Model for Further analysis
### Finding the intercept and the Coefficients 

In [35]:
# Get the intercept (bias) of our model

classifier.intercept_

array([-1.72585646])

In [36]:
# Get the coefficients (weights) of our model

classifier.coef_

array([[ 2.69636426,  0.47786128,  3.08731283,  0.89724914,  0.06697918,
        -0.17417653,  0.62793322,  0.05378656, -0.1570589 ,  0.03781686,
         0.20392623,  0.17270378,  0.48947904, -0.33153927]])

In [37]:
# Creating a summary table for our results
# check the names of the columns and save in variable
features = X_train.columns.values

# Create a new data frame (summary_table) and add the feature variable
summary_table = pd.DataFrame(data = {'Features' : features})

# Add the coeffients to the table
summary_table['Coefficients'] = classifier.coef_.reshape(len(features),1)

# Display the summary table
summary_table

Unnamed: 0,Features,Coefficients
0,reason_type_1,2.696364
1,reason_type_2,0.477861
2,reason_type_3,3.087313
3,reason_type_4,0.897249
4,Month,0.066979
5,Weekday,-0.174177
6,Transportation Expense,0.627933
7,Distance to Work,0.053787
8,Age,-0.157059
9,Daily Work Load Average,0.037817


In [38]:
# Adding the intercept to the summary table 
# for the intercept to be at the top we move the summary indices by 1
summary_table.index = summary_table.index + 1

# Add the intercept at index 0
summary_table.loc[0] = ['Intercept', classifier.intercept_[0]]

# Sort the table by index
summary_table = summary_table.sort_index()

# Display the summary table
summary_table

Unnamed: 0,Features,Coefficients
0,Intercept,-1.725856
1,reason_type_1,2.696364
2,reason_type_2,0.477861
3,reason_type_3,3.087313
4,reason_type_4,0.897249
5,Month,0.066979
6,Weekday,-0.174177
7,Transportation Expense,0.627933
8,Distance to Work,0.053787
9,Age,-0.157059


## Interpreting the coefficients

In [43]:
# Create a new series called 'Odd Ratio' which will show the odd ratio of each feature
summary_table['Odd Ratio'] = np.exp(summary_table['Coefficients'])

# Sort the table with the Odd ratio in descending order 
summary_table = summary_table.sort_values('Odd Ratio', ascending = False)

# Display result
summary_table

Unnamed: 0,Features,Coefficients,Odd Ratio
3,reason_type_3,3.087313,21.918101
1,reason_type_1,2.696364,14.825731
4,reason_type_4,0.897249,2.452846
7,Transportation Expense,0.627933,1.873734
13,Children,0.489479,1.631466
2,reason_type_2,0.477861,1.612622
11,Body Mass Index,0.203926,1.226208
12,Education,0.172704,1.188514
5,Month,0.066979,1.069273
8,Distance to Work,0.053787,1.055259


## **Summary Table Interpretation**
1. **Features:** These are the different features or variables used in the logistic regression model. Each row corresponds to a specific feature.

2. **Coefficients:** These values represent the coefficients assigned to each feature by the logistic regression model. They indicate the strength and direction (positive or negative) of the relationship between each feature and the probability of absenteeism.

3. **Odds Ratio:** The odds ratio is calculated by taking the exponential of the corresponding coefficient. It provides a measure of how a one-unit change in the predictor variable affects the odds of the dependent variable (absenteeism in this case).

Now, interpreting the results:

- **Reasons for Absence (reason_type_3, reason_type_1, reason_type_4):** Employees citing reason types 3 and 1 have significantly higher odds of being absent compared to reason type 4. Reason_type_3 has the highest coefficient and odds ratio, indicating the strongest influence on absenteeism.
    reason_type_3: This feature has the highest positive coefficient (3.087313), indicating a strong positive association with the dependent variable. The odds ratio is 21.918101, suggesting that for a one-unit increase in reason_type_3, the odds of the dependent variable increase by approximately 21.92 times.
    reason_type_1: This feature also has a positive coefficient (2.696364) and a high odds ratio (14.825731), indicating a positive association with the dependent variable.
- **Transportation Expense, Children, Reason_type_2, Body Mass Index, Education, Month, Distance to Work, Daily Work Load Average:** These variables all contribute positively to the odds of absenteeism, but their impact is less than that of the reasons for absence. Higher values in these features generally lead to higher odds of absenteeism.
- **Age, Weekday, Pets**: The features has a negative coefficient, suggesting a negative association with the dependent variable. The odds ratio for Age is 0.854654, implying that for a one-unit increase in 'Age', the odds of the dependent variable decrease by approximately 14.5%.
- **Intercept:** The intercept is the log-odds of the baseline case (absenteeism). It is negative, indicating a baseline probability less than 0.5. When the intercept is exponentiated, it gives the odds of absenteeism in the absence of all other variables.


