# Logistic Regression for Predicting Absenteeism

In this notebook, we will explore the use of logistic regression for predicting excessive absenteeism from work hours. The dataset consists of 700 samples, each with 14 features, and a binary target variable indicating whether an employee is "being absent too much" (1) or not (0).

## Objective:
Our objective is to build a logistic regression model that can effectively classify employees based on their absenteeism behavior. Specifically, we want to predict whether an employee will be absent from work for more than the median time, thus indicating excessive absenteeism.

## Approach:
1. **Data Preparation:** We will start by loading and preprocessing the dataset, including one-hot encoding categorical features, transforming the target variable, and splitting the data into training and testing sets.
2. **Model Training:** We will train a logistic regression model using the training data and evaluate its performance on the testing data.
3. **Model Evaluation:** We will assess the performance of the logistic regression model using appropriate metrics such as accuracy, precision, recall, and F1-score.
4. **Interpretation:** Finally, we will interpret the coefficients of the logistic regression model to understand the impact of each feature on the likelihood of excessive absenteeism.

By the end of this notebook, we aim to have a well-performing logistic regression model that can provide insights into factors contributing to absenteeism and help in making informed decisions to improve workplace productivity and employee well-being.


### Relevant Imports:

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Load the data:

In [3]:
data_file = r".csv\Absenteeism_preprocessed_data.csv"
data = pd.read_csv(data_file)

data.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0


### Select the inputs and targets for the regression

In [4]:
data.shape

(700, 15)

In [9]:
# The targets:
unscaled_inputs = data.iloc[:, :-1]
unscaled_inputs.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1


In [28]:
targets = data['Excessive Absenteeism']
targets

0      1
1      0
2      0
3      1
4      0
      ..
695    1
696    0
697    1
698    0
699    0
Name: Excessive Absenteeism, Length: 700, dtype: int64

# Standardize the Inputs

Standardization is one of the most common preprocessing tools in machine learning. It involves scaling the features to have a mean of 0 and a standard deviation of 1. This ensures that all input features are on a similar scale, which can prevent biases towards features with higher magnitudes.

A useful module for standardization is the `StandardScaler` from scikit-learn's preprocessing module. This scaler provides more capabilities than straightforward preprocessing methods and is widely used in practice.

Here's a brief overview of the `StandardScaler`:
- It scales each feature independently by removing the mean and scaling to unit variance.
- It can handle sparse matrices efficiently.
- It can also scale data to a specified range if needed.

In [11]:
class CustomScaler(BaseEstimator, TransformerMixin):
    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
        """
        Custom scaler for scaling specified columns of a DataFrame using StandardScaler.

        Parameters:
        - columns (list): List of column names to scale.
        - copy (bool): Whether to copy the input DataFrame before scaling (default=True).
        - with_mean (bool): Whether to center the data before scaling (default=True).
        - with_std (bool): Whether to scale the data to unit variance (default=True).
        """
        # Initialize StandardScaler with specified parameters
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        # Store the column names to be scaled
        self.columns = columns
        # Initialize mean and variance attributes
        self.mean_ = None
        self.var_ = None

    def fit(self, X, y=None):
        """
        Fit the scaler to the specified columns of the input DataFrame.

        Parameters:
        - X (DataFrame): Input DataFrame to fit the scaler.
        - y (array-like): Target values (ignored).

        Returns:
        - self (object): Returns the instance itself.
        """
        # Fit the scaler to the specified columns
        self.scaler.fit(X[self.columns], y)
        # Store the mean and variance of the scaled columns
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    def transform(self, X, y=None, copy=None):
        """
        Transform the input DataFrame by scaling the specified columns.

        Parameters:
        - X (DataFrame): Input DataFrame to transform.
        - y (array-like): Target values (ignored).
        - copy (bool): Whether to copy the input DataFrame (ignored).

        Returns:
        - X_scaled (DataFrame): Transformed DataFrame with scaled columns.
        """
        # Preserve the initial column order
        init_col_order = X.columns
        # Scale the specified columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        # Keep the columns that were not scaled
        X_not_scaled = X.loc[:, ~X.columns.isin(self.columns)]
        # Concatenate the scaled and unscaled columns
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [16]:
all_columns = unscaled_inputs.columns.values
all_columns

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work',
       'Age', 'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [17]:
# Define the columns to omit from scaling
# Reason: These columns contain categorical data and should not be scaled.
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']

columns_to_scale = [column for column in all_columns if column not in columns_to_omit]
columns_to_scale


['Month Value',
 'Day of the Week',
 'Transportation Expense',
 'Distance to Work',
 'Age',
 'Daily Work Load Average',
 'Body Mass Index',
 'Children',
 'Pets']

In [18]:
scaler = CustomScaler(columns_to_scale)
scaler.fit(unscaled_inputs)
scaled_inputs = scaler.transform(unscaled_inputs)
scaled_inputs

  return var(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)


Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,-0.683704,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-0.683704,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.007725,-0.654143,1.426749,0.248310,-0.806331,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.668253,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,0.668253,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.007725,-0.654143,-0.533522,0.562059,-0.853789,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,-0.007725,0.040034,-0.263140,-1.320435,-0.853789,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,0.668253,1.624567,-0.939096,-1.320435,-0.853789,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.668253,0.190942,-0.939096,-0.692937,-0.853789,-0.408580,1,-0.919030,-0.589690


In [19]:
scaled_inputs.shape

(700, 14)

### Split the data into Train and Test:

In [32]:
X_train, X_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size=0.2)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(560, 14) (560,)
(140, 14) (140,)


### Create the Logistic Regression

In [34]:
reg = LogisticRegression()


# Train the model:
reg.fit(X_train, y_train)

In [38]:
# Access the training:
accuracy = reg.score(X_train, y_train)
accuracy = np.round(accuracy, 2)
print(f"Accuracy: {accuracy*100}%.")

Accuracy: 76.0%.


### Create a summary table: 

In [50]:
intercept = reg.intercept_
coeff = np.round((reg.coef_), 3)
feature_names = unscaled_inputs.columns.values

### Exporting Model Coefficients for Tableau Integration

To facilitate integration with Tableau, we will export the coefficients from the logistic regression model. These coefficients will be transposed and organized into a DataFrame in a vertical orientation, allowing for easy multiplication by certain matrices in Tableau.

This process involves accessing the coefficients of the logistic regression model (accessible via `model.coef_`), transposing them, and creating a DataFrame to store the coefficients vertically. The resulting DataFrame will be exported and used in Tableau for further analysis and visualization.


In [51]:
summary_table = pd.DataFrame(columns= ['Feature Names'], data= feature_names)
summary_table['Coefficients'] = np.round(np.transpose(coeff), 3)
summary_table.index = summary_table.index + 1
summary_table.loc[0]  = ['Intercept', np.round((intercept[0]), 3)]
summary_table.sort_index()
summary_table

Unnamed: 0,Feature Names,Coefficients
1,Reason_1,2.64
2,Reason_2,0.792
3,Reason_3,2.836
4,Reason_4,0.844
5,Month Value,-0.009
6,Day of the Week,-0.257
7,Transportation Expense,0.598
8,Distance to Work,-0.045
9,Age,-0.159
10,Daily Work Load Average,-0.047


Create a new Series called: 'Odds ratio' which will show the.. odds ratio of each feature

Sorting the Table by Odds Ratio

To sort the table according to the odds ratio, we will use the `sort_values` method. By default, this method sorts values in ascending order. However, since we want to sort by odds ratio, which represents the strength of association between each feature and the target variable, we will need to sort in descending order to prioritize features with higher odds ratios.

In [53]:
summary_table['Odds ratio'] = np.exp(summary_table.Coefficients)
summary_table['Odds ratio'] = np.round((summary_table['Odds ratio']), 3)
summary_table.sort_values('Odds ratio', ascending= False)
summary_table

Unnamed: 0,Feature Names,Coefficients,Odds ratio
1,Reason_1,2.64,14.013
2,Reason_2,0.792,2.208
3,Reason_3,2.836,17.047
4,Reason_4,0.844,2.326
5,Month Value,-0.009,0.991
6,Day of the Week,-0.257,0.773
7,Transportation Expense,0.598,1.818
8,Distance to Work,-0.045,0.956
9,Age,-0.159,0.853
10,Daily Work Load Average,-0.047,0.954


### Testing the model:

In [57]:
accu_test = reg.score(X_test, y_test)
predicted_proba = reg.predict_proba(X_test)

predicted_proba

array([[0.48914368, 0.51085632],
       [0.81000477, 0.18999523],
       [0.2855308 , 0.7144692 ],
       [0.8750981 , 0.1249019 ],
       [0.73123405, 0.26876595],
       [0.61211549, 0.38788451],
       [0.89941261, 0.10058739],
       [0.48375902, 0.51624098],
       [0.66375662, 0.33624338],
       [0.7254275 , 0.2745725 ],
       [0.86035832, 0.13964168],
       [0.8532399 , 0.1467601 ],
       [0.20639933, 0.79360067],
       [0.87819038, 0.12180962],
       [0.23483691, 0.76516309],
       [0.83192563, 0.16807437],
       [0.55477747, 0.44522253],
       [0.30965378, 0.69034622],
       [0.81694325, 0.18305675],
       [0.86702124, 0.13297876],
       [0.66217733, 0.33782267],
       [0.7402418 , 0.2597582 ],
       [0.18670261, 0.81329739],
       [0.71478426, 0.28521574],
       [0.33876229, 0.66123771],
       [0.8014165 , 0.1985835 ],
       [0.43554707, 0.56445293],
       [0.13758802, 0.86241198],
       [0.75684154, 0.24315846],
       [0.78694464, 0.21305536],
       [0.

In [55]:
predicted_proba.shape

(140, 2)

In [56]:
predicted_proba[:, 1]

array([0.51085632, 0.18999523, 0.7144692 , 0.1249019 , 0.26876595,
       0.38788451, 0.10058739, 0.51624098, 0.33624338, 0.2745725 ,
       0.13964168, 0.1467601 , 0.79360067, 0.12180962, 0.76516309,
       0.16807437, 0.44522253, 0.69034622, 0.18305675, 0.13297876,
       0.33782267, 0.2597582 , 0.81329739, 0.28521574, 0.66123771,
       0.1985835 , 0.56445293, 0.86241198, 0.24315846, 0.21305536,
       0.59422358, 0.57990163, 0.57812738, 0.80466427, 0.86874517,
       0.49848115, 0.55554043, 0.16055873, 0.34422767, 0.28044851,
       0.16055873, 0.10407453, 0.18060987, 0.2770514 , 0.25846261,
       0.92800687, 0.64249117, 0.3203655 , 0.66892949, 0.70045295,
       0.1050464 , 0.50558981, 0.56445293, 0.67584287, 0.27387723,
       0.49633417, 0.80537798, 0.29821695, 0.44001207, 0.11893584,
       0.40666415, 0.15984183, 0.58549492, 0.18349103, 0.15143267,
       0.1488404 , 0.57946277, 0.17513351, 0.50846281, 0.25983127,
       0.29244416, 0.59118683, 0.13643107, 0.07761458, 0.20054

In [58]:
np.round((accu_test*100), 2)

78.57

### Final model evaluation

In [59]:
# Calculate predicted probabilities for the test set
predicted_proba = reg.predict_proba(X_test)[:, 1]

# Calculate predicted classes for the test set
predicted_classes = reg.predict(X_test)

# Calculate accuracy, precision, recall, and F1-score
accuracy = metrics.accuracy_score(y_test, predicted_classes)
precision = metrics.precision_score(y_test, predicted_classes)
recall = metrics.recall_score(y_test, predicted_classes)
f1_score = metrics.f1_score(y_test, predicted_classes)

# Create a summary table
summary_stats = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'Value': [accuracy, precision, recall, f1_score]
})

summary_stats


Unnamed: 0,Metric,Value
0,Accuracy,0.785714
1,Precision,0.823529
2,Recall,0.666667
3,F1-Score,0.736842


# Conclusion

In this notebook, we explored the use of logistic regression for predicting excessive absenteeism from work hours. The dataset comprised 700 samples, each with 14 features, and a binary target variable indicating whether an employee is "being absent too much" (1) or not (0).

## Objective:
Our objective was to build a logistic regression model that could effectively classify employees based on their absenteeism behavior. Specifically, we aimed to predict whether an employee would be absent from work for more than the median time, indicating excessive absenteeism.

## Approach:
1. **Data Preparation:** We started by loading and preprocessing the dataset, including one-hot encoding categorical features, transforming the target variable, and splitting the data into training and testing sets.
2. **Model Training:** We trained a logistic regression model using the training data and evaluated its performance on the testing data.
3. **Model Evaluation:** We assessed the performance of the logistic regression model using appropriate metrics such as accuracy, precision, recall, and F1-score.
4. **Interpretation:** Finally, we interpreted the coefficients of the logistic regression model to understand the impact of each feature on the likelihood of excessive absenteeism.

## Model Performance:
The logistic regression model achieved the following performance metrics on the test set:

- **Accuracy**: 78.57%
- **Precision**: 82.35%
- **Recall**: 66.67%
- **F1-Score**: 73.68%

These metrics indicate that the model performs reasonably well in predicting excessive absenteeism, with a good balance between precision and recall. Further optimization and exploration of other machine learning models may help improve performance in the future.


# Exploring Different Models

While logistic regression provides a solid baseline for predicting excessive absenteeism, there are several other machine learning models worth exploring. Each model has its own strengths and weaknesses, and exploring a variety of models can help identify the most suitable approach for the dataset at hand. Some alternative models to consider include:

1. **Decision Trees and Random Forests:** Decision trees are intuitive models that partition the feature space into regions, while random forests aggregate multiple decision trees for improved performance and robustness.

2. **Support Vector Machines (SVM):** SVMs aim to find the hyperplane that best separates the classes in the feature space. They are particularly effective in high-dimensional spaces and when the number of features exceeds the number of samples.

3. **Gradient Boosting Machines (GBM):** GBM is an ensemble technique that builds multiple weak learners sequentially, with each learner focusing on the mistakes made by the previous ones. GBM often yields high predictive accuracy but may be more computationally intensive.

4. **Neural Networks:** Neural networks, especially deep learning architectures, can capture complex nonlinear relationships in the data. They have shown impressive performance in various domains but may require extensive computational resources and careful hyperparameter tuning.

5. **Ensemble Methods:** Ensemble methods, such as bagging, boosting, and stacking, combine predictions from multiple models to produce a final prediction. These methods can often outperform individual models and are particularly useful when dealing with noisy or uncertain data.

Exploring different models allows us to compare their performance, interpretability, and computational requirements. It's essential to strike a balance between model complexity and performance, considering factors such as interpretability, scalability, and computational resources available.


### Saving the entire pipeline

In [60]:
from joblib import dump

pipeline_filename = 'absenteeism_model_pipeline.joblib'

dump((scaler, reg), pipeline_filename)

['absenteeism_model_pipeline.joblib']

# END OF DOCUMENT