# Analyzing Absenteeism Data: Preprocessing for Predictive Modeling
This Jupyter notebook focuses on analyzing absenteeism data to predict whether an employee will be absent from work for more than the median time. The dataset includes various features such as transportation expense, distance to work, age, daily workload average, body mass index, education level, number of children, number of pets, reason for absence, date, and absenteeism time in hours.

---

## Data Preprocessing Steps:
### One-Hot Encoding for Reason for Absence:
- The categorical feature 'Reason for Absence' will be encoded using one-hot encoding.
- One category (Reason 0) will be dropped to avoid multicollinearity, resulting in 27 encoded categories.
### Grouping Reasons for Absence:
- The remaining reasons for absence will be grouped into four categories for simplification and dimensionality reduction.
### Target Variable Transformation:
- The target variable 'Absenteeism Time in Hours' will be transformed into two categories:
    - 0 if the hours absent are less than or equal to the median time.
    - 1 otherwise.
### Feature Engineering:
- The employee ID column will be discarded as it holds no apparent relevance.
- The date column will be processed to extract month (numerical value) and day of the week (0 for Monday, 1 for Tuesday, and so on).
- The date column will then be removed as it serves no further purpose.

---

## Proposed Model:
A logistic regression model is proposed for this analysis due to its effectiveness in predicting binary outcomes. The model will be trained on the preprocessed dataset to predict the probability of an employee being absent for more than the median time.



### Relevant imports and loading the data file:

In [56]:
import pandas as pd
import numpy as np

In [4]:
data_file = r".csv\Absenteeism_raw_data.csv" # Location for the csv file with raw data

raw_csv_data = pd.read_csv(data_file) # Loading the file in a variable.
raw_csv_data.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [5]:
df = raw_csv_data.copy() # checkpoint 1

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


### Drop the 'ID' column since it serves no purpose in the analysis:

In [9]:
df = df.drop(['ID'], axis=1)
df.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


### Processing 'Reason for Absence' into four distinct categories and discarding reason_0 to avoid multicollinearity: 

- Create dummies

In [24]:
df['Reason for Absence'].min(), df['Reason for Absence'].max()

(0, 28)

In [25]:
sorted(df['Reason for Absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

In [26]:
reason_column = pd.get_dummies(df['Reason for Absence']).astype(int)
reason_column

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


- Check for MECE

In [27]:
reason_column['check'] = reason_column.sum(axis=1)
reason_column

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,21,22,23,24,25,26,27,28,check
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
696,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [28]:
# confirm Mutually Exclusive Collectively Ehaustive data (MECE):

reason_column['check'].sum(axis=0), reason_column['check'].unique()

(700, array([1], dtype=int64))

- Now drop 'check' column (since it server no furthur purpose) and drop reason_0 to avoid multicollinearity:

In [29]:
reason_column = reason_column.drop(['check'], axis=1)
reason_column = reason_column.drop([0], axis=1)
reason_column 

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
697,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


- Grouping the reason for absence:
- Four groups are created: reason_1/ 2/ 3/ 4; each covering similar categories for reason of absence.


In [32]:
df.columns.values

array(['Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours'], dtype=object)

In [33]:
reason_column.columns.values

array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       21, 22, 23, 24, 25, 26, 27, 28], dtype=object)

In [34]:
df = df.drop(['Reason for Absence'], axis=1) # Drop the reasons column from data

reason_1 = reason_column.loc[:, 1:14].max(axis=1) # Reason 1: covers serious diseases
reason_2 = reason_column.loc[:, 15:17].max(axis=1) # Reason 2: covers pregnancy and child birth related reasons
reason_3 = reason_column.loc[:, 18:21].max(axis=1) # Reason 3: covers poisoning reasons
reason_4 = reason_column.loc[:, 22:].max(axis=1) # Reason 4: covers minor diseases and clinic visits
# Reason 0: No reason given/ reason unknown


- Concatenate the two dataframes row-wise:

In [35]:
df = pd.concat([df, reason_1, reason_2, reason_3, reason_4], axis=1)
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


- Rename the columns:

In [36]:
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']
df.columns = column_names
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_1,Reason_2,Reason_3,Reason_4
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


- Rearrange the columns:

In [37]:
column_names_reordered = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 
                          'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']
df = df[column_names_reordered]
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


### Checkpoint 2

In [38]:
df_mod = df.copy()

### Modify the date column to extract Month and Day of the Week (both numerical values)

In [41]:
date_format = "%d/%m/%Y"
df_mod['Date'] = pd.to_datetime(df_mod['Date'], format= date_format)
df_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2


In [42]:
df_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Reason_1                   700 non-null    int32         
 1   Reason_2                   700 non-null    int32         
 2   Reason_3                   700 non-null    int32         
 3   Reason_4                   700 non-null    int32         
 4   Date                       700 non-null    datetime64[ns]
 5   Transportation Expense     700 non-null    int64         
 6   Distance to Work           700 non-null    int64         
 7   Age                        700 non-null    int64         
 8   Daily Work Load Average    700 non-null    float64       
 9   Body Mass Index            700 non-null    int64         
 10  Education                  700 non-null    int64         
 11  Children                   700 non-null    int64         
 12  Pets    

- extract the month value
- Jan : 1, Feb : 2, Mar : 3 and so on

In [46]:
def get_month_value(date_value):
    return date_value.month

df_mod['Month Value'] = df_mod['Date'].apply(get_month_value)
df_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2,7
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2,7


- extract day of the week value
- Monday : 0, Tuesday : 1, Wednesday : 2 and so on

In [47]:
def get_weekday_value(date_value):
    return date_value.weekday()

df_mod['Day of the Week'] = df_mod['Date'].apply(get_weekday_value)
df_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2,7,3


- rearrange and drop 'Date' column:

In [48]:
column_names_upd = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value', 'Day of the Week',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education', 'Children',
       'Pets', 'Absenteeism Time in Hours']
df_mod = df_mod.drop(['Date'], axis= 1)
df_mod = df_mod[column_names_upd]
df_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2


### Checkpoint 3

In [51]:
df_reason_date_mod = df_mod.copy()

### Education:

In [52]:
df_reason_date_mod['Education'].unique()

array([1, 3, 2, 4], dtype=int64)

Education has been categorised as under:
- 1: Basic Education, 2: Graduate Education, 3: Diploma Holder,  4: Master's or Ph. D. Holder

For the purposes of our analysis we will convert this to:
- 0 : for basic education, 
- 1 : otherwise

In [54]:
edu_map = {1: 0, 2: 1, 3: 1, 4: 1}
df_reason_date_mod['Education'] = df_reason_date_mod['Education'].map(edu_map)

### Checkpoint 4

In [55]:
df_preprocessed = df_reason_date_mod.copy()
df_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


### Create the targets:

In [57]:
df_preprocessed['Absenteeism Time in Hours'].median()

3.0

# Creating Targets for Logistic Regression
- To determine whether an employee is "being absent too much" or not, we've decided to categorize the targets into two categories. The goal is to achieve a balanced dataset for logistic regression, ensuring roughly equal numbers of 0s and 1s.
---
### Methodology:
- We will use the median of the dataset as a cutoff line. This approach ensures that the dataset is balanced, a crucial factor in machine learning.
- Alternatively, with more data, other methods could be explored. For instance, assigning an arbitrary value as a cutoff line instead of the median.

### Target Definition:
The target variable will be assigned as follows:
- Employees who have been absent for 4 hours or more (equivalent to taking half a day off) will be assigned a target value of 1.
- Employees who have been absent for less than 4 hours will be assigned a target value of 0.
---
This categorization will enable the logistic regression model to predict whether an employee is "being absent too much" based on their absenteeism time.

In [58]:
targets = np.where(df_preprocessed['Absenteeism Time in Hours'] > df_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [60]:
df_preprocessed['Excessive Absenteeism'] = targets
df_preprocessed = df_preprocessed.drop(['Absenteeism Time in Hours'], axis= 1) #Drop now since not needed anymore
df_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0


### Commentary on the Targets

---

Balancing the Dataset:
- Before proceeding, it's essential to check if the dataset is balanced. This involves determining the percentage of targets that are assigned a value of 1. We can calculate this by dividing the number of targets with a value of 1 by the total length of the targets array. The targets.sum() function gives us the count of targets with a value of 1, while shape[0] provides the length of the targets array.

Creating a Checkpoint:
- Once we've confirmed the balance of the dataset, it's prudent to create a checkpoint. This involves dropping unnecessary variables and eliminating the features we identified through exploring the weights. We'll also verify if the line above this step serves as a checkpoint.

Checking the Relationship between Variables:
- If data_with_targets is equal to data_preprocessed = True, it indicates that both variables point to the same object. However, if it's False, then the two variables are entirely distinct, signifying the creation of a checkpoint. It's crucial to inspect the contents at this stage to ensure the integrity of the data.

In [62]:
ratio = targets.sum()/ targets.shape[0]
ratio = np.round(ratio, 2)
print(f"Targets\n1's: {ratio*100}%\n0's: {(1 - ratio)*100}%")

Targets
1's: 46.0%
0's: 54.0%


This suggests an approximate 45 - 55 split which is within the acceptable limits for a logistic regression model

### Final Checkpoint

In [65]:
# save the preprocessed file as 'Absenteeism_preprocessed_data.csv'
df_preprocessed.to_csv(r'.csv/Absenteeism_preprocessed_data.csv', index= False)

# Conclusion

In this notebook, we've successfully preprocessed the absenteeism data to prepare it for predictive modeling. We began by analyzing the dataset and identifying various features such as transportation expense, distance to work, age, and reason for absence. We then performed several preprocessing steps, including one-hot encoding for the reason for absence, grouping reasons, transforming the target variable, and feature engineering.

The dataset was carefully curated to ensure that it is well-structured and suitable for analysis. We created a balanced target variable to enable logistic regression modeling, where employees are categorized based on whether they are "being absent too much" or not.

Throughout the preprocessing pipeline, we utilized checkpointing to ensure data integrity and facilitate debugging. Additionally, detailed comments and documentation were provided to explain each step of the process thoroughly.

With the preprocessed dataset saved as a CSV file, we are now ready to proceed with building and training a logistic regression model for predicting absenteeism from work hours. This model will help us gain insights into factors influencing absenteeism and assist in making informed decisions to improve workplace productivity and employee well-being.


# END OF DOCUMENT