# Environment Management

**Environment Name:** Absenteeism   
**Project Directory Name:** absenteeism_prj
<div style="text-align: justify">
<strong>Original Imported Libraries and Python:</strong>  
    
-python=3.12.4 (from conda-forge)  
-numpy=2.0.0 (from conda-forge)  
-matplotlib=3.9.1 (from conda-forge)  
-seaborn=0.13.2 (from conda-forge)  
-pandas=2.2.2 (from conda-forge)  
    
<strong>Project Date:</strong> August 2024
</div>

In [None]:
import sys 
sys.executable  # Display the path to the Python executable ensuring the correct env"				

# Import Libraries and Read Data

In [2]:
import numpy as np  # For numerical operations and arrays.	
import pandas as pd  # For data manipulation and analysis.	
import matplotlib.pyplot as plt  # For basic plotting.	
import seaborn as sns  # For enhanced plotting.	
from absenteeism_scripts import range_monthdays 

In [3]:
# Read CSV Datafile to a DataFrame:
raw_data = pd.read_csv('Absenteeism_data.csv')

In [4]:
pd.options.display.max_columns = None

# Quick Data Explanation

<div style="text-align: justify">
The target variable is the 'Absenteeism Time in Hours' column. The unique identifier is the 'ID' column. There is also a 'Date' column. The rest of the columns are features which will define the target variable. The Date column could be a trap because further analyzing it into weekdays might reveal a pattern about a specific weekday. Turning the Date column to weekdays instead of dates could make the column ideal for analysis. If not doing so, the Date column should also be dropped.
</div>

# Data Preprocessing

## Date Column

<div style="text-align: justify">
We 'll create more features from 'Date' column and we 'll let the model decide in next sections of this project whether these features are important or not.
</div>

In [5]:
# Create a checkpoint:
df_date = raw_data.copy()

In [6]:
# Convert Date column to datetime type:
df_date['Date'] = pd.to_datetime(df_date['Date'], format="%d/%m/%Y")

# Extract the weekday and adjust to (1=Monday,..., 6=Weekend):
df_date['Weekday'] = df_date['Date'].dt.weekday.apply(lambda x: x + 1 if x < 5 else 6) 

# Extract the month and monthday from 'Date' column:
df_date['Month'] = df_date['Date'].dt.month 
df_date['Monthday'] = df_date['Date'].dt.day

# Apply function to range the monthdays into 6 categories:
df_date['Monthday'] = df_date['Monthday'].apply(range_monthdays)

<div style="text-align: justify">
We could drop the 13 rows that represent weekend absences because it's likely that there were not many people working on weekends. This fact might introduce noise and potentially degrade our analysis. However, we will keep the initial number of observations intact for now. If the model weights indicate that the 'Monthday' column plays a crucial role in our analysis, we can revisit this decision and delete these 13 rows to see the impact. However, if the 'Monthday' column is unimportant for our analysis, dropping these 13 observations will be unnecessary and have no meaningful effect on the analysis.
</div>

In [7]:
# Drop the original 'Date' column:
df_date = df_date.drop(columns='Date', axis=1)

## ID Column

In [8]:
# Create a checkpoint:
df_id = df_date.copy()

In [9]:
# Drop the 'ID' column:
df_id = df_id.drop(columns='ID', axis=1) 

## Reason for Absence Column

In [10]:
# Create a checkpoint:
df_reason_for_absence = df_id.copy()

In [11]:
# One-hot encode 'Reason for Absence' and ensure single reason per row:
reason_for_absence = pd.get_dummies(df_reason_for_absence['Reason for Absence']).astype(int)
reason_for_absence.sum(axis=1).unique()  # This should return array([1]) if each absence has only one reason.

array([1])

<div style="text-align: justify">
We executed the cells above only to ensure that there is only one single reason for each absence. After checking this, we should one-hot-encode the 'Reason for Absence' column correctly, dropping the first column to prevent multicollinearity.
</div>

In [12]:
# Drop the original 'Reason for Absence' column:
df_reason_for_absence = df_reason_for_absence.drop(columns='Reason for Absence', axis=1)

<div style="text-align: justify">
We can group the reason_for_absence DataFrame into larger groups related to the reason someone is absent from work. For example, there are 14 reasons (Number 1 to 14) which are related to diseases, 3 reasons related to pregnancy (Number 15 to 17), 4 reasons related to other important factors influencing health (Number 18 to 21), and the final 7 reasons (Number 22 to 28) are not related to major health issues but to preventive medicine, recovery from a disease, unjustified absence, etc. This is inside information that I can't share with you because I don't own it either. Although we defined 4 major categories inside the 'Reason for Absence' column, our final number of groups will be 3: Disease, Pregnancy or Other Factors, and Not-Major Health reasons.
</div>

In [13]:
# Aggregate absence reasons into broader categories:
disease = reason_for_absence.loc[:, 0:14].sum(axis=1)  # Disease reasons
pregnancy = reason_for_absence.loc[:, 15:17].sum(axis=1)  # Pregnancy reasons
other_factors = reason_for_absence.loc[:, 18:21].sum(axis=1)  # Other important reasons
not_major_health_issues = reason_for_absence.loc[:, 22:28].sum(axis=1)  # Not major reasons

In [14]:
# Concatenate the newly created Series with the main DataFrame:
df_reason_for_absence = pd.concat(
    [df_reason_for_absence, disease, pregnancy, other_factors, not_major_health_issues], axis=1)

In [15]:
# Correct column names:
corrected_column_names = ['Transportation Expense', 'Distance to Work', 'Age', 'Daily Work Load Average', 
                          'Body Mass Index', 'Education', 'Children', 'Pets', 'Absenteeism Time in Hours', 
                          'Weekday Absence Occurred', 'Month Absence Occurred', 'Monthday Range Absence Occurred', 
                          'Disease Absence', 'Pregnancy Absence', 'Other Factor Absence', 'Not-Major Issue Absence']

# Assign the corrected column names to the main DataFrame:
df_reason_for_absence.columns = corrected_column_names

In [16]:
# Drop the least impactful column to prevent multicollinearity:
df_reason_for_absence = df_reason_for_absence.drop(columns='Pregnancy Absence')  

## Education Column

In [17]:
# Create a checkpoint:
df_education = df_reason_for_absence.copy()

<div style="text-align: justify">
The Education column takes values from 1 to 4, where 1 represents high school education, 2 represents graduate education, 3 represents postgraduate education, and 4 represents a master's degree or PhD. The issue is that individuals with a high school education (1) constitute 83.3% of the dataset, while the remaining three categories together account for the rest. To enhance the impact of the education column, we will combine categories 2, 3, and 4 into a single category representing education levels higher than high school.
</div>

Therefore we 'll create these groups:  
1) 0: High School Education
2) 1: Higher than High School Education

In [18]:
# Simplify 'Education' column into two categories:
df_education['Education'] = df_education['Education'].apply(lambda x: 0 if x==1 else 1) 

## Children Column

In [19]:
# Create a checkpoint:
df_children = df_education.copy()

Here we 'll create these groups:  
1) 0: Zero Children
2) 1: One Children
3) 2: Two Children
4) 3: More than Two Children

In [20]:
# Group 'Children' into categories and one-hot encode:
df_children['Children'] = df_children['Children'].apply(lambda x: x if x <= 2 else 3)
children_dummies = pd.get_dummies(df_children['Children'], drop_first=True).astype(int) 

In [21]:
# Name the dummies properly:
children_dummies['Has 1 Child'] = children_dummies[1]
children_dummies['Has 2 Children'] = children_dummies[2]
children_dummies['Has More than 2 Children'] = children_dummies[3]

# Drop the initial columns:
children_dummies = children_dummies.drop(columns=[1, 2, 3], axis=1)

In [22]:
# Drop 'Children' column from main DataFrame:
df_children = df_children.drop(columns='Children', axis=1)

## Pet Column

In [23]:
# Create a checkpoint:
df_pet = df_children.copy()

Here we 'll create these groups:  
1) 0: Zero Pets
2) 1: One Pet
3) 2: Two Pets
4) 3: More than Two Pets

In [24]:
# Group 'Pets' into categories and one-hot encode:
df_pet['Pets'] = df_pet['Pets'].apply(lambda x: x if x <= 2 else 3)
pet_dummies = pd.get_dummies(df_pet['Pets'], drop_first=True).astype(int) 

In [25]:
# Name the dummies properly:
pet_dummies['Has 1 Pet'] = pet_dummies[1]
pet_dummies['Has 2 Pets'] = pet_dummies[2]
pet_dummies['Has More than 2 Pets'] = pet_dummies[3]

# Drop the initial columns:
pet_dummies = pet_dummies.drop(columns=[1, 2, 3], axis=1)

In [26]:
# Drop 'Pets' column from main DataFrame:
df_pet = df_pet.drop(columns='Pets', axis=1)

***This is the End of Feature Preprocessing***

# Target Classification for Logistic Regression Model

We 'll modify the target variable for a Logistic Regression model.

In [27]:
# Create a checkpoint:
df_target = df_pet.copy()

<div style="text-align: justify">
We have only 700 observations, and an ideal approach to separate the target values into binary classes without affecting balance and/or losing observations is by using the median. The median will ideally separate the data into two equal sets.
</div>

In [28]:
# Calculate target's median:
median = df_target['Absenteeism Time in Hours'].median()

# Apply median to create binary target values:
df_target['Absenteeism Time in Hours'] = df_target['Absenteeism Time in Hours'].apply(
    lambda x: 0 if x <= median else 1)

<div style="text-align: justify">
The target column is not perfectly balanced after median separation, but using the value_counts method indicates that the two new classes are separated into 46-54% observations. This is within the limit for a set to be characterized as balanced, even for a neural network model, which is stricter than a logistic regression model in terms of balanced sets.
</div>

<div style="text-align: justify">
The new target column has values of 0s and 1s. Interpreting the 0s, we should note that '0' doesn't mean a person wasn't absent from work but that there wasn't an extensive absence. On the other hand, '1' means there was extensive absence. For this reason, we will change the name of the target column for interpretability reasons to 'Extensive Absenteeism Time in Hours'.
</div>

In [29]:
# Rename the target column to 'Extensive Absenteeism Time in Hours' for better interpretability:
df_target['Extensive Absenteeism Time in Hours'] = df_target['Absenteeism Time in Hours']

# Drop 'Absenteeism Time in Hours' column from main DataFrame:
df_target = df_target.drop(columns='Absenteeism Time in Hours', axis=1)

***This is the End of Target Preprocessing***  

# Final DataFrame

In [30]:
# Create a checkpoint
df_final = df_target.copy()

In [31]:
# Concat the newly created Series with the main DataFrame:
df_final = pd.concat(
    [df_target, children_dummies, pet_dummies], axis=1)

In [32]:
df_final.columns.values

array(['Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Weekday Absence Occurred', 'Month Absence Occurred',
       'Monthday Range Absence Occurred', 'Disease Absence',
       'Other Factor Absence', 'Not-Major Issue Absence',
       'Extensive Absenteeism Time in Hours', 'Has 1 Child',
       'Has 2 Children', 'Has More than 2 Children', 'Has 1 Pet',
       'Has 2 Pets', 'Has More than 2 Pets'], dtype=object)

In [33]:
# Reorder column names:
reordered_cols_list = ['Month Absence Occurred', 'Monthday Range Absence Occurred', 'Weekday Absence Occurred', 
                       'Disease Absence', 'Other Factor Absence', 'Not-Major Issue Absence', 'Education', 
                       'Has 1 Child', 'Has 2 Children', 'Has More than 2 Children', 'Has 1 Pet', 'Has 2 Pets', 
                       'Has More than 2 Pets', 'Transportation Expense', 'Distance to Work', 'Age', 
                       'Daily Work Load Average', 'Body Mass Index', 'Extensive Absenteeism Time in Hours']

# Assign the reordered column to the main DataFrame:
df_final = df_final[reordered_cols_list]

## Export Cleaned DataFrame

In [34]:
# Export the cleaned DataFrame to a CSV file:
df_final.to_csv('cleaned_data.csv', index=False)