<a href="https://colab.research.google.com/github/krishnavarathan/python-data-analysis/blob/main/Absenteesium_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Here is the step-by-step technical instruction based on your Python notebook.**

**`Step 1: Environment Setup and Data Loading`**
*Import Libraries:* Start by importing Pandas and NumPy to handle data structures and mathematical operations.

**`Load Dataset:`** Use pd.read_csv() to load the raw file (Absenteeism-data.csv).

**`Initial Audit:`** Inspect the first few rows to identify columns like ID, Reason for Absence, and Date to plan the cleaning strategy.

**`Step 2: Advanced Feature Engineering `** (Categorical Data)
Create Dummies: Convert the 'Reason for Absence' column into individual binary columns using pd.get_dummies() so the computer can understand the categories.

Group Reasons: To make the data simpler, group the 28 unique reasons into four main categories (Reason_1 to Reason_4) based on their similarity.

Reconstruct Dataframe: Drop the original 'ID' and 'Reason' columns, then attach the new grouped categories back to your main table.

**`Step 3: Time-Series Transformation`**
Object Conversion: Change the 'Date' column from a simple text string into a DateTime object so Python can perform calculations on it.

Extract Month: Create a new column for 'Month' to see if people miss more work during specific times of the year.

Identify Weekdays: Use a custom function (date_to_weekday) to extract the day of the week, which helps in analyzing if absenteeism happens more on Mondays or Fridays.

**`Step 4: Data Refining and Mapping`**
Binary Mapping: Simplify the 'Education' column by turning it into a binary feature (0 for high school, 1 for higher education/graduate).

Logical Reordering: Move the most important columns (Reasons and Time features) to the front of the dataframe to make the data easy to read for the final model.

**`Step 5: Final Validation and Statistics`**
Feature Analysis: Review numeric columns like Transportation Expense, BMI, and Daily Work Load to ensure there are no errors.

**`Target Preparation:`** Isolate the target variable ('Absenteeism Time in Hours') to ensure it is correctly formatted for the final prediction phase.

In [694]:
import numpy as np
import pandas as pd

In [695]:
data= pd.read_csv('/content/drive/MyDrive/Absenteesim/Absenteeism-data.csv')
data.head(10)

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2
5,3,23,10/07/2015,179,51,38,239.554,31,1,0,0,2
6,10,22,17/07/2015,361,52,28,239.554,27,1,1,4,8
7,20,23,24/07/2015,260,50,36,239.554,23,1,4,0,4
8,14,19,06/07/2015,155,12,34,239.554,25,1,2,0,40
9,1,22,13/07/2015,235,11,37,239.554,29,3,1,1,8


In [696]:
type(data)

In [697]:
df= data.copy()

In [698]:
# Displying all the clomns and rows
# pd.options.display.max_columns=None
# pd.options.display.max_rows=None

In [699]:
df.head(10)

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2
5,3,23,10/07/2015,179,51,38,239.554,31,1,0,0,2
6,10,22,17/07/2015,361,52,28,239.554,27,1,1,4,8
7,20,23,24/07/2015,260,50,36,239.554,23,1,4,0,4
8,14,19,06/07/2015,155,12,34,239.554,25,1,2,0,40
9,1,22,13/07/2015,235,11,37,239.554,29,3,1,1,8


In [700]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


In [701]:
df.head()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


# **Drop 'ID'**

In [702]:
x=df.drop(['ID'], axis=1)
x.head()

Unnamed: 0,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,23,23/07/2015,289,36,33,239.554,30,1,2,1,2


# **Reason for absence**

In [703]:
r=df['Reason for Absence']

In [704]:
r.min()

0

In [705]:
r.max()

28

In [706]:
pd.unique(df['Reason for Absence'])

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16])

In [707]:
len(pd.unique(df['Reason for Absence']))

28

In [708]:
# checking the missing value, 20 is missing
sorted(pd.unique(df['Reason for Absence']))

[np.int64(0),
 np.int64(1),
 np.int64(2),
 np.int64(3),
 np.int64(4),
 np.int64(5),
 np.int64(6),
 np.int64(7),
 np.int64(8),
 np.int64(9),
 np.int64(10),
 np.int64(11),
 np.int64(12),
 np.int64(13),
 np.int64(14),
 np.int64(15),
 np.int64(16),
 np.int64(17),
 np.int64(18),
 np.int64(19),
 np.int64(21),
 np.int64(22),
 np.int64(23),
 np.int64(24),
 np.int64(25),
 np.int64(26),
 np.int64(27),
 np.int64(28)]

# **.get_dummies()**

In [709]:
reason_columns = pd.get_dummies(df['Reason for Absence'], dtype=int)

In [710]:
reason_columns.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


In [711]:
reason_columns['check']=reason_columns.sum(axis=1)
reason_columns.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,check
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1


In [712]:
reason_columns['check'].sum(axis=0)

np.int64(700)

In [713]:
reason_columns['check'].unique()

array([1])

In [714]:
reason_columns=reason_columns.drop(['check'], axis=1)

In [715]:
reason_columns = pd.get_dummies(df['Reason for Absence'], drop_first = True, dtype=int)

In [716]:
reason_columns.head(10)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


## Group the Reasons for Absence:

In [717]:
df.columns.values

array(['ID', 'Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours'], dtype=object)

In [718]:
reason_columns.columns.values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 21, 22, 23, 24, 25, 26, 27, 28])

In [719]:
# Dropping the "Reason for Absence" column from df
df = df.drop(['Reason for Absence'], axis=1)

In [720]:
df.head()

Unnamed: 0,ID,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [721]:
# Spliting the columns
reason_columns.loc[:, 1:14].max(axis=1).head(10)

Unnamed: 0,0
0,0
1,0
2,0
3,1
4,0
5,0
6,0
7,0
8,0
9,0


In [722]:
# Grouping the Reason for Absence column
reason_type_1 = reason_columns.loc[:, 1:14].max(axis=1)
reason_type_2 = reason_columns.loc[:, 15:17].max(axis=1)
reason_type_3 = reason_columns.loc[:, 18:21].max(axis=1)
reason_type_4 = reason_columns.loc[:, 22:].max(axis=1)

In [723]:
reason_type_4.head()

Unnamed: 0,0
0,1
1,0
2,1
3,0
4,1


## Concatenate Column Values

In [724]:
df=pd.concat([df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis=1 )
df.head()

Unnamed: 0,ID,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,11,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,36,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,3,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,11,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


In [725]:
column_names=df.columns.values
column_names

array(['ID', 'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [726]:
column_names=['ID', 'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']

In [727]:
df.columns= column_names
df.head()

Unnamed: 0,ID,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_1,Reason_2,Reason_3,Reason_4
0,11,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,36,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,3,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,7,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,11,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1


## Reorder Columns

In [728]:
reordered_colum_names=['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','ID', 'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [729]:
df = df[reordered_colum_names]
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,ID,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,11,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,36,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,3,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,11,23/07/2015,289,36,33,239.554,30,1,2,1,2


## Create a Checkpoint

In [730]:
df_reason_mod = df.copy()

In [731]:
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,ID,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,11,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,36,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,3,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,11,23/07/2015,289,36,33,239.554,30,1,2,1,2


# 'Date':

In [732]:
type(df_reason_mod['Date'])

In [733]:
df_reason_mod['Date'].head()

Unnamed: 0,Date
0,07/07/2015
1,14/07/2015
2,15/07/2015
3,16/07/2015
4,23/07/2015


In [734]:
df_reason_mod['Date'] = pd.to_datetime(df_reason_mod['Date'], format = '%d/%m/%Y')

In [735]:
df_reason_mod['Date'].head()

Unnamed: 0,Date
0,2015-07-07
1,2015-07-14
2,2015-07-15
3,2015-07-16
4,2015-07-23


In [736]:
type(df_reason_mod['Date'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [737]:
# Noticed Date Dtype change into "datetime64"
df_reason_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Reason_1                   700 non-null    int64         
 1   Reason_2                   700 non-null    int64         
 2   Reason_3                   700 non-null    int64         
 3   Reason_4                   700 non-null    int64         
 4   ID                         700 non-null    int64         
 5   Date                       700 non-null    datetime64[ns]
 6   Transportation Expense     700 non-null    int64         
 7   Distance to Work           700 non-null    int64         
 8   Age                        700 non-null    int64         
 9   Daily Work Load Average    700 non-null    float64       
 10  Body Mass Index            700 non-null    int64         
 11  Education                  700 non-null    int64         
 12  Children

## Extract the Month Value:

In [738]:
df_reason_mod['Date'][0]

Timestamp('2015-07-07 00:00:00')

In [739]:
# Timestamp Dtype will the return month, day
df_reason_mod['Date'][0].month

7

In [740]:
months_column=[]

In [741]:
df_reason_mod.shape

(700, 15)

In [742]:
# df_reason_mod.shape[0] it will return the no.of rows
for i in range(df_reason_mod.shape[0]):
  months_column.append(df_reason_mod['Date'][i].month)

In [743]:
months_column[1:10]

[7, 7, 7, 7, 7, 7, 7, 7, 7]

In [744]:
df_reason_mod['Month Column']=months_column

In [745]:
df_reason_mod.tail(10)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,ID,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Column
690,0,0,0,0,23,2018-05-16,378,49,36,237.656,21,1,2,4,0,5
691,0,1,0,0,17,2018-05-18,179,22,40,237.656,22,2,2,0,1,5
692,1,0,0,0,14,2018-05-21,155,12,34,237.656,25,1,2,0,48,5
693,1,0,0,0,25,2018-05-21,235,16,32,237.656,25,3,0,0,8,5
694,0,0,0,1,15,2018-05-23,291,31,40,237.656,25,1,1,1,8,5
695,1,0,0,0,17,2018-05-23,179,22,40,237.656,22,2,2,0,8,5
696,1,0,0,0,28,2018-05-23,225,26,28,237.656,24,1,1,2,3,5
697,1,0,0,0,18,2018-05-24,330,16,28,237.656,25,2,0,0,8,5
698,0,0,0,1,25,2018-05-24,235,16,32,237.656,25,3,0,0,2,5
699,0,0,0,1,15,2018-05-31,291,31,40,237.656,25,1,1,1,2,5


# Extract the Day of the Week:

In [746]:
# weekday()- this method returns the day is weekday(0,4) or week-end(5,6) based on the numbers range
df_reason_mod['Date'][67].weekday()

4

In [747]:
df_reason_mod['Date'][67]

Timestamp('2015-10-16 00:00:00')

In [748]:
def date_to_weekday(date_value):
  return date_value.weekday()

In [749]:
df_reason_mod['Day of the week']=df_reason_mod['Date'].apply(date_to_weekday)
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,ID,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Column,Day of the week
0,0,0,0,1,11,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,36,2015-07-14,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,3,2015-07-15,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,7,2015-07-16,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,11,2015-07-23,289,36,33,239.554,30,1,2,1,2,7,3


In [750]:
# another way to extract the weekday()
new=[]
for i in range(700):
  new.append(df_reason_mod['Date'][i].weekday())
new

# checking
np.array_equal(new, df_reason_mod['Day of the week'])

True

**Exercise:**

In [751]:
df_reason_mod = df_reason_mod.drop(['Date'], axis = 1)
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,ID,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Column,Day of the week
0,0,0,0,1,11,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,36,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,3,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,7,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,11,289,36,33,239.554,30,1,2,1,2,7,3


In [752]:
df_reason_mod.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'ID',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Month Column',
       'Day of the week'], dtype=object)

In [753]:
column_names_upd = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Column', 'Day of the week',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education', 'Children',
       'Pets', 'Absenteeism Time in Hours']

In [754]:
# Rearranging the column names
df_reason_mod=df_reason_mod[column_names_upd]
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Column,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2


In [755]:
# Creating a replica copy
df_reason_date_mod = df_reason_mod.copy()
df_reason_date_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Column,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2


In [756]:
type(df_reason_date_mod['Transportation Expense'][0])

numpy.int64

In [757]:
type(df_reason_date_mod['Distance to Work'][0])

numpy.int64

In [758]:
type(df_reason_date_mod['Age'][0])

numpy.int64

In [759]:
type(df_reason_date_mod['Daily Work Load Average'][0])

numpy.float64

In [760]:
type(df_reason_date_mod['Body Mass Index'][0])

numpy.int64

## 'Education', 'Children', 'Pets'

In [761]:
display(df_reason_date_mod.head())

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Column,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2


In [762]:
df_reason_date_mod['Education'].unique()

array([1, 3, 2, 4])

In [763]:
df_reason_date_mod['Education'].value_counts()

Unnamed: 0_level_0,count
Education,Unnamed: 1_level_1
1,583
3,73
2,40
4,4


In [764]:
# Mapping the Education values into 1, 0
# 1 => 0
# 2,3,4 => 1
df_reason_date_mod['Education']=df_reason_date_mod['Education'].map({1:0, 2:1, 3:1, 4:1})

In [765]:
df_reason_date_mod['Education'].unique()

array([0, 1])

In [766]:
df_reason_date_mod['Education'].value_counts()

Unnamed: 0_level_0,count
Education,Unnamed: 1_level_1
0,583
1,117


**Final Checkpoint**

In [767]:
df_cleaned = df_reason_date_mod.copy()
df_cleaned.head(10)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Column,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2
5,0,0,0,1,7,4,179,51,38,239.554,31,0,0,0,2
6,0,0,0,1,7,4,361,52,28,239.554,27,0,1,4,8
7,0,0,0,1,7,4,260,50,36,239.554,23,0,4,0,4
8,0,0,1,0,7,0,155,12,34,239.554,25,0,2,0,40
9,0,0,0,1,7,0,235,11,37,239.554,29,1,1,1,8
