# Dataset Introduction

The dataset that is used in this book is [IBM HR Analytics Employee Attrition & Performance](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset) hosted on Kaggle. It is uploaded 4 years ago with no revisions since then. The size of the data file is around 222kB. There are 35 columns in the dataset. The primary aim for hosting the dataset was to predict the attrition of the employees.

In [9]:
# Import the necessary packages
import pandas as pd 
import json

In [10]:
# Load the data
df = pd.read_csv('./../../../data/data.csv')

In [11]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,College,Life Sciences,1,1,...,Low,80,0,8,0,Bad,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,1,2,...,Very High,80,1,10,3,Better,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,College,Other,1,4,...,Medium,80,0,7,3,Better,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,1,5,...,High,80,0,8,3,Better,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,Below College,Medical,1,7,...,Very High,80,1,6,3,Better,2,2,2,2


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   object
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   object
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   object
 14  JobLevel                

There are 1470 entries in the dataset with 35 columns. Also, no null values are present in the dataset.  
There are 19 numerical columns and 16 categorical columns.  
Almost all the columns are self-explainatory but still we will look at each column briefly.

|Column Name|Description|
|:--:|:--|
|Age|Age of the employee.|
|Attrition|Whether the employee left the firm or not. It is the target variable for prediction analysis for attrition.|
|BusinessTravel|Whether the emplyee needs to travel for business purposes or not.|
|DailyRate|Daily rate of the employee for the work.|
|DistanceFromHome|Distance of the office from the employee's home.|
|Education|Qualification till which an employee completed the education.|
|EducationFeild|Feild of study during the education.|
|EmployeeCount|*This column provides no information.*|
|EmployeeNumber|Unique identifier of employee.|
|EnvironmentSatisfaction|Satifaction level of employee regarding the environment in the office.|
|Gender|Gender of the employee.|
|HourlyRate|Hourly Rate of the employee.|
|JobInvolvement|Satisfaction level of employee regarding their involvement during the employment.|
|JobLevel|Level of employee in the heirarchy of promotion.|
|JobSatisfaction|Overall job satisfaction level of employee.|
|MaritalStatus|Whether the employee employee is married or not|
|MonthlyIncome|Monthly income of the employee.|
|MonthlyRate|Monthly rate of the employee.|
|NumCompaniesWorked|Number of companies that the employee worked in.|
|Over18|*This column provides no information.*|
|OverTime|Whether the employee needs to do overtime or not.|
|PercentSalaryHike|Recent percentage hike in the salary.|
|PerformanceRating|Recent performance rating that was awarded.|
|RelationshipSatisfaction|Satisfaction level regarding the employee's professional relationships in the company.|
|StandardHours|Average number of hours of work that the employee put in everyday.|
|StockOptionLevel|Level of stock options.|
|TotalWorkingYears|Total experience of employee in years.|
|TrainingTimesLastYear|Number of times employee was trained in the previous year.|
|WorkLifeBalance|Satisfaction level with regards to work life balance.|
|YearsAtCompany|Number of years the employee worked in the current company.|
|YearsInCurrentRole|Number of years the employee worked in current role.|
|YearsSinceLastPromotion|Number of years since the employee is promoted.|
|YearsWithCurrManager|Number of years the employee worked for the current manager.|

It is better to remove columns which does not required and does not contribute to the information gain from the dataset.

In [13]:
# Drop unnecessary columns
df.drop(['EmployeeCount', 'Over18'], axis=1, inplace=True)

Remove the dropped columns from the list of numerical and categorical columns.

In [14]:
# Load the static lists
with open('./../../../data/statics.json') as f:
    statics = json.load(f)
categorical_columns = statics['categorical_columns']
numerical_columns = statics['numerical_columns']

# Remove the colums
categorical_columns.remove('Over18')
numerical_columns.remove('EmployeeCount')

In [15]:
# Save the processed data
df.to_csv('./../../../data/cleaned_data.csv', index=False)