# Encoding Categorical Data


## Dataset Description.

The data provided, is based on the MNC Company HR data from Kaggle

The columns in the dataset are the following:

* `Emp_Id`: Id of Employee.
* `satisfaction_level`: Satisfaction level of employee in percentage. 100% or 1% is very satisfied. 0% or 0 is not satisfied.
* `last_evaluation`: Time from last evaluation in years.
* `number_project`: Number of projects an employee is working on.
* `average_montly_hours`: Average hours worked by employees in the last 3 months.
* `time_spend_company`: Time spent by my employee commuting to the office.
* `Work_accident`: If the employee was involved in a work accident.
* `left`: If the employee has left the company.
* `promotion_last_5years`: If the employee has a promotion in the past 5 years.
* `Department`: Department employee is working in.
* `Salary`: Salary Range from low to high



In [1]:
# Initial imports
import pandas as pd
from pathlib import Path
import calendar
from sklearn.preprocessing import LabelEncoder


## Loading the Data

Load the `HR_Data.csv` data in a Pandas DataFame. 

In [2]:
file_path = Path("HR_Data.csv")
hr_df = pd.read_csv(file_path)
hr_df.head()


Unnamed: 0,Emp_Id,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,IND02438,38%,53%,2,157,3,0,1,0,sales,low
1,IND28133,80%,86%,5,262,6,0,1,0,sales,medium
2,IND07164,11%,88%,7,272,4,0,1,0,sales,medium
3,IND30478,72%,87%,5,223,5,0,1,0,sales,low
4,IND24003,37%,52%,2,159,3,0,1,0,sales,low


In [3]:
hr_df['last_evaluation'] = hr_df['last_evaluation'].str.rstrip('%').astype('float') / 100.0
hr_df['satisfaction_level'] = hr_df['satisfaction_level'].str.rstrip('%').astype('float') / 100.0
hr_df

Unnamed: 0,Emp_Id,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,IND02438,0.38,0.53,2,157,3,0,1,0,sales,low
1,IND28133,0.80,0.86,5,262,6,0,1,0,sales,medium
2,IND07164,0.11,0.88,7,272,4,0,1,0,sales,medium
3,IND30478,0.72,0.87,5,223,5,0,1,0,sales,low
4,IND24003,0.37,0.52,2,159,3,0,1,0,sales,low
...,...,...,...,...,...,...,...,...,...,...,...
14994,IND40221,0.40,0.57,2,151,3,0,1,0,support,low
14995,IND24196,0.37,0.48,2,160,3,0,1,0,support,low
14996,IND33544,0.37,0.53,2,143,3,0,1,0,support,low
14997,IND40533,0.11,0.96,6,280,4,0,1,0,support,low


## Integer Encoding

### Encoding Data using `get_dummies()`

Perform a binary encoding on the `Department`,`salary` columns using the Pandas `get_dummies()` function.

In [4]:
# Encoding the Department and Salary columns
hr_df = pd.get_dummies(hr_df, columns=["Department", "salary"])
hr_df.head()


Unnamed: 0,Emp_Id,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department_IT,...,Department_hr,Department_management,Department_marketing,Department_product_mng,Department_sales,Department_support,Department_technical,salary_high,salary_low,salary_medium
0,IND02438,0.38,0.53,2,157,3,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0
1,IND28133,0.8,0.86,5,262,6,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
2,IND07164,0.11,0.88,7,272,4,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
3,IND30478,0.72,0.87,5,223,5,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0
4,IND24003,0.37,0.52,2,159,3,0,1,0,0,...,0,0,0,0,1,0,0,0,1,0


In [5]:
# Dropping the Emp_Id column since we do not need that particular column for our models
hr_df.drop(['Emp_Id'], axis=1, inplace = True)


In [6]:
hr_df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department_IT,Department_RandD,...,Department_hr,Department_management,Department_marketing,Department_product_mng,Department_sales,Department_support,Department_technical,salary_high,salary_low,salary_medium
0,0.38,0.53,2,157,3,0,1,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,0.8,0.86,5,262,6,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,1
2,0.11,0.88,7,272,4,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,1
3,0.72,0.87,5,223,5,0,1,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0.37,0.52,2,159,3,0,1,0,0,0,...,0,0,0,0,1,0,0,0,1,0


## Column Headers after encoding the Department and Salary Columns

In [7]:
hr_df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Department_IT', 'Department_RandD',
       'Department_accounting', 'Department_hr', 'Department_management',
       'Department_marketing', 'Department_product_mng', 'Department_sales',
       'Department_support', 'Department_technical', 'salary_high',
       'salary_low', 'salary_medium'],
      dtype='object')

## Save the Preprocessed File

Finally, save the preprocessed file as `emp_data_encoded.csv` 

In [8]:
# Save the file for use in other models
file_path = Path("emp_data_encoded.csv")
hr_df.to_csv(file_path, index=False)

