# HR Analytics: Employee Promotion

**Content**

A large MNC have 9 broad verticals across the organisation. One of the problem is identifying the right people for promotion (only for manager position and below) and prepare them in time.

The final promotions are only announced after the evaluation and this leads to delay in transition to new roles. Hence, company needs help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

Multiple attributes have been provided around Employee's past and current performance along with demographics.

**Features**

1. employee_id: Unique ID for employee
1. department: Department of employee
1. region: Region of employment (unordered)
1. education: Education Level
1. gender: Gender of Employee
1. recruitment_channel: Channel of recruitment for employee
1. no_ of_ trainings: no of other trainings completed in previous year on soft skills, technical skills etc.
1. age: Age of Employee
1. previous_ year_ rating: Employee Rating for the previous year
1. length_ of_ service: Length of service in years
1. awards_ won?: if awards won during previous year then 1 else 0
1. avg_ training_ score: Average score in current training evaluations
1. is_promoted: (Target) Recommended for promotion

**Inspiration**

Predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

## Import Library and Dataset

In [213]:
import pandas as pd
import numpy as np

In [214]:
df_train = pd.read_csv('dataset/hr_train.csv')
df_test = pd.read_csv('dataset/hr_test.csv')
df = df_train.copy()
print('Number of data:',len(df))
df.head()

Number of data: 54808


Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,73,0


## Data Cleansing and Feature Engineering

In [215]:
def describe_dataset(df):
    df_describe=[]
    for column in df.columns:
        df_describe.append([
            df[column].dtypes,column,df[column].isna().sum(),round((((df[column].isna().sum())/(len(df)))*100),2),df[column].nunique(),df[column].unique()
        ])
    return pd.DataFrame(df_describe,columns=['type','column','nan','nan (%)','nunique','unique']).sort_values(by='type').reset_index(drop=True)
describe_dataset(df)

Unnamed: 0,type,column,nan,nan (%),nunique,unique
0,int64,employee_id,0,0.0,54808,"[65438, 65141, 7513, 2542, 48945, 58896, 20379..."
1,int64,no_of_trainings,0,0.0,10,"[1, 2, 3, 4, 7, 5, 6, 8, 10, 9]"
2,int64,age,0,0.0,41,"[35, 30, 34, 39, 45, 31, 33, 28, 32, 49, 37, 3..."
3,int64,length_of_service,0,0.0,35,"[8, 4, 7, 10, 2, 5, 6, 1, 3, 16, 9, 11, 26, 12..."
4,int64,awards_won?,0,0.0,2,"[0, 1]"
5,int64,avg_training_score,0,0.0,61,"[49, 60, 50, 73, 85, 59, 63, 83, 54, 77, 80, 8..."
6,int64,is_promoted,0,0.0,2,"[0, 1]"
7,float64,previous_year_rating,4124,7.52,5,"[5.0, 3.0, 1.0, 4.0, nan, 2.0]"
8,object,department,0,0.0,9,"[Sales & Marketing, Operations, Technology, An..."
9,object,region,0,0.0,34,"[region_7, region_22, region_19, region_23, re..."


- There are missing values in "previous_year_rating" and "education" column.
- There are 5 categorical features such as "department", "region", "education", "gender" and "recruitment_channel". Therefore, we need to encode them later.
- The rest 8 columns are numerical features.

In [216]:
df[df.previous_year_rating.isna()].length_of_service.value_counts()

1    4124
Name: length_of_service, dtype: int64

All of the missing values data in "previous_year_rating" column are the data with "length_of_service" = 1. This is make sense because "previous_year_rating" can be count if the employee have service for the company for more than one year. Therefore, we will fill the missing values with 0.

In [217]:
df.previous_year_rating.fillna(0.0, inplace=True)
df.previous_year_rating.value_counts()

3.0    18618
5.0    11741
4.0     9877
1.0     6223
2.0     4225
0.0     4124
Name: previous_year_rating, dtype: int64

In [218]:
df[df.education.isna()].department.value_counts()

Sales & Marketing    1575
Analytics             337
Operations            226
Technology             99
Procurement            72
Finance                36
HR                     32
R&D                    28
Legal                   4
Name: department, dtype: int64

In [219]:
print('Most of the missing values are coming from department Sales & Marketing by {}%'
.format(round(df[df.education.isna()].department.value_counts()[0]/len(df[df.education.isna()]),2)))

Most of the missing values are coming from department Sales & Marketing by 0.65%


We can conclude that, Sales & Marketing department's job requirement doesn't require under below secondary education. Take a look at below insight.

In [220]:
df[df.department == "Sales & Marketing"].education.value_counts()

Bachelor's          11099
Master's & above     4166
Name: education, dtype: int64

Interesting!, none of "Below Secondary" education are in Sales & Marketing department. We can fill the 65% missing values of education with "Below Secondary". 

In [225]:
df[df.education.isna()][df.department == "Sales & Marketing"].fillna("Below Secondary")

  """Entry point for launching an IPython kernel.


Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,awards_won?,avg_training_score,is_promoted
32,35465,Sales & Marketing,region_7,Below Secondary,f,sourcing,1,24,1.0,2,0,48,0
43,17423,Sales & Marketing,region_2,Below Secondary,m,other,3,24,2.0,2,0,48,0
82,66013,Sales & Marketing,region_2,Below Secondary,m,sourcing,2,25,3.0,2,0,53,0
87,69094,Sales & Marketing,region_2,Below Secondary,m,sourcing,1,39,1.0,9,0,49,0
90,62658,Sales & Marketing,region_2,Below Secondary,f,sourcing,1,20,0.0,1,0,55,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
54596,43241,Sales & Marketing,region_2,Below Secondary,m,sourcing,1,36,1.0,5,0,49,0
54599,7570,Sales & Marketing,region_23,Below Secondary,m,other,1,24,2.0,3,0,48,0
54692,14821,Sales & Marketing,region_2,Below Secondary,f,sourcing,1,35,3.0,7,0,53,0
54742,38935,Sales & Marketing,region_31,Below Secondary,m,other,1,28,4.0,3,0,47,0


In [224]:
cat_col = list(df.dtypes[df.dtypes == 'object'].index)
num_col = list(df.dtypes[df.dtypes != 'object'].index)

## Exploratory Data Analysis (EDA)

## References
1. [HR Analytics: Employee Promotion Data - Kaggle](https://www.kaggle.com/arashnic/hr-ana)