# AfterWork Data Science: 


## 1. Business Understanding




Your client is a large Multinational Corporation, and they have nine broad verticals across the organization. One of the problems your client faces is identifying the right people for promotion (only for the manager position and below) and preparing them in time.

Currently the process, they are following is:


* They first identify a set of employees based on recommendations/ past performance.
* Selected employees go through the separate training and evaluation program for each vertical.
* These programs are based on the required skill of each vertical. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., the employee gets a promotion.






For the process mentioned above, the final promotions are only announced after the evaluation, and this leads to a delay in transition to their new roles. 

The task is to predict whether a potential promotee at a checkpoint will be promoted or not after the evaluation process.


## 2. Data Exploration

In [None]:
# Loading libraries
# ---
# 
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
 

In [None]:
# Loading the dataset
# --- 
# Dataset url = https://bit.ly/2ODZvLCHRDataset
# --- 
# 
df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')
df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [None]:
# Determining the size 
# ---
#
df.shape

(54808, 14)

In [None]:
# Checking the datatypes
# ---
# 
df.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

In [None]:
# Statistical summary
# ---
#
df.describe()

Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
count,54808.0,54808.0,54808.0,50684.0,54808.0,54808.0,54808.0,54808.0,54808.0
mean,39195.830627,1.253011,34.803915,3.329256,5.865512,0.351974,0.023172,63.38675,0.08517
std,22586.581449,0.609264,7.660169,1.259993,4.265094,0.47759,0.15045,13.371559,0.279137
min,1.0,1.0,20.0,1.0,1.0,0.0,0.0,39.0,0.0
25%,19669.75,1.0,29.0,3.0,3.0,0.0,0.0,51.0,0.0
50%,39225.5,1.0,33.0,3.0,5.0,0.0,0.0,60.0,0.0
75%,58730.5,1.0,39.0,4.0,7.0,1.0,0.0,76.0,0.0
max,78298.0,10.0,60.0,5.0,37.0,1.0,1.0,99.0,1.0


## 3. Data Cleaning and Preparation

In [None]:
# Checking for duplicates 
# ---
#
df.duplicated().sum()

0

In [None]:
# Checking for missing values 
# ---
# 
df.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [None]:
# Dropping observations with missing values
# ---
#
df.dropna(inplace =True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48660 entries, 0 to 54807
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           48660 non-null  int64  
 1   department            48660 non-null  object 
 2   region                48660 non-null  object 
 3   education             48660 non-null  object 
 4   gender                48660 non-null  object 
 5   recruitment_channel   48660 non-null  object 
 6   no_of_trainings       48660 non-null  int64  
 7   age                   48660 non-null  int64  
 8   previous_year_rating  48660 non-null  float64
 9   length_of_service     48660 non-null  int64  
 10  KPIs_met >80%         48660 non-null  int64  
 11  awards_won?           48660 non-null  int64  
 12  avg_training_score    48660 non-null  int64  
 13  is_promoted           48660 non-null  int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 5.6+ MB


In [None]:
# Dropping irrelevant columns
# ---
#
df.drop(['employee_id'], axis=1, inplace=True)

In [None]:
# Transforming the gender feature into a dummy variable
# ---
#
df["gender"] = np.where(df["gender"].str.contains("m", "f"), 1, 0)
df.head()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,0,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,Bachelor's,1,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,1,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,1,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,Bachelor's,1,other,1,45,3.0,2,0,0,73,0


In [None]:
# Encoding the other categorical features
# ---
#
dummies_dept = pd.get_dummies(df['department'], prefix='Dept')
df = pd.concat([df, dummies_dept], axis=1)
dummies_edu = pd.get_dummies(df['education'])
df = pd.concat([df, dummies_edu], axis=1)
dummies_rc = pd.get_dummies(df['recruitment_channel'])
df = pd.concat([df, dummies_rc], axis=1)
#dummies_pl = pd.get_dummies(df['performance_level'])
#df = pd.concat([df, dummies_pl], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48660 entries, 0 to 54807
Data columns (total 28 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   department              48660 non-null  object 
 1   region                  48660 non-null  object 
 2   education               48660 non-null  object 
 3   gender                  48660 non-null  int64  
 4   recruitment_channel     48660 non-null  object 
 5   no_of_trainings         48660 non-null  int64  
 6   age                     48660 non-null  int64  
 7   previous_year_rating    48660 non-null  float64
 8   length_of_service       48660 non-null  int64  
 9   KPIs_met >80%           48660 non-null  int64  
 10  awards_won?             48660 non-null  int64  
 11  avg_training_score      48660 non-null  int64  
 12  is_promoted             48660 non-null  int64  
 13  Dept_Analytics          48660 non-null  uint8  
 14  Dept_Finance            48660 non-null

## 4. Data modeling

In [None]:
# Preparing our dataset for training
# ---
# We first divide our data into features and target.
X = df.drop(['is_promoted','department','region','education','recruitment_channel'],axis=1)
y = df['is_promoted']

print(X.shape)
print(y.shape)

(48660, 23)
(48660,)


In [None]:
# Splitting the dataset into a training set and test set
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
# Normalisation
# ---
norm = MinMaxScaler().fit(X_train) 
X_train = norm.transform(X_train) 
X_test = norm.transform(X_test)

In [None]:
# Training the model
# ---
# 
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier()

In [None]:
# Predicting the test set results. 
# ---
#
decision_y_prediction = model.predict(X_test) 

In [None]:
# Comparing actual output values for X_test with the predicted values
# ---
#
df2 = pd.DataFrame({
    'Actual': y_test, 
    'decision_tree_prediction': decision_y_prediction,})

df2.sample(5)

Unnamed: 0,Actual,decision_tree_prediction
19434,1,1
12231,0,0
11543,0,0
1989,0,0
41320,0,0


In [None]:
# Printing evaluation metrics to determine the accuracy of classifiers
# ---
# 
from sklearn.metrics import classification_report
print(classification_report(y_test, decision_y_prediction))

              precision    recall  f1-score   support

           0       0.95      0.94      0.94     11137
           1       0.40      0.43      0.41      1028

    accuracy                           0.90     12165
   macro avg       0.67      0.69      0.68     12165
weighted avg       0.90      0.90      0.90     12165



## 5. Summary of findings and recommendations

The model can, on average, accurately predict 90% of the employees who get promoted. It can therefore serve as a useful tool for the HR department in picking candidates for promotion, saving them time in manually assessing different promotion eligibility parameters for individual candidates.