# AfterWork Data Science: Introduction to Machine Learning

## 1. Defining the Question

### a) Understanding the context 

HR analytics is revolutionising the way human resources departments operate, leading
to higher efficiency and better results overall. Human resources have been using
analytics for years. However, the collection, processing, and analysis of data have been
largely manual, and given the nature of human resources dynamics and HR KPIs, the
approach has been constraining HR

### b) Problem statement


The client has a large Multinational Corporation, and they have nine broad verticals
across the organization. One of the problems your client faces is identifying the right
people for promotion (only for the manager position and below) and preparing them in
time.


### c) Data Analysis Question

To predict whether a potential employee at a
checkpoint will be promoted or not after the evaluation process?

### d) Defining the Metric for Success

correctly predict promotion eligibility

## 2. Reading the Data

In [1]:
# Importing our libraries
# ---
# import Pandas for data manipulation
import pandas as pd
#improt numpy for arithimetic oprations
import numpy as np
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Load the data below
# --- 
# Dataset url =  https://bit.ly/2ODZvLCHRDataset

# reading the data and storing it in a dataframe named employee_df

employee_df = pd.read_csv("https://bit.ly/2ODZvLCHRDataset")

# 

In [3]:
# Checking the first 5 rows of data
# ---
employee_df.head()
#

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [4]:
# Checking the last 5 rows of data
# ---
employee_df.tail()
#

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
54803,3030,Technology,region_14,Bachelor's,m,sourcing,1,48,3.0,17,0,0,78,0
54804,74592,Operations,region_27,Master's & above,f,other,1,37,2.0,6,0,0,56,0
54805,13918,Analytics,region_1,Bachelor's,m,other,1,27,5.0,3,1,0,79,0
54806,13614,Sales & Marketing,region_9,,m,sourcing,1,29,1.0,2,0,0,45,0
54807,51526,HR,region_22,Bachelor's,m,other,1,27,1.0,5,0,0,49,0


In [5]:
# Sample 10 rows of data
# ---
employee_df.sample(10)
#

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
46078,60632,Sales & Marketing,region_13,Bachelor's,m,other,1,34,3.0,7,1,0,50,0
9825,73343,Sales & Marketing,region_28,Bachelor's,m,sourcing,1,37,3.0,3,1,0,47,0
29099,70348,Analytics,region_16,Master's & above,m,other,2,34,3.0,9,1,0,85,0
478,62530,Procurement,region_32,Bachelor's,m,other,2,32,3.0,3,0,0,70,0
50809,5549,Operations,region_24,Bachelor's,f,sourcing,1,31,3.0,5,0,0,62,0
44765,52451,Procurement,region_27,Bachelor's,f,sourcing,1,32,5.0,5,1,0,69,1
10001,61388,Operations,region_13,Bachelor's,f,other,1,40,5.0,5,0,0,58,0
49944,71615,Procurement,region_2,Bachelor's,f,other,2,55,3.0,8,1,0,71,0
30783,23295,Operations,region_22,Bachelor's,m,other,1,48,3.0,19,0,0,60,0
43057,20734,Sales & Marketing,region_26,Master's & above,m,sourcing,1,43,2.0,8,0,0,49,0


## observation:
we have noted missing values on education

In [7]:
# Checking number of rows and columns
# ---

employee_df.shape
#  

(54808, 14)

In [7]:
# Checking datatypes
# ---
employee_df.dtypes
# 

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

##  Data Preparation and Cleaning

In [8]:
# Checking datatypes and missing entries of all the variables
# ---
#

employee_df.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

We observe the following from our dataset:
1. missing information on education : Since we can not be able to predict the level of education we shall drop all the missing records
2. we have missing data on previous year rating: this is a critical data that is required in making our decision, in this, we shall replace all the missing data with the mean rating


In [9]:
# replacing the missing previous year rating data with mean value


employee_df['previous_year_rating'].fillna(value=employee_df['previous_year_rating'].mean(), inplace=True)

employee_df.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating       0
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

All missing previous rating data filled

In [10]:
#deleting all records missing education
promotion_df = employee_df.dropna(subset=['education'])
promotion_df.isnull().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [11]:
# Standardizing your dataset i.e. variable renaming
# 
# convert all column headers to lower case

promotion_df.columns =  promotion_df.columns.str.lower()
promotion_df.columns

Index(['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'kpis_met >80%', 'awards_won?',
       'avg_training_score', 'is_promoted'],
      dtype='object')

In [12]:
# Checking how many duplicate rows are there in the data
# ---
# 
promotion_duplicated_df =promotion_df[promotion_df.duplicated()]
promotion_duplicated_df

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,kpis_met >80%,awards_won?,avg_training_score,is_promoted


## Observation:

No duplicates



In [13]:
#checking the shape of the clean data
promotion_df.shape

(52399, 14)


## Solution Implementation

From the HR data, the learning data set already has the target ( promotion)


### Defining our features and target

In [14]:
promotion_df.columns

Index(['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'kpis_met >80%', 'awards_won?',
       'avg_training_score', 'is_promoted'],
      dtype='object')

In [15]:
#our features
features = promotion_df.drop(['employee_id','is_promoted','department', 'region', 'education', 'gender',
       'recruitment_channel'], axis=1)

#our target - promption
target = promotion_df['is_promoted']

In [16]:
# defining our model

model = DecisionTreeClassifier()
model.fit(features, target)
#

In [None]:
#Defining new features to predict if the employee is a promotion candidate
new_features = pd.DataFrame(
    [
        [None, 35, None, None,None, None,None],
        [None, 50, None, None,None, None,None],
    ],
    columns=features.columns,
)


new_features.loc[0, 'no_of_trainings'] = 9
new_features.loc[0, 'previous_year_rating'] = 4
new_features.loc[0, 'length_of_service'] = 15
new_features.loc[0, 'kpis_met >80%'] = 1
new_features.loc[0, 'awards_won?'] = 1
new_features.loc[0, 'avg_training_score'] = 90




new_features.loc[1, 'no_of_trainings'] = 1
new_features.loc[1, 'previous_year_rating'] = 2
new_features.loc[1, 'length_of_service'] = 3
new_features.loc[1, 'kpis_met >80%'] = 1
new_features.loc[1, 'awards_won?'] = 0
new_features.loc[1, 'avg_training_score'] = 90

promotion_status = model.predict(new_features)

promotion_status

## Observation:
the result shows that the employee with the data set [0] is eligible for promoton while the one with data set [1] is not eligible.
By varying the variables, you can predict for other employees.

## Conclusion:
from the above model results, the HR can use the decision tree to predict whether an employee is eligible for promotion by adding the relevant data set to the new features.