# EMPLOYEE PROMOTION PREDICTION
<div class="alert alert-block alert-success">
    <b>By: OFINNI OLUWASEUN ABEL</b>
</div>

<div class="alert alert-block alert-info">
    <b>PROBLEM DEFINITION AND FEATURES' DESCRIPTION<b/>
       </div>

<div class="alert alert-block alert-success">
<b>CASE STUDY</b>
</div>

**YAKUB TRADING GROUP - ALGORITHMIC STAFF PROMOTION**

Abdullah’s Baba Yakub, 38, is the heir apparent to the highly revered Yakub business dynasty. The enterprise has spanned decades with vast investment interest in all the various sectors of the economy.

Abdullah has worked for 16 years in Europe and America after his first and second degrees at Harvard University where he studied Engineering and Business Management. He is a very experienced technocrat and a global business leader who rose through the rank to become a Senior Vice President at a leading US business conglomerate.
His dad is now 70 and has invited him to take over the company with a mandate to take it to the next level of growth as a sustainable legacy. Abdullah is trusted by his father and his siblings to lead this mandate.

On resumption, he had an open house with the staff to share his vision and to listen to them on how to take the business to the next level. Beyond the general operational issues and increasing need for regulatory compliance, one of the issues raised by the staff was a general concern on the process of staff promotion. Many of the staff allege that it is skewed and biased. Abdullah understood the concern and promised to address it in a most scientific way.


<div class="alert alert-block alert-info">
<b>MAIN TASK</b>
</div>
    
_**You have been called in by Abdullah to use your machine learning skills to study the pattern of promotion. With this insight, he can understand the important features among available features that can be used to predict promotion eligibility.**_


<div class="alert alert-block alert-danger">
<b>FEATURES DESCRIPTIONS</b>
</div>
    
The dataset contains these variables as explained below:

• **Employee No** : System-generated unique staff ID

• **Division**: Operational department where each employee works

• **Qualification**: Highest qualification received by the staff

• **Gender**: Male or Female

• **Channel of Recruitment**: How the staff was recruited – this is via internal process, use of an agent or special referral

• **Trainings Attended** : Unique paid and unpaid trainings attended by each staff in the previous business cycle

• **Year of birth**: Year that the employee was born

• **Last Performance Score**: Previous year overall performance HR score and rated on a scale of 0-14

• **Year of recruitment** : The year that each staff was recruited into the company

• **Targets Met or KPI Met**: A measure of employees who meet the annual set target. If met, the staff scores 1 but if not, it is a 0.

• **Previous Award** : An indicator of previous award won. If yes, it is a 1 and if No it is a 0.

• **Training score average**: Feedback score on training attended based on evaluation

• **State Of Origin**: The state that the employee claims

• **Foreign Schooled**: An indicator of staff who had any of their post-secondary education outside the country. Responses are in Yes or No

• **Marital Status**: Marriage status of employees and recorded as Yes or No

• **Past Disciplinary Action** : An indicator if a staff has been summoned to a disciplinary panel in the past. This is indicated as Yes or No

• **Previous Intra-Departmental Movement** : This is an indicator to identify staff who have moved between departments in the past. Yes and No are the responses.

• **No of Previous Employers** : A list of the number of companies that an employee worked with before joining the organisation. This is recorded as counts

In [None]:
def message():
    print("""Hey there! 
Your Codes Ran successfully! You may proceed.""")
    
message()

<div class="alert alert-block alert-info">
     <center>
<b>LOADING REQUIRED MODULES FOR THE PROJECT</b>
     </center>   
</div>


In [None]:
# import the required package and libraries
import pandas as pd #Dataframe handling
import numpy as np # For handling the Maths and Stats operations
import pandas_profiling as pp #To geberate a broad overview of my data

#For Missing Values treatment. I'll prefer MinMaxScaler for scaling in this project due to the nature of distribution 
import missingno #For visualising the distribution of missing values based on counts strategy
from sklearn.preprocessing import MinMaxScaler #For scaling our feature to range between 0 and 1

#Visualisation tools imported from their libraries tfor this project
# Standard plotly imports
from matplotlib import pyplot as plt
import seaborn as sns 
%matplotlib inline #To make our plots appear directly in this notebook
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode #The benefit of visual interactivity is huge
# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=False)
#I'll use IPython Core Interactive Shell to enable Multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity="all" #I'll set this to all to make my experiment seamless

# The Classifier Algorithm for our experiment and other relevant modules
from catboost import Pool, CatBoostClassifier #The raw guy is here waiting to be trained. It's a superstar, you know.
from sklearn.ensemble import RandomForestClassifier #I'll prefer this to DecisionForest anyday.
from sklearn.ensemble import GradientBoostingClassifier #A good member of the boosting family, I just hope it can handle this project
from sklearn import model_selection #I'll import this, just in case I resort to selecting the best model
from sklearn.metrics import (confusion_matrix,f1_score, recall_score, 
                             precision_score,accuracy_score) #The problem of imbalance class distribution is best measured using these metrics. It will be costly to use Accuracy
from sklearn.model_selection import (StratifiedKFold, train_test_split, 
                                     KFold, cross_val_score, RandomizedSearchCV,GridSearchCV) #These slaves have work to do here too

#We obviously don't want to keep getting that Red Box showing up in the experiment
import warnings
warnings.filterwarnings(action="ignore") #We have ignored it now. So, we are good to go.

message()#You know the feeling when your code complete with 0 error... smiles!

<div class="alert alert-block alert-info">
     <center>
<b>LOADING THE DATASETS FOR THIS PROJECT</b>
      </center>
</div>

<div class='alert alert-block alert-success'>

<b>We have two Datasets for this project:
    
* Train Dataset: This contain the attributes that would be used and the target variable for our prediction problem.
* Test Dataset: Here we have the attributes which are similar to what we have in the Train set with the exception of 
  Target variable 
  </b>
</div>

In [None]:
#Load the train data
train=pd.read_csv('train.csv')

#Load the test data
test=pd.read_csv('test.csv')

message()#winks

<div class="alert alert-block alert-info">
     <center>
<b>DESCRIPTIVE STATISTICS OF OUR DATA</b>
      </center>
</div>

In [None]:
#Data information: Get to know the data and thier types plus other info
train.info()#
test.info()

#Accessing the data from the head and tail
train.head(3)
train.tail(3)

test.head(3)
test.tail(3)

#Check the dimension of the train and test data
'The train shape is- '+str(train.shape), 'The test shape is- '+str(test.shape)

#Check the descriptive statistics of quantitative variables in our dataset
train.describe()
test.describe()

message()

<div class="alert alert-block alert-info">
     <center>
<b>CHECKING FOR NULL OR MISSING VALUES IN OUR DATASET</b>
      </center>
</div>

In [None]:
#Check for Null Values and then confirm their spreads in our dataset
train.isnull().sum()
test.isnull().sum()

In [None]:
#Let's verify the location of the missing data in our dataset, then we can visualise it.  
missing_train = train[train.isnull().any(axis=1)]
#Let's visualise the spread of the missing data in our train dataset to give us a clue of how best to deal with it.
sns.heatmap(train.isnull(),yticklabels= False,cbar= False, cmap='viridis')

**Note:** The yellow lines indicates the pattern of the missing values in the dataset

In [None]:
#Let's verify the location of the missing data in our test set, then we can visualise it.
missing_test = test[test.isnull().any(axis=1)]
#Let's visualise the spread of the missing data in our train dataset to give us a clue of how best to deal with it.
sns.heatmap(test.isnull(),yticklabels= False,cbar= False, cmap='viridis')

message()

In [None]:
#WHat went missing might be costlier. I'll check the number of promotions along the missing instances
missing_train['Promoted_or_Not'].value_counts()

We are dealing with imbalanced situations, every information counts but some count more. However, we don't know whose. I'll retain the missing values and find a way to fill them as I progress

In [None]:
#Let's take a deeper look into our dataset by checking the profile of each attribute and a quick overview with relevant tips to guide us.
pp.ProfileReport(train)

pp.ProfileReport(test)

What I have done by using this in interactive Pandas_Profiling is to basically give you a clear picture and descriptions of what we can expect in the process of analysis and building the predictive models for this project. The information presented are clear I assume.

<div class="alert alert-block alert-info">
    <center>
<b> UNIVARIATE VISUALISATIONS</b>
    </center>
</div>

In [None]:
#Let's Visualise the target's (Promoted_or_Not) distribution
train['Promoted_or_Not'].iplot(kind='hist', xTitle='Promotion (0=No, 1=Yes)',
yTitle ='count', title='Distribution of Promotion')

Based on the Profile of 'State_of_Origin', there are 37 levels in the categorical variable. It's best practice, if we verify it's distribution by visualising it with an interactive plot.

<div class="alert alert-block alert-info">
    <center>
<b>MISSING DATA TREATMENTS</b>
    </center>
</div> 

* We have NaN as the missing values in Qulaification. The three levels I'm familiar with have been exhausted here, so, I'll replace the NaN by grouping by Division, Division should give us more information to fill the missing Qualification for the purpose of analysis and interpretations.

* Note: I have tried other approaches of imputing missing values before resorting to this method, they seem to influence negatively the result of my model.

In [None]:
# How many missing values are there in our dataset and how are they distributed?
missingno.matrix(train, figsize = (15,9))
missingno.bar(train, sort='descending', figsize = (15,9))

In [None]:
#Treating missing Data
#imp = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
#train_df = pd.DataFrame(imp.fit_transform(train), columns=train.columns, index=train.index)
#test_df = pd.DataFrame(imp.fit_transform(test), columns=test.columns, index=test.index)
#test_df = test.fillna((test).mode().iloc[0]).astype('object') #interpolate(method='pad', limit_direction='both')

train["Qualification"] = train["Qualification"].astype('object')
train['Qualification'] = train.groupby(["Division"])["Qualification"].apply(lambda x: x.fillna(x.value_counts().index[0]))

test["Qualification"] = test["Qualification"].astype('object')
test['Qualification'] = test.groupby(["Division"])["Qualification"].apply(lambda x: x.fillna(x.value_counts().index[0]))

#train_df = train.dropna()
#test_df = test.fillna(method = "ffill")

<div class="alert alert-block alert-info">
    <center>
<b>COPYING THE DATA TO NEW DATAFRAME</b>
    </center>
</div> 

In [None]:
train_df = pd.DataFrame(train.copy()) #The train data has been copied into tran_df
test_df = pd.DataFrame(test.copy())   #The test data has been copied into test_df

In [None]:
#Let's check the outcome of missing values filled
train_df.Qualification.value_counts()
test_df.Qualification.value_counts()

print ("There are no missing values anymore, we can go ahead with the project now")
sns.heatmap(train_df.isnull(),yticklabels= False,cbar= False, cmap='viridis')
#sns.heatmap(test_df.isnull(),yticklabels= False,cbar= False, cmap='viridis')

<div class="alert alert-block alert-info">
    <center>
<b>UNICARIATE VISUALISATION OF VARIABLES</b>
    </center>
</div> 

In [None]:
print("This is the distribution of Employee by Region of Origin in the Train Data")
#Check State's distribution
train['State_Of_Origin'].iplot(kind='hist', xTitle='State of Origin',
yTitle ='count', title='Employees States of Origin')

print("This is the distribution of Employee by Region of Origin in the Test Data")
#Check State's distribution
test['State_Of_Origin'].iplot(kind='hist', xTitle='State of Origin',
yTitle ='count', title='Employees States of Origin')

In [None]:
print("This is the distribution of Employee by Gender in the Train Data")
#Check Gender distribution
train['Gender'].iplot(kind='hist', xTitle="Employees' Gender",
yTitle ='count', title="Employees' Gender Distribution")

print("This is the distribution of Employee by Gender in the Test Data")
#Check Gender distribution
test['Gender'].iplot(kind='hist', xTitle="Employees' Gender",
yTitle ='count', title="Employees' Gender Distribution")

In [None]:
print("This is the distribution of Employee by Performance Score in the Train Data")
#Check Performance distribution
train['Last_performance_score'].iplot(kind='hist', xTitle='Performance Score',
yTitle ='count', title='Employees Last Performance Score')

print("This is the distribution of Employee by Performance Score in the Test Data")
#Check Performance distribution
test['Last_performance_score'].iplot(kind='hist', xTitle='Performance Score',
yTitle ='count', title='Employees Last Performance Score')

In [None]:
print("This is the distribution of Employee by Average of Training Score in the Train Data")
#Check Averaged Training Scores' distribution
train['Training_score_average'].iplot(kind='hist', xTitle= 'Average of Training Score',
yTitle ='count', title='Employees Averaged Training Scores')

print("This is the distribution of Employee by Average of Training Score in the Test Data")
#Check Averaged Training Scores' distribution
test['Training_score_average'].iplot(kind='hist', xTitle= 'Average of Training Score',
yTitle ='count', title='Employees Averaged Training Scores')

In [None]:
print("This is the distribution of Employee by Average of Number of Trainin in the Train Data")
#Check Averaged Training Scores' distribution
train['Trainings_Attended'].iplot(kind='hist', xTitle= 'Number of Training',
yTitle ='count', title='Number of Past Trainings')

print("This is the distribution of Employee by Average of Number of Trainin in the Test Data")
#Check Averaged Training Scores' distribution
test['Trainings_Attended'].iplot(kind='hist', xTitle= 'Number of Training',
yTitle ='count', title='Number of Past Trainings')

In [None]:
print("This is the distribution of Employee by Average of Educational Qualification in the Train Data")
#Check Qualification distribution. Note: NaN would not feature in the measurement
train['Qualification'].iplot(kind='hist', xTitle= 'Educational Qualification',
yTitle ='count', title='Level of Education')

print("This is the distribution of Employee by Average of Educational Qualification in the Test Data")
#Ch#Check Qualification distribution. Note: NaN would not feature in the measurement
test['Qualification'].iplot(kind='hist', xTitle= 'Educational Qualification',
yTitle ='count', title='Level of Education')

<div class="alert alert-block alert-success">
    <center>
<b>SECURING MORE LEVEL OF DETAILS FROM EMPLOYEES RECENT PROMOTION USING OTHER ATTRIBUTES</b>
    </center>
</div>

<div class="alert alert-block alert-info">
    <center>
<b>Performnace Score And PRevious Award Assessment </b>
    </center>
</div>

In [None]:
train_df.groupby(["Previous_Award"])['Last_performance_score'].mean()

With this level of variation, Performance score motivated Previous Award received by employee. It will be good to check if this also triggers Promotion

In [None]:
pd.crosstab([train_df.Previous_Award, train_df.Last_performance_score], train_df.Promoted_or_Not,  margins=True)

A large proportion of the employee that got Award eventually got Promoted. We can therefore expect these to be predictor of Employee Promotion.

<div class="alert alert-block alert-info">
    <center>
<b>Targets Met, Training Score Average and Promotion of Employee Assessment and Visualisation </b>
    </center>
</div>

In [None]:
train_df.groupby(["Targets_met"])['Training_score_average'].mean()

There is slight difference in the mean Training Score Average of Employees that met targes and otherwise, I'll check for the variation in promotion below

In [None]:
pd.crosstab([train_df.Targets_met, train_df.Training_score_average], train_df.Promoted_or_Not, margins=True)

Here we go, Targets Met, Performance Score and Training Score Average  are strong predictors of Promotion in this regards. So, the model must account for them.

In [None]:
KPI = pd.crosstab(train_df.Targets_met,train_df.Promoted_or_Not,normalize='index')
KPI.plot.bar(stacked=True)
plt.rcParams['figure.figsize'] = [5, 5]
plt.legend(title='is_promoted',loc='upper left',bbox_to_anchor=(1, 0.5))

<div class="alert alert-block alert-info">
    <center>
<b>Gender Based Promotion Assessment </b>
    </center>
</div>

The stacked plot above reveals the importance of meeting targets beore an Employee could really qualify for Promotion

In [None]:
plt.rcParams['figure.figsize'] = [3, 5]
Emp_Gender_pro = pd.crosstab(train_df.Gender,train_df.Promoted_or_Not,normalize='index')
Emp_Gender_pro.plot.bar(stacked=True)
plt.legend(title='is_promoted',loc='upper left',bbox_to_anchor=(1, 0.5))

The organisation seems to hold equity at workplace very strongly in the area of promotion of their staffs. I might end up not using the Gender attributes.

In [None]:
pd.crosstab([train_df.Gender, train_df.Division], train_df.Promoted_or_Not, margins=True)

Just as shown in the plot above, there are not variation in the percentage of Promotion across various Deparments in the organisation as far as Gender is concerned

<div class="alert alert-block alert-info">
    <center>
<b>DATA PREPROCESSING AND FEATURE ENGINEERING</b>
    </center>
</div> 

* I'll be dealing with all the categorical variables in the two Data Sets in this section.

* Numerical variable that are not Normally distributed would be addressed here and I'll generate other varibales from their results if necessary. 

In [None]:
#How many States are there in our DAta

train_df['State_Of_Origin'].value_counts().count()
test_df['State_Of_Origin'].value_counts().count()

In [None]:
#Here I will create a Dictionary of State to get their geo-political zone, then we'll replace States with their Region 
Region = {"BENUE":"North-Central", "KOGI":"North-Central", "KWARA":"North-Central", "NASSARAWA":"North-Central",
            "NIGER":"North-Central", "PLATEAU":"North-Central", "FCT":"North-Central", "ADAMAWA":"North-East", "BAUCHI":"North-East", 
            "BORNO":"North-East", "GOMBE":"North-East", "TARABA":"North-East", "YOBE":"North-East", "JIGAWA":"North-West","KADUNA":"North-West", 
            "KANO":"North-West", "KATSINA":"North-West", "KEBBI":"North-West", "SOKOTO":"North-West", "ZAMFARA":"North-West", "ABIA":"South-East", 
            "ANAMBRA":"South-East", "EBONYI":"South-East", "ENUGU":"South-East", "IMO":"South-East", "AKWA IBOM":"South-South", "BAYELSA":"South-South",
            "CROSS RIVER":"South-South", "RIVERS":"South-South", "DELTA":"South-South", "EDO":"South-South", "EKITI":"South-West", "LAGOS":"South-West",
            "OGUN":"South-West", "ONDO":"South-West", "OSUN":"South-West", "OYO":"South-West"}

data = [train_df, test_df]
for dataset in data:
    dataset['State_Of_Origin'] = dataset['State_Of_Origin'].map(Region).astype('category')

test_df.head()

print ("I have done this to cut down computational cost and imcrease Model's effeciency and for effective analysis and interpretation")

In [None]:
train_df['State_Of_Origin'].iplot(kind='hist', xTitle="Region of Origin" ,
yTitle ='count', title="Distribution of Employees' Region of Origin")

test_df['State_Of_Origin'].iplot(kind='hist', xTitle="Region of Origin" ,
yTitle ='count', title="Distribution of Employees' Region of Origin")
message()
print("""This is the distribution of Employee based on their Region of Origin. 
You can get realtime count of their distribution by hovering your pointer over the plot.

You can drill down using the tools available at the right hand top-corner of the plot.

The Plots show that, there are more Employees South-Western Region than every other region. 
We will understand the Channel that the majority of these Employee got recruited from later.""")

In [None]:
ps = (18, 14)
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize = ps)
sns.countplot(ax=ax, x='State_Of_Origin',hue='Promoted_or_Not',data=train_df,palette='rainbow')

In [None]:
train_df["State_Of_Origin"].value_counts()

<div class="alert alert-block alert-danger">
    <center>
<b>CONVERTING OBJECT TYPE TO CATEGORY</b>
    </center>
</div>

In [None]:
sns.heatmap(test_df.isnull(),yticklabels= False,cbar= False, cmap='viridis')

In [None]:
data = [train_df, test_df]
for dataset in data:
    dataset["Gender"] = dataset["Gender"].astype("category")
test_df["Gender"].dtype

In [None]:
data = [train_df, test_df]
for dataset in data:
    dataset["Channel_of_Recruitment"] = dataset["Channel_of_Recruitment"].astype("category")
test_df["Channel_of_Recruitment"].dtype

In [None]:
data = [train_df, test_df]
for dataset in data:
    dataset["No_of_previous_employers"] = dataset["No_of_previous_employers"].astype("category")
test_df["No_of_previous_employers"].dtype

In [None]:
Employment_History = {'0' : 0, '1' : 1, '2': 2, '3': 3, '4': 4, '5' : 5, 'More than 5': 6}

data = [train_df, test_df]
for dataset in data:
    dataset['No_of_previous_employers'] = dataset['No_of_previous_employers'].map(Employment_History)

In [None]:
ps = (16, 10)
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize = ps)
sns.countplot(ax=ax, x='Past_Disciplinary_Action',hue='Promoted_or_Not',data=train_df,palette='rainbow')

Employee with no previous disciplinary actions got more promoted. This should be a predictor as well

<div class="alert alert-block alert-danger">
    <center>
<b>GENERATING NEW VARIABLES FROM OUR QUANTITATIVE VARIABLE</b>
    </center>
</div>

In [None]:
#Based on the distribution of the recruitment year and the Birth-Year. I'll create a new variable that tells me the Age of Employee at the time of employment
train_df['Age_at_Employment'] = list (train_df['Year_of_recruitment'])-(train_df['Year_of_birth']).astype('int')
test_df['Age_at_Employment'] = list (test_df['Year_of_recruitment'])-(test_df['Year_of_birth']).astype('int')

test_df.head()
message()

In [None]:
#Based on the distribution of the recruitment year and the Birth-Year. I'll create a new variable that tells me the length of service with the company and i'll bin it to categories
train_df['length_of_service'] = 2019 - train['Year_of_recruitment']
test_df['length_of_service'] = 2019 - test['Year_of_recruitment']

#I will release it from this current cell now
#train_df.drop(['Year_of_recruitment'], axis=1, inplace=True)
#test_df.drop(['Year_of_recruitment'], axis=1, inplace=True)
test_df.tail()

message()

<div class="alert alert-block alert-success">
    <center>
<b>VISUALISATION OF VARIABLES RELATIONSHIPS</b>
    </center>
</div>

In [None]:
ps = (18, 14)
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize = ps)
sns.countplot(ax=ax, x='Age_at_Employment',hue='Promoted_or_Not',data=train_df,palette='RdBu_r')

In [None]:
ps = (18, 14)
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize = ps)
sns.countplot(ax=ax, x='Gender',hue='Previous_Award',data=train_df,palette='RdBu_r')

In [None]:
#Binning Age at the time of Recruitment into categories
bins_1 = np.linspace(min(train_df["Last_performance_score"]), max(train_df["Last_performance_score"]), 3)
bins_2 = np.linspace(min(test_df["Last_performance_score"]), max(test_df["Last_performance_score"]), 3)

#We group the result with these categories
group_names = ['Low', 'High']
train_df['Binned_Score'] = pd.cut(train_df['Last_performance_score'], bins_1, labels=group_names, include_lowest=True )
test_df['Binned_Score'] = pd.cut(test_df['Last_performance_score'], bins_2, labels=group_names, include_lowest=True )

#train_df.drop(['Last_performance_score'], axis=1, inplace=True)
#test_df.drop(['Last_performance_score'], axis=1, inplace=True)

#Let's Visualise the Bins
test_df['Binned_Score'].iplot(kind='hist', xTitle='Binned Last Performance Score',
yTitle ='count', title='Last Performance Score of Employee')

ex_1 = pd.get_dummies(train_df["Binned_Score"])
ex_2 = pd.get_dummies(test_df["Binned_Score"])

ex_1.rename(columns={'Low':'Low_PS', 'High':'High_PS'}, inplace=True)
ex_2.rename(columns={'Low':'Low_PS', 'High':'High_PS'}, inplace=True)

# merge data  and "ex_1" and "ex_2" accordingly
train_df = pd.concat([train_df, ex_1], axis=1)
test_df = pd.concat([test_df, ex_2], axis=1)

# drop original column "Binned_Score"
train_df.drop("Binned_Score", axis = 1, inplace=True)
test_df.drop("Binned_Score", axis = 1, inplace=True)

test_df.tail()

<div class="alert alert-block alert-info">
    <center>
<b>FURTHER DRILL INTO THE DATA</b>
    </center>
</div>

In [None]:
pd.crosstab([train_df.Previous_Award, train_df.No_of_previous_employers], train_df.Promoted_or_Not, margins=True)

In [None]:
pd.crosstab([train_df.Previous_Award, train_df.Training_score_average], train_df.Promoted_or_Not, margins=True)

In [None]:
pd.crosstab([train_df.Qualification, train_df.Targets_met], train_df.Promoted_or_Not, margins=True)

In [None]:
pd.crosstab([train_df.Trainings_Attended, train_df.Training_score_average], train_df.Promoted_or_Not, margins=True)

In [None]:
pd.crosstab([train.Marital_Status, train_df.Targets_met], train_df.Promoted_or_Not, margins=True)

In [None]:
pd.crosstab([train_df.State_Of_Origin, train_df.Age_at_Employment], train_df.Promoted_or_Not, margins=True)                                                                                                                                                                                                                                                                                                

In [None]:
pd.crosstab([train_df.Previous_IntraDepartmental_Movement, train_df.Targets_met], train_df.Promoted_or_Not, margins=True)

In [None]:
pd.crosstab([train_df.Division, train_df.Qualification], train_df.Promoted_or_Not, margins=True) 

 <div class="alert alert-block alert-success">
    <center>
<b>MORE FEATURE GENERATION</b>
    </center>
</div>

In [None]:
#train_df["Training_score_rec"] = np.where(((train_df["Training_score_average"]>=39) & (train_df["Targets_met"]==1)),1,0)
train_df["Targets_Met_Reward"] = np.where(((train_df["Targets_met"]==1) & (train_df["Previous_Award"]==1)),1,0)
#train_df["Gold_Employee"] = np.where(((train_df["Trainings_Attended"]<=6) & (train_df['Training_score_average']<=65)), 1,0)
#train_df["Targets_Met_Reward"] = np.where(((train["Last_performance_score"]>=5.0) & (train["Targets_met"]==1)),1,0)
#train_df["High_Potential"] = np.where(((train_df["Previous_Award"]>=0) & (train_df["length_of_service"]<=23)),1,0)


#test_df["Training_score_rec"] = np.where(((test_df["Training_score_average"]>=39) & (test_df["Targets_met"]==1)),1,0)
test_df["Targets_Met_Reward"] = np.where(((test_df["Targets_met"]==1) & (test_df["Previous_Award"]==1)),1,0)
#test_df['Gold_employee'] = np.where(((test_df["Trainings_Attended"]<=6) & (test_df['Training_score_average']<=65)), 1,0)                                                                          
#test_df["Targets_Met_Reward"] = np.where(((test_df["Last_performance_score"]>=5.0) & (test_df["Targets_met"]==1)),1,0)
#test_df["High_Potential"] = np.where(((test_df["Previous_Award"]>=0) & (test_df["length_of_service"]<=23)),1,0)

In [None]:
train_df.head(4)

In [None]:
exp_1 = pd.get_dummies(train_df["Qualification"])
exp_2 = pd.get_dummies(test_df["Qualification"])

exp_1.rename(columns={'Qualification-First Degree or HND':'First Degree or HND',
                        'Qualification-MSc, MBA and PhD':'MSc, MBA and PhD',
                        'Qualification-Non-University Education':'Non-University Education', 
                        'Qualification-Unknown': 'Unknown'}, inplace=True)
exp_2.rename(columns={'Qualification-First Degree or HND':'First Degree or HND', 
                        'Qualification-MSc, MBA and PhD':'MSc, MBA and PhD',
                        'Qualification-Non-University Education':'Non-University Education', 
                        'Qualification-Unknown': 'Unknown'}, inplace=True)

# merge data  and "exp_1" and "exp_2" accordingly
train_df = pd.concat([train_df, exp_1], axis=1)
test_df = pd.concat([test_df, exp_2], axis=1)

# drop original column "Qualification" from the the two sets
train_df.drop("Qualification", axis = 1, inplace=True)
test_df.drop("Qualification", axis = 1, inplace=True)

test_df.tail()
train_df.head()

In [None]:
data = [train_df, test_df]
for dataset in data:
    dataset['Division'] = dataset['Division']

exp_1 = pd.get_dummies(train_df["Division"])
exp_2 = pd.get_dummies(test_df["Division"])

exp_1.rename(columns={'Division-Commercial Sales and Marketing':'Commercial Sales and Marketing',
                        'Division-Sourcing and Purchasing':'Sourcing and Purchasing',
                        'Division-Information Technology and Solution Support':'Information Technology and Solution Support',
                        'Division-Information and Strategy':"Information and Strategy",
                        'Division-Business Finance Operations': "Business Finance Operations",
                        'Division-People/HR Management':"People/HR Management",
                        'Division-Regulatory and Legal services':"Regulatory and Legal services",
                        'Division-Research and Innovation':"Research and Innovation",
                        'Division-Customer Support and Field Operations':"Customer Support and Field Operations" 
                       }, inplace=True)
exp_2.rename(columns={'Division-Commercial Sales and Marketing':'Commercial Sales and Marketing',
                        'Division-Sourcing and Purchasing':'Sourcing and Purchasing',
                        'Division-Information Technology and Solution Support':'Information Technology and Solution Support',
                        'Division-Information and Strategy':"Information and Strategy",
                        'Division-Business Finance Operations': "Business Finance Operations",
                        'Division-People/HR Management':"People/HR Management",
                        'Division-Regulatory and Legal services':"Regulatory and Legal services",
                        'Division-Research and Innovation':"Research and Innovation",
                        'Division-Customer Support and Field Operations':"Customer Support and Field Operations" 
                       }, inplace=True)

# merge data  and "exp_1" and "exp_2" accordingly
train_df = pd.concat([train_df, exp_1], axis=1)
test_df = pd.concat([test_df, exp_2], axis=1)

# drop original column "Qualification" from the the two sets
train_df.drop("Division", axis = 1, inplace=True)
test_df.drop("Division", axis = 1, inplace=True)

test_df.tail()
train_df.info()
message()

 <div class="alert alert-block alert-info">
    <center>
<b>DROPPING SOME FEATURES AT THIS LEVEL</b>
    </center>
</div>

In [None]:
train_df.drop(["EmployeeNo", "Gender", "Channel_of_Recruitment", "Trainings_Attended",
              "Year_of_birth", "Last_performance_score","Year_of_recruitment", "State_Of_Origin",
              "Marital_Status","No_of_previous_employers", "Promoted_or_Not", "length_of_service"], axis=1, inplace=True)

test_df.drop(["EmployeeNo", "Gender", "Channel_of_Recruitment", "Trainings_Attended",
               "Year_of_birth", "Last_performance_score","Year_of_recruitment", "State_Of_Origin",
               "Marital_Status", "No_of_previous_employers","length_of_service"], axis=1, inplace=True)

In [None]:
train_df['Foreign_schooled'] = train_df['Foreign_schooled'] .astype('category')
test_df['Foreign_schooled'] = test_df['Foreign_schooled'] .astype('category')
train_df.Previous_IntraDepartmental_Movement = train_df.Previous_IntraDepartmental_Movement.astype('category')
test_df.Previous_IntraDepartmental_Movement = test_df.Previous_IntraDepartmental_Movement.astype('category')
train_df.Past_Disciplinary_Action = train_df.Past_Disciplinary_Action.astype('category')
test_df.Past_Disciplinary_Action = test_df.Past_Disciplinary_Action.astype('category')

In [None]:
train_df.head(3)

<div class="alert alert-block alert-success">
    <center>
<b>FEATURE SCALING USING MinMaxScaler</b>
    </center>
</div>

In [None]:
scaling_tool = MinMaxScaler(feature_range=(0, 1), copy=True)
columns = ['Training_score_average', 'Age_at_Employment']
#For Train
train_df[columns] = pd.DataFrame(scaling_tool.fit_transform(train_df[columns].values))
test_df[columns] = pd.DataFrame(scaling_tool.transform(test_df[columns].values))

test_df.head()
#For Test
#col_to_scale_2 = test_df[columns]
#test_df.columns = scaling_tool.transform(col_to_scale_2, , axis=1)

In [None]:
#clean_train = pd.DataFrame(train_df.copy())
#clean_test = pd.DataFrame(train_df.copy())
label_names=['Foreign_schooled', "Past_Disciplinary_Action", "Previous_IntraDepartmental_Movement"]
train_df= pd.get_dummies(train_df, columns=label_names,prefix=['Foreign_schooled', "Past_Disciplinary_Action", 
                                                               "Previous_IntraDepartmental_Movement"], drop_first=True)
test_df= pd.get_dummies(test_df, columns=label_names, prefix=['Foreign_schooled', "Past_Disciplinary_Action", 
                                                             "Previous_IntraDepartmental_Movement"], drop_first=True)
train_df.head(3)

In [None]:
target = train.Promoted_or_Not

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df, target,
                                                    test_size=0.2,
                                                    random_state=121, stratify=target) # I'm setting the seed here

X_val, X_val_test, y_val, y_val_test = train_test_split(X_test, y_test,
                                                    test_size=0.3,
                                                    random_state=121, stratify=y_test)
X_train.shape, X_test.shape, X_val.shape, X_val_test.shape
message()

<div class="alert alert-block alert-info">
    <center>
<b>USING RANDOMSEARCH CROSS-VALIDATION TO TUNE THE PARAMETERS OF CATBOOST</b>
    </center>
</div>

In [None]:
model = CatBoostClassifier(loss_function='Logloss')
model.fit(X_train, y_train)

grid = {'learning_rate': [0.03, 0.1],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9]}

randomized_search_result = model.randomized_search(grid,
                                                   X=X_train,
                                                   y=y_train,
                                                   plot=True)

THE  SEARCH TOOK LONG TIME BEFORE COMPLETION DUE TO SYSTEM ISSUE. SO, I MODIFIED THE OUTPUT OF ITS PARAMETER TO TRAIN AND TEST. CHECK BELOW FOR OUTCOMES

<div class="alert alert-block alert-info">
    <center>
<b>MODEL TRAINING, VALIDATION AND PREDICTION</b>
    </center>
</div>

In [None]:
model.fit(X_train, y_train, eval_set=(X_val, y_val), use_best_model=True, verbose=False)

In [None]:
pred_val_t = model.predict(X_val_test)

In [None]:
Accuracy = accuracy_score(y_val_test,pred_val_t)
F1_score = f1_score(y_val_test,pred_val_t)
precision = precision_score(y_val_test,pred_val_t)
recall = recall_score(y_val_test,pred_val_t)
cm = confusion_matrix(y_val_test,pred_val_t)

print(Accuracy, F1_score, precision, recall)
print (cm)

In [None]:
pred_test_ca = cb_model.predict(test_df)

In [None]:
submission = pd.DataFrame({
        "EmployeeNo": test["EmployeeNo"],
        "Promoted_or_Not":pred_test_ca.astype('int')
    })

submission.to_csv('New_Pred_Cat.csv', index=False)

<div class="alert alert-block alert-info">
    <center>
<b>USING GridSearch CROSS-VALIDATION TO TUNE THE PARAMETERS OF RandomForestClassifier</b>
    </center>
</div>

In [None]:
forest = RandomForestClassifier(n_jobs=-1, random_state=0,class_weight='balanced',n_estimators=100,bootstrap=True, max_depth=80)
#forest = GradientBoostingClassifier(loss='exponential',max_features='auto')
param_grid = {
    'n_estimators': [200,500,800]
}
grid_search = GridSearchCV(estimator = forest, param_grid = param_grid,cv = 3, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importance = {}
for i in range(len(X_train.columns)):
    feature_importance[X_train.columns[i]] = feature_importances[i]
importance_df = pd.DataFrame(list(feature_importance.items()),columns=['feature','importance'])
importance_df = importance_df.sort_values('importance',ascending=False)
plt.xticks(rotation='vertical')
plt.rcParams['figure.figsize'] = [18, 11]
sns.barplot(x="feature",y="importance",data=importance_df)

From the above Plots of feaure of Importance,(Training_average_score, Age at EMployment, Targets Met, Previous Award, Commercial and Sales Marketing and some others provided more information for our Model

In [None]:
pred = grid_search.predict(X_val)

In [None]:
f1 = f1_score(y_val, pred)
'f1 score - '+str(f1)

<div class="alert alert-block alert-info">
    <center>
<b>USING GridSearch CROSS-VALIDATION TO TUNE THE PARAMETERS OF GradientBoostingClassifier</b>
    </center>
</div>

In [None]:
#forest = RandomForestClassifier(n_jobs=-1, random_state=0,class_weight='balanced',n_estimators=100,bootstrap=True, max_depth=80)
forest = GradientBoostingClassifier(loss='exponential',max_features='auto')
param_grid = {
    'n_estimators': [200,500,800]
}
grid_search = GridSearchCV(estimator = forest, param_grid = param_grid,cv = 3, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importance = {}
for i in range(len(X_train.columns)):
    feature_importance[X_train.columns[i]] = feature_importances[i]
importance_df = pd.DataFrame(list(feature_importance.items()),columns=['feature','importance'])
importance_df = importance_df.sort_values('importance',ascending=False)
plt.xticks(rotation='vertical')
plt.rcParams['figure.figsize'] = [18, 10]
sns.barplot(x="feature",y="importance",data=importance_df)

From the above Plots of feaure of Importance,(Training_average_score, Targets Met, Previous Award, Commercial and Sales Marketing and some others provided more information for our Model.  Age at EMployment at seems to provide little infomation than it was for RandomForestClassifier

In [None]:
pred = grid_search.predict(X_val)

In [None]:
f1 = f1_score(y_val, pred)
'f1 score - '+str(f1)

In [None]:
pred = grid_search.predict(X_val_test)

In [None]:
f1 = f1_score(y_val_test, pred)
'f1 score - '+str(f1)

In [None]:
confusion_matrix(y_val_test, pred)

<div class="alert alert-block alert-info">
    <center>
<b>Second Round of Hyperparameter Tuning for Gradient Boosting Classifier</b>
    </center>
</div>

In [None]:
forest = GradientBoostingClassifier(loss='exponential',max_features='auto', criterion = "friedman_mse")
param_grid = {'learning_rate':[0.3, 0.2],
    'n_estimators': [800,1000]
}
grid_search = GridSearchCV(estimator = forest, param_grid = param_grid,cv = 3, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
pred_ = grid_search.predict(X_val)

In [None]:
f1 = f1_score(y_val, pred_)
'f1 score - '+str(f1)

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importance = {}
for i in range(len(X_train.columns)):
    feature_importance[X_train.columns[i]] = feature_importances[i]
importance_df = pd.DataFrame(list(feature_importance.items()),columns=['feature','importance'])
importance_df = importance_df.sort_values('importance',ascending=False)
plt.xticks(rotation='vertical')
plt.rcParams['figure.figsize'] = [18, 11]
sns.barplot(x="feature",y="importance",data=importance_df)

Slight improvements was witnessed after the training

In [None]:
test_pred_gbm = grid_search.predict(test_df).astype('int')

In [None]:
submission = pd.DataFrame({
        "EmployeeNo": test["EmployeeNo"],
        "Promoted_or_Not": test_pred_gbm
    })

submission.to_csv('GBM_1.csv', index=False)

In [None]:
gb_tuned = GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.2,
                                                  loss='exponential', max_depth=3, max_features='auto',max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=2,
                                                  min_samples_split=3, min_weight_fraction_leaf=0.0, n_estimators=2000,
                                                  n_iter_no_change=None, presort='auto', random_state=None,subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1, verbose=0, warm_start=False)

<div class="alert alert-block alert-info">
    <center>
<b>AT THIS POINT, I STARTED MODIFYING CATBOOST PARAMETERS</b>
    </center>
</div>
**I had to do this because of the parameter search time required is much due to the fact that I have HotEncoded the features thereby increasing the time to train model**

## THE MODEL BELOW ACHIEVED 0.94603 ON THE LEADERBOARD

In [None]:
cb_model = CatBoostClassifier(iterations=800,
                             learning_rate=0.1,
                             depth=10,
                             eval_metric='F1',
                             random_seed = 42,
                             bagging_temperature = 0.23,
                             od_type='Iter',l2_leaf_reg = 3,random_strength = 0.2,
                             metric_period = 70,
                             od_wait=21)

In [None]:
cb_model.fit(X_train, y_train)

In [None]:
pred_new_cat = cb_model.predict(X_val)

In [None]:
f1 = f1_score(y_val, pred_new_cat)
'f1 score - '+str(f1)

In [None]:
test_pred_cat = cb_model.predict(test_df).astype('int')

In [None]:
submission = pd.DataFrame({
        "EmployeeNo": test["EmployeeNo"],
        "Promoted_or_Not": test_pred_cat
    })

submission.to_csv('CAT_1.csv', index=False)

In [None]:
submission.Promoted_or_Not.value_counts()

<div class="alert alert-block alert-info">
    <center>
<b>The Model below Achieved PUB-0.9502 and PRI-0.9411 on the Leaderboard</b>
    </center>
</div>

THIS IS THE PREFERED MODEL AND I WILL RECOMMEND THIS THAN THE MODEL BELOW WHICH I ENDED UP WITH ON THE PRIVATE LEADERBOARD

In [None]:
cb_model_2 = CatBoostClassifier(iterations=1000,
                             learning_rate=0.11,
                             depth=10,
                             eval_metric='F1',
                             random_seed = 42,
                             bagging_temperature = 0.25,
                             od_type='Iter',l2_leaf_reg = 4,random_strength = 0.3,
                             metric_period = 50,
                             od_wait=20)

In [None]:
cb_model_2.fit(X_train, y_train)

In [None]:
pred_new_ca = cb_model_2.predict(X_val)

In [None]:
f1 = f1_score(y_val, pred_new_ca)
'f1 score - '+str(f1)

In [None]:
confusion_matrix(y_val, pred_new_ca)

In [None]:
test_pred_cat_2 = cb_model_2.predict(test_df).astype('int')

In [None]:
submission = pd.DataFrame({
        "EmployeeNo": test["EmployeeNo"],
        "Promoted_or_Not": test_pred_cat_2
    })

submission.to_csv('CAT_3.csv', index=False)

<div class="alert alert-block alert-info">
    <center>
<b>THE LAST CATBOOST MODEL</b>
    </center>
</div>

**This Model achieved 0.94664 on the Public Leaderboard and 0.940680 on the Private Leaderboard**

In [None]:
cb_model_3 = CatBoostClassifier(iterations=831,
                             learning_rate=0.19,
                             depth=8,
                             eval_metric='F1',
                             random_seed = 99,
                             bagging_temperature = 0.2,
                             od_type='Iter',l2_leaf_reg = 4,random_strength = 0.25,
                             metric_period = 60,
                             od_wait=23)

In [None]:
cb_model_3.fit(X_train, y_train)

In [None]:
pred_new_c = cb_model_3.predict(X_val)

In [None]:
f1 = f1_score(y_val, pred_new_c)
'f1 score - '+str(f1)

In [None]:
confusion_matrix(y_val, pred_new_c)

In [None]:
pred_new_c1 = cb_model_3.predict(X_val_test)

In [None]:
f1 = f1_score(y_val_test, pred_new_c1)
'f1 score - '+str(f1)

In [None]:
confusion_matrix(y_val_test, pred_new_c1)

In [None]:
test_pred_cat_3 = cb_model_3.predict(test_df).astype('int')

In [None]:
submission = pd.DataFrame({
        "EmployeeNo": test["EmployeeNo"],
        "Promoted_or_Not": test_pred_cat_3
    })

submission.to_csv('CAT_4.csv', index=False)

<div class="alert alert-block alert-info">
    <center>
<b>I STRONGLY BELIEVE THAT WITH MORE DATA, THE PREFERED MODEL WILL PERFORM BETTER</b>
    </center>
</div>

<div class="alert alert-block alert-success">
    <center>
<b> THANKS FOR THE OPPORTUNITY TO LEARN AND BECOME BETTER WITH MORE PRACTICE</b>
    </center>
</div>

I must confess that this competition turtured me more than the way I actually turtured the Data. I look forward to meeting you all at the Boot Camp.
Thanks.