Installation steps for windows user

1) In Anaconda Prompt: conda install rpy2

2) In Command Prompt: pip install tzlocal

3) Copy tzlocal folder to Anaconda3\Lib\site-packages

In [24]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import math
from sklearn.metrics import f1_score,confusion_matrix
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
r=robjects.r

In [2]:
df = pd.read_csv('binary_classification.csv',index_col='employee_id')

In [3]:
df.head(3)

Unnamed: 0_level_0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0


There are 12 features and 1 binary target variables.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54808 entries, 65438 to 51526
Data columns (total 13 columns):
department              54808 non-null object
region                  54808 non-null object
education               52399 non-null object
gender                  54808 non-null object
recruitment_channel     54808 non-null object
no_of_trainings         54808 non-null int64
age                     54808 non-null int64
previous_year_rating    50684 non-null float64
length_of_service       54808 non-null int64
KPIs_met >80%           54808 non-null int64
awards_won?             54808 non-null int64
avg_training_score      54808 non-null int64
is_promoted             54808 non-null int64
dtypes: float64(1), int64(7), object(5)
memory usage: 5.9+ MB


It can be seen [education] and [previous_year_rating] have missing values. I decided to label the missing values in [education] features as 'No Education Info'. However, I have a hypothesis that the missing values in [previous_year_rating] is because that particular employee's length of service is 1. 

In [5]:
#Check the unique values of [length_of_service] where previous_year_rating is nan
print(df[df.previous_year_rating.isna()].length_of_service.drop_duplicates().values.item())

1


It can be seen that the hypothesis is not rejected. So, we can not simply done the median/mean imputation on [previous_year_rating] feature. 

In [6]:
#Check the unique values of previous_year_rating
print(df.previous_year_rating.drop_duplicates().values)

[ 5.  3.  1.  4. nan  2.]


It can be seen that the [previous_year_rating] elements are integers. So, rather than do the mean/median imputation, I decided to transform this feature into categorical feature and label the missing value as 'no_prev_year_rating'. 

In [3]:
#Rename the columns
df=df.rename(columns={"KPIs_met >80%": "KPIs_met_80","no_of_trainings": "previous_year_no_of_trainings",
                      "awards_won?":"previous_year_awards_won"})
#Imputing missing values and transform
df.education=df.education.fillna('No Education Info')
df.previous_year_rating=df.previous_year_rating.fillna('no_prev_year_rating')
df.previous_year_rating=df.previous_year_rating.replace(1,'1_prev_year_rating')
df.previous_year_rating=df.previous_year_rating.replace(2,'2_prev_year_rating')
df.previous_year_rating=df.previous_year_rating.replace(3,'3_prev_year_rating')
df.previous_year_rating=df.previous_year_rating.replace(4,'4_prev_year_rating')
df.previous_year_rating=df.previous_year_rating.replace(5,'5_prev_year_rating')

In [4]:
df.head(3)

Unnamed: 0_level_0,department,region,education,gender,recruitment_channel,previous_year_no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met_80,previous_year_awards_won,avg_training_score,is_promoted
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5_prev_year_rating,8,1,0,49,0
65141,Operations,region_22,Bachelor's,m,other,1,30,5_prev_year_rating,4,0,0,60,0
7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3_prev_year_rating,7,0,0,50,0


In [5]:
#Check the proportion of the target variable
print('is_promoted=1 : ',len(df[df.is_promoted==1]))
print('is_promoted=0 : ',len(df[df.is_promoted==0]))

is_promoted=1 :  4668
is_promoted=0 :  50140


It can be seen that the proportion of the target variable is imbalanced, 1:10 ratio. So, we have to be careful when split our dataset into training and test set.

In [4]:
#Train Test Split which keeping the same proportion of the target variable
X=df.drop('is_promoted',1)
y=df[['is_promoted']]
split_0=list(np.random.choice(y[y.is_promoted==0].index.tolist(),math.floor(0.7*len(y[y.is_promoted==0])),replace=False))
split_1=list(np.random.choice(y[y.is_promoted==1].index.tolist(),math.floor(0.7*len(y[y.is_promoted==1])),replace=False))
train_split=split_0+split_1 
test_split=list(set(X.index)-(set(train_split)))
X_Train, X_Test, y_Train, y_Test = X.loc[train_split],X.loc[test_split],y.loc[train_split],y.loc[test_split]

This tutorial is focusing on using RPy2 as the bridge between Python and R to do the classification using gbm. So, we just jump to the modeling process without any feature engineering 

In [6]:
#Input your desired R packages
R_packages=['gbm','randomForest','varImp']

#Install the desired R packages
utils=rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
for package in R_packages:
    if not(rpackages.isinstalled(package)):
        utils.install_packages(package)

In [8]:
gbm=rpackages.importr('gbm')

In [9]:
#Preparing data for the input to R global environment
df_Train=X_Train
df_Train['is_promoted']=y_Train
df_Test=X_Test
df_Test['is_promoted']=y_Test

#Input data to R global environment
robjects.globalenv['r_df_Train']=pandas2ri.py2ri(df_Train)
robjects.globalenv['r_df_Test']=pandas2ri.py2ri(df_Test)
#Running in R environment
r('''
factor_cols<-c("department","region","education","gender","recruitment_channel","previous_year_rating","KPIs_met_80","previous_year_awards_won")
numeric_cols<-c("previous_year_no_of_trainings","length_of_service","age","avg_training_score","is_promoted")
r_df_Train[factor_cols] <- lapply(r_df_Train[factor_cols], factor)
r_df_Train[numeric_cols] <- lapply(r_df_Train[numeric_cols], as.integer)
r_df_Test[factor_cols] <- lapply(r_df_Test[factor_cols], factor)
r_df_Test[numeric_cols] <- lapply(r_df_Test[numeric_cols], as.integer)
''')

#Save the data into python environment
r_df_Train=r['r_df_Train']
r_df_Test=r['r_df_Test']

In [11]:
#Check the result on the command prompt
r('print(str(r_df_Train))')

rpy2.rinterface.NULL

![image.png](attachment:image.png)

In [13]:
#You will not see the loading sign on the jupyter notebook. Look your progress on the command prompt, so it is wise to turn the verbose argument into TRUE
r('model_gbm<-gbm(formula=is_promoted~.,distribution = "bernoulli",data=r_df_Train, n.trees = 5000, shrinkage = 0.01, interaction.depth = 6,verbose=TRUE,cv.folds=2)')

0,1,2,3,4,5,6,7,8
initF,FloatVector with 1 elements.  -2.374272,,,,,,,
-2.374272,,,,,,,,
fit,FloatVector with 38365 elements.  -4.403145  -5.128789  -6.219459  -5.998774  ...  7.038423  -0.976102  -1.069882  4.296664,,,,,,,
-4.403145,-5.128789,-6.219459,-5.998774,...,7.038423,-0.976102,-1.069882,4.296664
train.error,FloatVector with 5000 elements.  0.578792  0.575662  0.572454  0.569535  ...  0.290137  0.290131  0.290126  0.290118,,,,,,,
0.578792,0.575662,0.572454,0.569535,...,0.290137,0.290131,0.290126,0.290118
...,...,,,,,,,
call,Vector with 9 elements.  RNULLType  Vector  Vector  Vector  ...  RNULLType  Vector  Vector  Vector,,,,,,,
RNULLType,Vector,Vector,Vector,...,RNULLType,Vector,Vector,Vector
m,Vector with 5 elements.  RObject  Vector  RObject  BoolVec...  Signatu...,,,,,,,

0
-2.374272

0,1,2,3,4,5,6,7,8
-4.403145,-5.128789,-6.219459,-5.998774,...,7.038423,-0.976102,-1.069882,4.296664

0,1,2,3,4,5,6,7,8
0.578792,0.575662,0.572454,0.569535,...,0.290137,0.290131,0.290126,0.290118

0,1,2,3,4,5,6,7,8
RNULLType,Vector,Vector,Vector,...,RNULLType,Vector,Vector,Vector

0,1,2,3,4
RObject,Vector,RObject,BoolVec...,Signatu...

0,1,2,3,4,5,6,7,8
-4.228785,-5.103907,-5.802237,-5.578418,...,6.057434,-1.710251,-0.996575,3.702395


In [15]:
#Finding the optimum number of trees. You can see there is a deviance vs iteartion plot created on your windows.
r('''
ntree_opt_cv<-gbm.perf(model_gbm,method='cv')
print(ntree_opt_cv)
''')

0
3157


In [18]:
#Using the optimum number of trees
r('model_gbm<-gbm(formula=is_promoted~.,distribution = "bernoulli",data=r_df_Train, n.trees = ntree_opt_cv, shrinkage = 0.01, interaction.depth = 6,verbose=TRUE,cv.folds=2)')

0,1,2,3,4,5,6,7,8
initF,FloatVector with 1 elements.  -2.374272,,,,,,,
-2.374272,,,,,,,,
fit,FloatVector with 38365 elements.  -4.315364  -4.780214  -5.952584  -5.648406  ...  6.352226  -1.193966  -1.266967  3.941266,,,,,,,
-4.315364,-4.780214,-5.952584,-5.648406,...,6.352226,-1.193966,-1.266967,3.941266
train.error,FloatVector with 3157 elements.  0.578794  0.575785  0.572640  0.569897  ...  0.304887  0.304881  0.304871  0.304863,,,,,,,
0.578794,0.575785,0.572640,0.569897,...,0.304887,0.304881,0.304871,0.304863
...,...,,,,,,,
call,Vector with 9 elements.  RNULLType  Vector  Vector  Vector  ...  RNULLType  Vector  Vector  Vector,,,,,,,
RNULLType,Vector,Vector,Vector,...,RNULLType,Vector,Vector,Vector
m,Vector with 5 elements.  RObject  Vector  RObject  BoolVec...  Signatu...,,,,,,,

0
-2.374272

0,1,2,3,4,5,6,7,8
-4.315364,-4.780214,-5.952584,-5.648406,...,6.352226,-1.193966,-1.266967,3.941266

0,1,2,3,4,5,6,7,8
0.578794,0.575785,0.57264,0.569897,...,0.304887,0.304881,0.304871,0.304863

0,1,2,3,4,5,6,7,8
RNULLType,Vector,Vector,Vector,...,RNULLType,Vector,Vector,Vector

0,1,2,3,4
RObject,Vector,RObject,BoolVec...,Signatu...

0,1,2,3,4,5,6,7,8
-4.452693,-4.575842,-5.895422,-5.443086,...,5.637873,-1.227329,-1.38024,3.149791


In [19]:
r('print(model_gbm)')

rpy2.rinterface.NULL

![image.png](attachment:image.png)

In [22]:
#Predict using the model trained
r(' y_pred <- predict(object=model_gbm,newdata=r_df_Test, n.trees=model_gbm$n.trees,type="response")')

0,1,2,3,4,5,6,7,8
0.009257,0.070456,0.00933,0.159489,...,0.002634,0.07693,0.416144,0.002936


In [33]:
#Labeling the prediction output
y_pred=r['y_pred']
y_pred=pd.DataFrame(np.asarray(y_pred).tolist())
y_pred=list(map(lambda x: 1 if x>0.5 else 0,y_pred[0]))

In [34]:
#Check the F1 Score and confusion matrix
print('F1 Score: ',f1_score(y_Test, y_pred, average='binary'))
print('')
tn, fp, fn, tp = confusion_matrix(y_Test,y_pred).ravel()
print('True Negative: ',tn,' False Positive: ',fp,' False Negative: ',fn,' True Positive: ',tp)
print('')
print(confusion_matrix(y_Test,y_pred))

F1 Score:  0.4997386304234187

True Negative:  15008  False Positive:  34  False Negative:  923  True Positive:  478

[[15008    34]
 [  923   478]]


By using all of the features as predictor and without generate any new features, we got 0.499 F1-Score by using GBM from R package!