# Problem

Abdullah’s Baba Yakub, 38, is the heir apparent to the highly revered Yakub business dynasty. The enterprise has spanned decades with vast investment interest in all the various sectors of the economy.

Abdullah has worked for 16 years in Europe and America after his first and second degrees at Harvard University where he studied Engineering and Business Management. He is a very experienced technocrat and a global business leader who rose through the rank to become a Senior Vice President at a leading US business conglomerate.
His dad is now 70 and has invited him to take over the company with a mandate to take it to the next level of growth as a sustainable legacy. Abdullah is trusted by his father and his siblings to lead this mandate.

On resumption, he had an open house with the staff to share his vision and to listen to them on how to take the business to the next level. Beyond the general operational issues and increasing need for regulatory compliance, one of the issues raised by the staff was a general concern on the process of staff promotion. Many of the staff allege that it is skewed and biased. Abdullah understood the concern and promised to address it in a most scientific way.

You have been called in by Abdullah to use your machine learning skills to study the pattern of promotion. With this insight, he can understand the important features among available features that can be used to predict promotion eligibility.



## Business Understanding and Analytical Approach

The problem is a classification problem. The problem can be solved by using features provided in the dataset to predict if an employee should be promoted or not.

## Feature Engineering

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import preprocessing


In [None]:
full_data = pd.read_csv('data/train.csv')

In [None]:
def find_missing_values():
    MISSING_VALUES = full_data.isnull()
    for features in full_data.columns.values.tolist():
        print(features, '\n', MISSING_VALUES[features].value_counts(), '\n')
        
def encode_feature(ft):
    encoded_ft = []
    for instance in (ft):
        full_data[instance] = preprocessing.LabelEncoder().fit_transform(full_data[instance])
        encoded_ft.append(full_data[instance])
    return encoded_ft

On scrolling through the data there are some missing values represented by NaN.   

In [None]:
find_missing_values()

The missing values exists only in the Qualifications feature and its about 30.2% of the total data in that column. To know what to do with this issue lets see the significance of this feature to the feature we are trying to predict. Firstly we need to encode it to turn it to an numerical value to allow checking of correlation.  

In [None]:
qualification_class = full_data['Qualification'].unique() 
qualification_class  

In [None]:
full_data['Qualification'].replace(np.nan, qualification_class[1], inplace=True)
full_data.replace('More than 5', 5, inplace=True)

In [None]:
MIN_YR_OF_BIRTH = min(full_data['Year_of_birth'])
MAX_YR_OF_BIRTH = max(full_data['Year_of_birth'])

MIN_YR_OF_REC = min(full_data['Year_of_recruitment'])
MAX_YR_OF_REC =max(full_data['Year_of_recruitment'])

In [None]:
bin_yr_of_birth = np.linspace(MIN_YR_OF_BIRTH, MAX_YR_OF_BIRTH, 4)

bin_yr_of_rec = np.linspace(MIN_YR_OF_REC, MAX_YR_OF_REC, 4)


In [None]:
yr_birth_labels = ['60s', '50s', 'less than 40']
yr_rec_labels = ['35 yrs', '25 yrs', 'less than 15']


full_data['Year_of_birth_binned'] = pd.cut(
                                    full_data['Year_of_birth'], bin_yr_of_birth, 
                                    labels=yr_birth_labels, include_lowest=True
                                )


full_data['Year_of_recruitment_binned'] = pd.cut(
                                            full_data['Year_of_recruitment'], 
                                            bin_yr_of_rec, labels=yr_rec_labels, 
                                            include_lowest=True
                                        )


Lets group some of the features and observe them.

In [None]:
fd_group = full_data[['Qualification','Targets_met','Last_performance_score']]
fd_group = fd_group.groupby(['Qualification'],as_index=False).mean()
fd_group

From the above table we can see that on average employees with a graduate degree get higher performance score even with a slightly lower met targets.

In [None]:
fd_group2 = full_data[['Qualification','Last_performance_score','Promoted_or_Not']]
fd_group2 = fd_group2.groupby(['Qualification'],as_index=False).mean()
fd_group2

From this table we can find that the last performance score has very little to do with whether an employee is promoted or not. The employees with Non-University Education performance score were lower than their counterparts but the have a good chance of being promoted based on the amount of targets met.

## Data Viz

In [None]:
sns.boxplot(x="Year_of_birth_binned", y="Training_score_average", data=full_data)

In [None]:
sns.regplot(x="Targets_met", y="Promoted_or_Not", data=full_data)

In [None]:
encode_feature([ 'Year_of_birth_binned', 'Year_of_recruitment_binned', 'Gender', 
                'Channel_of_Recruitment', 'Foreign_schooled', 'Marital_Status',
                'Past_Disciplinary_Action','Previous_IntraDepartmental_Movement',
                'Division', 'Qualification','State_Of_Origin'
               ])

## Model Training

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score


In [None]:
x_data = full_data.drop(['EmployeeNo','Year_of_birth', 'Year_of_recruitment', 'Promoted_or_Not'], axis=1)

y_data = full_data['Promoted_or_Not']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=4)

#### Logistic Regression

In [None]:
logistic_model = LogisticRegression().fit(x_train, y_train)
lg_pred = logistic_model.predict(x_test)


In [None]:
f1_score(lg_pred, y_test, average='weighted')

Lests check the performance of our model using the R-square test. A low R-square score means the model did not perform well on the training data that was given to it and that the model paid attention to the noise and did not pick the patterns that lead to good predictions. Besides this, a negative score shows that the model over fitted the data and will perform poorly when given a different set of data to predict.

In [None]:
logistic_model.score(x_test, y_test)

#### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf_model = RandomForestClassifier(n_estimators=245).fit(x_data,y_data)

In [None]:
rf_pred = rf_model.predict(x_test)

In [None]:
f1_score(rf_pred, y_test, average='weighted')

In [None]:
for i, p in zip(x_data, rf_model.feature_importances_):
    print(i, ' ', p)