![Logo](img/datasciencelab.png)

# DBM16 - HR Competition
#### Alexander Kopp & Leonhard Kühne-Hellmessen

1. Business Understanding
2. Data Understanding
3. Data Preperation  
**4. Modeling** 
5. Evaluation
6. Deployment

![modelling](img/modeling.png)

# Modeling

## 1. Initialize

### 1.1 Import

In [25]:
# Load standard libraries
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [26]:
# Import cleaned data from EDA
df = pd.read_pickle('data/hr_train_clean.pkl')

### 1.2 Feature selection

In [3]:
# Transfroming salary into integer (salary_int)
def varianten_salary(value):
    if value == "low" :
        return 1
    elif value == "medium":
        return 2
    else:
        return 3

df['salary_int'] = df.apply(lambda row: varianten_salary(row['salary']), axis=1)

In [4]:
# Remove unnecessary columns
df = df.drop(['id','department', 'work_accident', 'promotion_last_5years', 'salary'], axis=1)

### 1.3 Data preparation - feature engineering

In [5]:
# Enrich column'left' with data
df_num = pd.get_dummies(df.drop('left', axis=1)).join(df[['left']])

In [6]:
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spend_company,left,salary_int
0,0.65,0.96,5,226,2,0,2
1,0.88,0.8,3,166,2,0,1
2,0.69,0.98,3,214,2,0,1
3,0.41,0.47,2,154,3,1,1
4,0.87,0.76,5,254,2,0,1


## 2. Train Models with sklearn

### 2.1 Train/Test Split
We use the train_test_split function in order to make the split. The `test_size=0.25` inside the function indicates the percentage of the data that should be held over for testing:

In [7]:
from sklearn.model_selection import train_test_split, cross_val_score

In [8]:
# Set target vector y:
y = df_num['left'].values

# Set class label:
class_names = np.unique(y)

# Set characteristic names:
feature_names = np.array(['satisfaction_level'
    , 'last_evaluation'
    , 'number_project'
    , 'average_monthly_hours'
    , 'time_spend_company'
    , 'salary_int'
                         ])
# Create feature matrix:
X = df_num[feature_names].values

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(7500, 6) (7500,)
(2500, 6) (2500,)


### 2.2 Decision Trees
A decision tree is a schematic, tree-shaped diagram used to determine a course of action or show a statistical probability. Each branch of the decision tree represents a possible decision, occurrence or reaction. The tree is structured to show how and why one choice may lead to the next, with the use of the branches indicating each option is mutually exclusive.`[1]`

**model1** -  Using `DecisionTreeClassifier`, which is a class capable of performing multi-class classification on a dataset:

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
model1 = DecisionTreeClassifier()

In [12]:
# Estimating the accuracy with cross-calidation: cross_val_score
results = cross_val_score(model1, X_train, y_train, cv=5)
np.mean(results)

0.97173350103711154

In [13]:
# Training model1
model1.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [14]:
# Score on Test-Dataset
model1.score(X_test, y_test)

0.97399999999999998

### 2.3 Ensemble Methods

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.`[2]`

### 2.3.1 Bagging Decision Trees

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

**model2** - Using `BaggingClassifier` with **model1** as base estimator. Sub-samples: 100:

In [15]:
from sklearn.ensemble import BaggingClassifier

In [16]:
model2 = BaggingClassifier(base_estimator=model1, n_estimators=100)

In [17]:
results = cross_val_score(model2, X_train, y_train, cv=5)
np.mean(results)

0.98253279371827862

In [18]:
model2.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [19]:
model2.score(X_test, y_test)

0.98480000000000001

### 2.3.2 Random Forests
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model. `[3]`

**model3** - Using `RandomForestClassifier` with no specific parameters:

In [20]:
from sklearn.ensemble import RandomForestClassifier

In [21]:
model3 = RandomForestClassifier()

In [22]:
results = cross_val_score(model3, X_train, y_train, cv=5)
np.mean(results)

0.98293279395531585

In [23]:
model3.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [24]:
model3.score(X_test, y_test)

0.98280000000000001

**model4** - Using `RandomForestClassifier` with 100 sub-samples:

In [45]:
model4 = RandomForestClassifier(n_estimators=100)

In [46]:
results = cross_val_score(model4, X_train, y_train, cv=5)
np.mean(results)

0.98439990571847658

**model5** - Using `RandomForestClassifier` with 250 sub-samples:

In [41]:
model5 = RandomForestClassifier(n_estimators=250)

In [42]:
results = cross_val_score(model5, X_train, y_train, cv=10)
np.mean(results)

0.98666523448634269

In [43]:
model5.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [44]:
model5.score(X_test, y_test)

0.98599999999999999

### 2.3.3 Extra Trees
Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset.

**model6** - Using `ExtraTreeClassifier` with 6 features to consider when looking for the best split (`max_features`)

In [47]:
from sklearn.ensemble import ExtraTreesClassifier

In [48]:
model6 = ExtraTreesClassifier(n_estimators=100, max_features=6)

In [49]:
results = cross_val_score(model6, X_train, y_train, cv=5)
np.mean(results)

0.98293306068136044

In [50]:
model6.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features=6, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [51]:
model6.score(X_test, y_test)

0.98440000000000005

### 2.3.4 AdaBoost
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction.`[4]`

**model7** - Using `AdaBoostClassifier` with **model4** as base estimator:

In [52]:
from sklearn.ensemble import AdaBoostClassifier

In [53]:
model7 = AdaBoostClassifier(base_estimator=model4)

In [54]:
results = cross_val_score(model7, X_train, y_train)
np.mean(results)

0.98226671810134158

In [55]:
model7.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          learning_rate=1.0, n_estimators=50, random_state=None)

In [56]:
model7.score(X_test, y_test)

0.98560000000000003

### 2.3.5 GradientBoost
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.`[5]`

**model8** - Using `GradientBoostClassifier` with 100 sub-samples, a 0.5 learning rate and a max_depth of 10:

In [57]:
from sklearn.ensemble import GradientBoostingClassifier

In [58]:
model8 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5,
                                    max_depth=10)

In [59]:
results = cross_val_score(model8, X_train, y_train, cv=5)
np.mean(results)

0.98146692663715263

In [60]:
model8.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.5, loss='deviance', max_depth=10,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [61]:
model8.score(X_test, y_test)

0.98519999999999996

### 2.3.6  Voting Ensemble
Voting is a ways of combining the predictions from multiple machine learning algorithms. We try combining the predictions of `Bagging Decision Tree`, `Random Forests`, `Extra Trees`, `AdaBoost` and `GradiantBoost` together for a classification problem.`[6]`

In [62]:
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

**estimators** - Using `VotingClassifier` with all trained models from above:

In [63]:
estimators = []
estimators.append(('Decision Trees', model1))
estimators.append(('Bagging Decision Trees', model2))
estimators.append(('Random Forests', model4))
estimators.append(('Extra Trees', model6))
estimators.append(('AdaBoost', model7))
estimators.append(('GradiantBoost', model8))

In [70]:
ensemble = VotingClassifier(estimators)

In [69]:
results = cross_val_score(ensemble, X_train, y_train, cv=5)
np.mean(results)

0.98453315022214094

In [65]:
ensemble.fit(X_train, y_train)

VotingClassifier(estimators=[('Decision Trees', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_f...      presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [66]:
ensemble.score(X_test, y_test)

0.98599999999999999

**estimators2** - Using `VotingClassifier` with a selection of trained models from above:

In [71]:
estimators2 = []
estimators2.append(('Decision Trees', model1))
estimators2.append(('Random Forests', model5))
estimators2.append(('AdaBoost', model7))
estimators2.append(('GradiantBoost', model8))

In [72]:
ensemble2 = VotingClassifier(estimators2)

In [73]:
results = cross_val_score(ensemble2, X_train, y_train, cv=5)
np.mean(results)

0.98466675028151873

In [74]:
ensemble2.fit(X_train, y_train)

VotingClassifier(estimators=[('Decision Trees', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_f...      presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [75]:
ensemble2.score(X_test, y_test)

0.98519999999999996

**Select model for export to kaggle.com**

In [None]:
kaggle = ensemble2

** Before exporting we fit the model with the complete dataset. **

In [None]:
X_complete_train, X_complete_test, y_complete_train, y_complete_test = train_test_split(X, y, 
                                                                                        test_size=0, random_state=42)
kaggle.fit(X_complete_train, y_complete_train)

## 3. Import test data
Import real test data

In [None]:
input = pd.read_csv('data/hr_test.csv')

In [None]:
# Adapt column heads
input.rename(columns={'average_montly_hours':'average_monthly_hours','Work_accident':'work_accident'},inplace=True)

In [None]:
# Separate column ID
input_ID = input['id']
input_ID_list = input_ID.tolist()

In [None]:
# Transfroming salary into integer (salary_int)
input['salary_int'] = input.apply(lambda row: varianten_salary(row['salary']), axis=1)

In [None]:
# Remove unnecessary columns for solution approach
input = input.drop(['id'], axis=1)
input = input.drop(['department'], axis=1)
input = input.drop(['work_accident'], axis=1)
input = input.drop(['promotion_last_5years'], axis=1)
input = input.drop(['salary'], axis=1)

## 4. Export data
Export to .csv for Kaggle-Upload

In [None]:
prediction = kaggle.predict(input)
result = df = pd.DataFrame({'id':input_ID, 'left':prediction})

In [None]:
result.to_csv('data/20180131_result_ensemble2.csv', index=False)

## Appendix

### Kaggle uploads

|Date     |Public Leaderboard |Private Leaderboard |Model           |Submitted by |Comment             |
|:--------|:------------------|:-------------------|:---------------|:------------|:-------------------|
|14.12.18 |0.98732            |0.99057             |Ensemble        |LKH          |v03[****]           |
|15.12.18 |0.99199            |**0.99314**         |VotingClassifier|LKH          |v05[****]           |
|27.01.18 |0.99199            |0.99285             |VotingClassifier|LKH          |v06[****]           |
|27.01.18 |0.98732            |0.99000             |VotingClassifier|LKH          |v07[****]           |
|28.01.18 |0.98799            |0.99000             |RandomForest    |LKH          |model2[****]        |
|28.01.18 |0.98732            |0.99000             |VotingClassifier|LKH          |Ensemble[****]      |
|31.01.18 |0.99132            |0.99285             |VotingClassifier|LKH          |Ensemble            |
|31.01.18 |0.99132            |0.99000             |RandomForest    |LKH          |model5              |
|31.01.18 |0.99132            |0.99285             |AdaBoost        |AK           |model7              |
|31.01.18 |0.99132            |0.99285             |VotingClassifier|AK           |Ensemble2           |

[****] *old file*

### Neural Network

#### Data Preprocessing

In [None]:
#from sklearn.preprocessing import StandardScaler

In [None]:
#scaler = StandardScaler()

In [None]:
# Fit only to the training data
#scaler.fit(X_train)

In [None]:
#X_train_scaled = scaler.transform(X_train)
#X_test_scaled = scaler.transform(X_test)

#### Training the model

In [None]:
#from sklearn.neural_network import MLPClassifier

In [None]:
#kfold = model_selection.KFold(n_splits=10, random_state=42)
#model6 = MLPClassifier(hidden_layer_sizes=(60, 60, 60, 60), max_iter=2000, random_state=42)

In [None]:
#results = model_selection.cross_val_score(model6, X_train_scaled, y_train, cv=kfold)
#np.mean(results)

In [None]:
#model6.fit(X_train_scaled,y_train)

In [None]:
#model6.score(X_test_scaled, y_test)

## Links

[1] https://www.investopedia.com/terms/d/decision-tree.asp  
[2] http://scikit-learn.org/stable/modules/ensemble.html  
[3] http://scikit-learn.org/stable/modules/ensemble.html#random-forests  
[4] http://scikit-learn.org/stable/modules/ensemble.html#adaboost  
[5] http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting  
[6] http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier  