![Logo](img/datasciencelab.png)

# DBM16 - HR Competition
#### Alexander Kopp & Leonhard Kühne-Hellmessen

1. Business Understanding
2. Data Understanding
3. Data Preperation  
**4. Modeling** 
5. Evaluation
6. Deployment

![modelling](img/modeling.png)

## 4. Modeling

### 4.1 Initialize

In [1]:
# Load standard libraries
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Import cleaned data from EDA

In [2]:
df = pd.read_pickle('data/hr_train_clean.pkl')

#### Feature selection

In [3]:
# Transfroming salary into integer (salary_int)
def varianten_salary(value):
    if value == "low" :
        return 1
    elif value == "medium":
        return 2
    else:
        return 3

df['salary_int'] = df.apply(lambda row: varianten_salary(row['salary']), axis=1)

In [4]:
# Remove unnecessary columns
df = df.drop(['id','department', 'work_accident', 'promotion_last_5years', 'salary'], axis=1)

#### Data preparation - feature engineering

In [5]:
# Enrich column'left' with data
df_num = pd.get_dummies(df.drop('left', axis=1)).join(df[['left']])

In [6]:
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spend_company,left,salary_int
0,0.65,0.96,5,226,2,0,2
1,0.88,0.8,3,166,2,0,1
2,0.69,0.98,3,214,2,0,1
3,0.41,0.47,2,154,3,1,1
4,0.87,0.76,5,254,2,0,1


### 4.2 Train/Test Split
We use the train_test_split function in order to make the split. The `test_size=0.25` inside the function indicates the percentage of the data that should be held over for testing:

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
# Set target vector y:
y = df_num['left'].values

# Set class label:
class_names = np.unique(y)

# Set characteristic names:
feature_names = np.array(['satisfaction_level'
    , 'last_evaluation'
    , 'number_project'
    , 'average_monthly_hours'
    , 'time_spend_company'
    , 'salary_int'
                         ])
# Create feature matrix:
X = df_num[feature_names].values

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(7500, 6) (7500,)
(2500, 6) (2500,)


### 4.3 Classification

### 4.3.1 Bagging Decision Trees

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

In [10]:
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [11]:
kfold = model_selection.KFold(n_splits=10, random_state=42)
clf = DecisionTreeClassifier()
model1 = BaggingClassifier(base_estimator=clf, n_estimators=100, random_state=42)

In [12]:
results = model_selection.cross_val_score(model1, X_train, y_train, cv=kfold)
results.mean()

0.98480000000000012

In [13]:
model1.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=42, verbose=0, warm_start=False)

In [14]:
model1.score(X_test, y_test)

0.98599999999999999

### 4.3.2 Random Forests
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In [83]:
from sklearn.ensemble import RandomForestClassifier

In [84]:
kfold = model_selection.KFold(n_splits=10, random_state=42)
model2 = RandomForestClassifier(n_estimators=100, max_features=3)

In [85]:
results = model_selection.cross_val_score(model2, X_train, y_train, cv=kfold)
results.mean()

0.98640000000000005

In [86]:
model2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=3, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [87]:
model2.score(X_test, y_test)

0.98640000000000005

In [88]:
# Kaggel: 0.98799

### 4.3.3 Extra Trees
Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset.

In [89]:
from sklearn.ensemble import ExtraTreesClassifier

In [90]:
kfold = model_selection.KFold(n_splits=10, random_state=42)
model3 = ExtraTreesClassifier(n_estimators=100, max_features=6)

In [91]:
results = model_selection.cross_val_score(model3, X_train, y_train, cv=kfold)
results.mean()

0.98520000000000008

In [92]:
model3.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features=6, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [93]:
model3.score(X_test, y_test)

0.98440000000000005

### 4.3.4 AdaBoost
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction

In [94]:
from sklearn.ensemble import AdaBoostClassifier

In [95]:
kfold = model_selection.KFold(n_splits=10, random_state=42)
model4 = AdaBoostClassifier(n_estimators=100, random_state=42)

In [96]:
results = model_selection.cross_val_score(model4, X_train, y_train, cv=kfold)
results.mean()

0.95533333333333326

In [97]:
model4.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=100, random_state=42)

In [98]:
model4.score(X_test, y_test)

0.95520000000000005

In [99]:
model4_2 = AdaBoostClassifier(base_estimator=model2)

In [101]:
results = model_selection.cross_val_score(model4_2, X_train, y_train, cv=kfold)
results.mean()

0.98640000000000005

### 4.3.5 GradiantBoost
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

In [30]:
from sklearn.ensemble import GradientBoostingClassifier

In [31]:
kfold = model_selection.KFold(n_splits=10, random_state=42)
model5 = GradientBoostingClassifier(n_estimators=100, random_state=42)

In [32]:
results = model_selection.cross_val_score(model5, X_train, y_train, cv=kfold)
results.mean()

0.97360000000000002

In [33]:
model5.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=42, subsample=1.0, verbose=0,
              warm_start=False)

In [34]:
model5.score(X_test, y_test)

0.97760000000000002

### 4.3.6  Voting Ensemble
Voting is a ways of combining the predictions from multiple machine learning algorithms. We try combining the predictions of `Bagging Decision Tree`, `Random Forests`, `Extra Trees`, `AdaBoost` and `GradiantBoost` together for a classification problem.

In [102]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [103]:
kfold = model_selection.KFold(n_splits=10, random_state=42)

In [116]:
estimators = []
estimators.append(('Bagging Decision Trees', model1))
estimators.append(('Random Forests', model2))
estimators.append(('Extra Trees', model3))
estimators.append(('AdaBoost', model4))
estimators.append(('AdaBoost2', model4_2))
estimators.append(('GradiantBoost', model5))

In [117]:
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X_train, y_train, cv=kfold)
results.mean()

0.98640000000000005

In [118]:
ensemble.fit(X_train, y_train)

VotingClassifier(estimators=[('Bagging Decision Trees', BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_...        presort='auto', random_state=42, subsample=1.0, verbose=0,
              warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [119]:
ensemble.score(X_test, y_test)

0.98640000000000005

In [111]:
# Kaggle (28.1.18): 0.98732

### 4.3.5 Neural Network

#### Data Preprocessing

In [41]:
from sklearn.preprocessing import StandardScaler

In [42]:
scaler = StandardScaler()

In [43]:
# Fit only to the training data
scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [44]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### Training the model

In [45]:
from sklearn.neural_network import MLPClassifier

In [71]:
kfold = model_selection.KFold(n_splits=10, random_state=42)
model6 = MLPClassifier(hidden_layer_sizes=(60, 60, 60, 60), max_iter=2000, random_state=42)

In [72]:
results = model_selection.cross_val_score(model6, X_train_scaled, y_train, cv=kfold)
results.mean()

0.97106666666666663

In [73]:
model6.fit(X_train_scaled,y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(60, 60, 60, 60), learning_rate='constant',
       learning_rate_init=0.001, max_iter=2000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=42, shuffle=True,
       solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

In [74]:
model6.score(X_test_scaled, y_test)

0.9728

## 4.4 Import test data

In [50]:
input = pd.read_csv('data/hr_test.csv')
input.rename(columns={'average_montly_hours':'average_monthly_hours','Work_accident':'work_accident'},inplace=True)
input_ID = input['id']
input_ID_list = input_ID.tolist()
input = input.drop(['id'], axis=1)
input = input.drop(['department'], axis=1)
input = input.drop(['work_accident'], axis=1)
input = input.drop(['promotion_last_5years'], axis=1)
input['salary_int'] = input.apply(lambda row: varianten_salary(row['salary']), axis=1)
input = input.drop(['salary'], axis=1)

## 4.5 Export data

In [51]:
#prediction = ensemble.predict(input)

In [52]:
#result = df = pd.DataFrame({'id':input_ID, 'left':prediction})
#result.to_csv('data/result_ensemble.csv', index=False)