![Logo](img/datasciencelab.png)

# DBM16 - HR Competition
#### Alexander Kopp & Leonhard Kühne-Hellmessen

1. Business Understanding
2. Data Understanding
3. Data Preperation  
**4. Modeling** 
5. Evaluation
6. Deployment

![modelling](img/modeling.png)

## 4. Modeling

### 4.1 Initialize

In [1]:
# Load standard libraries
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Import cleaned data from EDA

In [2]:
df = pd.read_pickle('data/hr_train_clean.pkl')

#### Feature selection

In [3]:
# Transfroming salary into integer (salary_int)
def varianten_salary(value):
    if value == "low" :
        return 1
    elif value == "medium":
        return 2
    else:
        return 3

df['salary_int'] = df.apply(lambda row: varianten_salary(row['salary']), axis=1)

In [4]:
# Remove unnecessary columns
df = df.drop(['id','department', 'work_accident', 'promotion_last_5years', 'salary'], axis=1)

#### Data preparation - feature engineering

In [5]:
# Enrich column'left' with data
df_num = pd.get_dummies(df.drop('left', axis=1)).join(df[['left']])

In [6]:
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spend_company,left,salary_int
0,0.65,0.96,5,226,2,0,2
1,0.88,0.8,3,166,2,0,1
2,0.69,0.98,3,214,2,0,1
3,0.41,0.47,2,154,3,1,1
4,0.87,0.76,5,254,2,0,1


### 4.2 Train/Test Split
We use the train_test_split function in order to make the split. The `test_size=0.25` inside the function indicates the percentage of the data that should be held over for testing:

In [7]:
from sklearn.model_selection import train_test_split

In [13]:
# Set target vector y:
y = df_num['left'].values

# Set class label:
class_names = np.unique(y)

# Set characteristic names:
feature_names = np.array(['satisfaction_level'
    , 'last_evaluation'
    , 'number_project'
    , 'average_monthly_hours'
    , 'time_spend_company'
    , 'salary_int'
                         ])
# Create feature matrix:
X = df_num[feature_names].values

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(7500, 6) (7500,)
(2500, 6) (2500,)


### 4.3 Classification

In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold

### 4.3.1 Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

In [16]:
# DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
model = clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [17]:
model.score(X_test, y_test)

0.9708

In [18]:
cv_scores = cross_val_score(clf, X_train, y_train, cv=10)
np.mean(cv_scores)

0.97159944556938471

### 4.3.2 Ensemble methods
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.
We will have look at the two different families of ensemble methods, which are usually considered:
  1. In **averaging methods**, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.  
  *We will use **Random Forests** & **Extremely Randomized Trees***
  2. In **boosting methods**, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.  
  *We will use **GradiantBoost** and **AdaBoost***

#### 4.3.2.1 Random Forests
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

In [19]:
rfc = RandomForestClassifier()
cv_scores = cross_val_score(rfc, X_train, y_train, cv=10)
np.mean(cv_scores)

0.98453314441447903

In [20]:
rfc2 = RandomForestClassifier(n_estimators=100)
cv_scores = cross_val_score(rfc2, X_train, y_train, cv=10)
np.mean(cv_scores)

0.98599892195363892

In [21]:
rfc3 = RandomForestClassifier(n_estimators=250, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0)
cv_scores = cross_val_score(rfc3, X_train, y_train, cv=10)
np.mean(cv_scores)

0.98639838933046986

#### 4.3.2.2 Extremely Randomized Trees
In extremely randomized trees, randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

In [22]:
from sklearn.ensemble import ExtraTreesClassifier
etc = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=42)
scores = cross_val_score(etc, X_train, y_train, cv=10)
scores.mean()

0.98453261036908513

#### 4.3.2.3 AdaBoost
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction

In [23]:
abc = AdaBoostClassifier(base_estimator=rfc2)
cv_scores = cross_val_score(abc, X_train, y_train, cv=10)
np.mean(cv_scores)

0.98599874441258262

#### 4.3.2.4 GradiantBoost
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

In [24]:
gbc = GradientBoostingClassifier()
cv_scores = cross_val_score(gbc, X_train, y_train, cv=5)
np.mean(cv_scores)

0.97400043514093415

In [25]:
gbc2 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5,
    max_depth=10, random_state=4)
cv_scores = cross_val_score(gbc2, X_train, y_train, cv=5)
np.mean(cv_scores)

0.981199815348066

#### 4.3.2.5 Model Selection
KFold divides all the samples in `k` groups of samples, called folds of equal sizes (if possible). The prediction function is learned using `k - 1` folds, and the fold left out is used for test.

In [26]:
from sklearn import model_selection

In [27]:
kfold = model_selection.KFold(n_splits=40, random_state=42)
estimators = []
estimators.append(('clf', clf))
estimators.append(('rfc2', rfc2))
estimators.append(('gbc2', gbc2))
estimators.append(('abc', abc))
ensemble = VotingClassifier(estimators)
cv_scores = cross_val_score(ensemble, X_train, y_train, cv=5)
np.mean(cv_scores)

0.98453332799999771

In [28]:
ensemble.fit(X_train, y_train)

VotingClassifier(estimators=[('clf', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_lea...0,
            warm_start=False),
          learning_rate=1.0, n_estimators=50, random_state=None))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [29]:
ensemble.score(X_test, y_test)

0.98560000000000003

#### 4.3.2.6 Voting Classifier

In [30]:
eclf = VotingClassifier(estimators=[('clf', clf), ('rfc2', rfc2), ('gbc2', gbc2), ('abc', abc)], voting='hard')

for vclf, label in zip([clf, rfc3, gbc2, abc , eclf], ['DecisionTree', 'Random Forest', 'GradiantBoost', 'AdaBoost', 'Ensemble']): 
    scores = cross_val_score(vclf, X_train, y_train, cv=5, scoring='accuracy')
    print("Accuracy: %0.3f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.971 (+/- 0.00) [DecisionTree]
Accuracy: 0.985 (+/- 0.00) [Random Forest]
Accuracy: 0.981 (+/- 0.00) [GradiantBoost]
Accuracy: 0.985 (+/- 0.00) [AdaBoost]
Accuracy: 0.984 (+/- 0.00) [Ensemble]


In [31]:
ensemble.score(X_test, y_test)

0.98560000000000003

#### Import real test data

In [None]:
input = pd.read_csv('data/hr_test.csv')

In [None]:
input.rename(columns={'average_montly_hours':'average_monthly_hours','Work_accident':'work_accident'},inplace=True)

In [None]:
# Split off ID
input_ID = input['id']
input_ID_list = input_ID.tolist()
input = input.drop(['id'], axis=1)

In [None]:
# Remove unnecessary columns for solution approach
input = input.drop(['department'], axis=1)
input = input.drop(['work_accident'], axis=1)
input = input.drop(['promotion_last_5years'], axis=1)
input.head()

In [None]:
input['salary_int'] = input.apply(lambda row: varianten_salary(row['salary']), axis=1)

In [None]:
input = input.drop(['salary'], axis=1)
input.head()

In [None]:
# prediction = abc.predict(input)

In [None]:
# result = df = pd.DataFrame({'id':input_ID, 'left':prediction})
# df.head()

In [None]:
# result.to_csv('data/result_v08.csv', index=False)