![Logo](img/datasciencelab.png)

# DBM16 - HR Competition
#### Alexander Kopp & Leonhard Kühne-Hellmessen

1. Business Understanding
2. Data Understanding
3. Data Preperation  
**4. Modeling** 
5. Evaluation
6. Deployment

![modelling](img/modeling.png)

## 4. Modeling

### 4.1 Initialize

In [1]:
# Load standard libraries
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Import cleaned data from EDA

In [2]:
df = pd.read_pickle('data/hr_train_clean.pkl')

#### Feature selection

In [3]:
# Transfroming salary into integer (salary_int)
def varianten_salary(value):
    if value == "low" :
        return 1
    elif value == "medium":
        return 2
    else:
        return 3

df['salary_int'] = df.apply(lambda row: varianten_salary(row['salary']), axis=1)

In [4]:
# Remove unnecessary columns
df = df.drop(['id','department', 'work_accident', 'promotion_last_5years', 'salary'], axis=1)

#### Data preparation - feature engineering

In [5]:
# Enrich column'left' with data
df_num = pd.get_dummies(df.drop('left', axis=1)).join(df[['left']])

In [6]:
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_monthly_hours,time_spend_company,left,salary_int
0,0.65,0.96,5,226,2,0,2
1,0.88,0.8,3,166,2,0,1
2,0.69,0.98,3,214,2,0,1
3,0.41,0.47,2,154,3,1,1
4,0.87,0.76,5,254,2,0,1


### 4.2 Train/Test Split
We use the train_test_split function in order to make the split. The `test_size=0.25` inside the function indicates the percentage of the data that should be held over for testing:

In [7]:
from sklearn.model_selection import train_test_split, cross_val_score

In [8]:
# Set target vector y:
y = df_num['left'].values

# Set class label:
class_names = np.unique(y)

# Set characteristic names:
feature_names = np.array(['satisfaction_level'
    , 'last_evaluation'
    , 'number_project'
    , 'average_monthly_hours'
    , 'time_spend_company'
    , 'salary_int'
                         ])
# Create feature matrix:
X = df_num[feature_names].values

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(7500, 6) (7500,)
(2500, 6) (2500,)


### 4.3 sklearn

### 4.3.1 Decision Tree
...

**model1** -  Using `DecisionTreeClassifier`, which is a class capable of performing multi-class classification on a dataset.

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
model1 = DecisionTreeClassifier()

In [12]:
results = cross_val_score(model1, X_train, y_train, cv=5)
np.mean(results)

0.97160043419278552

In [13]:
#model1.fit(X_train, y_train)

In [14]:
#model1.score(X_test, y_test)

### 4.3.1 Bagging Decision Trees

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

**model2** - Using `BaggingClassifier` with **model1**

In [None]:
from sklearn.ensemble import BaggingClassifier

In [None]:
model2 = BaggingClassifier(base_estimator=model1, n_estimators=100, random_state=42)

In [None]:
results = cross_val_score(model2, X_train, y_train, cv=5)
np.mean(results)

In [None]:
model2.fit(X_train, y_train)

In [None]:
model2.score(X_test, y_test)

### 4.3.2 Random Forests
In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

**model3** - Using `RandomForestClassifier` 

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
model3 = RandomForestClassifier()

In [17]:
results = cross_val_score(model3, X_train, y_train, cv=5)
np.mean(results)

0.98319963875539496

In [18]:
#model3.fit(X_train, y_train)

In [19]:
#model3.score(X_test, y_test)

**model4** - `RandomForestClassifier` 

In [20]:
model4 = RandomForestClassifier(n_estimators= 100)

In [21]:
results = cross_val_score(model4, X_train, y_train, cv=5)
np.mean(results)

0.98440008349633357

**model5** - `RandomForestClassifier` 

In [22]:
model5 = RandomForestClassifier(n_estimators=250, min_samples_leaf=1, min_samples_split=2, 
                                min_weight_fraction_leaf=0.0)

In [23]:
results = cross_val_score(model5, X_train, y_train, cv=10)
np.mean(results)

0.98639910044284529

In [24]:
#model5.fit(X_train, y_train)

In [25]:
#model5.score(X_test, y_test)

### 4.3.3 Extra Trees
Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset.

**model6** - Using `ExtraTreeClassifier`

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
model6 = ExtraTreesClassifier(n_estimators=100, max_features=6)

In [None]:
results = cross_val_score(model6, X_train, y_train, cv=5)
np.mean(results)

In [None]:
model6.fit(X_train, y_train)

In [None]:
model6.score(X_test, y_test)

### 4.3.4 AdaBoost
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction

**model7** - `AdaBoostClassifier`

In [26]:
from sklearn.ensemble import AdaBoostClassifier

In [27]:
model7 = AdaBoostClassifier(base_estimator=model4)

In [29]:
results = cross_val_score(model7, X_train, y_train)
np.mean(results)

0.98279989149864944

In [30]:
#model7.fit(X_train, y_train)

In [31]:
#model7.score(X_test, y_test)

### 4.3.5 GradientBoost
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

**model8** - `GradientBoostClassifier`

In [32]:
from sklearn.ensemble import GradientBoostingClassifier

In [33]:
model8 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.5,
                                    max_depth=10, random_state=4)

In [34]:
results = cross_val_score(model8, X_train, y_train, cv=5)
np.mean(results)

0.981199815348066

In [None]:
#model8.fit(X_train, y_train)

In [None]:
#model8.score(X_test, y_test)

### 4.3.6  Voting Ensemble
Voting is a ways of combining the predictions from multiple machine learning algorithms. We try combining the predictions of `Bagging Decision Tree`, `Random Forests`, `Extra Trees`, `AdaBoost` and `GradiantBoost` together for a classification problem.

In [35]:
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [None]:
estimators = []
estimators.append(('Decision Trees', model1))
estimators.append(('Bagging Decision Trees', model2))
estimators.append(('Random Forests', model4))
estimators.append(('Extra Trees', model6))
estimators.append(('AdaBoost', model7))
estimators.append(('GradiantBoost', model8))

In [None]:
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X_train, y_train, cv=5)
np.mean(results)

In [None]:
ensemble.fit(X_train, y_train)

In [None]:
ensemble.score(X_test, y_test)

Fit model with complete dataset.

In [None]:
X_complete_train, X_complete_test, y_complete_train, y_complete_test = train_test_split(X, y, 
                                                                                        test_size=0, random_state=42)
model7.fit(X_complete_train, y_complete_train)

In [36]:
estimators2 = []
estimators2.append(('Decision Trees', model1))
estimators2.append(('Random Forests', model5))
estimators2.append(('AdaBoost', model7))
estimators2.append(('GradiantBoost', model8))

In [37]:
ensemble2 = VotingClassifier(estimators2)

In [39]:
ensemble2.fit(X_train, y_train)

VotingClassifier(estimators=[('Decision Trees', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_f...         presort='auto', random_state=4, subsample=1.0, verbose=0,
              warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [40]:
X_complete_train, X_complete_test, y_complete_train, y_complete_test = train_test_split(X, y, 
                                                                                        test_size=0, random_state=42)
ensemble2.fit(X_complete_train, y_complete_train)

VotingClassifier(estimators=[('Decision Trees', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_f...         presort='auto', random_state=4, subsample=1.0, verbose=0,
              warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [None]:
# Kaggle (28.01.18): 0.98732 (Ensemble)
# Kaggle (31.01.18): 0.99132 (Ensemble)
# Kaggle (31.01.18): 0.99132 (RandomForest)
# Kaggle (31.01.18): tbd. (AdaBoost)
# Kaggle (31.01.18): tbd. (Ensemble2)

## 4.4 Import test data

In [41]:
input = pd.read_csv('data/hr_test.csv')
input.rename(columns={'average_montly_hours':'average_monthly_hours','Work_accident':'work_accident'},inplace=True)

In [42]:
input_ID = input['id']
input_ID_list = input_ID.tolist()
input = input.drop(['id'], axis=1)
input = input.drop(['department'], axis=1)
input = input.drop(['work_accident'], axis=1)
input = input.drop(['promotion_last_5years'], axis=1)

In [43]:
input['salary_int'] = input.apply(lambda row: varianten_salary(row['salary']), axis=1)
input = input.drop(['salary'], axis=1)

## 4.5 Export data
Export to .csv for Kaggle-Upload

In [44]:
prediction = ensemble2.predict(input)
result = df = pd.DataFrame({'id':input_ID, 'left':prediction})
result.to_csv('data/20180131_result_ensemble2.csv', index=False)

## Appendix

### Neural Network

#### Data Preprocessing

In [None]:
#from sklearn.preprocessing import StandardScaler

In [None]:
#scaler = StandardScaler()

In [None]:
# Fit only to the training data
#scaler.fit(X_train)

In [None]:
#X_train_scaled = scaler.transform(X_train)
#X_test_scaled = scaler.transform(X_test)

#### Training the model

In [None]:
#from sklearn.neural_network import MLPClassifier

In [None]:
#kfold = model_selection.KFold(n_splits=10, random_state=42)
#model6 = MLPClassifier(hidden_layer_sizes=(60, 60, 60, 60), max_iter=2000, random_state=42)

In [None]:
#results = model_selection.cross_val_score(model6, X_train_scaled, y_train, cv=kfold)
#np.mean(results)

In [None]:
#model6.fit(X_train_scaled,y_train)

In [None]:
#model6.score(X_test_scaled, y_test)