<img src='logo/dsl-logo.png' width="500" align="center" />

# HR Competition

## Random Forest Model for Kaggle

### Initializations

In [13]:
# Bibliotheken einbinden
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import learning_curve
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline

### Load Data

Zunächst werden sowohl die Trainingsdaten als auch erstmals die zur Verfügung gestellten Testdaten eingelesen. Die Spalte `hasLeftCompany` stellt bei den Trainingsdaten eine Besonderheit dar und wird für das Fitten des Modells als y-Labels verwendet. Bei den Testdaten muss die Spalte `id` zunächst zwischengespeichert werden, da sie für die Predictions keine Rolle spielt, aber am Ende für den Export des Ergebnisses wieder benötigt wird. Anders als beim Vorgehen zuvor werden diesmal keine Testdaten von den gelabelten Trainingsdaten abgesplittet. Auf diese Weise stehen alle Daten für das Trainieren der Modelle zur Verfügung. Eine Auswahl der Parameter findet allein aufgrund von Cross Validation statt und eine Beurteilung des Ergebnisses ist am Ende anhand des Scores auf Kaggle möglich.

In [29]:
# Trainingsdaten einlesen
df = pd.read_pickle('exchange/hr_01_cleaned_train.pkl')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
satisfactionLevel       10000 non-null float64
yearsSinceEvaluation    10000 non-null float64
numberOfProjects        10000 non-null int64
averageMonthlyHours     10000 non-null int64
yearsAtCompany          10000 non-null int64
workAccident            10000 non-null category
hasLeftCompany          10000 non-null category
gotPromotion            10000 non-null category
department              10000 non-null category
salary                  10000 non-null category
dtypes: category(5), float64(2), int64(3)
memory usage: 439.7 KB


In [55]:
# Datentyp von Category in Object umwandeln
for col in df.select_dtypes(['category']):
    print('transforming', col)
    df[col] = df[col].astype('str')

transforming workAccident
transforming hasLeftCompany
transforming gotPromotion
transforming department
transforming salary


In [56]:
df = pd.get_dummies(df.drop(['hasLeftCompany', 'department'], axis=1)).join(df[['hasLeftCompany']])
df.head()

Unnamed: 0,satisfactionLevel,yearsSinceEvaluation,numberOfProjects,averageMonthlyHours,yearsAtCompany,workAccident_0,workAccident_1,gotPromotion_0,gotPromotion_1,salary_high,salary_low,salary_medium,hasLeftCompany
0,0.65,0.96,5,226,2,0,1,1,0,0,0,1,0
1,0.88,0.8,3,166,2,1,0,1,0,0,1,0,0
2,0.69,0.98,3,214,2,1,0,1,0,0,1,0,0
3,0.41,0.47,2,154,3,1,0,1,0,0,1,0,1
4,0.87,0.76,5,254,2,0,1,1,0,0,1,0,0


In [57]:
y_train = df['hasLeftCompany'].values
y_train

array(['0', '0', '0', ..., '0', '0', '1'], dtype=object)

In [58]:
X_train = df.drop(['hasLeftCompany'], axis=1).values
X_train

array([[ 0.65,  0.96,  5.  , ...,  0.  ,  0.  ,  1.  ],
       [ 0.88,  0.8 ,  3.  , ...,  0.  ,  1.  ,  0.  ],
       [ 0.69,  0.98,  3.  , ...,  0.  ,  1.  ,  0.  ],
       ..., 
       [ 0.83,  0.86,  4.  , ...,  0.  ,  1.  ,  0.  ],
       [ 0.74,  0.56,  4.  , ...,  0.  ,  1.  ,  0.  ],
       [ 0.11,  0.88,  7.  , ...,  0.  ,  0.  ,  1.  ]])

In [59]:
scaler = MinMaxScaler()

In [60]:
X_train_scaled = scaler.fit_transform(X_train)

In [61]:
df = pd.read_pickle('exchange/hr_01_cleaned_test.pkl')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 10 columns):
id                      4999 non-null int64
satisfactionLevel       4999 non-null float64
yearsSinceEvaluation    4999 non-null float64
numberOfProjects        4999 non-null int64
averageMonthlyHours     4999 non-null int64
yearsAtCompany          4999 non-null int64
workAccident            4999 non-null category
gotPromotion            4999 non-null category
department              4999 non-null category
salary                  4999 non-null category
dtypes: category(4), float64(2), int64(4)
memory usage: 254.1 KB


In [62]:
# Datentyp von Category in Object umwandeln
for col in df.select_dtypes(['category']):
    print('transforming', col)
    df[col] = df[col].astype('str')

transforming workAccident
transforming gotPromotion
transforming department
transforming salary


In [63]:
df = pd.get_dummies(df.drop(['id','department'], axis=1)).join(df[['id']])
df.head()

Unnamed: 0,satisfactionLevel,yearsSinceEvaluation,numberOfProjects,averageMonthlyHours,yearsAtCompany,workAccident_0,workAccident_1,gotPromotion_0,gotPromotion_1,salary_high,salary_low,salary_medium,id
0,0.81,0.96,4,219,2,1,0,1,0,0,1,0,10000
1,0.86,0.84,4,246,6,1,0,1,0,0,1,0,10001
2,0.9,0.66,4,242,3,1,0,1,0,1,0,0,10002
3,0.37,0.54,2,131,3,0,1,1,0,0,0,1,10003
4,0.52,0.96,3,271,3,0,1,1,0,0,0,1,10004


In [64]:
ids = df['id']
ids.head()

0    10000
1    10001
2    10002
3    10003
4    10004
Name: id, dtype: int64

In [65]:
X_test = df.drop(['id'], axis=1).values
X_test

array([[ 0.81,  0.96,  4.  , ...,  0.  ,  1.  ,  0.  ],
       [ 0.86,  0.84,  4.  , ...,  0.  ,  1.  ,  0.  ],
       [ 0.9 ,  0.66,  4.  , ...,  1.  ,  0.  ,  0.  ],
       ..., 
       [ 0.66,  0.73,  5.  , ...,  0.  ,  0.  ,  1.  ],
       [ 0.79,  1.  ,  4.  , ...,  0.  ,  1.  ,  0.  ],
       [ 0.98,  0.86,  2.  , ...,  0.  ,  1.  ,  0.  ]])

In [66]:
X_test_scaled = scaler.transform(X_test)

### Predict Kaggle Data

Als erstes wurde der Random Forest Classifier mit den zuvor bestimmten Hyperparametern verwendet. Die Predictions werden zusammen mit der Id in einer csv-Datei exportiert, um sie auf Kaggle hochzuladen.

In [26]:
clf = RandomForestClassifier(criterion='gini', max_features=0.6, n_estimators=300)

In [27]:
clf.fit(X_train_scaled, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.6, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [28]:
scores = cross_val_score(clf, X_train_scaled, y_train, cv=10, n_jobs=-1)
scores.mean()

0.9886001886001885

In [30]:
predictions = clf.predict(X_test_scaled)
list(predictions);

In [31]:
list(ids);

In [84]:
df = pd.DataFrame(
    {'id': ids,
     'left': predictions
    })
df.head()

Unnamed: 0,id,left
0,10000,0
1,10001,1
2,10002,0
3,10003,1
4,10004,0


In [87]:
df.to_csv('kaggle/random_forest.csv', index=False)

**Ergebnis in Kaggle:** 99.066%