# Random Forest

In this notebook we will try to tune all important hyperparameters of Random Forest Classifier. In addition we will look which features are considered by Random Forest as important features.

In [6]:
from imblearn.over_sampling import SMOTE
from tqdm import tqdm
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, accuracy_score, classification_report

In [2]:
data=pd.read_csv("data.csv")
enc_data=pd.read_csv("encoded_data.csv")

<a id="feature-scaling-and-oversampling"></a>
## Feature scaling and oversampling

In [5]:
X = enc_data.drop(columns = ['Churn', "customerID"])
y = enc_data['Churn'].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=5, stratify=y)

num_cols = ["tenure", 'MonthlyCharges', 'TotalCharges']

scaler= StandardScaler()

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

oversample = SMOTE(sampling_strategy='minority')

X_resampled, y_resampled = oversample.fit_resample(X_train, y_train)

pd.Series(y_resampled).value_counts()

0.0    3880
1.0    3880
Name: count, dtype: int64

## Random Forest

We will try to optimize the next hyperparameters:

* criterion of split
* n_estimators
* max_depth
* min_samples_leaf

In [11]:
model_rf = RandomForestClassifier()

parameters = {"criterion": ("gini", "entropy"),
              "n_estimators": [i*10 for i in range(1, 31)],
              "max_depth": [i for i in range(1, 16)],
              "min_samples_leaf": [i for i in range(5, 31, 5)]}

clf = GridSearchCV(model_rf, parameters, verbose=4, scoring="f1", n_jobs=-1)
clf.fit(X_resampled, y_resampled)

Fitting 5 folds for each of 5400 candidates, totalling 27000 fits
[CV 3/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=10;, score=0.783 total time=   0.0s
[CV 5/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=20;, score=0.784 total time=   0.0s
[CV 5/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=30;, score=0.794 total time=   0.1s
[CV 3/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=50;, score=0.800 total time=   0.1s
[CV 5/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=60;, score=0.795 total time=   0.1s
[CV 3/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=80;, score=0.804 total time=   0.2s
[CV 1/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=100;, score=0.752 total time=   0.2s
[CV 4/5] END criterion=gini, max_depth=1, min_samples_leaf=5, n_estimators=110;, score=0.796 total time=   0.2s
[CV 2/5] END criterion=gini, max_depth=1, mi

  _data = np.array(data, dtype=dtype, copy=copy,


[CV 1/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=110;, score=0.760 total time=   0.8s
[CV 4/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=120;, score=0.834 total time=   0.8s
[CV 2/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=140;, score=0.787 total time=   1.0s
[CV 5/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=150;, score=0.830 total time=   1.0s
[CV 3/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=170;, score=0.835 total time=   1.1s
[CV 1/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=190;, score=0.761 total time=   1.3s
[CV 4/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=200;, score=0.835 total time=   1.3s
[CV 2/5] END criterion=entropy, max_depth=15, min_samples_leaf=20, n_estimators=220;, score=0.788 total time=   1.5s
[CV 5/5] END criterion=entropy, max_depth=15, min_samples_leaf=2

In [12]:
clf.best_params_

{'criterion': 'entropy',
 'max_depth': 13,
 'min_samples_leaf': 5,
 'n_estimators': 120}

We have trained 27000 models and it turned out that the best CV results (from the side of f1-score) are with entropy criterion, max depth 13, min samples leaf 5 and n_estimators 120.

Let's train now the RandomFores with the above hyperparameters and see how it performs on our test set

In [14]:
model_rf = RandomForestClassifier(n_estimators=120, n_jobs = -1, max_depth=13,
                                  min_samples_leaf=5, criterion="entropy")

model_rf.fit(X_resampled, y_resampled)
predicted_y = model_rf.predict(X_test)
print(classification_report(y_test, predicted_y))

              precision    recall  f1-score   support

         0.0       0.88      0.83      0.85      1294
         1.0       0.59      0.67      0.63       467

    accuracy                           0.79      1761
   macro avg       0.73      0.75      0.74      1761
weighted avg       0.80      0.79      0.79      1761



We received 0.59 precision and 0.67 recall. This is almost the same what we received in the previous notebook.

It seems like the reason for not good results can lay not in the models we've chosen. It can be in our data because we have tried some ways to increase results of detecting but results increased not significant.

Let's check important features in our RandomForest

In [18]:
importances = model_rf.feature_importances_
pd.Series(importances)

0     0.020135
1     0.014830
2     0.022080
3     0.022293
4     0.136983
5     0.005589
6     0.016833
7     0.040780
8     0.018152
9     0.015795
10    0.026832
11    0.015092
12    0.020352
13    0.041178
14    0.100317
15    0.097931
16    0.014124
17    0.055226
18    0.022265
19    0.131674
20    0.021687
21    0.061850
22    0.009860
23    0.010567
24    0.050574
25    0.006997
dtype: float64