## Training a Random Forest Classifier

In [1]:
import pandas as pd
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/activity.csv'
df = pd.read_csv(file_url)
df.head()

Unnamed: 0,avg_rss12,var_rss12,avg_rss13,var_rss13,avg_rss23,var_rss23,Activity
0,42.0,0.0,18.5,0.5,12.0,0.0,bending1
1,42.0,0.0,18.0,0.0,11.33,0.94,bending1
2,42.75,0.43,16.75,1.79,18.25,0.43,bending1
3,42.5,0.5,16.75,0.83,19.0,1.22,bending1
4,43.0,0.82,16.25,0.83,18.0,0.0,bending1


In [2]:
# Separate target variable and features
target = df.pop('Activity')

The model uses the training set to learn relevant parameters in predicting the response variable. The test set is used to check whether a model can accurately predict unseen data. We say the model is overfitting when it has learned the patterns relevant only to the training set and makes incorrect predictions about the testing set.

The sklearn package provides a function called train_test_split() to randomly split the dataset into two different sets. We need to specify the following parameters for this function: the feature and target variables, the ratio of the testing set (test_size), and random_state in order to get reproducible results if we have to run the code again:

In [4]:
# Split the dataset into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.33, random_state=43)

In [11]:
"""
print(X_train) # features
print(X_test)  # features
print(y_train) # response(target) for train
print(y_test)  # response(target) for test
"""

'\nprint(X_train) # features\nprint(X_test)  # features\nprint(y_train) # response(target) for train\nprint(y_test)  # response(target) for test\n'

In [5]:
# import RandomForestClassifier class
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=1, n_estimators=10) # Hyperparameter tuning

The next step is to train (also called fit) the model with the training data.

In [8]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=10, random_state=1)

Now that the model has completed its training, we can use the parameters it learned to make predictions on the input data we will provide.

In [9]:
preds = rf_model.predict(X_train)
preds

array(['standing', 'lying', 'standing', ..., 'cycling', 'standing',
       'standing'], dtype=object)

## Evaluating the Model's Performance

In [12]:
# import accuracy_score() model
from sklearn.metrics import accuracy_score
accuracy_score(y_train, preds)

0.9868110658374437

In [14]:
test_preds = rf_model.predict(X_test)
accuracy_score(y_test, test_preds)

0.7768667005297148

The difference between the training and testing sets is quite big. This tells us our model is actually overfitting and learned only the patterns relevant to the training set. In an ideal case, the performance of your model should be equal or very close to equal for those two sets. 

In the next sections, we will look at tuning some Random Forest hyperparameters in order to reduce overfitting.

## Number of Trees Estimator

The first hyperparameter you will look at in this section is called n_estimators. This hyperparameter is responsible for defining the number of trees that will be trained by the RandomForest algorithm.

A tree is a logical graph that maps a decision and its outcomes at each of its nodes. Simply speaking, it is a series of yes/no (or true/false) questions that lead to different outcomes.

A leaf is a special type of node where the model will make a prediction. There will be no split after a leaf.

As you may have guessed now, the n_estimators hyperparameter is used to specify the number of trees the RandomForest algorithm will build. For example (as in the previous exercise), say we ask it to build 10 trees. For a given observation, it will ask each tree to make a prediction. Then, it will average those predictions and use the result as the final prediction for this input. For instance, if, out of 10 trees, 8 of them predict the outcome sitting, then the RandomForest algorithm will use this outcome as the final prediction.

## Maximum Depth

There are different hyperparameters that can help to lower the risk of overfitting for Random Forest and one of them is called max_depth.

This hyperparameter defines the depth of the trees built by Random Forest. Basically, it tells Random Forest model, how many nodes (questions) it can create before making predictions. But how will that help to reduce overfitting, you may ask. Well, let's say you built a single tree and set the max_depth hyperparameter to 50. This would mean that there would be some cases where you could ask 49 different questions (the value c includes the final leaf node) before making a prediction. So, the logic would be IF X1 > value1 AND X2 > value2 AND X1 <= value3 AND … AND X3 > value49 THEN predict class A

As you can imagine, this is a very specific rule. In the end, it may apply to only a few observations in the training set, with this case appearing very infrequently. Therefore, your model would be overfitting. By default, the value of this max_depth parameter is None, which means there is no limit set for the depth of the trees.

What you really want is to find some rules that are generic enough to be applied to bigger groups of observations. This is why it is recommended to not create deep trees with Random Forest.

In [15]:
rf_model4 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=3) #instantiate RandomForestClassifier
rf_model4.fit(X_train, y_train) #fit training dataset
train_preds4 = rf_model4.predict(X_train) #predict training set
test_preds4 = rf_model4.predict(X_test) #predict testing set
train_acc4 = accuracy_score(y_train, train_preds4) #score training set
test_acc4 = accuracy_score(y_test, test_preds4) #score testing set
print(train_acc4)
print(test_acc4)

0.6076202730716992
0.6077933386546694


For a max_depth of 3, we got extremely similar results for the training and testing sets but the overall performance decreased drastically to 0.61. Our model is not overfitting anymore, but it is now underfitting; that is, it is not predicting the target variable very well (only in 61% of cases).

In [17]:
rf_model5 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10) #instantiate RandomForestClassifier
rf_model5.fit(X_train, y_train) #fit training dataset
train_preds5 = rf_model5.predict(X_train) #predict training set
test_preds5 = rf_model5.predict(X_test) #predict testing set
train_acc5 = accuracy_score(y_train, train_preds5) #score training set
test_acc5 = accuracy_score(y_test, test_preds5) #score testing set
print(train_acc5)
print(test_acc5)

0.8084923868754021
0.7637326754226834


The accuracy of the training set increased and is relatively close to the testing set. We are starting to get some good results, but the model is still slightly overfitting.

In [18]:
rf_model6 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=50) #instantiate RandomForestClassifier
rf_model6.fit(X_train, y_train) #fit training dataset
train_preds6 = rf_model6.predict(X_train) #predict training set
test_preds6 = rf_model6.predict(X_test) #predict testing set
train_acc6 = accuracy_score(y_train, train_preds6) #score training set
test_acc6 = accuracy_score(y_test, test_preds6) #score testing set
print(train_acc6)
print(test_acc6)

0.9969618986346415
0.7931935273202235


The accuracy jumped to 0.99 for the training set but it didn't improve much for the testing set. So, the model is overfitting with max_depth = 50. It seems the sweet spot to get good predictions and not much overfitting is around 10 for the max_depth hyperparameter in this dataset.

## Minimum Sample in Leaf

min_samples_leaf

This hyperparameter will specify the minimum number of observations (or samples) that will have to fall under a leaf node to be considered in the tree.

For instance, if we set min_samples_leaf to 3, then RandomForest will only consider a split that leads to at least three observations on both the left and right leaf nodes.

In [20]:
rf_model7 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=3)
rf_model7.fit(X_train, y_train)

# predictions
train_preds7 = rf_model7.predict(X_train)
test_preds7 = rf_model7.predict(X_test)

# accuracy scores
train_acc7 = accuracy_score(y_train, train_preds7)
test_acc7 = accuracy_score(y_test, test_preds7)

print(train_acc7)
print(test_acc7)

0.8032382586317821
0.7589434728974676


With min_samples_leaf=3, the accuracy for both the training and testing sets didn't change much compared to the best model we found in the previous section. Let's try increasing it to 10:

In [21]:
rf_model8 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=10)
rf_model8.fit(X_train, y_train)

# predictions
train_preds8 = rf_model8.predict(X_train)
test_preds8 = rf_model8.predict(X_test)

# accuracy scores
train_acc8 = accuracy_score(y_train, train_preds8)
test_acc8 = accuracy_score(y_test, test_preds8)

print(train_acc8)
print(test_acc8)

0.7911930802773608
0.761120383136202


Now the accuracy of the training set dropped a bit but increased for the testing set and their difference is smaller now. So, our model is overfitting less. Let's try another value for this hyperparameter – 25:

In [22]:
rf_model9 = RandomForestClassifier(random_state=1, n_estimators=50, max_depth=10, min_samples_leaf=25)
rf_model9.fit(X_train, y_train)

# predictions
train_preds9 = rf_model9.predict(X_train)
test_preds9 = rf_model9.predict(X_test)

# accuracy scores
train_acc9 = accuracy_score(y_train, train_preds9)
test_acc9 = accuracy_score(y_test, test_preds9)

print(train_acc9)
print(test_acc9)

0.7774680105797412
0.7544445250707495


Both accuracies for the training and testing sets decreased but they are quite close to each other now. So, we will keep this value (25) as the optimal one for this dataset as the performance is still OK and we are not overfitting too much.

When choosing the optimal value for this hyperparameter, you need to be careful: a value that's too low will increase the chance of the model overfitting, but on the other hand, setting a very high value will lead to underfitting (the model will not accurately predict the right outcome).

For instance, if you have a dataset of 1000 rows, if you set min_samples_leaf to 400, then the model will not be able to find good splits to predict 5 different classes. In this case, the model can only create one single split and the model will only be able to predict two different classes instead of 5. It is good practice to start with low values first and then progressively increase them until you reach satisfactory performance.