## Random Forest model with large data set

We are going use the same Iris dataset from sklearn for better comparision


In [1]:
#import everything
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

Load the dataset from sklearn

In [2]:
from sklearn import datasets

iris = datasets.load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

Now we are going to build a dataframe for the data.

In [3]:
# construct dataframe from iris data set
df = pd.DataFrame(data = iris.data, columns = ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'])
df['class'] = iris.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
data = df.iloc[:, 0:2] # we can select the first two features and predict based on those data. 
                       # you can choose which ones to use
target = df['class']

### Train the model

In [5]:
from sklearn.model_selection import train_test_split

# split our data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state = 42)

In [6]:
from sklearn.ensemble import RandomForestClassifier

# build 2 random forest with different numbers of trees.
# the first forest we build has 2 trees; the second forest(large) has 150 trees
# the param, random_state is used for randomly selecting data so that the trend of data won't be factored in
# we set random_state = 42 just to be consistent in the overall example
rdForest = RandomForestClassifier(n_estimators = 2, random_state = 42)
rdForest_lg = RandomForestClassifier(n_estimators = 150, random_state = 42)

# train both forests
rdForest_lg.fit(X_train, y_train)
rdForest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=2, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

### Test our models

We should check their accuracy scores as a measure of performance

In [7]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, rdForest.predict(X_test)))
print(accuracy_score(y_test, rdForest_lg.predict(X_test)))

0.6842105263157895
0.7631578947368421


The accuracy of a random forest with only 2 trees is lower than one with 150 trees.
Because of the way of building random forest involves selecting features randomly and selecting samples from the population randomly. Thus a large number of trees can eliminate errors well.
The larger the forest is, more accurate it gets, but the time cost raises.
For this example, what is a optimal number of trees that our forest should have?

We can construct an array for our number of trees. For each number, we build a random forest to predict and
check its accuracy

In [8]:
n_trees = np.arange(1, 150, 4) # build a array contains values form 1 to 100 with step of 4

# for each n, we want to check the accuracy of random forest with n trees
for n in n_trees:
    rdf = RandomForestClassifier(n_estimators = n, random_state = 42)
    rdf.fit(X_train, y_train)
    # we will print both the number of trees, and the accuracy score
    print(n)
    print(accuracy_score(y_test, rdf.predict(X_test)))

1
0.7368421052631579
5
0.7631578947368421
9
0.7105263157894737
13
0.7368421052631579
17
0.7368421052631579
21
0.7368421052631579
25
0.7368421052631579
29
0.7368421052631579
33
0.7368421052631579
37
0.7368421052631579
41
0.7368421052631579
45
0.7368421052631579
49
0.7368421052631579
53
0.7368421052631579
57
0.7631578947368421
61
0.7631578947368421
65
0.7631578947368421
69
0.7631578947368421
73
0.7631578947368421
77
0.7631578947368421
81
0.7631578947368421
85
0.7631578947368421
89
0.7631578947368421
93
0.7894736842105263
97
0.7631578947368421
101
0.7631578947368421
105
0.7894736842105263
109
0.7631578947368421
113
0.7631578947368421
117
0.7631578947368421
121
0.7631578947368421
125
0.7631578947368421
129
0.7631578947368421
133
0.7631578947368421
137
0.7894736842105263
141
0.7631578947368421
145
0.7631578947368421
149
0.7631578947368421


### Conclusion

Random Forest tends to not overfitting compared to single decision tree. The number of trees doesn't affect the accuracy much since our data has only 2 features. When n = 93, the accuracy is the highest, then using 93 trees would be good. Generally, we would want a large number of trees when there are many features, so that the noise can be reduced to minimum. 