# Random Forest

A random forest is a collection of decision trees, which are used together to estimate which label a sample should be assigned.
Random forests are built on the idea that while a single decision tree is highly biased, or overfit, if we train several decision trees, they'll be biased in different ways. This requires that each tree is trained independently, and each on a slightly different training set.

The performance of random forests is often impressive and so comparisons are often best made against neural networks, which are another popular and high-performance model type. Unlike neural networks, random forest models are easy to train: modern frameworks provide helpful methods that let you do so in only a few lines of code. Random forests are also fast to train and don't need large datasets to perform well.

Like several models, random forests have various architectural options. The easiest to consider is the size of the forest – how many trees are involved, along with the size of these trees. The ability for a random forest to make good predictions isn't infinite. At some point, increasing the size and number of trees gives no further improvement due to the limited variety of training data that we've.


In [1]:
import pandas
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/san_fran_crime.csv
import numpy as np
from sklearn.model_selection import train_test_split


--2021-12-06 22:47:42--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py’


2021-12-06 22:47:42 (17.1 MB/s) - ‘graphing.py’ saved [21511/21511]

--2021-12-06 22:47:42--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/san_fran_crime.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11715606 (11M) [text/plain]
Saving to: ‘san_f

In [2]:
# Import the data from the .csv file
dataset = pandas.read_csv('san_fran_crime.csv', delimiter="\t")


In [3]:
# Remember to one-hot encode our crime and PdDistrict variables 
categorical_features = ["Category", "PdDistrict"]
dataset = pandas.get_dummies(dataset, columns=categorical_features, drop_first=False)


In [5]:
# Split the dataset in an 90/10 train/test ratio. 
# Recall that our dataset is very large so we can afford to do this
train, test = train_test_split(dataset, test_size=0.1, random_state=2, shuffle=True)


In [6]:
print(dataset.head())
print("train shape:", train.shape)
print("test shape:", test.shape)


   DayOfWeek  Resolution  ...  PdDistrict_TARAVAL  PdDistrict_TENDERLOIN
0          5        True  ...                   0                      0
1          5        True  ...                   0                      0
2          1        True  ...                   0                      0
3          2       False  ...                   0                      1
4          5       False  ...                   0                      0

[5 rows x 54 columns]
train shape: (135387, 54)
test shape: (15044, 54)


In [7]:
from sklearn.metrics import balanced_accuracy_score


In [9]:
features = [c for c in dataset.columns if c != "Resolution"]

def fit_and_test_model(model):
    '''
    Trains a model and tests it against both train and test sets
    '''  
    global features

    # Train the model
    model.fit(train[features], train.Resolution)

    # Assess its performance
    # -- Train
    predictions = model.predict(train[features])
    train_accuracy = balanced_accuracy_score(train.Resolution, predictions)

    # -- Test
    predictions = model.predict(test[features])
    test_accuracy = balanced_accuracy_score(test.Resolution, predictions)

    return train_accuracy, test_accuracy


print("OK")

OK


### Decision Tree

In [10]:
import sklearn.tree
# re-fit our last decision tree to print out its performance
model = sklearn.tree.DecisionTreeClassifier(random_state=1, max_depth=10) 

dt_train_accuracy, dt_test_accuracy = fit_and_test_model(model)

print("Decision Tree Performance:")
print("Train accuracy", dt_train_accuracy)
print("Test accuracy", dt_test_accuracy)


Decision Tree Performance:
Train accuracy 0.7742407145595661
Test accuracy 0.7597105242913844


### Random Forest

In [11]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest model with two trees
random_forest = RandomForestClassifier( n_estimators=2,
                                        random_state=2,
                                        verbose=False)

# Train and test the model
train_accuracy, test_accuracy = fit_and_test_model(random_forest)
print("Random Forest Performance:")
print("Train accuracy", train_accuracy)
print("Test accuracy", test_accuracy)


Random Forest Performance:
Train accuracy 0.8842998107846062
Test accuracy 0.734378540999183


In [14]:
# n_estimators states how many trees to put in the model
# We will make one model for every entry in this list and see how well each model performs 
n_estimators = [2, 5, 10, 20, 50]

# Train our models and report their performance
train_accuracies = []
test_accuracies = []

for n_estimator in n_estimators:
    print("Preparing a model with", n_estimator, "trees...")

    # Prepare the model 
    rf = RandomForestClassifier(n_estimators=n_estimator, 
                                random_state=2, 
                                verbose=False)
    
    # Train and test the result
    train_accuracy, test_accuracy = fit_and_test_model(rf)

    # Save the results
    test_accuracies.append(test_accuracy)
    train_accuracies.append(train_accuracy)

print("test accuracies: ", test_accuracies)
print("train accuracies: ", train_accuracies)

Preparing a model with 2 trees...
Preparing a model with 5 trees...
Preparing a model with 10 trees...
Preparing a model with 20 trees...
Preparing a model with 50 trees...
[0.734378540999183, 0.7998956455629179, 0.8000019838258419, 0.8107573519372882, 0.8150166433458843]
[0.8842998107846062, 0.9716768193935338, 0.9797076760383495, 0.9929211742941426, 0.9990978388259448]


   If we let our model split and create too many nodes, it can become increasingly complex and start to overfit.

One way to limit that complexity is to tell the model that each node needs to have at least a certain number of samples, otherwise it can't split into subnodes.

In other words, we can set the model's min_samples_split parameter to the least number of samples required so that a node can be split.

In [15]:

# Shrink the training set temporarily to explore this
# setting with a more normal sample size
full_trainset = train
train = full_trainset[:1000] # limit to 1000 samples

min_samples_split = [2, 10, 20, 50, 100, 500]

# Train our models and report their performance
train_accuracies = []
test_accuracies = []

for min_samples in min_samples_split:
    print("Preparing a model with min_samples_split = ", min_samples)

    # Prepare the model 
    rf = RandomForestClassifier(n_estimators=20,
                                min_samples_split=min_samples,
                                random_state=2, 
                                verbose=False)
    
    # Train and test the result
    train_accuracy, test_accuracy = fit_and_test_model(rf)

    # Save the results
    test_accuracies.append(test_accuracy)
    train_accuracies.append(train_accuracy)

print("test accuracies: ", test_accuracies)
print("train accuracies: ", train_accuracies)

Preparing a model with min_samples_split =  2
Preparing a model with min_samples_split =  10
Preparing a model with min_samples_split =  20
Preparing a model with min_samples_split =  50
Preparing a model with min_samples_split =  100
Preparing a model with min_samples_split =  500
test accuracies:  [0.6982512531888485, 0.6979693593334069, 0.6992717539931208, 0.6971769551859806, 0.6892389229748863, 0.5]
train accuracies:  [0.9963503649635037, 0.8541855180873097, 0.8000945084554906, 0.7592246285013372, 0.7284088395568157, 0.5]


In [16]:
# Rol back the trainset to the full set
train = full_trainset


# Prepare the model 
rf = RandomForestClassifier(n_estimators=200,
                            max_depth=128,
                            max_features=25,
                            min_samples_split=2,
                            random_state=2, 
                            verbose=False)

# Train and test the result
print("Training model. This may take 1 - 2 minutes")
train_accuracy, test_accuracy = fit_and_test_model(rf)

# Print out results, compared to the decision tree
data = {"Model": ["Decision tree","Final random forest"],
        "Train sensitivity": [dt_train_accuracy, train_accuracy],
        "Test sensitivity": [dt_test_accuracy, test_accuracy]
        }

pandas.DataFrame(data, columns = ["Model", "Train sensitivity", "Test sensitivity"])

Training model. This may take 1 - 2 minutes


Unnamed: 0,Model,Train sensitivity,Test sensitivity
0,Decision tree,0.774241,0.759711
1,Final random forest,0.999657,0.816087
