# Random Forest

Ensemble methods combines many many simple models, probably less powerful models, in order to obtain a single and potentially very powerful model.<br>
Instead of training different model architectures, bagging uses bootstrapping to randomly (and repeately) sampling subsets of dataset and train multiple models with same architecture. Same effect of ensemble is obtained in this manner.<br>
Random forests are bagging but with decision trees.<br><br>
In bootstrapping, random data points (about 75%) from original dataset are sampled and duplicates are allowed. Resulting bootstrapped dataset is the same size as original dataset but with duplicates. Each tree is trained on each bootstrap dataset.<br>
In bootstrapping, only a subset of dataset are used so there are unseen samples for each bootstrap dataset. These samples are called out-of-bag oob samples. oob samples are used to evaluate each tree to get oob score.<br><br>
Decision trees have very high variance. Bagging reduces variance while retaining the same bias.<br>
At each split, random forest uses only a subset of features to grow a tree. By doing so, different trees can have different behaviors depending on the features they see at each timestep, promoting diversity among trees. It can prevent one strong feature to be dominant in all the trees.

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [None]:
# data processing

df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

# drop unneccessary columns
df_train = df_train.drop(columns=['Unnamed: 0', 'id'])
df_test = df_test.drop(columns=['Unnamed: 0', 'id'])

# drop missing values
df_train = df_train.dropna()
df_test = df_test.dropna()

# convert Arrival Delay in Minutes to integer
df_train['Arrival Delay in Minutes'] = df_train['Arrival Delay in Minutes'].astype(np.int64)
df_test['Arrival Delay in Minutes'] = df_test['Arrival Delay in Minutes'].astype(np.int64)

# divide categorical and numerical features
cat_feats = []
num_feats = []

for col, dtype in dict(df_train.dtypes).items():
    if dtype == np.dtypes.ObjectDType:
        cat_feats.append(col)
    elif dtype == np.int64:
        num_feats.append(col)

target = 'satisfaction'
cat_feats.remove(target)

# prepare train and test sets
train_features = cat_feats + num_feats

for cf in cat_feats+[target]:
    cats = list(df_train[cf].unique())
    cats.sort()
    encoder = OrdinalEncoder(categories=[cats])
    df_train[cf] = encoder.fit_transform(df_train[[cf]]).astype(int)
    
    cats = list(df_test[cf].unique())
    cats.sort()
    encoder = OrdinalEncoder(categories=[cats])
    df_test[cf] = encoder.fit_transform(df_test[[cf]]).astype(int)
       
X_train = df_train[train_features]
y_train = df_train[target]

X_test = df_test[train_features]
y_test = df_test[target]

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(103594, 22) (103594,)
(25893, 22) (25893,)


In [None]:
clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_train)
accuracy = accuracy_score(y_train, y_pred)

print(f"train accuracy: {accuracy}\n")

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"test accuracy: {accuracy}\n")

print("confusion matrix:")
print(cm)

train accuracy: 0.9999903469312894

test accuracy: 0.9631174448692696

confusion matrix:
[[14236   292]
 [  663 10702]]


In [40]:
def fit_tree(n_estimators, bootstrap):
    clf = RandomForestClassifier(n_estimators=n_estimators, bootstrap=bootstrap, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_train)
    train_acc = accuracy_score(y_train, y_pred)

    y_pred = clf.predict(X_test)
    test_acc = accuracy_score(y_test, y_pred)
    return train_acc, test_acc

In [42]:
n_estimators = [80, 90, 100, 120, 150, 200, 250, 300]
bootstrap = [True, False]
test_accs = []

for ne in n_estimators:
    for bo in bootstrap:
        train_acc, test_acc = fit_tree(ne, bo)
        print(f"n_estimators: {ne}, bootstrap: {bo} -> train: {train_acc}, test: {test_acc}")
        test_accs.append((ne, bo, test_acc))

best = max(test_accs, key=lambda x: x[-1])
print(f"\nbest -> n_estimators: {best[0]}, bootstrap: {best[1]} -> test acc: {best[2]}")

n_estimators: 80, bootstrap: True -> train: 0.9999903469312894, test: 0.9622677943845827
n_estimators: 80, bootstrap: False -> train: 1.0, test: 0.9626153786737728
n_estimators: 90, bootstrap: True -> train: 0.9999806938625789, test: 0.9626926196269262
n_estimators: 90, bootstrap: False -> train: 1.0, test: 0.9630402039161163
n_estimators: 100, bootstrap: True -> train: 0.9999903469312894, test: 0.9631174448692696
n_estimators: 100, bootstrap: False -> train: 1.0, test: 0.9631174448692696
n_estimators: 120, bootstrap: True -> train: 1.0, test: 0.9629629629629629
n_estimators: 120, bootstrap: False -> train: 1.0, test: 0.9630015834395397
n_estimators: 150, bootstrap: True -> train: 1.0, test: 0.9631174448692696
n_estimators: 150, bootstrap: False -> train: 1.0, test: 0.9630015834395397
n_estimators: 200, bootstrap: True -> train: 1.0, test: 0.9629243424863863
n_estimators: 200, bootstrap: False -> train: 1.0, test: 0.9633877882053065
n_estimators: 250, bootstrap: True -> train: 1.0, tes