# Random Forest

Decision Trees are great for extracting features from a dataset, in the sense that they find the optimal thresholds for features. A problem is that they tend to overfit. A way to prevent this is to create a forest of trees based on a subset of the training set, both in rows as features, to explore alternative ways to classify items. The prediction of a Random Forest is than the majority vote of all these trees.

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from sklearn.metrics import recall_score, precision_score, f1_score

In [33]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", names=names)

In [34]:
X = df.drop(columns='class')
y = df['class']

In [45]:
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.1)

In [46]:
scaler = StandardScaler()
train_X = scaler.fit_transform(train_X)
valid_X = scaler.transform(valid_X)

Let's train a model on all of the training data.

In [56]:
model = LogisticRegression()
model.fit(train_X, train_y)
pred_y = model.predict(valid_X)
f1_score(valid_y, pred_y)

0.5925925925925927

In [57]:
model = DecisionTreeClassifier()
model.fit(train_X, train_y)
pred_y = model.predict(valid_X)
f1_score(valid_y, pred_y)

0.5

In [59]:
model = RandomForestClassifier(n_estimators=1000)
model.fit(train_X, train_y)
pred_y = model.predict(valid_X)
f1_score(valid_y, pred_y)

0.6785714285714286