# Binary Classification: Random Forest - Sklearn

The random forest is an ensemble classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

In this example, we will build a random forest classifier to predict which messages are spam or not. The predictions will be based on the counts of each word in the text message. Before using a Random Forest, see how well a simple Decision Tree model performs.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix

## Load data

In [2]:
# Read in our dataset
url = "https://raw.githubusercontent.com/lucaskienast/Classification-Models/main/1)%20Binary%20Classification/SMSSpamCollection.dms"
df = pd.read_table(url,
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# map spam as 1 and ham as 0
df["label"] = np.where(df["label"]=="spam", 1, 0)
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Explore data

In [4]:
# show descriptive statistics
df.describe(include="all")

Unnamed: 0,label,sms_message
count,5572.0,5572
unique,,5169
top,,"Sorry, I'll call later"
freq,,30
mean,0.134063,
std,0.340751,
min,0.0,
25%,0.0,
50%,0.0,
75%,0.0,


In [5]:
df["label"].unique()

array([0, 1])

## Declare features and targets

In [6]:
# create feature (X) and target (y) variables
y = df["label"]
X = df["sms_message"]

## Train-Test Split

In [7]:
# 80:20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Instantiate CountVectorizer method

In [8]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

In [9]:
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

In [10]:
# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

## Train & Test Decision Tree model (for comparison)

In [11]:
# build and fit model
dt = DecisionTreeClassifier()
dt.fit(training_data, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [12]:
# Predict on the test data
dt_predictions = dt.predict(testing_data)

In [13]:
# Score our model
print('Accuracy score: ', format(accuracy_score(y_test, dt_predictions)))
print('Precision score: ', format(precision_score(y_test, dt_predictions)))
print('Recall score: ', format(recall_score(y_test, dt_predictions)))
print('F1 score: ', format(f1_score(y_test, dt_predictions)))

Accuracy score:  0.9713004484304932
Precision score:  0.8979591836734694
Recall score:  0.8859060402684564
F1 score:  0.8918918918918919


In [14]:
# confusion matrix
cm = confusion_matrix(y_test, dt_predictions)
cm

array([[951,  15],
       [ 17, 132]])

## Train & Test Random Forest model

In [15]:
# build and fit model
rf = RandomForestClassifier(n_estimators=200) # 200 weak learners (all else default)
rf.fit(training_data, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [16]:
# Predict on the test data
rf_predictions = rf.predict(testing_data)

In [17]:
 #Score our model
print('Accuracy score: ', format(accuracy_score(y_test, rf_predictions)))
print('Precision score: ', format(precision_score(y_test, rf_predictions)))
print('Recall score: ', format(recall_score(y_test, rf_predictions)))
print('F1 score: ', format(f1_score(y_test, rf_predictions)))

Accuracy score:  0.97847533632287
Precision score:  1.0
Recall score:  0.8389261744966443
F1 score:  0.9124087591240876


It looks like the Random Forest outperformed the simple Decision Tree in all metrics except recall.