# Random Forest Example

Using an SMS Spam data set (slightly modified) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data set is a collection of 5574 SMS messages that have been labeled as ham or spam. The file is a tab-delimited file with the first column the label and the second the message content. I edited the data set to remove some unwanted columns and add headings. 



In [20]:
import pandas as pd
df = pd.read_csv('data/sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
print('rows and columns:', df.shape)
print(df.head())

rows and columns: (4837, 2)
   spam                                               text
0     0  Go until jurong point, crazy.. Available only ...
1     0                      Ok lar... Joking wif u oni...
2     1  Free entry in 2 a wkly comp to win FA Cup fina...
3     0  U dun say so early hor... U c already then say...
4     0  Nah I don't think he goes to usf, he lives aro...


In [21]:
# text preprocessing
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = set(stopwords.words('english'))
df['text'].replace('[\d][\d]+', ' num ', regex=True, inplace=True)
df['text'].replace('[!@#*][!@#*]+', ' punct ', regex=True, inplace=True)
df['text'].replace('[A-Z][A-Z]+', ' caps ', regex=True, inplace=True)

vectorizer = TfidfVectorizer(stop_words=stopwords)

In [22]:
# set up X and y
X = df.text
y = df.spam

In [23]:
# divide into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

In [24]:
# apply tfidf vectorizer
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

## Decision Trees and Random Forest

A decision tree classifier is an iterative algorithm that repeatedly divides observations into smaller and more similar groups. At each iteration, the algorithm finds the feature that most cleanly divides the data. The process continues on the subsets of data until the observations in the subset are similar. 

Decision trees are highly interpretable but not as accurate as other algorithms. Decision trees suffer from high variance. Another problem with decision trees is that the first split made may not be the best one. It is a greedy algorithm and does not go back and revisit earlier decisions.

To overcome these problems, random forests were developed. A random forest generates hundreds of trees. Further, the trees are decorrelated as follows. At each split, only a random number of features are considered. Random forests often get results competitive with far more sophisticated algorithms. The disadvantage of a random forest over a decision tree is that the forest is no longer interpretable.

### Train and test

Train on the train data and then evaluate on the test data.

In [38]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [39]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
pred = classifier.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

print(confusion_matrix(y_test, pred))

accuracy score:  0.9803719008264463
f1 score:  0.9147982062780268
[[847   1]
 [ 18 102]]


This is a higher accuracy and f1 score than Naive Bayes. 