# tweet-sentiment with Random Forest

In this notebook, we analyse the tweet-sentiment dataset using a random forest classifier. Random forest models are very popular as they tend to be very robust in predicting and are resistant to overfitting to the training set. 

## 0. Set-up

We begin by loading the necessary libraries as well as our own `eda` module, which consists of the functions used in exploratory analysis of the tweet-sentiment dataset. 

In [1]:
%load_ext autoreload
%autoreload 2

In [11]:
import eda

import numpy as np
from scipy import stats

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import word2vec

For reproducibility purposes, we also set a seed value of 42 throughout the entire project: 

In [25]:
seed = 42

## 1. Load the data

Using the functions defined in our eda module, we load and do simple cleaning of the dataset. This includes: 
- removing NA values; 
- converting the categorical variable `sentiment` to numerical values.

In [8]:
df = eda.load_dataset(url="https://raw.github.com/ant1code/tweet-sentiment/master/data/train.csv")

df = eda.clean_dataset(df)

In [9]:
print(f"This dataset consists of {len(df)} tweets.")
df.head()

This dataset consists of 27485 tweets.


Unnamed: 0,text,sentiment
0,Spent the entire morning in a meeting w/ a ven...,1
1,Oh! Good idea about putting them on ice cream,2
2,says good (or should i say bad?) afternoon! h...,1
3,i dont think you can vote anymore! i tried,0
4,haha better drunken tweeting you mean?,2


## 2. Data Processing

Similar to in our EDA notebook, we normalise the data by removing stopwords and lemmatizing the remaining words. This reduces the number of words to be processed, thereby reducing the complexity of our eventual random forest model and making the training process more efficient. 

In [10]:
stop_words = eda.get_stop_words()
df = eda.normalize(df, stop_words)

df.head()

Unnamed: 0,text,sentiment,lemmatized
0,Spent the entire morning in a meeting w/ a ven...,1,spent entire morning meeting w vendor boss not...
1,Oh! Good idea about putting them on ice cream,2,oh good idea put ice cream
2,says good (or should i say bad?) afternoon! h...,1,say good say bad afternoon
3,i dont think you can vote anymore! i tried,0,dont think vote anymore try
4,haha better drunken tweeting you mean?,2,haha well drunken tweet mean


Next, we have to *vectorize* the lemmatized tweets. This refers to the process of converting text (string features) into vectors (of numerical features) that can be understood by the random forest classifier. 

We chose to use sklearn's TfidfVectorizer for this purpose. TFIDF stands for *term frequency-inverse document frequency* and in this project, it is a statistic that reflects how important a word is to a tweet. 

In [18]:
vectorizer = TfidfVectorizer()
bow = vectorizer.fit_transform(df['lemmatized'])

print(f"There are {len(vectorizer.get_feature_names())} features.")

There are 21359 features.


The above results in a very large and very sparse term frequency matrix, since there are certainly many words that appear only very few times in the entire dataset. A lot of space is wasted in representing such matrices. To get around this problem, we specify a *cut-off* of 5: it means that when the vectorizer builds the vocabulatry, it should ignore terms that appear in fewer than five tweets. 

In [20]:
vectorizer = TfidfVectorizer(min_df=5)
bow = vectorizer.fit_transform(df['lemmatized'])

print(f"There are {len(vectorizer.get_feature_names())} features.")
print()

There are 3749 features.


## 3. Training the Random Forest

We are ready to proceed with training our classifier! The first step is to split the dataset into train and test sets to avoid contamination of the training set:

In [22]:
x_train, x_test, y_train, y_test = train_test_split(bow, sentiment, test_size=0.33)

Next, we initialise the random forest classifier and fit it to the train set: 

In [26]:
classifier = RandomForestClassifier(random_state=seed, bootstrap=False)
classifier.fit(x_train,y_train)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

## 4. Testing the Random Forest 

It remains to evaluate the performance of our random forest classifier on the test set: 

In [27]:
classifier.score(x_test,y_test)

0.6978282438540403

## 5. Bonus: Hyperparameter Optimization 

The score above is decent (and in fact, better than our human predictions), but it is not nearly as high as the scores obtained by the DistilBERT and BERT classifiers. To improve the performance of the random forest classifier, we can search for optimal hyperparameters like so: 

In [29]:
# BONUS: hyperparameter optimization

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 2000, stop = 4000, num = 200)]

# Number of features to consider at every split
max_features = ['sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(100, 2000, num = 100)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [320, 640, 1280, 2560]

# Minimum number of samples required at each leaf node
min_samples_leaf = [16, 32]

# Method of selecting samples for training each tree
bootstrap = [False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

classifier = RandomForestClassifier()

random_search = RandomizedSearchCV(estimator = classifier, param_distributions = random_grid,
                                   cv = 3, verbose = 2, random_state = seed, 
                                   n_iter=3, n_jobs= -1, return_train_score = True)

random_search.fit(x_train,y_train)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:  3.5min remaining:  4.4min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  4.4min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  4.4min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [30]:
# Take the best of the 6 trained models
random_search.best_params_

{'n_estimators': 3587,
 'min_samples_split': 640,
 'min_samples_leaf': 16,
 'max_features': 'sqrt',
 'max_depth': 1558,
 'bootstrap': False}

As expected, the random forest with these hyperparameters performs better than our initial random forest classifier: 

In [32]:
new_classifier = RandomForestClassifier(**random_search.best_params_)
new_classifier.fit(x_train, y_train)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=1558, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=16, min_samples_split=640,
                       min_weight_fraction_leaf=0.0, n_estimators=3587,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [34]:
new_classifier.score(x_test, y_test)

0.6832763752618234