# Random Forest vs XGBoost vs SVM

This notebook compares performance of **Random forest**, **XGBoost** and **SVM** algorithms. Training relies only on assigned Sentiment class (*Positive*/*Negative*/*Neutral or author is just sharing information*), but all the tweets with Confidence <= 0.65 are filtered out first. We left *Tweets not related to weather condition* with any Confidence, because we didn’t get enough vocabulary for this class from our dataset, and analyzing even low-confidence tweets should help with this.

## Imports


In [20]:
from src.WeatherSentimentData import WeatherSentimentData
from src.TweetTextPreprocessor import TweetTextPreprocessor
from src.Assessor import Assessor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
import numpy as np
from os import path
import os
from config import threshold, test_size, saved_models_path, saved_vectorizers_path

## Data

Tweets related to weather and having low confidence are filtered out.

In [21]:
weather_data = WeatherSentimentData('data', use_generated_data=True)
df = weather_data.full_data

df =  df[(df.confidence.astype(float) > threshold) | (df.sentiment == 'Tweet not related to weather condition')]

print(df.sentiment.value_counts())
df[['confidence', 'sentiment', 'tweet_text']]

Tweet not related to weather condition          348
Positive                                        302
Negative                                        275
Neutral / author is just sharing information    230
Name: sentiment, dtype: int64


Unnamed: 0,confidence,sentiment,tweet_text
0,0.8439,Positive,Grilling kabobs on the grill last night was am...
1,0.6963,Negative,The slowest day ever !! And the weather makes ...
2,0.8802,Neutral / author is just sharing information,Fire Weather Watch issued May 17 at 4:21PM CDT...
3,0.6897,Positive,Im going to lunch early today. The weather i...
7,0.7987,Negative,I hate this weather. Good day for a movie mara...
...,...,...,...
1565,0.7486,Negative,I'm so sick of this rain. It's ruining my mood...
1567,0.7821,Positive,Going skiing with my buddies. It's going to be...
1568,0.7194,Negative,This humidity is unbearable. I feel like I'm i...
1569,0.6389,Tweet not related to weather condition,Look at this cute puppy I found on the street....


### Text preprocessing

Text preprocessing consists of 5 stages:
- converting all letters to lowercase,
- removing all unnecessary elements (urls, @mentions, nonwords, digits etc.)
- tokenizing the text,
- excluding stopwords,
- stemming

Its code can be fined in [text-preprocessing.ipynb](text-preprocessing.ipynb) or in [TweetTextPreprocessor.py](src/TweetTextPreprocessor.py).

In [22]:
preprocessor = TweetTextPreprocessor()
df['tweet_text'] = preprocessor.preprocess(df['tweet_text'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tweet_text'] = preprocessor.preprocess(df['tweet_text'])


## Models setup

### Splitting data

Splits the data into training and test sets. `X` consists of preprocessed tweets and `y` holds assigned sentiment classes.

In [23]:
df['tweet_text'] = df['tweet_text'].apply(lambda x: ' '.join(x)) # list -> string
X = df['tweet_text']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tweet_text'] = df['tweet_text'].apply(lambda x: ' '.join(x)) # list -> string


### Vectorizing

In [24]:
# ngram
tf_idf_ngram_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
X_tf_idf_word_train = tf_idf_ngram_vectorizer.fit_transform(X_train)
X_tf_idf_word_test = tf_idf_ngram_vectorizer.transform(X_test)

with open(path.join(saved_vectorizers_path, 'basic_vectorizer.pkl'), 'wb') as file:
    pickle.dump(tf_idf_ngram_vectorizer, file)

### Label encoding for XGBoost

Because **XGBoost** algorithm cannot deal with categorical variables on its own we need label encoding to transform Sentiment values (strings) into plain integers. We also save our label encoder to file as it will be needed for future endeavours.

In [25]:
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)

with open(path.join(saved_vectorizers_path, 'basic_label_encoder.pkl'), 'wb') as file:
    pickle.dump(label_encoder, file)

Our vectorizer of choice is going to be a **Term Frequency-Inverse Document Frequency** with n-grams of words within the range of 1 to 3 (unigrams, bigrams, and trigrams)

## Random forest

Initializing and fitting random forest algorithm:

In [26]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_tf_idf_word_train, y_train)

with open(os.path.join(saved_models_path, "rf_basic.pkl"), 'wb') as file:
    pickle.dump(rf, file)

### Predictions

Prediction for test set and counting predicted classes complemented with test set accuracy and classification report.

In [27]:
y_pred_rf = rf.predict(X_tf_idf_word_test)
pd.Series(y_pred_rf).value_counts()

Tweet not related to weather condition          41
Positive                                        29
Negative                                        27
Neutral / author is just sharing information    19
dtype: int64

In [28]:
"Accuracy", accuracy_score(y_test, y_pred_rf)

('Accuracy', 0.8448275862068966)

In [29]:
cr_rf = classification_report(y_test, y_pred_rf)
print(cr_rf)

                                              precision    recall  f1-score   support

                                    Negative       0.81      0.88      0.85        25
Neutral / author is just sharing information       1.00      0.83      0.90        23
                                    Positive       0.76      0.88      0.81        25
      Tweet not related to weather condition       0.85      0.81      0.83        43

                                    accuracy                           0.84       116
                                   macro avg       0.86      0.85      0.85       116
                                weighted avg       0.85      0.84      0.85       116



The same fitting, predicting and metrics gathering process will be repeated for both **XGBoost** and **SVM**.

## XGBoost

In [30]:
xgb = XGBClassifier(random_state=42)
xgb.fit(X_tf_idf_word_train, y_train_encoded)

with open(os.path.join(saved_models_path, "xgb_basic.pkl"), 'wb') as file:
    pickle.dump(xgb, file)

### Predictions

In [31]:
y_pred_xgb = xgb.predict(X_tf_idf_word_test)
y_pred_xgb = label_encoder.inverse_transform(y_pred_xgb)
pd.Series(y_pred_xgb).value_counts()

Tweet not related to weather condition          41
Positive                                        30
Negative                                        25
Neutral / author is just sharing information    20
dtype: int64

In [32]:
"Accuracy", accuracy_score(y_test, y_pred_xgb)

('Accuracy', 0.7844827586206896)

In [33]:
cr_xgb = classification_report(y_test, y_pred_xgb)
print(cr_xgb)

                                              precision    recall  f1-score   support

                                    Negative       0.84      0.84      0.84        25
Neutral / author is just sharing information       0.85      0.74      0.79        23
                                    Positive       0.67      0.80      0.73        25
      Tweet not related to weather condition       0.80      0.77      0.79        43

                                    accuracy                           0.78       116
                                   macro avg       0.79      0.79      0.79       116
                                weighted avg       0.79      0.78      0.79       116



## SVM

In [34]:
svm = SVC(random_state=42)
svm.fit(X_tf_idf_word_train, y_train)

with open(os.path.join(saved_models_path, "svm_basic.pkl"), 'wb') as file:
    pickle.dump(svm, file)

### Predictions

In [35]:
y_pred_svm = svm.predict(X_tf_idf_word_test)

In [36]:
pd.Series(y_pred_svm).value_counts()

Tweet not related to weather condition          52
Positive                                        28
Negative                                        23
Neutral / author is just sharing information    13
dtype: int64

In [37]:
"Accuracy", accuracy_score(y_test, y_pred_svm)

('Accuracy', 0.8103448275862069)

In [38]:
cr_svm = classification_report(y_test, y_pred_svm)
print(cr_svm)

                                              precision    recall  f1-score   support

                                    Negative       0.91      0.84      0.87        25
Neutral / author is just sharing information       1.00      0.57      0.72        23
                                    Positive       0.75      0.84      0.79        25
      Tweet not related to weather condition       0.75      0.91      0.82        43

                                    accuracy                           0.81       116
                                   macro avg       0.85      0.79      0.80       116
                                weighted avg       0.83      0.81      0.81       116

