# Random Forest vs XGBoost vs SVM with complex classes based on confidence

This notebook evaluates the performance of three classifiers: **Random Forest**, **XGBoost** and **SVM**. These classifiers use both **Sentiment** and sentiment **Confidence**, by splitting the dataset into buckets. The target classes are:

- *Highly Positive*: positive sentiment and confidence **> threshold**
- *Slightly Positive*: positive sentiment and confidence **≤ threshold**
- *Highly Negative*: negative sentiment and confidence **> threshold**
- *Slightly Negative*: negative sentiment and confidence **≤ threshold**
- *Neutral / author is just sharing information*
- *Tweet not related to weather condition*

## Imports


In [187]:
from src.WeatherSentimentData import WeatherSentimentData
from src.TweetTextPreprocessor import TweetTextPreprocessor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
import numpy as np
from os import path
from config import threshold, test_size, saved_models_path, saved_vectorizers_path

## Data

In [188]:
weather_data = WeatherSentimentData('data', use_generated_data=True)
df = weather_data.full_data
print(df.sentiment.value_counts())
df[['confidence', 'sentiment', 'tweet_text']]

Negative                                        429
Positive                                        403
Neutral / author is just sharing information    391
Tweet not related to weather condition          348
Name: sentiment, dtype: int64


Unnamed: 0,confidence,sentiment,tweet_text
0,0.8439,Positive,Grilling kabobs on the grill last night was am...
1,0.6963,Negative,The slowest day ever !! And the weather makes ...
2,0.8802,Neutral / author is just sharing information,Fire Weather Watch issued May 17 at 4:21PM CDT...
3,0.6897,Positive,Im going to lunch early today. The weather i...
4,0.6153,Neutral / author is just sharing information,Weekend Weather Causes Delays In I-270 Bridge ...
...,...,...,...
1566,0.6012,Neutral / author is just sharing information,@mention You're welcome. I'm glad you enjoyed ...
1567,0.7821,Positive,Going skiing with my buddies. It's going to be...
1568,0.7194,Negative,This humidity is unbearable. I feel like I'm i...
1569,0.6389,Tweet not related to weather condition,Look at this cute puppy I found on the street....


### Text preprocessing

Text preprocessing consists of 5 stages:
- converting all letters to lowercase,
- removing all unnecessary elements (urls, @mentions, nonwords, digits etc.)
- tokenizing the text,
- excluding stopwords,
- stemming

Its code can be fined in [text-preprocessing.ipynb](text-preprocessing.ipynb) or in [TweetTextPreprocessor.py](src/TweetTextPreprocessor.py).

In [189]:
preprocessor = TweetTextPreprocessor()
df['tweet_text'] = preprocessor.preprocess(df['tweet_text'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mmakaranka\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Models setup

### Data preparation

Transforming Sentiment column so tweets with *Positive* sentiment are split into *Highly positive* and *Slightly positive* and with *Negative* sentiment are split into *Highly negative* and *Slightly negative*.

In [190]:
# new classes
df['sentiment'] = df.apply(lambda row: "Highly positive" if (row['sentiment'] == 'Positive' and row["confidence"] > threshold) else row['sentiment'], axis=1)
df['sentiment'] = df.apply(lambda row: "Slightly positive" if (row['sentiment'] == 'Positive' and row["confidence"] <= threshold) else row['sentiment'], axis=1)
df['sentiment'] = df.apply(lambda row: "Highly negative" if (row['sentiment'] == 'Negative' and row["confidence"] > threshold) else row['sentiment'], axis=1)
df['sentiment'] = df.apply(lambda row: "Slightly negative" if (row['sentiment'] == 'Negative' and row["confidence"] <= threshold) else row['sentiment'], axis=1)
df['sentiment'].value_counts()

Neutral / author is just sharing information    391
Tweet not related to weather condition          348
Highly positive                                 302
Highly negative                                 275
Slightly negative                               154
Slightly positive                               101
Name: sentiment, dtype: int64

### Splitting data

Splits the data into training and test sets. `X` consists of preprocessed tweets, `y` holds assigned sentiment classes, `w` keeps weights (confidence).

In [191]:
df['tweet_text'] = df['tweet_text'].apply(lambda x: ' '.join(x)) # list -> string
X = df['tweet_text']
y = df['sentiment']
w = df['confidence']
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(X, y, w, test_size=test_size, random_state=42)

### Vectorizing

Our vectorizer of choice is going to be a **Term Frequency-Inverse Document Frequency** with n-grams of words within the range of 1 to 3 (unigrams, bigrams, and trigrams).

In [192]:
# ngram
tf_idf_ngram_vectorizer = TfidfVectorizer(ngram_range=(1, 3))
X_tf_idf_word_train = tf_idf_ngram_vectorizer.fit_transform(X_train)
X_tf_idf_word_test = tf_idf_ngram_vectorizer.transform(X_test)

with open(path.join(saved_vectorizers_path, 'complex_classes_vectorizer.pkl'), 'wb') as file:
    pickle.dump(tf_idf_ngram_vectorizer, file)

### Label encoding for XGBoost

Because **XGBoost** algorithm cannot deal with categorical variables on its own we need label encoding to transform Sentiment values (strings) into plain integers. We also save our label encoder to file as it will be needed for future endeavours.

In [193]:
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)

with open(path.join(saved_vectorizers_path, 'complex_classes_label_encoder.pkl'), 'wb') as file:
    pickle.dump(label_encoder, file)

## Random Forest

Initializing and fitting random forest algorithm:

In [194]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_tf_idf_word_train, y_train, w_train)

with open(path.join(saved_models_path, "rf_complex_classes.pkl"), 'wb') as file:
    pickle.dump(rf, file)

### Predictions

Prediction for test set and counting predicted classes complemented with test set accuracy and classification report.

In [195]:
y_pred_rf = rf.predict(X_tf_idf_word_test)
pd.Series(y_pred_rf).value_counts()

Tweet not related to weather condition          44
Highly positive                                 40
Neutral / author is just sharing information    30
Highly negative                                 25
Slightly negative                               18
Slightly positive                                1
dtype: int64

In [196]:
"Accuracy", accuracy_score(y_test, y_pred_rf)

('Accuracy', 0.5443037974683544)

In [197]:
cr_rf = classification_report(y_test, y_pred_rf)
print(cr_rf)

                                              precision    recall  f1-score   support

                             Highly negative       0.60      0.56      0.58        27
                             Highly positive       0.65      0.81      0.72        32
Neutral / author is just sharing information       0.53      0.52      0.52        31
                           Slightly negative       0.28      0.22      0.24        23
                           Slightly positive       1.00      0.07      0.13        14
      Tweet not related to weather condition       0.52      0.74      0.61        31

                                    accuracy                           0.54       158
                                   macro avg       0.60      0.49      0.47       158
                                weighted avg       0.57      0.54      0.52       158



The same fitting, predicting and metrics gathering process will be repeated for both **XGBoost** and **SVM**.

## XGBoost

In [198]:
xgb = XGBClassifier(random_state=42)
xgb.fit(X_tf_idf_word_train, y_train_encoded, w_train)

with open(path.join(saved_models_path, "xgb_complex_classes.pkl"), 'wb') as file:
    pickle.dump(xgb, file)



### Predictions

In [199]:
y_pred_xgb = xgb.predict(X_tf_idf_word_test)
y_pred_xgb = label_encoder.inverse_transform(y_pred_xgb)
pd.Series(y_pred_xgb).value_counts()

Highly positive                                 47
Tweet not related to weather condition          37
Highly negative                                 29
Neutral / author is just sharing information    24
Slightly negative                               18
Slightly positive                                3
dtype: int64

In [200]:
"Accuracy", accuracy_score(y_test, y_pred_xgb)

('Accuracy', 0.5)

In [201]:
cr_xgb = classification_report(y_test, y_pred_xgb)
print(cr_xgb)

                                              precision    recall  f1-score   support

                             Highly negative       0.45      0.48      0.46        27
                             Highly positive       0.57      0.84      0.68        32
Neutral / author is just sharing information       0.62      0.48      0.55        31
                           Slightly negative       0.28      0.22      0.24        23
                           Slightly positive       0.00      0.00      0.00        14
      Tweet not related to weather condition       0.51      0.61      0.56        31

                                    accuracy                           0.50       158
                                   macro avg       0.41      0.44      0.42       158
                                weighted avg       0.46      0.50      0.47       158



## SVM

In [202]:
svm = SVC(random_state=42)
svm.fit(X_tf_idf_word_train, y_train, w_train)

with open(path.join(saved_models_path, "svm_complex_classes.pkl"), 'wb') as file:
    pickle.dump(svm, file)

### Predictions

In [203]:
y_pred_svm = svm.predict(X_tf_idf_word_test)
pd.Series(y_pred_svm).value_counts()

Tweet not related to weather condition          59
Highly positive                                 52
Neutral / author is just sharing information    28
Highly negative                                 19
dtype: int64

In [204]:
"Accuracy", accuracy_score(y_test, y_pred_svm)

('Accuracy', 0.5379746835443038)

In [206]:
cr_svm = classification_report(y_test, y_pred_svm)
print(cr_svm)

                                              precision    recall  f1-score   support

                             Highly negative       0.74      0.52      0.61        27
                             Highly positive       0.56      0.91      0.69        32
Neutral / author is just sharing information       0.64      0.58      0.61        31
                           Slightly negative       0.00      0.00      0.00        23
                           Slightly positive       0.00      0.00      0.00        14
      Tweet not related to weather condition       0.41      0.77      0.53        31

                                    accuracy                           0.54       158
                                   macro avg       0.39      0.46      0.41       158
                                weighted avg       0.44      0.54      0.47       158



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
