<a href="https://colab.research.google.com/github/jagrutimohanty/CrossDomain-Realtime-FineGrained-Twitter-Senitment-Analysis/blob/main/Take_the_Pulse_Multiclass_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **TAKE THE PULSE - MULTICLASS SVM**

When dealing with high dimensional spaces, it's hard to beat SVMs. They are really effective in high dimensional spaces, not only that but are versatile in the decision boundaries through kernel functions. They also only need a subset of the training data to learn the decision boundary.

For our task, we have vectors of shape (768, ) from BERT. We then do dimensionality reduction getting it down to (512,) while retaining ~99% of the original variance. We have ~180000 rows of data and are looking for boundaries to split it into five sentiment classes, that's where Multiclass SVMs come in.

While SVMs were designed for binary classification tasks and not for multiclass classification tasks, there are workarounds to get multiclass predictions. We experimented with two of these in our project.





## One-vs-Rest Classifiers

This involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each of these problems separately and predicitons made using the model that is most confident

For instance we have five sentiments that we're trying to predict in our data: Very Positive, Positive, Neutral, Negative and Very Negative.

This can be split into five binary classification problems as follows
* Very Positive Classification
> Very Positive vs [*Positive, Neutral, Negative, Very Negative*]

* Positive Classification
> Positive vs [*Very Positive, Neutral, Negative, Very Negative*]

* Neutral Classification
> Neutral vs [*Very Positive, Positive, Negative, Very Negative*]

* Negative Classification
> Very Positive vs [*Very Positive, Positive, Neutral, Very Negative*]

* Very Negative Classification
> Very Positive vs [*Very Positive, Positive, Neutral, Negative*]

This means that our model creates five classifiers. For each phrase in our training data or tweet later in test time, each classifier predicts a class membership probablity and then the argmax of these scores is used to predict the class

We used two such variants of OVR on our training corpus, Sklearn's SVC that uses OVR by default for multiclass classification and we also used the OneVsRestClassifier wrapper around an SVC model.


Make sure we're in the right directory


In [None]:
import os 

DIR = '/content/drive/Shareddrives/255/Project'
if os.getcwd() != DIR:
  os.chdir(DIR)

Read in our data, we join the two dataframes, dictionary and sentiment labels on the phrase id and then use the combined dataframe to test our models. We're also loading in the BERT encodings of our phrases that we'll pass as input to our model.

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('./cleaned_data/dictionary.csv')
sentiment = pd.read_csv('./cleaned_data/sentiment_labels.txt', sep="|")
combined = pd.merge(data, 
                    sentiment, 
                    how='inner', 
                    left_on='id', 
                    right_on='phrase ids')
df = combined[['phrase', 'sentiment values']]
encodings = np.loadtxt('./cleaned_data/rt_bert_encodings.csv', delimiter=',')

This is a helper function that maps the sentiment probabilities back to the five classes that we're predicting

In [None]:
def assign_sentiment(val):
  if val <= 0.2:
    return 0
  elif val <= 0.4:
    return 1
  elif val <= 0.6:
    return 2
  elif val <= 0.8:
    return 3
  return 4

We'll add our new sentiment column to our dataframe

In [None]:
labels = list(map(assign_sentiment, df['sentiment values'].to_numpy()))
df = df.assign(labels=labels)

Dimensionality reduction on the BERT encodings to get vectors of shape (512,)

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=512)
labels = np.array(labels)
encodings_512 = pca.fit_transform(encodings, labels)
print(np.sum(pca.explained_variance_ratio_))

0.991135501545363


Split our data, to get a test sample that we can evaluate our models on. We'll also load in our two classifiers in this cell



> Both these models were trained on the same train test split as below, however, due to the quadratic runtime complexity of SVMs, coupled with 24 hour limits on Google Colab runtimes, we only used about half of the data, 80000 rows and then did a train test split on that of 0.2 and so we trained on 64000 rows.

> Hyperparameter tuning was done through sklearn's `GridSearchCV` library to obtain an optimal set of parameters for both models

> Due to the imbalance in our training data i.e, we have a lot of neutral, positive and negative tweets but very few very postive and very negative tweets both models are *balanced* in that they use the y values to automatically asjust class weights inversely proportional to class frequencies in the input data.



In [None]:
from sklearn.model_selection import train_test_split
from joblib import load

_, X_test, _, y_test = train_test_split(encodings_512, 
                                        labels, 
                                        test_size=0.2, 
                                        random_state=0)
clf = load('./svm_models/rbf_80000_5_balanced.joblib')
ovr_clf = load('./svm_models/ovr_rbf_80000_5_balanced.joblib')
gcv_clf = load('./svm_models/rbf_80000_5_balanced_gcv.joblib')

Evaluate both models, f1 score and the confusion matrix are key here since our dataset is imbalanced. We also wrote a per class accuracy function to see how well the models do on each of the classes

Sklearn's SVC implementation 

In [None]:
labels_map = {
    '0': 'Very Negative =====>',
    '1': 'Negative ==========>',
    '2': 'Neutral ===========>',
    '3': 'Positive ==========>',
    '4': 'Very Positive =====>'
}

def accuracy_per_class():
  for label in range(5):
    print(labels_map[str(label)], 
          sum([1 if x == label and y_test[idx] == y_pred[idx] else 0 for idx, x in enumerate(y_test)])/sum([1 if x == label else 0 for x in y_test]))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score, accuracy_score

y_pred = clf.predict(X_test)
accuracy_per_class()
print("f1 score: ", f1_score(y_test, y_pred, average='weighted'))
print("accuracy: ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

Very Negative =====> 0.6061053668143771
Very Positive =====> 0.6671652954375468
f1 score:  0.6440708727297854
accuracy:  0.6404157291331833


array([[ 1231,   633,   129,    32,     6],
       [  962,  4574,  1409,   315,    27],
       [  286,  2738, 11620,  2140,   165],
       [   67,   529,  1853,  4699,  1243],
       [    8,    62,   117,   703,  1784]])

Sklearn's SVC implementation with One vs Rest Classifier wrapped around it

In [None]:
y_pred = ovr_clf.predict(X_test)
accuracy_per_class()
print("f1 score: ", f1_score(y_test, y_pred, average='weighted'))
print("accuracy: ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

Very Negative =====> 0.3495814869522403
Very Positive =====> 0.4760658189977562
f1 score:  0.6711504005013004
accuracy:  0.6724793742633666


array([[  710,  1162,   128,    29,     2],
       [  313,  5140,  1551,   268,    15],
       [   88,  2399, 12375,  2017,    70],
       [   14,   418,  1827,  5607,   525],
       [    1,    46,   103,  1251,  1273]])

Sklearn's SVC implementation with optimal hyperparameter tuning through GridSearch CV

In [None]:
y_pred = gcv_clf.predict(X_test)
accuracy_per_class()
print("f1 score: ", f1_score(y_test, y_pred, average='weighted'))
print("accuracy: ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

Very Negative =====> 0.5790251107828656
Very Positive =====> 0.6406133133881825
f1 score:  0.6827167160695405
accuracy:  0.6811850423229401


array([[ 1176,   690,   138,    25,     2],
       [  712,  4848,  1522,   189,    16],
       [  220,  2227, 12471,  1922,   109],
       [   33,   326,  1846,  5222,   964],
       [    5,    34,   129,   793,  1713]])

## One vs One Classifiers

Like ovr, ovo splits a multi-class classification problem into binary classification problems. We basically pit one dataset of each class against every other class individually.

For example in our case we split it as follows:
* Classifier 1 - *Very Positive vs Positive*
* Classifier 2 - *Very Positive vs Neutral*
* Classifier 3 - *Very Positive vs Negative*
* Classifier 4 - *Very Positive vs Very Negative*
* Classifier 5 - *Positive vs Negative*
* Classifier 6 - *Positive vs Neutral*
* Classifier 7 - *Positive vs Very Negative*
* Classifier 8 - *Neutral vs Negative*
* Classifier 9 - *Neutral vs Very Negative*
* Classifier 10 - *Negative vs Very Negative*

So we have ten binary classifiers in total. Each classifier predicts one class label and the label with the most number of predictions is predicted

Total Number of Classifiers = `(NumClasses * (NumClasses – 1)) / 2`





The cells below follow the same procedure as above for ovr to load and evaluate our ovo classifier

In [None]:
ovo_clf = load('./svm_models/rbf_80000_5_balanced_gcv_ovo.joblib')
y_pred = ovo_clf.predict(X_test)
accuracy_per_class()
print("f1 score: ", f1_score(y_test, y_pred, average='weighted'))
print("accuracy: ", accuracy_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)

Very Negative =====> 0.5839487936976858
Very Positive =====> 0.6637995512341062
f1 score:  0.6846970183227786
accuracy:  0.6828726025929498


array([[ 1186,   689,   133,    20,     3],
       [  755,  4824,  1505,   193,    10],
       [  204,  2248, 12417,  1974,   106],
       [   25,   322,  1774,  5291,   979],
       [    2,    36,   122,   739,  1775]])

## Appendix

### Training

All the above models followed the same process below during training

In [None]:
# WARNING: This cell takes a long time to run 
# even when using GPUs and High RAM Runtimes
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from joblib import dump

X_train, X_test, y_train, y_test = train_test_split(encodings_512[:80000], 
                                                    labels[:80000], 
                                                    test_size=0.2, 
                                                    random_state=0)
# Optimal SVC hyperparameters obtained through GridSearchCV
clf = make_pipeline(
    StandardScaler(), 
    SVC(C=10, 
        gamma=0.001, 
        random_state=0, 
        class_weight='balanced'))
clf.fit(X_train, y_train)
dump(clf, './svm_models/rbf_80000_5_balanced_gcv.joblib')

### Hyperparameter Tuning

Sklearn's GridSearchCV Library came in handy with automated search for optimal hyperparameters

In [None]:
# defining parameter range 
clf = make_pipeline(StandardScaler(),
                    SVC(random_state=0, class_weight='balanced'))
param_grid = {'svc__C': [0.1, 1, 10, 100, 1000],
              'svc__gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}  
  
grid = GridSearchCV(clf, param_grid, refit = True, verbose = 3) 
  
# fitting the model for grid search 
grid.fit(X_train, y_train)
print(grid.best_params_) 
print(grid.best_estimator_) 