# SVM Model

According to a [Stanford Paper](https://nlp.stanford.edu/courses/cs224n/2009/fp/14.pdf),  SVM classifiers perform well when classifying the subjectivity of sentences within the same domain. They are also effective, though less so, at classifying
the polarity of subjective sentences in the same domain. Now we will try doing so for our reviews data. 

Previously we have already transformed the reviews data into a Bag of Words Model for modelling. Now we will transform it with a `TfidfVectorizer` instead. This vectorizer is equivalent to using `CountVectorizer` followed by `TfidfTransformer`. 

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../data/googleplaystore_user_reviews.csv')
df.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


Minor preprocessing on the dataset

In [39]:
df["Translated_Review"].replace(["NaN"], np.nan, inplace = True)
df = df.dropna().drop_duplicates().reset_index(drop=True)
len(reviews)

27994

Split the test and train data

In [40]:
from sklearn.model_selection import train_test_split

X = df["Translated_Review"]
y = df["Sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.33, random_state=42)

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf = True, smooth_idf=True, ngram_range=(1,2))
vectorizer.fit(reviews)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=True, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [43]:
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

Classifying the data using a SVM with linear kernel

In [60]:
import time
from sklearn import svm
from sklearn.metrics import classification_report

classifier_linear = svm.SVC(kernel='linear')
t0 = time.time()
classifier_linear.fit(train_vectors, y_train)
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1

print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))

Training time: 33.102018s; Prediction time: 42.944373s


In [61]:
linear_report = classification_report(y_test, prediction_linear)
print("Classification Report for SVM Linear Kernel")
print(linear_report)

Classification Report for SVM Linear Kernel
              precision    recall  f1-score   support

    Negative       0.80      0.69      0.74      4294
     Neutral       0.86      0.50      0.64      2925
    Positive       0.84      0.96      0.89     12675

    accuracy                           0.83     19894
   macro avg       0.83      0.72      0.76     19894
weighted avg       0.83      0.83      0.82     19894



Now we try a SVM with a rbf kernel

In [62]:
classifier_rbf = svm.SVC(kernel='rbf', gamma='scale')
t0 = time.time()
classifier_rbf.fit(train_vectors, y_train)
t1 = time.time()
prediction = classifier_rbf.predict(test_vectors)
t2 = time.time()
time_train = t1-t0
time_predict = t2-t1

print("Training time: %fs; Prediction time: %fs" % (time_train, time_predict))

Training time: 52.562992s; Prediction time: 56.747367s


In [63]:
rbf_report = classification_report(y_test, prediction)
print("Classification Report for SVM RBF Kernel")
print(rbf_report)

Classification Report for SVM RBF Kernel
              precision    recall  f1-score   support

    Negative       0.90      0.45      0.60      4294
     Neutral       0.96      0.18      0.31      2925
    Positive       0.73      0.99      0.84     12675

    accuracy                           0.75     19894
   macro avg       0.86      0.54      0.58     19894
weighted avg       0.80      0.75      0.71     19894



The classification reports give us 4 readings:
1. Precision = tp / (tp + fp) 
2. Recall = tp / (tp + fn)
3. f1-score = 2 * (precision * recall) / (precision + recall)
4. Support: the number of occurrences of each class in y_true

From the classification reports, we can see that
- Precision: RBF kernel has a higher precision for Negative and Neutral classes than the Linear Kernel, but the Linear Kernel has a higher precision for the Positive class 
- Recall: RBF kernel scored lower recall for Negative and Neutral classes, but higher recall for Positive class than Linear Kernel
- f1-score: Linear kernel achieves better f1-score for all 3 categories than RBF kernel

According to the stack overflow conversation [here](https://stats.stackexchange.com/questions/49226/how-to-interpret-f-measure-values), our use case does not emphasise on correctly labelling a particular type of sentiment, so the f1-score will be a crucial factor to take into account. Thus we should proceed with the linear kernel svm in this case.

In [65]:
# Incomplete:Plot the difference in score if necessary using these data

dict_linear = classification_report(y_test, prediction_linear,  output_dict=True)
dict_rbf = classification_report(y_test, prediction, output_dict=True)