# Sentiment Prediction


### Building the model


- We now have the preprocessed data, we'll first convert the **reviews** to **TfIdf** vectors to use them as features, as in order to build the model we can't work with plain text.


- **TfIdf** are word frequency scores that try to highlight words that are more interesting, e.g. Frequent in a document but not across documents. The higher the TfIdf score, the rarer the term is.


- We'll start with some basic models like Naive Bayes and Logistice Regression, and then move onto more complex ones like RandomForests, SVM, Decision trees and Extreme Gradient boosting.


- To measure the performance of model, we'll mainly focus on F1 scores, but we'll also see how their other metrics like accuracy, recall, and precision score (although we're already looking at F1 score but sometimes it is not a reliable measure as it's just a harmonic mean of precision and recall and if there's large variation between them, F1 score might give us false picture of the model)


- We'll need do one more step as we forgot to convert text based labels, **Positive/Negative** to 1 or 0, so we'll do that before splitting the dataset.


- Also, as we have 50,000 datapoints we can use almost 80 to 85% of the data for training.

In [1]:
from warnings import filterwarnings
filterwarnings('ignore')

import time
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

In [2]:
df = pd.read_csv('../data/final_data.csv')
df.head()

Unnamed: 0,sentiment,cleaned_review
0,positive,one of the other reviewer have mention that af...
1,positive,a wonderful little production the filming t...
2,positive,think this be a wonderful way to spend time o...
3,negative,basically there s a family where a little boy ...
4,positive,Petter Matteis Love in the Time of Money be a ...


In [3]:
X = df['cleaned_review']
y = df['sentiment'].replace({'positive':1, 'negative':0})

Now that the data is ready to splitted into training and testing splits we'll first convert the reviews to Tf-Idf vectors.

In [4]:
tfidf = TfidfVectorizer()

In [5]:
%%time
X = tfidf.fit_transform(X)

CPU times: user 8.16 s, sys: 188 ms, total: 8.35 s
Wall time: 8.43 s


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)

In [7]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(42500, 150143) (7500, 150143) (42500,) (7500,)


- We'll load the model objects in a list in order to test all of the models at once.


- We'll also write a function to display all the metrics we're going to use to evaluate performance of the models.


- Last but not the least, we'll also look at the time required to train the models, as it's also an important metric while deciding between 2 models if they are giving same accuracy/f1/any  scores.

In [8]:
models = [
    MultinomialNB(),
    LogisticRegression(n_jobs=-1),
    RandomForestClassifier(n_jobs=-1),
    LinearSVC(),
    XGBClassifier(n_jobs=-1),
    DecisionTreeClassifier()
]


def display_metrics(true, pred):

    f1 = round(f1_score(y_true=true, y_pred=pred) * 100)
    precision = round(precision_score(y_true=true, y_pred=pred) * 100)
    recall = round(recall_score(y_true=true, y_pred=pred) * 100)

    print(f'F1: {f1}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')

In [9]:
%%time
trained_models = dict()

for model in models:
    print(f'Training -> {model.__class__.__name__}')
    s = time.time()
    trained_models[model.__class__.__name__] = model.fit(X_train, y_train)
    e = time.time()
    preds = trained_models[model.__class__.__name__].predict(X_test)
    acc = round(accuracy_score(y_true=y_test, y_pred=preds) * 100)
    print(f'Acc: {acc}')
    display_metrics(true=y_test, pred=preds)
    print(f'Training time: {round(e - s)} seconds')
    print('-' * 10)

Training -> MultinomialNB
Acc: 86
F1: 86
Precision: 87
Recall: 84
Training time: 0 seconds
----------
Training -> LogisticRegression
Acc: 89
F1: 89
Precision: 88
Recall: 90
Training time: 6 seconds
----------
Training -> RandomForestClassifier
Acc: 83
F1: 83
Precision: 83
Recall: 83
Training time: 41 seconds
----------
Training -> LinearSVC
Acc: 89
F1: 89
Precision: 89
Recall: 90
Training time: 1 seconds
----------
Training -> XGBClassifier
Acc: 85
F1: 85
Precision: 83
Recall: 86
Training time: 155 seconds
----------
Training -> DecisionTreeClassifier
Acc: 71
F1: 71
Precision: 70
Recall: 71
Training time: 103 seconds
----------
CPU times: user 9min 39s, sys: 1.07 s, total: 9min 40s
Wall time: 5min 6s


- As we can see above, Logistic regression and LinearSVM (SVM without a kernel, because our dataset is huge with more than 150k features so the time complexity O(datapoints x features^2) is not feasible) are giving best and similar metric scores.


- Although these two models are same in terms of their metrics scores, but SVM is almost 6 times faster in terms of training time.


- Also, SVM's precision is a little better by just 1%, that means out of the 90% correctly predicted positive labels(recall), 89% of them are actually positive(precision).


- Although we can tune, LinearSVM for better value of the regularization parameter C, but we already have a good enough model for this task, so we won't do it.


- Now, we'll store the model and the Tf-Idf vectorizer object by using joblib to use them in a webapp for setiment prediction.

In [10]:
import joblib

In [11]:
joblib.dump(value=trained_models['LinearSVC'], filename='../app/models/linear_svm.joblib')

['../models/linear_svm.joblib']

In [12]:
joblib.dump(value=tfidf, filename='../app/models/tfidf_vectorizer.joblib')

['../models/tfidf_vectorizer.joblib']

Modeling part is also done, we'll move onto building the webapp using flask.