# Sentiment Analysis with n-grams and GridSearch

When looking for ways how to improve performance of sentiment analysis model, it's good to reach for such ideas as:
- optimising text representations,
- hyperparameters tuning.

In the first part, let's confront the performance of initial word vectors by testing word vectors containing not only unigrams but also n-grams of size 2 and 3 (with and without stop words removal).

In the second part, the GridSearchCV tool will be used to find the most optimal setup of model parameters.

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

import functions as f

In [3]:
df = pd.read_csv('data/reviews_toys_games_100k.csv')

reviews = df['review'].astype('U').values
y = df['sentiment'].to_list()

## N-Grams

This part contains the comparison of 6 different combinations of BOW and TF-IDF models, with n-grams of size 2 and 3, with and without stop words removal.

Obtained results are very similar in each variant, however the best performance was achieved with BOW model with uni- and bi-grams, without stop words removal.

In [4]:
M_bow_1 = CountVectorizer(stop_words='english', ngram_range=(1,2), max_features=6000).fit_transform(reviews)
M_bow_2 = CountVectorizer(ngram_range=(1,2), max_features=6000).fit_transform(reviews)
M_bow_3 = CountVectorizer(ngram_range=(1,3), max_features=6000).fit_transform(reviews)

M_tfidf_1 = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_features=6000).fit_transform(reviews)
M_tfidf_2 = TfidfVectorizer(ngram_range=(1,2), max_features=6000).fit_transform(reviews)
M_tfidf_3 = TfidfVectorizer(ngram_range=(1,3), max_features=6000).fit_transform(reviews)

In [5]:
embeddings_names = ['M_bow_1', 'M_bow_2', 'M_bow_3', 'M_tfidf_1', 'M_tfidf_2', 'M_tfidf_3']
embeddings = [M_bow_1, M_bow_2, M_bow_3, M_tfidf_1, M_tfidf_2, M_tfidf_3]
results_names = ['test_acc', 'f1', 'precision', 'recall']

In [6]:
sgd = SGDClassifier(random_state=9, n_jobs=-1)

sgd_cv = f.model_cv(sgd, embeddings, y)
f.df_model_cv(sgd_cv, embeddings_names, results_names)

Unnamed: 0,test_acc,f1,precision,recall
M_bow_1,0.95922,0.977965,0.966973,0.989211
M_bow_2,0.96663,0.98186,0.976483,0.987309
M_bow_3,0.96599,0.981502,0.976743,0.986314
M_tfidf_1,0.95067,0.973677,0.951147,0.9973
M_tfidf_2,0.95853,0.977777,0.959045,0.997256
M_tfidf_3,0.95816,0.97758,0.958817,0.997092


## Grid Search Cross Validations

GridSearchCV evaluates all hyper-parameters' combinations and indicates the configuration with the highest score. This script will focus on finding the optimal setting of parameters: loss, penalty, alpha and max_iter.
The conclusion is that the best configuration of the model is the default one, although it's enough to use 800 iterations instead of 1000.

In [7]:
parameters = {
    'loss': ('hinge', 'log'),
    'penalty': ('l1', 'l2'),
    'alpha':(0.0001, 0.0005), 
    'max_iter':[800, 1000, 1200]}

sgd = SGDClassifier(random_state=9, n_jobs=-1)
clf = GridSearchCV(sgd, parameters)
clf.fit(M_bow_2, y)

GridSearchCV(estimator=SGDClassifier(n_jobs=-1, random_state=9),
             param_grid={'alpha': (0.0001, 0.0005), 'loss': ('hinge', 'log'),
                         'max_iter': [800, 1000, 1200],
                         'penalty': ('l1', 'l2')})

In [8]:
clf.best_params_

{'alpha': 0.0001, 'loss': 'hinge', 'max_iter': 800, 'penalty': 'l2'}

In [10]:
sgd_best_params = SGDClassifier(max_iter=800, random_state=9, n_jobs=-1)
sgd_best_cv = f.model_cv(sgd, [M_bow_2], y)
f.df_model_cv(sgd_best_cv, ['M_bow_2'], results_names)

Unnamed: 0,test_acc,f1,precision,recall
M_bow_2,0.96663,0.98186,0.976483,0.987309
