## Sentiment classification - Using Bag of words

### Train

In this training metodology we'll compare the classification performance using bag of words and word embedding.


In [2]:
# Add path of the folder 'resources' to the path from which we can import modules  
import sys
sys.path.append('../utilities')

In [3]:
import re

import pandas as pd

from nlp import BagOfWords, WordEmbedding

pd.set_option('display.max_colwidth', 500)

### Read data

In the following cell we read the data from a CSV file and filter only the GOOD / BAD evaluated texts (to simplify classification).

In [27]:
dataset = pd.read_csv("./sample_output/sentiment_train_processed1.csv")

text_field = "Text"
class_field = "Sentiment"

dataset = dataset.query("Sentiment != 'Neutral'")

dataset.head()

Unnamed: 0,Id,Sentiment,Text
0,0,Bad,company company lot recalls barrons blog
2,2,Bad,company company risky autonomous driving plan barrons blog
3,3,Good,company company plans ridehailing service fleet driverless cars
4,4,Bad,company company files k events f
5,5,Bad,company company goldman sachs threw towel barrons blog


### Bag of words

The bag of words representation is then calculated. Here we are using three diferent specificities, the regular bag of words, the TFIDF normalized one and the L2 normalized one.

In [7]:
bow_model, word_counts             = BagOfWords.fit_regular_bow(dataset[text_field])
tfidf_bow_model, tfidf_word_counts = BagOfWords.fit_tfidf_bow(dataset[text_field])
norm_word_counts                   = BagOfWords.fit_normalized_bow(dataset[text_field])


### Classification
Both of them will be tested as input to a Logistic regression classifier.

In [22]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import sklearn.model_selection as modsel
from sklearn.exceptions import ConvergenceWarning
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, auc, roc_auc_score

warnings.filterwarnings('ignore', category=ConvergenceWarning)

In [14]:
model = LogisticRegression(random_state=100000, penalty = "l2", fit_intercept=True, intercept_scaling=1000, class_weight='balanced')
param_grid_ = {'C': [1e-5, 1e-4, 1e-3, 1e-2, 0.05, 0.1, 0.11, 0.12, 0.125, 0.15, 0.175, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1e0, 1e1, 1e2]}

y = dataset[class_field]


bow_search = modsel.GridSearchCV(
    model,
    cv=5,
    param_grid=param_grid_,
    return_train_score=True
)

bow_search.fit(word_counts, y)


l2_search = modsel.GridSearchCV(
    model,
    cv=5,
    param_grid=param_grid_,
    return_train_score=True
)

l2_search.fit(norm_word_counts, y)


tfidf_search = modsel.GridSearchCV(
    model,
    cv=5,
    param_grid=param_grid_,
    return_train_score=True
)

tfidf_search.fit(tfidf_word_counts, y)


Text_search_results = pd.DataFrame.from_dict({
    'bow': bow_search.cv_results_['mean_test_score'],
    'tfidf': tfidf_search.cv_results_['mean_test_score'],
    'l2': l2_search.cv_results_['mean_test_score'],
})

Text_search_results

Unnamed: 0,bow,tfidf,l2
0,0.489575,0.657915,0.692664
1,0.637066,0.661004,0.691892
2,0.636293,0.671815,0.687259
3,0.647876,0.68417,0.694208
4,0.658687,0.680309,0.695753
5,0.664093,0.678764,0.690347
6,0.664865,0.678764,0.69112
7,0.665637,0.678764,0.69112
8,0.667954,0.678764,0.69112
9,0.671042,0.678764,0.69112


In [18]:
C = param_grid_['C'][13]

Logistic_Model = LogisticRegression(
    C=C,
    fit_intercept=True,
    penalty="l1",
    class_weight='balanced',
    solver="liblinear",
    intercept_scaling=1000,
    random_state=100000
)

log_CV = Logistic_Model.fit(tfidf_word_counts, y)


In [26]:
preds_LASSO = modsel.cross_val_predict(log_CV, tfidf_word_counts, y, cv=5, method="predict")
preds_proba_LASSO = modsel.cross_val_predict(log_CV, tfidf_word_counts, y, cv=5, method="predict_proba")

accuracy_score(preds_LASSO, y)

0.6501930501930502

## Conclusion
Our toy model obtained 65% of accuracy on the training data. That is not the best way to evaluate and select machine learning models but gives us a glimpse of how our data could be used for modeling.

You can refer to **Gryphon classification template** to get more details of the process of fitting a model in this kind of problem.
