## Sentiment classification - Using Word embedding

### Train

In this training metodology we'll compare the classification performance using bag of words and word embedding.


In [2]:
# Add path of the folder 'resources' to the path from which we can import modules  
import sys
sys.path.append('../utilities')

In [3]:
import re

import pandas as pd

from nlp import BagOfWords, WordEmbedding

pd.set_option('display.max_colwidth', 500)

### Read data

In the following cell we read the data from a CSV file and filter only the GOOD / BAD evaluated texts (to simplify classification).

In [27]:
dataset = pd.read_csv("./sample_output/sentiment_train_processed1.csv")

text_field = "Text"
class_field = "Sentiment"

dataset = dataset.query("Sentiment != 'Neutral'")

dataset.head()

Unnamed: 0,Id,Sentiment,Text
0,0,Bad,company company lot recalls barrons blog
2,2,Bad,company company risky autonomous driving plan barrons blog
3,3,Good,company company plans ridehailing service fleet driverless cars
4,4,Bad,company company files k events f
5,5,Bad,company company goldman sachs threw towel barrons blog


### Word embedding

The word embedding representation is then calculated.

In [28]:
embedding = WordEmbedding()

Loading embedding model to memory...
Done!


In [29]:
result = embedding.transform(dataset[text_field])
result

array([[ 0.07675781,  0.02259522, -0.05097656, ..., -0.0350586 ,
        -0.04492188, -0.14404297],
       [ 0.04305594, -0.06951904, -0.01410784, ..., -0.03456334,
        -0.02202497, -0.07232666],
       [ 0.11544364,  0.0324707 ,  0.0905413 , ...,  0.03480748,
        -0.07974679, -0.14257812],
       ...,
       [ 0.02143999, -0.04086026, -0.04854792, ...,  0.01879883,
        -0.01993075, -0.11048473],
       [ 0.02790715, -0.03783241, -0.03212092, ..., -0.00324895,
        -0.01854412, -0.1219858 ],
       [-0.03604126, -0.07578278,  0.05036926, ..., -0.03924942,
        -0.05232239, -0.04679108]], dtype=float32)

### Classification
Both of them will be tested as input to a Logistic regression classifier.

In [30]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import sklearn.model_selection as modsel
from sklearn.exceptions import ConvergenceWarning
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, auc, roc_auc_score

warnings.filterwarnings('ignore', category=ConvergenceWarning)

In [32]:
model = LogisticRegression(random_state=100000, penalty = "l2", fit_intercept=True, intercept_scaling=1000, class_weight='balanced')
param_grid_ = {'C': [1e-5, 1e-4, 1e-3, 1e-2, 0.05, 0.1, 0.11, 0.12, 0.125, 0.15, 0.175, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1e0, 1e1, 1e2]}

X = result
y = dataset[class_field]


bow_search = modsel.GridSearchCV(
    model,
    cv=5,
    param_grid=param_grid_,
    return_train_score=True
)

bow_search.fit(X, y)

Text_search_results = pd.DataFrame.from_dict({
    'embedding': bow_search.cv_results_['mean_test_score']
})

Text_search_results

Unnamed: 0,embedding
0,0.737452
1,0.738224
2,0.760618
3,0.755985
4,0.763707
5,0.763707
6,0.765251
7,0.764479
8,0.766023
9,0.771429


In [35]:
C = param_grid_['C'][13]

Logistic_Model = LogisticRegression(
    C=C,
    fit_intercept=True, 
    penalty="l1",
    class_weight='balanced',
    solver="liblinear",
    intercept_scaling=1000,
    random_state=100000
)

log_CV = Logistic_Model.fit(X, y)


In [36]:
preds_LASSO = modsel.cross_val_predict(log_CV, X, y, cv=5, method="predict")
preds_proba_LASSO = modsel.cross_val_predict(log_CV, X, y, cv=5, method="predict_proba")

accuracy_score(preds_LASSO, y)

0.7382239382239382

## Conclusion
Our toy model obtained 73% of accuracy on the training data. That is not the best way to evaluate and select machine learning models but gives us a glimpse of how our data could be used for modeling.

You can refer to **Gryphon classification template** to get more details of the process of fitting a model in this kind of problem.