# Logistic regression using word embeddings

In this notebook, we implement another model applying word embeddings for the logistic regression model.

We start by installing gensim, training a word2vec model and applying this embedding for multinomial logistic regression in an attempt to improve the performance of the preivously seen models using bag-of-words vector representation.

In [1]:
!pip install gensim



In [2]:
import os
import gensim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression

We load the undersampled and full dataset to check for increase in performance using the undersampled dataset.

In [4]:
# load back data
df_train = pd.read_csv('../data/processed/normalized_train.csv')
df_test = pd.read_csv('../data/processed/normalized_test.csv')
df_under_train = pd.read_csv('../data/processed/undersampled_train.csv')
df_under_test = pd.read_csv('../data/processed/undersampled_test.csv')

In [5]:
# Feed a word2vec model with the data
w2v_model = gensim.models.Word2Vec(list(df_train['Data']), vector_size=100, window=5, min_count=2)

In [6]:
# Generate aggregated sentence vectors based on the word vectors for each word in the sentence
# Replace the words in each abstract with the learned word vector
words = set(w2v_model.wv.index_to_key )
X_train_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in words])
                         for ls in df_train['Data']])
X_test_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in words])
                         for ls in df_test['Data']])
X_train_under = np.array([np.array([w2v_model.wv[i] for i in ls if i in words])
                         for ls in df_under_train['Data']])
X_test_under = np.array([np.array([w2v_model.wv[i] for i in ls if i in words])
                         for ls in df_under_test['Data']])

In [7]:
# Average the word vectors for each sentence and assign a vector of zeros if the model
# did not learn any of the words in the text during training
X_train_vect_avg = []
for v in X_train_vect:
    if v.size:
        X_train_vect_avg.append(v.mean(axis=0))
    else:
        X_train_vect_avg.append(np.zeros(100, dtype=float))
        
X_test_vect_avg = []
for v in X_test_vect:
    if v.size:
        X_test_vect_avg.append(v.mean(axis=0))
    else:
        X_test_vect_avg.append(np.zeros(100, dtype=float))
        
X_train_under_avg = []
for v in X_train_under:
    if v.size:
        X_train_under_avg.append(v.mean(axis=0))
    else:
        X_train_under_avg.append(np.zeros(100, dtype=float))
        
X_test_under_avg = []
for v in X_test_under:
    if v.size:
        X_test_under_avg.append(v.mean(axis=0))
    else:
        X_test_under_avg.append(np.zeros(100, dtype=float))

We implement the logistic regression classifier using the previously seen word2vec model on both the undersampled and the full data set.

In [8]:
clf = LogisticRegression(C=1, multi_class = 'multinomial')

In [9]:
# Full dataset logistic regression model
lr_model_full = clf.fit(X_train_vect_avg, df_train['Label'].ravel())

In [10]:
# predict
predict_test = lr_model_full.predict(X_test_vect_avg)

# report
print(classification_report(df_test['Label'], predict_test))

              precision    recall  f1-score   support

           0       0.51      0.86      0.64       840
           1       0.73      0.29      0.42       353
           2       0.66      0.66      0.66       785
           3       0.54      0.04      0.07       376

    accuracy                           0.58      2354
   macro avg       0.61      0.46      0.45      2354
weighted avg       0.60      0.58      0.52      2354



In [11]:
# Undersampled dataset logistic regression model

lr_model_under = clf.fit(X_train_under_avg, df_under_train['Label'].ravel())

In [12]:
# predict
predict_test_under = lr_model_under.predict(X_test_under_avg)

# report
print(classification_report(df_under_test['Label'], predict_test_under))

              precision    recall  f1-score   support

           0       0.46      0.54      0.50       353
           1       0.71      0.53      0.61       353
           2       0.64      0.53      0.58       353
           3       0.45      0.56      0.50       353

    accuracy                           0.54      1412
   macro avg       0.57      0.54      0.55      1412
weighted avg       0.57      0.54      0.55      1412



The results indicate the poor representation of the words as embeddings for this problem. The architecture or the model might be maladapted to this particular case. The results are no better than a random classifier for all datasamples.

We choose not to save and pursue this model due to poor performance.