# Sentiment Analysis with Logistic Regression and word embedding
In this file, we experienment sentiment analysis with Logistic Regression model taking pretrained NLPL word embedding as feature inputs. We didn't continue to finetune the model because the initial performance is worse than using TF-IDF weights. 

## Setup

In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from gensim.models import KeyedVectors  

## Load preprocessed data and generate datasets
We utilize the same preprocessed data file generated by sentiment_analysis_w_lstm.ipynb. After loading the data, we split the data to training, validation and test datasets for training.

In [2]:
df = pd.read_csv('data/pre.csv')

tweets = df['clean_text'].apply(lambda x: ' '.join(eval(x)))
# 0 - negative
# 1 - positive
labels = df['target']

data_train_val, data_test, y_train_val, y_test = train_test_split(tweets, labels, test_size=0.2, random_state=42)
data_train, data_val, y_train, y_val = train_test_split(data_train_val, y_train_val, test_size=0.2, random_state=42)

# print the distribution and size of each dataset
print("Training Dataset:")
print(y_train.value_counts())
print("Size: {}".format(len(data_train)))

print("\nValidation Dataset:")
print(y_val.value_counts())
print("Size: {}".format(len(data_val)))

print("\nTest Dataset:")
print(y_test.value_counts())
print("Size: {}".format(len(data_test)))

# print samples from data_train and y_train
print(data_train[:5])
print(y_train[:5])


Training Dataset:
target
0    510871
1    510581
Name: count, dtype: int64
Size: 1021452

Validation Dataset:
target
0    128114
1    127249
Name: count, dtype: int64
Size: 255363

Test Dataset:
target
1    159914
0    159290
Name: count, dtype: int64
Size: 319204
180755     i hear tapping on my window like human finger ...
1483820    back to planning my road trip home i think the...
1228707    it a beeautiful first day of june my bday is o...
1301205                                      is in charlotte
546314     yummy sound like a plan but i m probably not n...
Name: clean_text, dtype: object
180755     0
1483820    1
1228707    1
1301205    1
546314     0
Name: target, dtype: int64


## Load pretrained word embedding and evluate the performance of Lositic Regression on the validation dataset.

In [3]:
word_vectors = KeyedVectors.load_word2vec_format('data/model.bin', binary=True)  

def average_word_vectors(tokens, model, vocabulary):
    feature_vector = np.zeros(model.vector_size)  
    num_tokens = 0  
    for token in tokens:  
        if token in vocabulary:  
            feature_vector += model[token]  
            num_tokens += 1  
    if num_tokens > 0:  
        feature_vector /= num_tokens  
    return feature_vector

X_train = []
vocabulary = set(word_vectors.key_to_index)  
for row in data_train:
    X_train.append(average_word_vectors(row, word_vectors, vocabulary))
X_train = np.array(X_train)

X_val = []
for row in data_val:
    X_val.append(average_word_vectors(row, word_vectors, vocabulary))
X_val = np.array(X_val)

model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_val)

print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))

Accuracy: 0.5059229410682049
Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.93      0.65    128114
           1       0.53      0.08      0.14    127249

    accuracy                           0.51    255363
   macro avg       0.52      0.50      0.40    255363
weighted avg       0.52      0.51      0.40    255363

