# Sentiment Analysis - Machine Learning Methods

In [1]:
import pandas as pd
import numpy as np
import os
import re
import nltk
import string
import warnings
import seaborn as sns
import matplotlib.pyplot as plt

## Data Loading
### Load cleaned data from data_preprocessing pipeline

In [2]:
# train data
train_data_path = "./clean_train.csv"
df_train = pd.read_csv(train_data_path)
df_train.head()

Unnamed: 0,0,words
0,0,awww bummer shoulda got david carr third day
1,0,upset updat facebook text might cri result sch...
2,0,dive mani time ball manag save rest bound
3,0,whole bodi feel itchi like fire
4,0,behav mad see


In [3]:
# test data
test_data_path = "./clean_test.csv"
df_test = pd.read_csv(test_data_path)
df_test.head()

Unnamed: 0,0,words
0,0,awww bummer shoulda got david carr third day
1,0,upset updat facebook text might cri result sch...
2,0,dive mani time ball manag save rest bound
3,0,whole bodi feel itchi like fire
4,0,behav mad see


In [4]:
# remove empty entries
df_train = df_train[df_train['words'].notna()]
df_test = df_test[df_test['words'].notna()]

## Data Features
Transforms text to feature vectors using the TfidfVectorizer so that they can be used as input to ML models.

In [5]:
# Create embedding matrix
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=2500, stop_words='english') 
tfidf_train = tfidf_vectorizer.fit_transform(df_train['words']) 

tfidf_vectorizer = TfidfVectorizer(max_features=2500, stop_words='english') 
tfidf_test = tfidf_vectorizer.fit_transform(df_test['words']) 

In [6]:
# data standardization
from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)
sc.fit(tfidf_train)
x_train = sc.transform(tfidf_train)
x_test = sc.transform(tfidf_test)
y_train = df_train['0']
y_test = df_test['0']

## Machine Learning Models

Two classical machine learning methods were experimented with: Logistic Regression and Support Vector Machine

### Logistic Regression

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

model = LogisticRegression()
model.fit(x_train, y_train)

y_pred = model.predict(x_test)
print("Test Accuracy: ", accuracy_score(y_test, y_pred))

Test Accuracy:  0.7580695072002899


### SVM

In [8]:
from sklearn import svm

model = svm.LinearSVC(C=1)
model.fit(x_train, y_train) 

y_pred = model.predict(x_test)
print("Test Accuracy: ", accuracy_score(y_test, y_pred))

Test Accuracy:  0.6719836521721629




Logistic Regression performed better than SVM in this Sentiment Analysis task. More Deep Learning methods are explored in sentimentpredict-DL.