
# Text Classification for Sentiment Analysis


This project uses a dataset of Amazon reviews to perform sentinent analysis.

## Data Preparation, Cleaning, Preprocessing

We begin by preparing the dataset, loading it into Pandas and keeping only the reviews and ratings -- only these two columns will be relevant for our scope. We then turn the problem into a 3 class classification problem and select 20,000 reviews from each class and do a 80-20 split on this sample.

In accordance to common practice in NLP, we perform some data cleaning which consists of:
* Removing HTML
* Removing multiple spaces
* Removing URLs
* Removing non-alphabetical characters
* Using the `contractions` library to perform contractions.

After this, we use `nltk` to do some preprocessing and remove stop words as well as lemmatize the data.

## Feature Extraction
After this, we use SKLearn to extract TFIDF features of the dataset and build the vocabulary. We then use these features as well as our ground truth 3-classes in evaluating a variety of approaches.

## Perceptron
A perceptron model is a simple model with a single layer of neurons. The input features are multipled with a set of weights and passed through a threshold function, where the perceptron algorithm is used to calibrate the weights. It only classifies linear decision boundaries.

We obtain a F1 score of 65.40% on this dataset using the Perceptron approach.

## SVM
An SVM finds the best boundary/hyperplane that seperates the data into seperate classes. This should maximize maximizes the margin, which is the distance between the boundary and the closest data points from each class, also called support vectors.

RBF is a type of kernel for high dimensional data and so is poly. We try both of them and report the best of them. 

We obtain a F1 score of 70.73% on the dataset using the RBF Kernel.

## Logistic Regression
Multinomial logistic regression applies a linear function to a set of features and applies a softmax to the data to obtain a probability distribution over K classes

We obtain a F1 score of 72.94% using multinomial logistic regression with the `lbfgs` solver

## Naive Bayes
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). We use TFIDF to obtain a distribution of our 3 classes above.

We obtain a F1 score of 71.47% using multinomial naive bayes.

# Findings
Between the SVM approach, Logistic Regression, and Naive Bayes, we get minimal improvements within our F1 scores.

Logistic regression gives us our best performance with an F1 score of 72.94%. The logistic regression approach winning indicates a linear relationship between the features and classes.






# Code

## Environment setup

In [1]:
env = 'colab'

In [2]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
import re
from bs4 import BeautifulSoup
 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
! pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
! pip install bs4 # in case you don't have it installed

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
import warnings
warnings.filterwarnings("ignore")

In [8]:
# Dataset: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz
!wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz
!gunzip /content/amazon_reviews_us_Beauty_v1_00.tsv.gz

--2023-01-20 22:37:33--  https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.153.118, 52.217.44.54, 52.216.35.64, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.153.118|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914070021 (872M) [application/x-gzip]
Saving to: ‘amazon_reviews_us_Beauty_v1_00.tsv.gz’


2023-01-20 22:37:58 (36.0 MB/s) - ‘amazon_reviews_us_Beauty_v1_00.tsv.gz’ saved [914070021/914070021]

gzip: /content/amazon_reviews_us_Beauty_v1_00.tsv already exists; do you wish to overwrite (y or n)? n
	not overwritten


In [9]:
dataset_path = None
if env == "colab":
   dataset_path = './amazon_reviews_us_Beauty_v1_00.tsv'
else:
    dataset_path = './data.tsv'

## Read Data

In [10]:
 df = pd.read_csv(dataset_path, sep='\t', on_bad_lines='skip')

## Keep Reviews and Ratings

In [11]:
# calling head() method  
# storing in new variable 
data_top = df.head() 


#### Dropping all columns except review_body and star_rating

In [12]:
df.drop(df.columns.difference(['review_body', 'star_rating']), 1, inplace=True)

In [13]:
df.loc[df['star_rating'] == '1', 'class'] = 1
df.loc[df['star_rating'] == '1', 'class'] = 1
df.loc[df['star_rating'] == '3', 'class'] = 2
df.loc[df['star_rating'] == '4', 'class'] = 3
df.loc[df['star_rating'] == '5', 'class'] = 3

In [14]:
if env == 'colab':
  df.head()

 ## We form three classes and select 20000 reviews randomly from each class.



In [15]:
class1_sample = df.loc[df["class"] == 1].sample(n = 20000)
class2_sample = df.loc[df["class"] == 2].sample(n = 20000)
class3_sample = df.loc[df["class"] == 3].sample(n = 20000)

In [16]:
merged_df = pd.concat([class1_sample, class2_sample, class3_sample], axis=0)

# Data Cleaning



In [17]:
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: str(x) if type(x)==float else x)

In [18]:
rev_body_split = merged_df["review_body"].str.split()

word_lengths = rev_body_split.apply(lambda x: sum(len(i) for i in x))
avg_pp = word_lengths.mean()
if (env == 'colab'):
  print(f'Average length of the words before cleaning: {avg_pp}')


Average length of the words before cleaning: 153.2628


# Pre-processing

In [19]:
merged_df["review_body"] = merged_df["review_body"].str.lower()

In [20]:
# Remove HTML
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: re.sub(r'<[^<]+?>', '', x))



In [21]:
# Remove URLs
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: re.sub(r'https?://\S+', '', x))

In [22]:
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: re.sub("[^a-zA-Z]", " ", x))

In [23]:
# Remove extra spaces
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: re.sub(" +", " ", x))


In [24]:
import contractions
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: contractions.fix(x))

## Average length after cleaning

In [25]:
rev_body_split_ap = merged_df["review_body"].str.split()

word_lengths_ap = rev_body_split_ap.apply(lambda x: sum(len(i) for i in x))
avg_post_cleaned = word_lengths_ap.mean()
if env == 'colab':
  print(f'Average length of the words after cleaning: {avg_post_cleaned}')

Average length of the words after cleaning: 145.61068333333333


In [26]:
print(f'{avg_pp},{avg_post_cleaned}')

153.2628,145.61068333333333


## Preprocessing

## remove the stop words 

In [27]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))

 

## perform lemmatization  

In [28]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
merged_df["review_body"] = merged_df["review_body"].apply(lambda x: lemmatizer.lemmatize(x))

### Average length of words

In [29]:
rev_body_split_ap = merged_df["review_body"].str.split()

word_lengths_ap = rev_body_split_ap.apply(lambda x: sum(len(i) for i in x))
avg_processed = word_lengths_ap.mean()
if env == 'colab':
  print(f'Average length of the words after preprocessing: {avg_processed}')

Average length of the words after preprocessing: 95.11606666666667


In [30]:
print(f'{avg_post_cleaned},{avg_processed}')

145.61068333333333,95.11606666666667


# TF-IDF Feature Extraction

**TF-IDF** is used to evaluate how important a word is to a document in a collection of documents. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the collection of documents

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the vectorizer object
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the data (build the vocabulary)
vectorizer.fit(merged_df["review_body"] )

# Transform the data into TF-IDF features
tfidf_features = vectorizer.transform(merged_df["review_body"] )

In [32]:
tfidf_features.reshape(-1,1)

<1664340000x1 sparse matrix of type '<class 'numpy.float64'>'
	with 897499 stored elements in COOrdinate format>

In [33]:
tfidf_features.shape

(60000, 27739)

In [34]:
labels = merged_df["class"].values
labels = labels[:, np.newaxis]
labels.shape

(60000, 1)

## Split data



In [35]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, labels, test_size=0.2, random_state=0)

In [36]:
# Check the shapes of the resulting matrices
if env == 'colab':
  print("Shape of X_train:", X_train.shape)
  print("Shape of X_test:", X_test.shape)
  print("Shape of y_train:", y_train.shape)
  print("Shape of y_test:", y_test.shape)

Shape of X_train: (48000, 27739)
Shape of X_test: (12000, 27739)
Shape of y_train: (48000, 1)
Shape of y_test: (12000, 1)


# Perceptron

In [37]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [38]:
from sklearn.linear_model import Perceptron

perceptron = Perceptron(random_state=0)

perceptron.fit(X_train, y_train)

Perceptron()

In [39]:
# Make predictions on the testing data
y_pred = perceptron.predict(X_test)

### Precision:
Evaluates perfomance of a classification model, specifically in cases where the goal is to reduce the amount of false positives.  It is a measure of the model's ability to make correct positive predictions

 We can calcuate it by:
Precision = TP / (TP + FP)

In [40]:
precision_perceptron = precision_score(y_test, y_pred, average=None)
if env == 'colab':
  print(f'Precision for class 0: {precision_perceptron[0]} Class 1: {precision_perceptron[1]} Class 2: {precision_perceptron[2]}')
precision_perceptron_avg = precision_score(y_test, y_pred, average='micro')
if env == 'colab':
  print(f'Average Precision: {precision_perceptron_avg}')

Precision for class 0: 0.674635429117858 Class 1: 0.5787605787605787 Class 2: 0.7000481463649495
Average Precision: 0.6541666666666667


### Recall
When the goal is to minimize the amount of false negatives, we calculate recall by:
TP = TP / TP + FN

A high recall value indicates that the model is correctly identifying most of the positive instances in the dataset

In [41]:
recall_perceptron = recall_score(y_test, y_pred, average=None)
if env == 'colab':
  print(f'Recall for class 0: {recall_perceptron[0]} Class 1: {recall_perceptron[1]} Class 2: {recall_perceptron[2]}')
recall_perceptron_avg = recall_score(y_test, y_pred, average='micro')
if env == 'colab':
  print(f'Average Recall: {recall_perceptron_avg}')

Recall for class 0: 0.7060295221416062 Class 1: 0.5252725470763132 Class 2: 0.7330476430552054
Average Recall: 0.6541666666666667


### F1 Score
F1 combines precision and recall and gives a single metric to balance the two. It is the harmonic mean between precision and recall. 

An F1-score will be high when both the precision and recall are high, and will be low when either precision or recall is low.

In [42]:
f1_perceptron = f1_score(y_test, y_pred, average=None)
if env == 'colab':
  print(f'F1 for class 0: {f1_perceptron[0]} Class 1: {f1_perceptron[1]} Class 2: {f1_perceptron[2]}')
f1_perceptron_avg = f1_score(y_test, y_pred, average='micro')
if env == 'colab':
  print(f'Average F1 Score: {f1_perceptron_avg}')

F1 for class 0: 0.6899755501222494 Class 1: 0.5507208728406288 Class 2: 0.7161679596108853
Average F1 Score: 0.6541666666666667


In [43]:
print(f'{precision_perceptron[0]}, {precision_perceptron[1]}, {precision_perceptron[2]}, {precision_perceptron_avg}')
print(f'{recall_perceptron[0]}, {recall_perceptron[1]}, {recall_perceptron[2]}, {recall_perceptron_avg}')
print(f'{f1_perceptron[0]}, {f1_perceptron[1]}, {f1_perceptron[2]}, {f1_perceptron_avg}')

0.674635429117858, 0.5787605787605787, 0.7000481463649495, 0.6541666666666667
0.7060295221416062, 0.5252725470763132, 0.7330476430552054, 0.6541666666666667
0.6899755501222494, 0.5507208728406288, 0.7161679596108853, 0.6541666666666667


# SVM
An SVM finds the best boundary/hyperplane that seperates the data into seperate classees. This should maximize maximizes the margin, which is the distance between the boundary and the closest data points from each class, also called support vectors.

In [44]:
from sklearn.svm import LinearSVC

svm = LinearSVC(random_state=0, max_iter=1000).fit(X_train, y_train)


In [45]:
svm_pred = svm.predict(X_test)

In [46]:
f1_svm = f1_score(y_test, svm_pred, average=None)
if env == 'colab':
  print(f'F1 for class 0: {f1_svm[0]} Class 1: {f1_svm[1]} Class 2: {f1_svm[2]}')

F1 for class 0: 0.7454723445912873 Class 1: 0.6180378129790496 Class 2: 0.7597499999999999


In [47]:
svm_f1_avg = f1_score(y_test, svm_pred, average='weighted')
if env == 'colab':
  print('F1 (Linear SVM): ', "%.2f" % (svm_f1_avg*100))

F1 (Linear SVM):  70.73


In [48]:
recall_svm = recall_score(y_test, svm_pred, average=None)
if env == 'colab':
  print(f'Recall for class 0: {recall_svm[0]} Class 1: {recall_svm[1]} Class 2: {recall_svm[2]}')
recall_svm_avg = recall_score(y_test, svm_pred, average='micro')
if env == 'colab':
  print(f'Average Recall: {recall_svm_avg}')

Recall for class 0: 0.7620715536652489 Class 1: 0.5993557978196233 Class 2: 0.7660700781446937
Average Recall: 0.7086666666666667


In [49]:
precision_svm = precision_score(y_test, svm_pred, average=None)
if env == 'colab':  
  print(f'Precision for class 0: {precision_svm[0]} Class 1: {precision_svm[1]} Class 2: {precision_svm[2]}')
precision_svm_avg = precision_score(y_test, svm_pred, average='micro')
if env == 'colab':  
  print(f'Average Precision: {precision_svm_avg}')

Precision for class 0: 0.7295808383233533 Class 1: 0.63792194092827 Class 2: 0.7535333498636251
Average Precision: 0.7086666666666667


In [50]:
print(f'{precision_svm[0]}, {precision_svm[1]}, {precision_svm[2]}, {precision_svm_avg}')
print(f'{recall_svm[0]}, {recall_svm[1]}, {recall_svm[2]}, {recall_svm_avg}')
print(f'{f1_svm[0]}, {f1_svm[1]}, {f1_svm[2]}, {svm_f1_avg}')

0.7295808383233533, 0.63792194092827, 0.7535333498636251, 0.7086666666666667
0.7620715536652489, 0.5993557978196233, 0.7660700781446937, 0.7086666666666667
0.7454723445912873, 0.6180378129790496, 0.7597499999999999, 0.7073318187095682


# Logistic Regression
Multinomial logistic regression applies a linear function to a set of features and applies a softmax to the data to obtain a probability distribution over K classes

In [51]:
from sklearn.linear_model import LogisticRegression
# Instantiate a logistic regression model
logreg_clf = LogisticRegression(multi_class='auto', solver='lbfgs')

# Fit the model to the data
logreg_clf.fit(X_train, y_train)


LogisticRegression()

In [52]:
logreg_pred = logreg_clf.predict(X_test)

In [53]:
precision_lr = precision_score(y_test, logreg_pred, average=None)
if env == 'colab': 
  print(f'Precision for class 0: {precision_lr[0]} Class 1: {precision_lr[1]} Class 2: {precision_lr[2]}')
precision_lr_avg = precision_score(y_test, logreg_pred, average='micro')
if env == 'colab': 
  print(f'Average Precision: {precision_lr_avg}')

Precision for class 0: 0.7470280551592963 Class 1: 0.6536312849162011 Class 2: 0.787603734439834
Average Precision: 0.7294166666666667


In [54]:
recall_lr = recall_score(y_test, logreg_pred, average=None)
if env == 'colab': 
  print(f'Recall for class 0: {recall_lr[0]} Class 1: {recall_lr[1]} Class 2: {recall_lr[2]}')
recall_lr_avg = recall_score(y_test, logreg_pred, average='micro')
if env == 'colab':  
  print(f'Average Recall: {recall_lr_avg}')

Recall for class 0: 0.7860895671753816 Class 1: 0.6377601585728444 Class 2: 0.7655659188303504
Average Recall: 0.7294166666666667


In [55]:
f1_lr = f1_score(y_test, logreg_pred, average=None)
if env == 'colab': 
  print(f'F1 for class 0: {f1_lr[0]} Class 1: {f1_lr[1]} Class 2: {f1_lr[2]}')
f1_avg_lr = f1_score(y_test, logreg_pred, average='micro')
if env == 'colab': 
  print(f'Average F1 Score: {f1_avg_lr}')

F1 for class 0: 0.7660611971230038 Class 1: 0.6455981941309256 Class 2: 0.776428480122715
Average F1 Score: 0.7294166666666667


In [56]:
print(f'{precision_lr[0]}, {precision_lr[1]}, {precision_lr[2]}, {precision_lr_avg}')
print(f'{recall_lr[0]}, {recall_lr[1]}, {recall_lr[2]}, {recall_lr_avg}')
print(f'{f1_lr[0]}, {f1_lr[1]}, {f1_lr[2]}, {f1_avg_lr}')

0.7470280551592963, 0.6536312849162011, 0.787603734439834, 0.7294166666666667
0.7860895671753816, 0.6377601585728444, 0.7655659188303504, 0.7294166666666667
0.7660611971230038, 0.6455981941309256, 0.776428480122715, 0.7294166666666667


# Naive Bayes

Uses the frequency of the words in a document to determine probability that the document belongs in a certain class.

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [57]:
from sklearn.naive_bayes import MultinomialNB
# Instantiate a NB model
clf_nb = MultinomialNB()

# Fit the model to the data
clf_nb.fit(X_train, y_train)

MultinomialNB()

In [58]:
clf_nb_pred = clf_nb.predict(X_test)

In [59]:
precision_nb = precision_score(y_test, clf_nb_pred, average=None)
if env == 'colab': 
  print(f'Precision for class 0: {precision_nb[0]} Class 1: {precision_nb[1]} Class 2: {precision_nb[2]}')
precision_nb_avg = precision_score(y_test, clf_nb_pred, average='micro')
if env == 'colab':  
  print(f'Average Precision: {precision_nb_avg}')

Precision for class 0: 0.7645669291338583 Class 1: 0.6214899048503133 Class 2: 0.7693893326462252
Average Precision: 0.71475


In [60]:
recall_nb = recall_score(y_test, clf_nb_pred, average=None)
if env == 'colab': 
  print(f'Recall for class 0: {recall_nb[0]} Class 1: {recall_nb[1]} Class 2: {recall_nb[2]}')
recall_nb_avg = recall_score(y_test, clf_nb_pred, average='micro')
if env == 'colab': 
  print(f'Average Recall: {recall_nb_avg}')

Recall for class 0: 0.7287965974480861 Class 1: 0.6635282457879088 Class 2: 0.7527098563145954
Average Recall: 0.71475


In [61]:
f1_nb = f1_score(y_test, clf_nb_pred, average=None)
if env == 'colab': 
  print(f'F1 for class 0: {f1_nb[0]} Class 1: {f1_nb[1]} Class 2: {f1_nb[2]}')
f1_avg_nb = f1_score(y_test, clf_nb_pred, average='micro')
if env == 'colab': 
  print(f'Average F1 Score: {f1_avg_nb}')

F1 for class 0: 0.7462533623671065 Class 1: 0.641821449970042 Class 2: 0.7609582059123344
Average F1 Score: 0.71475


In [62]:
print(f'{precision_nb[0]}, {precision_nb[1]}, {precision_nb[2]}, {precision_nb_avg}')
print(f'{recall_nb[0]}, {recall_nb[1]}, {recall_nb[2]}, {recall_nb_avg}')
print(f'{f1_nb[0]}, {f1_nb[1]}, {f1_nb[2]}, {f1_avg_nb}')

0.7645669291338583, 0.6214899048503133, 0.7693893326462252, 0.71475
0.7287965974480861, 0.6635282457879088, 0.7527098563145954, 0.71475
0.7462533623671065, 0.641821449970042, 0.7609582059123344, 0.71475
