# Assignment 1: Sentiment Analysis

This assignment is a classification prediction of reviews collected from the three domains: imdb.com, amazon.com, yelp.com. The goal of the assignment is to develop an effective Bag of Words (BoW) feature engineering, and use a binary classifer to predict the sentiment (Positive or Negative) of held-out data.

The assignment is divided into 4 sections:
- Task 1: Data loading and Data preparation
- Task 2: Feature representation
- Task 3: Classification and evaluation
- Task 4: Report

Task 1-3 is displayed in this Jupyter notebook file, with Task 4 submitted in a .pdf file included with the submission. This Jupyter notebook however contains a Task 4 section which includes some code to supplement the report

## Task 1: Data loading and Data preparation

In [1]:
# import libraries used in all tasks
import numpy as np
import pandas as pd
import string
import nltk

nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/marcustran/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Subtask 1.1: Load the datasets

In [2]:
# load the datasets. header=None so pd does not treat the first row as the header
x_columns = ['source', 'comment'] # x_test.csv and x_train.csv have two unnamed columns 
x_test = pd.read_csv("Dataset/x_test.csv", names=x_columns,header=None)
y_test = pd.read_csv("Dataset/y_test.csv",header=None)
x_train = pd.read_csv("Dataset/x_train.csv", names=x_columns,header=None)
y_train = pd.read_csv("Dataset/y_train.csv",header=None)


# change the data type to str for preprocessing
x_train['comment'] = x_train['comment'].astype(str)
x_test['comment'] = x_test['comment'].astype(str)

# drop the 'source' column as the source is not the subject of analysis
x_train.drop('source', axis=1, inplace=True)
x_test.drop('source', axis=1, inplace=True)

# check the y_train 0 (negative) and 1 (positive) distribution. Same with y_test
y_train.value_counts() # equal, 1: 1200, 0: 1200
y_test.value_counts() # equal, 1: 300, 0: 300

# make y_train 1D array for downstream ML
y_train_series = y_train.iloc[:,0]
y_test_series = y_test.iloc[:,0]


### Subtask 1.2: Pre-processing the data

The cleaning steps are:
- Remove punctuations
- Conversion to lower case
- Remove stop words


In [3]:
# create a function to preprocess x_train and x_test: remove all punctuations, convert all to lower case, remove stop words

# create a function that removes punctuations
def remove_punctuation(text):
    """This function removes punctuations"""
    result = text
    for char in string.punctuation:
        result = result.replace(char, '')
    return result

# create a function to remove stop words
def remove_stopwords(text):
    """This function removes stop words"""
    tokens = word_tokenize(text)
    
    # specify the language
    english_stopwords = stopwords.words('english')
    tokens_no_stopwords = [t for t in tokens if t not in english_stopwords]
    text = " ".join(tokens_no_stopwords)
    
    return text
    
        
# a general preprocess function
def preprocess(text):
    """This function preprocesses the data:
    - Conversion to lower case
    - Remove punctuations
    - Remove stop words"""
    # lower case conversion
    text = text.lower()
    
    # remove punctuations
    text = remove_punctuation(text)
    
    # remove stop words
    text = remove_stopwords(text)
    
    return text


# apply preprocess function, and add result to a new column 'processed' - do it for both x_train and x_test
x_train['new_text'] = x_train['comment'].apply(preprocess)    

x_test['new_text'] = x_test['comment'].apply(preprocess)    
x_test


Unnamed: 0,comment,new_text
0,It only recognizes the Phone as its storage de...,recognizes phone storage device
1,Disappointing accessory from a good manufacturer.,disappointing accessory good manufacturer
2,The one big drawback of the MP3 player is that...,one big drawback mp3 player buttons phones fro...
3,This particular model would not work with my M...,particular model would work motorola q smartphone
4,If the two were seperated by a mere 5+ ft I st...,two seperated mere 5 ft started notice excessi...
...,...,...
595,Everything was fresh and delicious!,everything fresh delicious
596,#NAME?,name
597,Pretty awesome place.,pretty awesome place
598,"The staff are great, the ambiance is great.",staff great ambiance great


## Task 2: Feature representation
- Since stop words have already been removed, no need to exclude common words
- Keep all vocabularies, since it is a moderate size training set with 2400 documents
- Unigrams only, since simplicity could be the most effective

In [4]:
vectorizer = CountVectorizer() # use CountVectorizer to count the word frequency
X_feature_train = vectorizer.fit_transform(x_train['new_text']) # transform x_train to the feature matrix

vectorizer.get_feature_names_out() # some also contains numbers


# ==== Optional: EDA on the 20 most frequent words ==== 
word_counts = np.array(X_feature_train.sum(axis=0)).flatten() # get total word frequencies across all docs

# map word to count
vocab = vectorizer.get_feature_names_out()
word_freq = pd.DataFrame({'word': vocab, 'count': word_counts})

# sort and show top 20 most frequent words:
top_words = word_freq.sort_values(by='count', ascending=False).head(20)
print(top_words)

# transform preprocessed x_test using vectorizer too - using the same vectorizer
X_feature_test = vectorizer.transform(x_test['new_text'])

X_feature_train
# printing this shows: 2400 - number of rows in x_train['new_text'] (documents)
# 4564: number of unique words in all docs (vocabulary size)
# 14608: number of non-zero word count


         word  count
1771     good    183
1793    great    162
2605    movie    141
2925    phone    137
1546     film    122
2753      one    112
2950    place    100
1622     food     99
2319     like     95
3521  service     85
4048     time     84
3199   really     79
338       bad     79
4517    would     75
4427     well     71
1203     dont     65
1396     even     64
1401     ever     63
424      best     63
332      back     62


<2400x4564 sparse matrix of type '<class 'numpy.int64'>'
	with 14608 stored elements in Compressed Sparse Row format>

## Task 3: Classification and Evaluation
- Multinomial Naive Bayes Classifer is chosen for document classification problems - use the frequency of words as the predictor

Justifications:
- works well with high-dimensional text data
- fits with BoW - as it uses the word frequency, which is kept in CountVectorizer()
- simple, fast yet effective

### Subtask 3.1: Splitting the training data into test set and training set
- Since the comment sentiment distribution are closely equal, I don't have to use stratified sampling

- Split the data in several ways to ensure that the result is robust: 80/20, then 70/30, lastly 90/10 for train/test ratio - have random_state = 42 for reproducibility

In [5]:
# split the training set into 3 different train/validate ratios. Denote X and Y to differentiate from x_train and y_train
# 80/20
X_train_1, X_val_1, Y_train_1, Y_val_1 = train_test_split(X_feature_train, y_train_series, test_size=0.2, random_state=42)
# 70/30
X_train_2, X_val_2, Y_train_2, Y_val_2 = train_test_split(X_feature_train, y_train_series, test_size=0.3, random_state=42)
# 90/10
X_train_3, X_val_3, Y_train_3, Y_val_3 = train_test_split(X_feature_train, y_train_series, test_size=0.1, random_state=42)



### Subtask 3.2: 
- Model Setup
- Cross-Validation with Hyperparameter Tuning: Use GridSearchCV with 5-fold. F1-score is the success metric 
- Apply the best parameter, and predict the validation sets for all three split sets
- Fit the classifer to the entire training data, and evaluate its performance on the test set

In [6]:
nb_classifier = MultinomialNB() # set up the classifer

# setting up param_grid
param_grid = {'alpha': [0.1, 0.5, 1, 2, 5, 10]}

# setting up GridSearchCV
grid_search = GridSearchCV(
    estimator=nb_classifier,
    param_grid=param_grid,
    scoring='f1',
    cv=5, # set 5 folds
    verbose=1
)

# 80/20 split
grid_search.fit(X_train_1, Y_train_1)
best_80_20 = grid_search.best_params_, grid_search.best_score_
print("80/20 split", best_80_20)

# 70/30 split
grid_search.fit(X_train_2, Y_train_2)
best_70_30 = grid_search.best_params_, grid_search.best_score_
print("70/30 split", best_70_30)

# 90/10 split
grid_search.fit(X_train_3, Y_train_3)
best_90_10 = grid_search.best_params_, grid_search.best_score_
print("90/10 split", best_90_10)

# alpha 0.5 is the most suitable, as it was selected 2 out of 3, and gives the best f1 score


# apply alpha=0.5 onto MultinomialNB, retrain X_train, and apply to predict X_val sets
nb_classifier = MultinomialNB(alpha=0.5) # set up the classifer
nb_classifier.fit(X_train_1, Y_train_1)

# predict on validation data
Y_pred_1 = nb_classifier.predict(X_val_1)

# Evaluate - comparing the predicted vs the actual values
print("\n80/20 split")
print(classification_report(Y_val_1, Y_pred_1))
print(confusion_matrix(Y_val_1, Y_pred_1))

# Repeat the steps for 70/30 and 90/10 splits
# 70/30
nb_classifier.fit(X_train_2, Y_train_2)

# predict on validation data
Y_pred_2 = nb_classifier.predict(X_val_2)

# Evaluate - comparing the predicted vs the actual values
print("\n70/30 split")
print(classification_report(Y_val_2, Y_pred_2))
print(confusion_matrix(Y_val_2, Y_pred_2))

# 90/10
nb_classifier.fit(X_train_3, Y_train_3)

# predict on validation data
Y_pred_3 = nb_classifier.predict(X_val_3)

# Evaluate - comparing the predicted vs the actual values
print("\n90/10 split")
print(classification_report(Y_val_3, Y_pred_3))
print(confusion_matrix(Y_val_3, Y_pred_3))


# results are consistent, with 80/20 split giving the best results
# fit the classifier to the entire training data

nb_classifier.fit(X_feature_train, y_train_series)
y_pred = nb_classifier.predict(X_feature_test)

# Evaluate - comparing the predicted vs the actual values
print("\nTest set prediction")
print(classification_report(y_test_series, y_pred))
print(confusion_matrix(y_test_series, y_pred))

Fitting 5 folds for each of 6 candidates, totalling 30 fits
80/20 split ({'alpha': 0.5}, 0.7925993213506387)
Fitting 5 folds for each of 6 candidates, totalling 30 fits
70/30 split ({'alpha': 1}, 0.7829276439391594)
Fitting 5 folds for each of 6 candidates, totalling 30 fits
90/10 split ({'alpha': 0.5}, 0.8023144808719864)

80/20 split
              precision    recall  f1-score   support

           0       0.83      0.81      0.82       237
           1       0.82      0.84      0.83       243

    accuracy                           0.83       480
   macro avg       0.83      0.83      0.83       480
weighted avg       0.83      0.83      0.83       480

[[192  45]
 [ 38 205]]

70/30 split
              precision    recall  f1-score   support

           0       0.82      0.78      0.80       348
           1       0.80      0.84      0.82       372

    accuracy                           0.81       720
   macro avg       0.81      0.81      0.81       720
weighted avg       0.81    

## Task 4: Report
This section contains some code that is used to illustrate topics written in the report 

In [7]:
# check for the false positives and negative in 80/20 set evaluation
text_train = x_train['new_text'].tolist() 
comment_train = x_train['comment'].tolist() 

for i, (true, pred) in enumerate(zip(Y_val_1, Y_pred_1)):
    if true != pred:
        print(f"Original Text: {comment_train[i]}")
        print(f"Preprocessed Text: {text_train[i]}")
        print(f"True: {true}, Pred: {pred}\n")




Original Text: The only thing that disappoint me is the infra red port (irda).
Preprocessed Text: thing disappoint infra red port irda
True: 0, Pred: 1

Original Text: Doesn't hold charge.
Preprocessed Text: doesnt hold charge
True: 1, Pred: 0

Original Text: Can't store anything but phone numbers to SIM.
Preprocessed Text: cant store anything phone numbers sim
True: 0, Pred: 1

Original Text: Very disappointing.
Preprocessed Text: disappointing
True: 1, Pred: 0

Original Text: I would not recommend this item to anyone.
Preprocessed Text: would recommend item anyone
True: 1, Pred: 0

Original Text: Just does not work.
Preprocessed Text: work
True: 1, Pred: 0

Original Text: In my house I was getting dropped coverage upstairs and no coverage in my basement.
Preprocessed Text: house getting dropped coverage upstairs coverage basement
True: 1, Pred: 0

Original Text: [...] down the drain because of a weak snap!
Preprocessed Text: drain weak snap
True: 1, Pred: 0

Original Text: Defective 

In [8]:
# check accurate predictions
count = 0
for i, (true, pred) in enumerate(zip(Y_val_1, Y_pred_1)):
    if true == pred:
        print(f"Original Text: {comment_train[i]}")
        print(f"Preprocessed Text: {text_train[i]}")
        print(f"True: {true}, Pred: {pred}\n")
        count += 1
        if count == 30:
            break


Original Text: Oh and I forgot to also mention the weird color effect it has on your phone.
Preprocessed Text: oh forgot also mention weird color effect phone
True: 1, Pred: 1

Original Text: THAT one didn't work either.
Preprocessed Text: one didnt work either
True: 0, Pred: 0

Original Text: Waste of 13 bucks.
Preprocessed Text: waste 13 bucks
True: 0, Pred: 0

Original Text: Product is useless, since it does not have enough charging current to charge the 2 cellphones I was planning to use it with.
Preprocessed Text: product useless since enough charging current charge 2 cellphones planning use
True: 0, Pred: 0

Original Text: None of the three sizes they sent with the headset would stay in my ears.
Preprocessed Text: none three sizes sent headset would stay ears
True: 1, Pred: 1

Original Text: Worst customer service.
Preprocessed Text: worst customer service
True: 0, Pred: 0

Original Text: The Ngage is still lacking in earbuds.
Preprocessed Text: ngage still lacking earbuds
True: 

In [9]:
# check for the false positives and negative in test set
text_test = x_test['new_text'].tolist() 
comment_test = x_test['comment'].tolist() 

for i, (true, pred) in enumerate(zip(y_test_series, y_pred)):
    if true != pred:
        print(f"Original Text: {comment_test[i]}")
        print(f"Preprocessed Text: {text_test[i]}")
        print(f"True: {true}, Pred: {pred}\n")


Original Text: It only recognizes the Phone as its storage device.
Preprocessed Text: recognizes phone storage device
True: 0, Pred: 1

Original Text: I purchased this and within 2 days it was no longer working!!!!!!!!!
Preprocessed Text: purchased within 2 days longer working
True: 0, Pred: 1

Original Text: AFTER ARGUING WITH VERIZON REGARDING THE DROPPED CALLS WE RETURNED THE PHONES AFTER TWO DAYS.
Preprocessed Text: arguing verizon regarding dropped calls returned phones two days
True: 0, Pred: 1

Original Text: My experience was terrible..... This was my fourth bluetooth headset, and while it was much more comfortable than my last Jabra (which I HATED!!!
Preprocessed Text: experience terrible fourth bluetooth headset much comfortable last jabra hated
True: 0, Pred: 1

Original Text: When it opens, the battery connection is broken and the device is turned off.
Preprocessed Text: opens battery connection broken device turned
True: 0, Pred: 1

Original Text: Not as good as I had hope