## The goal of this project is to utilizing the most simple form of Naive Bayes model and also neural network model (LSTM) to perform Sentiment Analysis of Amazon Alexa reviews dataset.

Reference: 

https://www.geeksforgeeks.org/what-is-sentiment-analysis/

https://www.kaggle.com/datasets/sid321axn/amazon-alexa-reviews

In [1]:
import pandas as pd

In [2]:
reviews = pd.read_csv("/content/amazon_alexa.tsv", sep = '\t', index_col = False)

In [3]:
reviews.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


# Exploratory Data Analysis

In [4]:
reviews.columns

Index(['rating', 'date', 'variation', 'verified_reviews', 'feedback'], dtype='object')

In [5]:
# Average rating for each variation of Alexa. Perform average of rating based on "variation" column
reviews.groupby('variation')['rating'].mean().round(2).sort_values(ascending = False)

variation
Walnut Finish                   4.89
Oak Finish                      4.86
Charcoal Fabric                 4.73
Heather Gray Fabric             4.69
Configuration: Fire TV Stick    4.59
Black  Show                     4.49
Black  Dot                      4.45
White  Dot                      4.42
Black  Plus                     4.37
Sandstone Fabric                4.36
White  Plus                     4.36
Black  Spot                     4.31
White  Spot                     4.31
White  Show                     4.28
Black                           4.23
White                           4.14
Name: rating, dtype: float64

We can see that Walnut Finish has the highest average rating and white variation has the lowest rating.

In [6]:
reviews.groupby('variation')['rating'].count().sort_values(ascending = False)

variation
Black  Dot                      516
Charcoal Fabric                 430
Configuration: Fire TV Stick    350
Black  Plus                     270
Black  Show                     265
Black                           261
Black  Spot                     241
White  Dot                      184
Heather Gray Fabric             157
White  Spot                     109
White                            91
Sandstone Fabric                 90
White  Show                      85
White  Plus                      78
Oak Finish                       14
Walnut Finish                     9
Name: rating, dtype: int64

It makes sense that since Walnut Finish only have 9 reviews, the product has a higher average score. 

However, for a good product that has a lot of rating and a decent score, Black Dot and Charcoal Fabric takes the win.

## We still need to create a label column. I am planning to do so by creating a new sentiment labels column based on the rating. We can then use this column in the train and test set. If rating ==3, 'neutral', if <3, 'negative', and 'positive' if >3

In [7]:
# Create a new column 'label' based on a condition of 'rating' column
reviews['label'] = 'neutral'  # Set a default value for all rows
reviews.loc[reviews['rating'] > 3, 'label'] = 'positive'  # Assign 'positive' for rows where the condition is True
reviews.loc[reviews['rating'] < 3, 'label'] = 'negative'

In [8]:
# Test to see if our label is correct
reviews.loc[reviews['rating'] == 3]

Unnamed: 0,rating,date,variation,verified_reviews,feedback,label
6,3,31-Jul-18,Sandstone Fabric,"Without having a cellphone, I cannot use many ...",1,neutral
24,3,30-Jul-18,Sandstone Fabric,"I got a second unit for the bedroom, I was exp...",1,neutral
33,3,30-Jul-18,Heather Gray Fabric,The speakers sound pretty good for being so sm...,1,neutral
49,3,30-Jul-18,Charcoal Fabric,No different than Apple. To play a specific li...,1,neutral
54,3,30-Jul-18,Sandstone Fabric,like google better,1,neutral
...,...,...,...,...,...,...
3059,3,30-Jul-18,White Dot,Works well. Just disappointed with the speaker...,1,neutral
3068,3,30-Jul-18,White Dot,I was hoping the cord was white also. Otherwis...,1,neutral
3114,3,30-Jul-18,Black Dot,,1,neutral
3122,3,30-Jul-18,Black Dot,I dislike that it confuses my requests all the...,1,neutral


Now we need to put the text column into X and corresponding sentiment labels into y column

In [9]:
X = reviews['verified_reviews']
y = reviews['label']

In [10]:
X.head()

0                                        Love my Echo!
1                                            Loved it!
2    Sometimes while playing a game, you can answer...
3    I have had a lot of fun with this thing. My 4 ...
4                                                Music
Name: verified_reviews, dtype: object

In [11]:
y.head()

0    positive
1    positive
2    positive
3    positive
4    positive
Name: label, dtype: object

# Creating a Naive Bayes model for Sentiment Analysis

In [12]:
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Feature extraction

In [14]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)  # X is the input text data

Convert the input texts into numerical features using the CountVectorizer class from scikit-learn. This class transforms the text data into a matrix of word frequencies.

In [15]:
# Split the dataset into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y, random_state=42)

We keep stratify y column since I want to preserve the distribution of the labels from y

In [16]:
X_train.shape

(2520, 4044)

In [17]:
y_train.shape

(2520,)

In [18]:
y_test.shape

(630,)

In [19]:
X_test.shape

(630, 4044)

In [20]:
# Create an instance of the MultinomialNB class and train it on the training data.
clf = MultinomialNB()
clf.fit(X_train, y_train)

In [21]:
# Make prediction
y_pred = clf.predict(X_test)

In [22]:
# Evaluate the model:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8904761904761904


The model is doing good with 89% accuracy. We can try to tune more parameter such as alpha or use grid_search to find the most optimal parameter.

# Quick simple analysis. Counting values for each label category of prediction and test set.

In [23]:
y_pred_df = pd.DataFrame(y_pred)
y_pred_df[0].value_counts()

positive    596
negative     30
neutral       4
Name: 0, dtype: int64

In [24]:
y_test_df = pd.DataFrame(y_test)
y_test_df['label'].value_counts()

positive    548
negative     52
neutral      30
Name: label, dtype: int64

Overall, seems like the model does well with predicting positive and negative values since it seems like the numbers are very close. However, seems like prediciting neutral is a little bit harder to do for this model.

# Now we can start with Neural Network for Sentiment Analysis

I want to use Recurrent Neural network(RNN), specially the Long Short Term Memory since they can handly sequential data like text.

In [25]:
# Import necessary libraries
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import csr_matrix
import tensorflow as tf

Need to preprocess the text data to convert it into a suitable format for the LSTM model. This may involve steps like tokenization, vectorization, and padding. I use the Tokenizer class from Keras for tokenization and the pad_sequences function to pad the sequences to a fixed length.

In [26]:
# First we set the seed
# Set a fixed seed for TensorFlow operations
seed_value = 42
tf.random.set_seed(seed_value)
np.random.seed(seed_value)

In [27]:
# Convert labels to numerical values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [28]:
# Convert labels to one-hot encoding
y = to_categorical(y, num_classes=3)

In [29]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
# Preprocess the data
# Convert X_train to a list of strings
X_train_texts = [str(row) for row in X_train2]
# Convert X_test to a list of strings
X_test_texts = [str(row) for row in X_test2]
# Tokenize the text data
max_words = 10000  # Maximum number of words to keep in the vocabulary. 10,000 seems like an adequate number for this.
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train_texts)

X_train_seq = tokenizer.texts_to_sequences(X_train_texts)
X_test_seq = tokenizer.texts_to_sequences(X_test_texts)

# Pad the sequences to ensure consistent length
max_length = 300  # Maximum length of a review
X_train_pad = pad_sequences(X_train_seq, maxlen=max_length)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length)

In [31]:
# Define and train the LSTM model
embedding_dim = 100  # Dimensionality of the word embeddings

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=max_length))
model.add(LSTM(100))
model.add(Dense(3, activation='softmax'))  # Three classes for sentiment analysis

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 64
epochs = 10 # Iteration

model.fit(X_train_pad, y_train2, validation_data=(X_test_pad, y_test2), batch_size=batch_size, epochs=epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff93878d9f0>

In [32]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test_pad, y_test2)
print('Test loss:', round(loss,3))
print('Test accuracy:', round(accuracy,3))

Test loss: 0.474
Test accuracy: 0.883


In summary, the loss metric provides a measure of how well the model is fitting the training data, and minimizing the loss during training helps the model make better predictions on unseen data.

In the case of multi-class classification, the range for categorical cross-entropy loss is from 0 to infinity. The lower value the better with 0 represents a perfect match between predictions and the true labels.

# We can see that with a 89.5% accuracy, LSTM performs a little bit better than Naive Bayes (89%) for this problem.

The limitation for this is that I only run the code once everytime, so even though I set seed, it is not always guarantee complete non-determinism. If possible, should focus on cross validation to find the best possible accuracy score. Or we could do a grid search to find the best hyperparameter.