# Lab Assignment 2 - Text Analytics CISB5123

In this lab assignment, we are going to do Sentiment Analysis where it will explore sentiment classification using the Amazon
Fine Food Review dataset. (https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?select=Reviews.csv)

#### Team members:
1. Nur Husnina Binti Norishak (IS01081121)
2. Nur Khairina Sofea Binti Khaidzir (IS01081122)

### Step 1: Import the libraries

In [2]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

### Step 2: Load the data

In [3]:
# Load the dataset
# read data - first 1000 rows
data = pd.read_csv('Reviews.csv', nrows=1000)
data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
995,996,B006F2NYI2,A1D3F6UI1RTXO0,Swopes,1,1,5,1331856000,Hot & Flavorful,BLACK MARKET HOT SAUCE IS WONDERFUL.... My hus...
996,997,B006F2NYI2,AF50D40Y85TV3,Mike A.,1,1,5,1328140800,Great Hot Sauce and people who run it!,"Man what can i say, this salsa is the bomb!! i..."
997,998,B006F2NYI2,A3G313KLWDG3PW,kefka82,1,1,5,1324252800,this sauce is the shiznit,this sauce is so good with just about anything...
998,999,B006F2NYI2,A3NIDDT7E7JIFW,V. B. Brookshaw,1,2,1,1336089600,Not Hot,Not hot at all. Like the other low star review...


### Step 3: Data Preprocessing

In [4]:
# check no. of rows & columns
data.shape

(1000, 10)

In [5]:
# removing HTML tags and Unwanted characters
def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

# Apply the clean_text function to the 'Text' column
data['Text'] = data['Text'].apply(clean_text)

# View the cleaned text data
data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,i have bought several of the vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,product arrived labeled as jumbo salted peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",this is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,if you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,great taffy at a great price there was a wide ...
...,...,...,...,...,...,...,...,...,...,...
995,996,B006F2NYI2,A1D3F6UI1RTXO0,Swopes,1,1,5,1331856000,Hot & Flavorful,black market hot sauce is wonderful my husband...
996,997,B006F2NYI2,AF50D40Y85TV3,Mike A.,1,1,5,1328140800,Great Hot Sauce and people who run it!,man what can i say this salsa is the bomb i ha...
997,998,B006F2NYI2,A3G313KLWDG3PW,kefka82,1,1,5,1324252800,this sauce is the shiznit,this sauce is so good with just about anything...
998,999,B006F2NYI2,A3NIDDT7E7JIFW,V. B. Brookshaw,1,2,1,1336089600,Not Hot,not hot at all like the other low star reviewe...


In [6]:
# Check for missing values
data.isnull().sum()

Id                        0
ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   0
Text                      0
dtype: int64

In [7]:
# Select relevant columns for sentiment analysis
data = data[['Score', 'Text']]

In [8]:
# tokenization text
nltk.download('punkt')

# Tokenize the text into individual words
data['Tokens'] = data['Text'].apply(lambda x: nltk.word_tokenize(x))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nurhu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Tokens'] = data['Text'].apply(lambda x: nltk.word_tokenize(x))


In [9]:
# Stop words
# Download the WordNet lemmatizer
nltk.download('wordnet')

# Initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
data['Tokens'] = data['Tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nurhu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Tokens'] = data['Tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])


In [10]:
# Join the tokens back into sentences
data['Preprocessed_Text'] = data['Tokens'].apply(lambda x: ' '.join(x))

# Save the preprocessed data to a new CSV file
data.to_csv('extracted_data_1000reviews.csv', index=False)

# Preview the preprocessed data
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Preprocessed_Text'] = data['Tokens'].apply(lambda x: ' '.join(x))


Unnamed: 0,Score,Text,Tokens,Preprocessed_Text
0,5,i have bought several of the vitality canned d...,"[i, have, bought, several, of, the, vitality, ...",i have bought several of the vitality canned d...
1,1,product arrived labeled as jumbo salted peanut...,"[product, arrived, labeled, a, jumbo, salted, ...",product arrived labeled a jumbo salted peanuts...
2,4,this is a confection that has been around a fe...,"[this, is, a, confection, that, ha, been, arou...",this is a confection that ha been around a few...
3,2,if you are looking for the secret ingredient i...,"[if, you, are, looking, for, the, secret, ingr...",if you are looking for the secret ingredient i...
4,5,great taffy at a great price there was a wide ...,"[great, taffy, at, a, great, price, there, wa,...",great taffy at a great price there wa a wide a...
...,...,...,...,...
995,5,black market hot sauce is wonderful my husband...,"[black, market, hot, sauce, is, wonderful, my,...",black market hot sauce is wonderful my husband...
996,5,man what can i say this salsa is the bomb i ha...,"[man, what, can, i, say, this, salsa, is, the,...",man what can i say this salsa is the bomb i ha...
997,5,this sauce is so good with just about anything...,"[this, sauce, is, so, good, with, just, about,...",this sauce is so good with just about anything...
998,1,not hot at all like the other low star reviewe...,"[not, hot, at, all, like, the, other, low, sta...",not hot at all like the other low star reviewe...


### Step 4: Feature Extraction

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### 1. Bag-of-Words

In [12]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the preprocessed text data
bow_features = vectorizer.fit_transform(data['Preprocessed_Text'])

# Get the vocabulary (unique words)
vocabulary = vectorizer.get_feature_names_out()

# Print the shape of the BoW features and the vocabulary size
print("BoW feature shape:", bow_features.shape)
print("Vocabulary size:", len(vocabulary))

BoW feature shape: (1000, 6047)
Vocabulary size: 6047


#### 2. Term Frequency-Inverse Document Frequency (TF-IDF)

In [13]:
# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the preprocessed text data
tfidf_features = tfidf_vectorizer.fit_transform(data['Preprocessed_Text'])

# Get the vocabulary (unique words)
tfidf_vocabulary = tfidf_vectorizer.get_feature_names_out()

# Print the shape of the TF-IDF features and the vocabulary size
print("TF-IDF feature shape:", tfidf_features.shape)
print("Vocabulary size:", len(tfidf_vocabulary))

TF-IDF feature shape: (1000, 6047)
Vocabulary size: 6047


### Step 5: Model Selection

In [14]:
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

#### 1. Lexicon-based approach

In [15]:
# Download the VADER lexicon
nltk.download('vader_lexicon')

# Assign sentiment labels based on the 'Score' column
def assign_sentiment(score):
    if score >= 4:
        return 'Positive'
    elif score <= 2:
        return 'Negative'
    else:
        return 'Neutral'

data['Sentiment'] = data['Score'].apply(assign_sentiment)

# Initialize the sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Calculate sentiment scores for each review
data['Lexicon_Sentiment'] = data['Preprocessed_Text'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Map sentiment scores to labels
data['Lexicon_Sentiment_Label'] = data['Lexicon_Sentiment'].apply(lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral'))

# Evaluate the lexicon-based approach
lexicon_accuracy = accuracy_score(data['Sentiment'], data['Lexicon_Sentiment_Label'])
print("Lexicon-based Approach Accuracy:", lexicon_accuracy)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\nurhu\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Sentiment'] = data['Score'].apply(assign_sentiment)


Lexicon-based Approach Accuracy: 0.807


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Lexicon_Sentiment'] = data['Preprocessed_Text'].apply(lambda x: sid.polarity_scores(x)['compound'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Lexicon_Sentiment_Label'] = data['Lexicon_Sentiment'].apply(lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral'))


#### 2. Machine learning-based approaches (Naive Bayes Accuracy & SVM Accuracy)

In [16]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['Preprocessed_Text'], data['Sentiment'], test_size= 0.2, random_state=42)

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Fit and transform the training data
X_train_tfidf = tfidf.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = tfidf.transform(X_test)

# Train and evaluate Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)
nb_predictions = nb_classifier.predict(X_test_tfidf)
nb_accuracy = accuracy_score(y_test, nb_predictions)
print("Naive Bayes Accuracy:", nb_accuracy)
print(classification_report(y_test, nb_predictions))

# Train and evaluate SVM classifier
svm_classifier = LinearSVC()
svm_classifier.fit(X_train_tfidf, y_train)
svm_predictions = svm_classifier.predict(X_test_tfidf)
svm_accuracy = accuracy_score(y_test, svm_predictions)
print("SVM Accuracy:", svm_accuracy)
print(classification_report(y_test, svm_predictions))

Naive Bayes Accuracy: 0.82
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00        25
     Neutral       0.00      0.00      0.00        11
    Positive       0.82      1.00      0.90       164

    accuracy                           0.82       200
   macro avg       0.27      0.33      0.30       200
weighted avg       0.67      0.82      0.74       200

SVM Accuracy: 0.845
              precision    recall  f1-score   support

    Negative       0.73      0.32      0.44        25
     Neutral       0.00      0.00      0.00        11
    Positive       0.85      0.98      0.91       164

    accuracy                           0.84       200
   macro avg       0.53      0.43      0.45       200
weighted avg       0.79      0.84      0.80       200



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Discussion - strengths and weaknesses of the selected models for sentiment classification

### Lexicon-Based Approach
Results Overview:
- Accuracy: 0.807

Strengths:
- Simplicity and Speed: straightforward to implement and fast to apply, as it doesn't require model training.
- High accuracy and effectiveness in identifying positive sentiments

Weaknesses:
- Fixed Lexicon: Might not capture context or nuances well due to the fixed nature of sentiment scores assigned to words.
- Potential Bias: If the lexicon is not comprehensive or up-to-date, it may not accurately reflect the nuances of language used in reviews.

### Naive Bayes 
Results Overview:
- Accuracy: 0.82

Strengths:
- Simple and fast, good for a baseline model.
- High accuracy and effectiveness in identifying positive sentiments.

Weaknesses:
- Inability to classify negative and neutral reviews suggests it struggles with class imbalance and more nuanced sentiment expressions.


#### SVM
Results Overview:
- Accuracy: 0.845

Strengths:
- Highest accuracy among the models.
- Shows capability in distinguishing between positive and negative sentiments to some extent.

Weaknesses:
- Requires more computational resources.
- Still lacks effectiveness in classifying neutral reviews.


### Conclusion:
- Since we only analyze 1000 rows from this data set due to uncapable of our hardware, SVM will be the most suitable choice given its superior overall accuracy and its ability to classify negative reviews better than the other models.