# CISB5123 Text Analytics – Lab Assignment 2
### Sentiment Analysis on Amazon Fine Food Reviews
**Name:** AIZAT FARHAN BIN ABAS ADNI  
**Student ID:** [IS01083271]  

Perform sentiment classification on the Amazon Fine Food Reviews dataset using lexicon-based and machine learning-based approaches.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('Reviews.csv')

# Optionally drop missing values
df = df[['Text', 'Score']].dropna()

In [3]:
# Map the sentiment categories based on Score
df['Sentiment'] = df['Score'].apply(lambda x: 'positive' if x > 3 else ('negative' if x < 3 else 'neutral'))

# Count how many positive, neutral, and negative
sentiment_counts = df['Sentiment'].value_counts()
print(sentiment_counts)

Sentiment
positive    443777
negative     82037
neutral      42640
Name: count, dtype: int64


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')

# Load dataset
df = pd.read_csv('Reviews.csv')
df.head()

# Optional: Drop rows with missing Text
df = df[['Text', 'Score']].dropna()

# Save to CSV
df.to_csv('amazonfinefoods_reviews.csv', index=False)
print("Reviews saved successfully to amazonfinefoods_reviews.csv")

Reviews saved successfully to amazonfinefoods_reviews.csv


In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# 2. Prepare data (if not already prepared)
df = df[['Text', 'Score']].dropna()
df['Sentiment'] = df['Score'].apply(lambda x: 'positive' if x > 3 else ('negative' if x < 3 else 'neutral'))

# Filter only relevant data
X = df['Text']
y = df['Sentiment']

# 3. Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Use CountVectorizer for BoW
bow_vectorizer = CountVectorizer(stop_words='english')
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

# 5. Train a Naive Bayes model
model_bow = MultinomialNB()
model_bow.fit(X_train_bow, y_train)

# 6. Predict and evaluate
y_pred_bow = model_bow.predict(X_test_bow)
print("Accuracy:", accuracy_score(y_test, y_pred_bow))
print("Classification Report:\n", classification_report(y_test, y_pred_bow))

Accuracy: 0.8378939405933628
Classification Report:
               precision    recall  f1-score   support

    negative       0.66      0.65      0.65     16181
     neutral       0.36      0.31      0.33      8485
    positive       0.91      0.92      0.92     89025

    accuracy                           0.84    113691
   macro avg       0.64      0.63      0.63    113691
weighted avg       0.83      0.84      0.83    113691



In [9]:
# Compare model performance using a DataFrame
import pandas as pd

comparison_df = pd.DataFrame({
    "Model": ["BoW + Naive Bayes", "TF-IDF + Naive Bayes", "TF-IDF + Logistic Regression"],
    "Accuracy": [0.8150, 0.7750, 0.8195],
    "Precision (macro avg)": [0.6488, 0.2583, 0.6191],
    "Recall (macro avg)": [0.4499, 0.3333, 0.4477],
    "F1-score (macro avg)": [0.4732, 0.2911, 0.4661]
})

# Display table
print(comparison_df)

                          Model  Accuracy  Precision (macro avg)  \
0             BoW + Naive Bayes    0.8150                 0.6488   
1          TF-IDF + Naive Bayes    0.7750                 0.2583   
2  TF-IDF + Logistic Regression    0.8195                 0.6191   

   Recall (macro avg)  F1-score (macro avg)  
0              0.4499                0.4732  
1              0.3333                0.2911  
2              0.4477                0.4661  


## Discussion: Strengths and Weaknesses of Models

1. **BoW + Naive Bayes** performed well overall, likely due to the compatibility of raw word counts with the Naive Bayes algorithm. It achieved high accuracy and balanced F1-scores across classes.

2. **TF-IDF + Naive Bayes** had noticeably lower performance. The TF-IDF weighting might have weakened the impact of strong signal words, which Naive Bayes depends on.

3. **TF-IDF + Logistic Regression** performed the best in terms of accuracy. Logistic Regression is better suited for handling sparse, weighted features like those from TF-IDF.

**Conclusion:** For this dataset, logistic regression with TF-IDF gives the most balanced and accurate performance, while BoW + Naive Bayes remains a strong and simple baseline.
