Project Setup

I downloaded the Amazon Reviews dataset (568,454 reviews) from Kaggle and saved it on my local storage.

Since training a model on this much data is slow on a local machine, I am using Google Colab for faster computation.

The dataset CSV file was uploaded to Google Drive, and then imported into the Colab notebook for processing and training.

In [15]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
drive.mount('/content/drive')

# Path to CSV in your Drive
file_path = '/content/drive/MyDrive/Colab Notebooks/Reviews.csv'

# Read CSV
df = pd.read_csv(file_path)

df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [16]:
df.shape

(568454, 10)

firstly i have downloaded  the dataset of amazon reviews with 568454 reviews data on my local storage from kaggle.
to train model on this much data is slow so i am using google colab for it .

here i uploaded my csv file on google drive and then imported it in colab notebook.

In [17]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568428 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB
None


Selecting Relevant Columns

For building the model, I only need the Text (review content) and Score (label/target) columns.
Using the code df = df[['Text', 'Score']], I keep only these two columns and drop the rest to keep the dataset clean and reduce memory usage.

In [18]:
df = df[['Text', 'Score']]
df.head()

Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


In [19]:
df.Score.value_counts()

Unnamed: 0_level_0,count
Score,Unnamed: 1_level_1
5,363122
4,80655
1,52268
3,42640
2,29769


Creating Sentiment Labels

To prepare the dataset for sentiment analysis, I will map the review Score into three categories:

1 and 2 → Negative

3 → Neutral

4 and 5 → Positive

A new column will be created in the dataframe to store these sentiment labels.

In [20]:
df['Sentiment'] = df['Score'].apply(lambda x: 2 if x in [1, 2] else (1 if x == 3 else 0))
df.head()

Unnamed: 0,Text,Score,Sentiment
0,I have bought several of the Vitality canned d...,5,0
1,Product arrived labeled as Jumbo Salted Peanut...,1,2
2,This is a confection that has been around a fe...,4,0
3,If you are looking for the secret ingredient i...,2,2
4,Great taffy at a great price. There was a wid...,5,0


In [23]:
df=df[['Text', 'Sentiment']]
df.head()

Unnamed: 0,Text,Sentiment
0,I have bought several of the Vitality canned d...,0
1,Product arrived labeled as Jumbo Salted Peanut...,2
2,This is a confection that has been around a fe...,0
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,0


In [24]:
df.Sentiment.value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
0,443777
2,82037
1,42640


Handling Class Imbalance

The dataset is imbalanced across the sentiment classes. To train the model on a manageable subset of approximately 125k +  reviews (suitable for Google Colab free), I will use undersampling.

Although undersampling is generally not the best approach because it discards data, it allows us to quickly train the model and demonstrate the workflow effectively on a limited-resource environment.

In [29]:
min_count=min(df.Sentiment.value_counts())
min_count

42640

In [30]:
125/3

41.666666666666664

In [31]:
df_positive=df[df['Sentiment']==0].sample(min_count)
df_neutral=df[df['Sentiment']==1].sample(min_count)
df_negative=df[df['Sentiment']==2].sample(min_count)
df_balanced=pd.concat([df_positive,df_neutral,df_negative])
df_balanced.Sentiment.value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
0,42640
1,42640
2,42640


In [32]:
df_balanced.shape

(127920, 2)

In [33]:
df_balanced.head()

Unnamed: 0,Text,Sentiment
219348,I don't even know how I came to try this delic...,0
250573,If you haven't tried these then you're missing...,0
431090,Excellent product and a generous amount for th...,0
410122,My husband has a serious thing for dates since...,0
289785,One box has 3 packages inside it. Using about ...,0


In [34]:
df_balanced.isnull().sum()

Unnamed: 0,0
Text,0
Sentiment,0


I will preprocess the Text column and create a new column called preprocessed_Text.
For preprocessing, I am using the spaCy library to:

1.Tokenize the text

2.Remove stop words

3.Remove punctuation

4.Perform lemmatization

In [39]:
import spacy
# Load model once
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

# Convert column to list
texts = df_balanced['Text'].tolist()

# Preprocess using nlp.pipe for faster batch processing
preprocessed_texts = []
for doc in nlp.pipe(texts, batch_size=1000):
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    preprocessed_texts.append(' '.join(tokens))

# Assign back to dataframe
df_balanced['preprocessed_Text'] = preprocessed_texts

In [40]:
df_balanced.head()

Unnamed: 0,Text,Sentiment,preprocessed_Text
219348,I don't even know how I came to try this delic...,0,know come try delicious beverage good course r...
250573,If you haven't tried these then you're missing...,0,try miss chocolate cover cookie truly melt m...
431090,Excellent product and a generous amount for th...,0,excellent product generous price help winter d...
410122,My husband has a serious thing for dates since...,0,husband thing date buy package Caramel Natural...
289785,One box has 3 packages inside it. Using about ...,0,box 3 package inside teaspoon 20 oz miso soup ...


Experiment Strategy

Start with classical ML models using TF-IDF or CountVectorizer on a subset of the dataset (around 20k–50k reviews) to quickly test different approaches.

Evaluate and compare models using metrics such as accuracy, F1-score, precision, and recall.

Once the best combination of vectorizer and model is identified, scale the training to the full dataset.

Optionally, experiment with word embeddings and neural network models to see if performance improves.

Finally, if resources allow, fine-tune a BERT model for state-of-the-art performance.

In [43]:
#making subset for experimenting
df_pos=df_balanced[df_balanced['Sentiment']==0].sample(10000, random_state=9)
df_neu=df_balanced[df_balanced['Sentiment']==1].sample(10000, random_state=9)
df_neg=df_balanced[df_balanced['Sentiment']==2].sample(10000, random_state=9)

df_subset=pd.concat([df_pos,df_neu,df_neg])
df_subset.head()

Unnamed: 0,Text,Sentiment,preprocessed_Text
284433,Excellent coffee. We use the coffee each morn...,0,excellent coffee use coffee morning guest sm...
305982,"If you like canned spinach, Del Monte Whole Le...",0,like canned spinach Del Monte Leaf try brand c...
329343,"I was at Gilt (in Portland, OR) enjoying some ...",0,Gilt Portland enjoy delicious cocktail ask bar...
476020,I am 55 years old and i find this product a de...,0,55 year old find product desert favorite snack...
69894,"I found these chocolates at Cosco, and they we...",0,find chocolate Cosco sell $ 10 buy jar hand cl...


In [44]:
df_subset.shape

(30000, 3)

In [45]:
df_subset.Sentiment.value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
0,10000
1,10000
2,10000


In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_subset['preprocessed_Text'],
    df_subset['Sentiment'],
    test_size=0.2,
    stratify=df_subset['Sentiment'],
    random_state=9
    )

In [47]:
len(X_train)

24000

In [48]:
len(X_test)

6000

In [50]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#with one-gram multinomialNB
clf=Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.72      0.73      2000
           1       0.56      0.58      0.57      2000
           2       0.66      0.65      0.65      2000

    accuracy                           0.65      6000
   macro avg       0.65      0.65      0.65      6000
weighted avg       0.65      0.65      0.65      6000



In [51]:
#with bi-gram multinomialNB
clf=Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('classifier', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.67      0.73      2000
           1       0.56      0.70      0.62      2000
           2       0.71      0.66      0.68      2000

    accuracy                           0.68      6000
   macro avg       0.69      0.68      0.68      6000
weighted avg       0.69      0.68      0.68      6000



In [52]:
#with 3-gram multinomialNB
clf=Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,3))),
    ('classifier', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.64      0.72      2000
           1       0.56      0.72      0.63      2000
           2       0.71      0.65      0.68      2000

    accuracy                           0.67      6000
   macro avg       0.70      0.67      0.68      6000
weighted avg       0.70      0.67      0.68      6000



till now get better with bi-gram multinomialNB

In [55]:
from sklearn.tree import DecisionTreeClassifier
#with 1-gram DecisionTreeClassifier
clf=Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', DecisionTreeClassifier())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.62      0.63      0.62      2000
           1       0.54      0.54      0.54      2000
           2       0.61      0.58      0.59      2000

    accuracy                           0.59      6000
   macro avg       0.59      0.59      0.59      6000
weighted avg       0.59      0.59      0.59      6000



In [56]:
from sklearn.tree import DecisionTreeClassifier
#with bi-gram DecisionTreeClassifier
clf=Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('classifier', DecisionTreeClassifier())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.61      0.64      0.62      2000
           1       0.54      0.54      0.54      2000
           2       0.60      0.59      0.60      2000

    accuracy                           0.59      6000
   macro avg       0.59      0.59      0.59      6000
weighted avg       0.59      0.59      0.59      6000



In [60]:
from sklearn.svm import LinearSVC

clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('classifier', LinearSVC())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)



In [61]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.79      0.77      2000
           1       0.63      0.61      0.62      2000
           2       0.72      0.70      0.71      2000

    accuracy                           0.70      6000
   macro avg       0.70      0.70      0.70      6000
weighted avg       0.70      0.70      0.70      6000



In [62]:
from sklearn.neighbors import KNeighborsClassifier #with bi-gram KNeighborsClassifier
clf=Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('classifier', KNeighborsClassifier())
    ])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.48      0.57      0.52      2000
           1       0.43      0.39      0.41      2000
           2       0.50      0.46      0.48      2000

    accuracy                           0.47      6000
   macro avg       0.47      0.47      0.47      6000
weighted avg       0.47      0.47      0.47      6000



In [64]:
from sklearn.ensemble import RandomForestClassifier
 #with bi-gram RandomForestClassifier
clf=Pipeline([ ('vectorizer', CountVectorizer(ngram_range=(1,1), max_features=5000)),
('classifier', RandomForestClassifier())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.76      0.74      2000
           1       0.63      0.60      0.61      2000
           2       0.69      0.69      0.69      2000

    accuracy                           0.68      6000
   macro avg       0.68      0.68      0.68      6000
weighted avg       0.68      0.68      0.68      6000



 ('vectorizer', TfidfVectorizer()),

In [65]:
from sklearn.feature_extraction.text import TfidfVectorizer
clf=Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.70      0.73      2000
           1       0.55      0.63      0.59      2000
           2       0.68      0.64      0.66      2000

    accuracy                           0.66      6000
   macro avg       0.67      0.66      0.66      6000
weighted avg       0.67      0.66      0.66      6000



In [66]:
clf=Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', DecisionTreeClassifier())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.60      0.60      0.60      2000
           1       0.52      0.54      0.53      2000
           2       0.58      0.56      0.57      2000

    accuracy                           0.57      6000
   macro avg       0.57      0.57      0.57      6000
weighted avg       0.57      0.57      0.57      6000



In [67]:
clf=Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LinearSVC())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.76      0.75      2000
           1       0.60      0.58      0.59      2000
           2       0.69      0.69      0.69      2000

    accuracy                           0.68      6000
   macro avg       0.68      0.68      0.68      6000
weighted avg       0.68      0.68      0.68      6000



In [68]:
clf=Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LinearSVC())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.76      0.75      2000
           1       0.60      0.58      0.59      2000
           2       0.69      0.69      0.69      2000

    accuracy                           0.68      6000
   macro avg       0.68      0.68      0.68      6000
weighted avg       0.68      0.68      0.68      6000



In [69]:
clf=Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', KNeighborsClassifier())
    ])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.50      0.60      0.55      2000
           1       0.44      0.49      0.47      2000
           2       0.55      0.37      0.45      2000

    accuracy                           0.49      6000
   macro avg       0.50      0.49      0.49      6000
weighted avg       0.50      0.49      0.49      6000



in all of them only bi-gram multinomialNB performs well so lets train it on full data and import the model

In [70]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_balanced['preprocessed_Text'],
    df_balanced['Sentiment'],
    test_size=0.2,
    stratify=df_balanced['Sentiment'],
    random_state=9
    )

In [71]:
#with bi-gram multinomialNB
clf=Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('classifier', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.76      0.79      8528
           1       0.65      0.75      0.70      8528
           2       0.79      0.73      0.76      8528

    accuracy                           0.75     25584
   macro avg       0.76      0.75      0.75     25584
weighted avg       0.76      0.75      0.75     25584



In [72]:
#with bi-gram multinomialNB
clf=Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,3))),
    ('classifier', MultinomialNB())
])
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.74      0.79      8528
           1       0.65      0.78      0.71      8528
           2       0.80      0.74      0.77      8528

    accuracy                           0.75     25584
   macro avg       0.77      0.75      0.76     25584
weighted avg       0.77      0.75      0.76     25584



In [74]:
import joblib
joblib.dump(clf, '/content/drive/MyDrive/Colab Notebooks/CountVectorizer_MultinomialNB_based.pkl')

['/content/drive/MyDrive/Colab Notebooks/CountVectorizer_MultinomialNB_based.pkl']