# About  dataset

Context

This is a small subset of dataset of Book reviews from Amazon Kindle Store category.

Content

5-core dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.
Columns

asin - ID of the product, like B000FA64PK

- helpful - helpfulness rating of the review - example: 2/3.
- overall - rating of the product.
- reviewText - text of the review (heading).
- reviewTime - time of the review (raw).
- reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN
- reviewerName - name of the reviewer.
- summary - summary of the review (description).
- unixReviewTime - unix timestamp.

Which file to use?

There are two files one is preprocessed ready for sentiment analysis and other is unprocessed to you basically have to process the dataset and then perform sentiment analysis

Inspiration
- Sentiment analysis on reviews.
- Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.
- Fake reviews/ outliers.
- Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).
- Any other interesting analysis

 ## Best practices
 1. Preprocessing and cleaning(feature engineering)
 2. Train Test split
 3. BOW, TFIDF, Word2Vec
 4. Train ML algo

In [46]:
import pandas as pd
df=pd.read_csv("all_kindle_review .csv")
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [47]:
df=df[['reviewText','rating']]

In [48]:
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [49]:
df.shape

(12000, 2)

In [50]:
##Missing values
df.isnull().sum()

reviewText    0
rating        0
dtype: int64

In [51]:
df['rating'].describe()

count    12000.000000
mean         3.250000
std          1.421619
min          1.000000
25%          2.000000
50%          3.500000
75%          4.250000
max          5.000000
Name: rating, dtype: float64

In [52]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [53]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

### Preprocessing and Cleaning

In [54]:
##Postive 1 and neg 0
df['rating']=df['rating'].apply(lambda x:0 if x<3 else 1 )

In [55]:
df['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

In [56]:
## 1- Lower ALL words
df['reviewText']=df['reviewText'].str.lower()

In [57]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


#### Removing special Characters

In [58]:
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
def clean_text(raw_text):
    """
    Cleans raw text by performing the following steps in order:
    1. Removes HTML tags.
    2. Removes URLs.
    3. Removes special characters and numbers, keeping only letters.
    4. Converts text to lowercase.
    5. Removes stopwords.
    6. Removes additional whitespace.

    Args:
        raw_text (str): The original text string.

    Returns:
        str: The cleaned text as a single string.
    """
    # 1. Remove HTML tags using BeautifulSoup
    text = BeautifulSoup(raw_text, "html.parser").get_text()

    # 2. Remove URLs
    # This regex looks for http/https/www patterns.
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # 3. Remove special characters and numbers
    # This keeps only alphabetic characters and spaces.
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # 4. Convert to lowercase and split into words (tokenization)
    words = text.lower().split()

    # 5. Remove stopwords
    # We create a set of stopwords for faster lookup.
    stop_words = set(stopwords.words('english'))
    meaningful_words = [w for w in words if not w in stop_words]

    # 6. Join the words back into a single string with single spaces
    # and remove any leading/trailing whitespace.
    cleaned_text = " ".join(meaningful_words).strip()

    return cleaned_text


In [59]:
df['reviewText'] = df['reviewText'].apply(clean_text)

  text = BeautifulSoup(raw_text, "html.parser").get_text()


In [61]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [62]:
###Lemmetazier
from nltk.stem import WordNetLemmatizer
wl=WordNetLemmatizer()

In [63]:
def lemmatize_words(text):
    return " ".join([wl.lemmatize(word) for word in text.split()])

In [64]:
df['reviewText']=df['reviewText'].apply(lambda x : lemmatize_words(x))

In [65]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


### Train test Split

In [66]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(df['reviewText'],df['rating'],test_size=.20)

### Model training

In [70]:
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()

In [82]:
X_train_bow= bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()

In [83]:
X_train_tf= tf.fit_transform(X_train).toarray()
X_test_tf=tf.transform(X_test).toarray()

In [84]:
X_test_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [100]:
from sklearn.naive_bayes import GaussianNB
nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
nb_model_tf = GaussianNB().fit(X_train_tf,y_train)

In [108]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
y_pred_bow=nb_model_bow.predict(X_test_bow)
print(confusion_matrix(y_pred_bow,y_test))
print(classification_report(y_pred_bow,y_test))
print("Bow accuracy: ",accuracy_score(y_pred_bow,y_test))

[[527 724]
 [292 857]]
              precision    recall  f1-score   support

           0       0.64      0.42      0.51      1251
           1       0.54      0.75      0.63      1149

    accuracy                           0.58      2400
   macro avg       0.59      0.58      0.57      2400
weighted avg       0.59      0.58      0.57      2400

Bow accuracy:  0.5766666666666667


In [124]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
y_pred_tf=nb_model_tf.predict(X_test_tf)
print(confusion_matrix(y_pred_tf,y_test))
print(classification_report(y_pred_tf,y_test))
print("TF-IDF accuracy: ",accuracy_score(y_pred_tf,y_test))

[[511 702]
 [308 879]]
              precision    recall  f1-score   support

           0       0.62      0.42      0.50      1213
           1       0.56      0.74      0.64      1187

    accuracy                           0.58      2400
   macro avg       0.59      0.58      0.57      2400
weighted avg       0.59      0.58      0.57      2400

TF-IDF accuracy:  0.5791666666666667


#### Word2Vec

In [111]:
import gensim
from gensim.models import Word2Vec

In [112]:
X_train_tokenized = X_train.apply(lambda text: text.split())
X_test_tokenized = X_test.apply(lambda text: text.split())

In [115]:
X_test_tokenized.iloc[0]

['story',
 'great',
 'concept',
 'magic',
 'steamy',
 'moment',
 'ill',
 'find',
 'book',
 'two',
 'see',
 'happens',
 'next']

In [116]:
nb_w2v=Word2Vec(sentences=X_train_tokenized)

In [117]:
def get_average_vector(token_list, model, vector_size):
    # Create an empty vector of the right size, filled with zeros
    avg_vector = np.zeros((vector_size,), dtype="float32")
    num_words_in_model = 0
    
    # For each word in the review, if it exists in our trained model, add its vector
    for word in token_list:
        if word in model.wv:
            avg_vector = np.add(avg_vector, model.wv[word])
            num_words_in_model += 1
            
    # If the review had words that were in our model, divide by the number of words to get the average
    if num_words_in_model > 0:
        avg_vector = np.divide(avg_vector, num_words_in_model)
        
    return avg_vector

In [120]:
import numpy as np
X_train_vectors = np.array([get_average_vector(tokens,nb_w2v, 100) for tokens in X_train_tokenized])
X_test_vectors = np.array([get_average_vector(tokens, nb_w2v, 100) for tokens in X_test_tokenized])

In [121]:
nb_model_w2v = GaussianNB()
nb_model_w2v.fit(X_train_vectors, y_train)

In [123]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
y_pred_wv=nb_model_w2v.predict(X_test_vectors)
print(confusion_matrix(y_pred_wv,y_test))
print(classification_report(y_pred_wv,y_test))
print("Word2Vec accuracy: ",accuracy_score(y_pred_wv,y_test))

[[ 599  525]
 [ 220 1056]]
              precision    recall  f1-score   support

           0       0.73      0.53      0.62      1124
           1       0.67      0.83      0.74      1276

    accuracy                           0.69      2400
   macro avg       0.70      0.68      0.68      2400
weighted avg       0.70      0.69      0.68      2400

Word2Vec accuracy:  0.6895833333333333


In [135]:
from sklearn.linear_model import LogisticRegression
lg_model_w2v=LogisticRegression().fit(X_train_vectors,y_train)

In [132]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
y_pred_lb=lg_model_w2v.predict(X_test_vectors)
print(confusion_matrix(y_pred_lb,y_test))
print(classification_report(y_pred_lb,y_test))
print("Word2Vec accuracy: ",accuracy_score(y_pred_lb,y_test))

[[ 417  182]
 [ 402 1399]]
              precision    recall  f1-score   support

           0       0.51      0.70      0.59       599
           1       0.88      0.78      0.83      1801

    accuracy                           0.76      2400
   macro avg       0.70      0.74      0.71      2400
weighted avg       0.79      0.76      0.77      2400

Word2Vec accuracy:  0.7566666666666667


In [133]:
from sklearn.svm import SVC
svc_model_w2c=SVC().fit(X_train_vectors,y_train)

In [134]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
y_pred_svc=svc_model_w2c.predict(X_test_vectors)
print(confusion_matrix(y_pred_svc,y_test))
print(classification_report(y_pred_svc,y_test))
print("Word2Vec accuracy: ",accuracy_score(y_pred_svc,y_test))

[[ 389  173]
 [ 430 1408]]
              precision    recall  f1-score   support

           0       0.47      0.69      0.56       562
           1       0.89      0.77      0.82      1838

    accuracy                           0.75      2400
   macro avg       0.68      0.73      0.69      2400
weighted avg       0.79      0.75      0.76      2400

Word2Vec accuracy:  0.74875


In [136]:

import pickle

# --- Save the Word2Vec Model ---
with open('w2v_model.pkl', 'wb') as f:
    pickle.dump(nb_w2v, f)

# --- Save the Logistic Regression Classifier ---
with open('lr_classifier.pkl', 'wb') as f:
    pickle.dump(lg_model_w2v, f)

print("Models saved successfully!")

Models saved successfully!
