## Dataset Overview: Kindle Book Reviews

The dataset contains **12,000** Kindle book reviews with the following columns:

| Column Name        | Description |
|--------------------|-------------|
| `Unnamed: 0.1`, `Unnamed: 0` | Index columns (likely redundant; can be dropped). |
| `asin`             | Amazon Standard Identification Number – unique identifier for each Kindle book. |
| `helpful`          | List in the format `[helpful_votes, total_votes]`, indicating how many people found the review helpful. |
| `rating`           | Star rating (1 to 5) given by the reviewer. |
| `reviewText`       | Full text of the review – **main text data for NLP**. |
| `reviewTime`       | Review date in string format (e.g., `09 2, 2010`). |
| `reviewerID`       | Unique ID for the reviewer. |
| `reviewerName`     | Name of the reviewer (may be missing in some cases). |
| `summary`          | Short summary or title of the review. |
| `unixReviewTime`   | Review date as a UNIX timestamp. |




# Best Practises
1. preprocessing and cleaning
2. train test split
3. BOW, TFIDF, Word2vec
4. Train ML algorthm

In [31]:
import pandas as pd
df=pd.read_csv('all_kindle_review.csv')
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400
2,2,9146,B002ZG96I4,"[0, 0]",3,I'll start by saying this is the first of four...,"04 11, 2014",A3S0H2HV6U1I7F,Merissa,Snapdragon Alley,1397174400
3,3,7038,B002QHWOEU,"[1, 3]",3,Aggie is Angela Lansbury who carries pocketboo...,"07 5, 2014",AC4OQW3GZ919J,Cleargrace,very light murder cozy,1404518400
4,4,1776,B001A06VJ8,"[0, 1]",4,I did not expect this type of book to be in li...,"12 31, 2012",A3C9V987IQHOQD,Rjostler,Book,1356912000


In [32]:
# important features
df = df[['reviewText','rating']]
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",3
1,Great short read. I didn't want to put it dow...,5
2,I'll start by saying this is the first of four...,3
3,Aggie is Angela Lansbury who carries pocketboo...,3
4,I did not expect this type of book to be in li...,4


In [33]:
df.shape

(12000, 2)

In [34]:
## Checking Missing Values
df.isnull().sum()


reviewText    0
rating        0
dtype: int64

In [35]:
df['rating'].unique()

array([3, 5, 4, 2, 1], dtype=int64)

In [36]:
df['rating'].value_counts()

rating
5    3000
4    3000
3    2000
2    2000
1    2000
Name: count, dtype: int64

- Preprocessing and cleaning

In [37]:
## positive review is 1 and negative review is 0
df['rating'] = df['rating'].apply(lambda x : 0 if x<3 else 1)
df['rating']


0        1
1        1
2        1
3        1
4        1
        ..
11995    1
11996    1
11997    1
11998    0
11999    1
Name: rating, Length: 12000, dtype: int64

In [38]:
df.head()

Unnamed: 0,reviewText,rating
0,"Jace Rankin may be short, but he's nothing to ...",1
1,Great short read. I didn't want to put it dow...,1
2,I'll start by saying this is the first of four...,1
3,Aggie is Angela Lansbury who carries pocketboo...,1
4,I did not expect this type of book to be in li...,1


In [39]:
df['rating'].unique()

array([1, 0], dtype=int64)

In [40]:
df['rating'].value_counts()

rating
1    8000
0    4000
Name: count, dtype: int64

1. lower all the cases

In [41]:
# lower all the cases
df['reviewText'] = df['reviewText'].str.lower()

In [42]:
df.head()

Unnamed: 0,reviewText,rating
0,"jace rankin may be short, but he's nothing to ...",1
1,great short read. i didn't want to put it dow...,1
2,i'll start by saying this is the first of four...,1
3,aggie is angela lansbury who carries pocketboo...,1
4,i did not expect this type of book to be in li...,1


2. removing special characters

In [43]:
pip install bs4




In [44]:
import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

# Make sure stopwords are downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Define a cleaning function
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', str(text))

    # Remove HTML tags
    text = BeautifulSoup(text, "lxml").get_text()

    # Remove special characters and digits (keep only words)
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Convert to lowercase
    text = text.lower()

    # Remove stopwords
    text = " ".join([word for word in text.split() if word not in stop_words])

    # Remove extra spaces
    text = " ".join(text.split())
    
    return text

# Apply to the reviewText column
df['reviewText']=df['reviewText'].apply(clean_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [45]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four books wasnt expect...,1
3,aggie angela lansbury carries pocketbooks inst...,1
4,expect type book library pleased find price right,1


In [46]:
# Lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [47]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    

In [48]:
df['reviewText']=df['reviewText'].apply(lambda x: lemmatize_words(x))

In [49]:
df.head()

Unnamed: 0,reviewText,rating
0,jace rankin may short he nothing mess man haul...,1
1,great short read didnt want put read one sitti...,1
2,ill start saying first four book wasnt expecti...,1
3,aggie angela lansbury carry pocketbook instead...,1
4,expect type book library pleased find price right,1


In [50]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['reviewText'], df['rating'], test_size=0.20, random_state=42)

In [51]:
# naive bayes model
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [52]:
# BOW
from sklearn.feature_extraction.text import CountVectorizer
bow=CountVectorizer()
X_train_bow=bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

In [53]:
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [54]:
nb_model_bow=gnb.fit(X_train_bow,y_train)

In [55]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
y_pred_bow=nb_model_bow.predict(X_test_bow)
print("BOW ACCURACY: ", accuracy_score(y_test, y_pred_bow))
print("BOW CONFUSION MATRIX: ", confusion_matrix(y_test,y_pred_bow))

BOW ACCURACY:  0.5770833333333333
BOW CONFUSION MATRIX:  [[505 298]
 [717 880]]


In [56]:
# Tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

In [57]:
X_train_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [58]:
nb_model_tfidf=gnb.fit(X_train_tfidf,y_train)

In [59]:
y_pred_tfidf=nb_model_tfidf.predict(X_test_tfidf)
print("Tfidf ACCURACY: ", accuracy_score(y_test, y_pred_tfidf))
print("Tfidf confusion matrix:" , confusion_matrix(y_test, y_pred_tfidf))

Tfidf ACCURACY:  0.57875
Tfidf confusion matrix: [[486 317]
 [694 903]]


In [60]:
# word2vec
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

In [61]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [62]:
# Tokenize each review for training Word2Vec
X_train_tokens = [word_tokenize(text.lower()) for text in X_train]
X_test_tokens = [word_tokenize(text.lower()) for text in X_test]

# Train Word2Vec model on training tokens
w2v_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=1, workers=4)

# Function to convert token list to mean Word2Vec vector
import numpy as np

def get_avg_word2vec(tokens, model, vector_size):
    vec = np.zeros(vector_size)
    count = 0
    for word in tokens:
        if word in model.wv:
            vec += model.wv[word]
            count += 1
    return vec / count if count != 0 else vec

# Convert train and test text data to average Word2Vec vectors
X_train_w2v = np.array([get_avg_word2vec(tokens, w2v_model, 100) for tokens in X_train_tokens])
X_test_w2v = np.array([get_avg_word2vec(tokens, w2v_model, 100) for tokens in X_test_tokens])


In [63]:
X_train_w2v

array([[-0.10708347,  0.40507751,  0.14574778, ..., -0.29692362,
         0.1395642 ,  0.10788212],
       [-0.11840786,  0.34882312,  0.2384743 , ..., -0.1370949 ,
        -0.02385138,  0.09884442],
       [-0.13246261,  0.24105866,  0.11759935, ..., -0.2866644 ,
         0.25968741,  0.12236031],
       ...,
       [-0.19195054,  0.28250059,  0.1500783 , ..., -0.30561785,
         0.21298316,  0.09552217],
       [-0.23909688,  0.2360215 ,  0.04887672, ..., -0.48307374,
         0.45610437,  0.19681165],
       [-0.19426772,  0.2045512 ,  0.16009831, ..., -0.30686476,
         0.27512078,  0.17603584]])

In [64]:
nb_model_w2v=gnb.fit(X_train_w2v, y_train)

In [65]:
y_pred_w2v=nb_model_w2v.predict(X_test_w2v)
print("w2v ACCURACY: ", accuracy_score(y_test, y_pred_w2v))
print("w2v confusion matrix:" , confusion_matrix(y_test, y_pred_w2v))

w2v ACCURACY:  0.68875
w2v confusion matrix: [[ 610  193]
 [ 554 1043]]
