# NLP - Product Review Sentiment Analysis

Dataset: Kindle Store 5-core - https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/#subsets

In [1]:
import time

start_time = time.time()

In [2]:
import numpy as np
import pandas as pd

### Sequence

1. Preprocessing & Data Cleaning
2. Train-Test-Split
3. BOW, TFIDF, Word2Vec
4. Train ML Algorithms

## The Dataset

- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- vote - helpful votes of the review
- style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
- image - images that users post after they have received the product

In [3]:
# from google.colab import drive
# drive.mount('/content/drive/')

In [4]:
# dataset_path = '/content/drive/MyDrive/Colab Notebooks/Kindle_Store_5.json.gz'

In [5]:
dataset_path = './Kindle_Store_5.json.gz'

In [6]:
df0 = pd.read_json(dataset_path, compression='gzip', lines=True)
df = df0.copy()

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2222983 entries, 0 to 2222982
Data columns (total 12 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   overall         int64 
 1   verified        bool  
 2   reviewTime      object
 3   reviewerID      object
 4   asin            object
 5   style           object
 6   reviewerName    object
 7   reviewText      object
 8   summary         object
 9   unixReviewTime  int64 
 10  vote            object
 11  image           object
dtypes: bool(1), int64(2), object(9)
memory usage: 188.7+ MB


In [8]:
df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,4,True,"07 3, 2014",A2LSKD2H9U8N0J,B000FA5KK0,{'Format:': ' Kindle Edition'},sandra sue marsolek,"pretty good story, a little exaggerated, but I...",pretty good story,1404345600,,
1,5,True,"05 26, 2014",A2QP13XTJND1QS,B000FA5KK0,{'Format:': ' Kindle Edition'},Tpl,"If you've read other max brand westerns, you k...",A very good book,1401062400,,
2,5,True,"09 16, 2016",A8WQ7MAG3HFOZ,B000FA5KK0,{'Format:': ' Kindle Edition'},Alverne F. Anderson,"Love Max, always a fun twist",Five Stars,1473984000,,
3,5,True,"03 3, 2016",A1E0MODSRYP7O,B000FA5KK0,{'Format:': ' Kindle Edition'},Jeff,"As usual for him, a good book",a good,1456963200,,
4,5,True,"09 10, 2015",AYUTCGVSM1H7T,B000FA5KK0,{'Format:': ' Kindle Edition'},DEHS - EddyRapcon,MB is one of the original western writers and ...,A Western,1441843200,2,
...,...,...,...,...,...,...,...,...,...,...,...,...
2222978,3,False,"07 16, 2016",A3Q6HJYRJX87Z9,B01HJENY3Y,{'Format:': ' Kindle Edition'},Tokea,Ok book but some parts just didn't add up I fe...,Cool book,1468627200,,
2222979,5,False,"07 12, 2016",A2O7HQNKCMOMUP,B01HJENY3Y,{'Format:': ' Kindle Edition'},Angela Burnett,Kia I loved this book. I am so glad that Sky ...,Crazy Read,1468281600,,
2222980,5,False,"07 1, 2016",A38NOWP7LQI8CM,B01HJENY3Y,{'Format:': ' Kindle Edition'},Treka22,This picks up where part one left off. Secret ...,Loved it,1467331200,,
2222981,5,False,"07 1, 2016",A1H9WGEEKVK0FM,B01HJENY3Y,{'Format:': ' Kindle Edition'},Adrienne Jeremiah,What a beautiful ending to such a twisted begi...,Beautiful ending,1467331200,,


In [9]:
df[['reviewText','overall']]

Unnamed: 0,reviewText,overall
0,"pretty good story, a little exaggerated, but I...",4
1,"If you've read other max brand westerns, you k...",5
2,"Love Max, always a fun twist",5
3,"As usual for him, a good book",5
4,MB is one of the original western writers and ...,5
...,...,...
2222978,Ok book but some parts just didn't add up I fe...,3
2222979,Kia I loved this book. I am so glad that Sky ...,5
2222980,This picks up where part one left off. Secret ...,5
2222981,What a beautiful ending to such a twisted begi...,5


In [10]:
df = df[['reviewText','overall']]

In [11]:
df.columns = df.columns.str.lower()

In [12]:
df

Unnamed: 0,reviewtext,overall
0,"pretty good story, a little exaggerated, but I...",4
1,"If you've read other max brand westerns, you k...",5
2,"Love Max, always a fun twist",5
3,"As usual for him, a good book",5
4,MB is one of the original western writers and ...,5
...,...,...
2222978,Ok book but some parts just didn't add up I fe...,3
2222979,Kia I loved this book. I am so glad that Sky ...,5
2222980,This picks up where part one left off. Secret ...,5
2222981,What a beautiful ending to such a twisted begi...,5


In [13]:
df.shape

(2222983, 2)

In [14]:
df.isna().sum()

reviewtext    403
overall         0
dtype: int64

In [15]:
df.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


In [16]:
df.shape

(2222580, 2)

In [17]:
df.rename(columns={'overall':'rating'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'overall':'rating'}, inplace=True)


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2222580 entries, 0 to 2222982
Data columns (total 2 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   reviewtext  object
 1   rating      int64 
dtypes: int64(1), object(1)
memory usage: 50.9+ MB


In [19]:
df['rating'].value_counts()

rating
5    1353349
4     556258
3     197919
2      66888
1      48166
Name: count, dtype: int64

In [20]:
df['rating'] = df['rating'].apply(lambda x: 1 if x > 3 else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['rating'] = df['rating'].apply(lambda x: 1 if x > 3 else 0)


In [21]:
df

Unnamed: 0,reviewtext,rating
0,"pretty good story, a little exaggerated, but I...",1
1,"If you've read other max brand westerns, you k...",1
2,"Love Max, always a fun twist",1
3,"As usual for him, a good book",1
4,MB is one of the original western writers and ...,1
...,...,...
2222978,Ok book but some parts just didn't add up I fe...,0
2222979,Kia I loved this book. I am so glad that Sky ...,1
2222980,This picks up where part one left off. Secret ...,1
2222981,What a beautiful ending to such a twisted begi...,1


In [22]:
df['rating'].value_counts()

rating
1    1909607
0     312973
Name: count, dtype: int64

In [23]:
df['reviewtext'].str.lower()

0          pretty good story, a little exaggerated, but i...
1          if you've read other max brand westerns, you k...
2                               love max, always a fun twist
3                              as usual for him, a good book
4          mb is one of the original western writers and ...
                                 ...                        
2222978    ok book but some parts just didn't add up i fe...
2222979    kia i loved this book.  i am so glad that sky ...
2222980    this picks up where part one left off. secret ...
2222981    what a beautiful ending to such a twisted begi...
2222982    honey let me tell you ms. kia must have been r...
Name: reviewtext, Length: 2222580, dtype: object

In [24]:
df['reviewtext'] = df['reviewtext'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewtext'] = df['reviewtext'].str.lower()


In [25]:
df

Unnamed: 0,reviewtext,rating
0,"pretty good story, a little exaggerated, but i...",1
1,"if you've read other max brand westerns, you k...",1
2,"love max, always a fun twist",1
3,"as usual for him, a good book",1
4,mb is one of the original western writers and ...,1
...,...,...
2222978,ok book but some parts just didn't add up i fe...,0
2222979,kia i loved this book. i am so glad that sky ...,1
2222980,this picks up where part one left off. secret ...,1
2222981,what a beautiful ending to such a twisted begi...,1


In [26]:
import re

In [27]:
df['reviewtext'] = df['reviewtext'].apply(lambda x: re.sub('[^a-z A-Z 0-9]+','',x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewtext'] = df['reviewtext'].apply(lambda x: re.sub('[^a-z A-Z 0-9]+','',x))


In [28]:
import nltk
from nltk.corpus import stopwords

In [29]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/iceyisaak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [30]:
stopwords_en = stopwords.words('english')

In [31]:
df['reviewtext'] = df['reviewtext'].apply(lambda x:" ".join([word for word in x.split() if word not in stopwords_en]) )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewtext'] = df['reviewtext'].apply(lambda x:" ".join([word for word in x.split() if word not in stopwords_en]) )


In [32]:
# Remove URLs and Email Addresses
df['reviewtext'] = df['reviewtext'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?','',str(x)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewtext'] = df['reviewtext'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?','',str(x)))


In [33]:
from bs4 import BeautifulSoup

In [34]:
import lxml
print(lxml.__version__)

5.3.0


In [35]:
# Remove HTML tags
df['reviewtext'] = df['reviewtext'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewtext'] = df['reviewtext'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())


In [36]:
# Remove additional spaces
df['reviewtext'] = df['reviewtext'].apply(lambda x: ' '.join(x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewtext'] = df['reviewtext'].apply(lambda x: ' '.join(x.split()))


In [37]:
df.head()

Unnamed: 0,reviewtext,rating
0,pretty good story little exaggerated liked pre...,1
1,youve read max brand westerns know expect your...,1
2,love max always fun twist,1
3,usual good book,1
4,mb one original western writers many years man...,1


In [38]:
df.tail()

Unnamed: 0,reviewtext,rating
2222978,ok book parts didnt add felt like purp died br...,0
2222979,kia loved book glad sky got coming hiring hit ...,1
2222980,picks part one left secret still conniving sta...,1
2222981,beautiful ending twisted beginning everyone st...,1
2222982,honey let tell ms kia must really mad writing ...,1


## Lemmatisation

In [39]:
from nltk.stem import WordNetLemmatizer

In [40]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/iceyisaak/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [41]:
lemmatizer = WordNetLemmatizer()

In [42]:
def lemmatize_text(text):
  return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [43]:
df['reviewtext'] = df['reviewtext'].apply(lambda x: lemmatize_text(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['reviewtext'] = df['reviewtext'].apply(lambda x: lemmatize_text(x))


In [44]:
df

Unnamed: 0,reviewtext,rating
0,pretty good story little exaggerated liked pre...,1
1,youve read max brand western know expect youre...,1
2,love max always fun twist,1
3,usual good book,1
4,mb one original western writer many year many ...,1
...,...,...
2222978,ok book part didnt add felt like purp died bro...,0
2222979,kia loved book glad sky got coming hiring hit ...,1
2222980,pick part one left secret still conniving star...,1
2222981,beautiful ending twisted beginning everyone st...,1


## Declare X & y variables

In [45]:
X = df['reviewtext']

In [46]:
y = df['rating']

# Train Test Split

In [47]:
from sklearn.model_selection import train_test_split

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

In [49]:
X_train.shape, X_test.shape

((1778064,), (444516,))

In [50]:
y_train.shape, y_test.shape

((1778064,), (444516,))

---

## BOW

In [51]:
from sklearn.feature_extraction.text import CountVectorizer

In [52]:
bow = CountVectorizer()

In [54]:
X_train_bow = bow.fit_transform(X_train)

In [55]:
X_test_bow = bow.transform(X_test)

In [56]:
X_train_bow.shape, X_test_bow.shape

((1778064, 1188276), (444516, 1188276))

In [57]:
len(bow.vocabulary_)

1188276

#### Random Oversampling: BOW

In [58]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

ros_bow = RandomOverSampler(random_state=42, sampling_strategy='minority')
X_train_bow, y_train_bow = ros_bow.fit_resample(X_train_bow, y_train)

print("-" * 30)
print("Training Data Balance AFTER MCO:")
print(Counter(y_train_bow)) # Counter shows the new distribution

print("-" * 30)
print("Shape of Resampled Features:", X_train_bow.shape)
print("Shape of Resampled Target:", y_train_bow.shape)

------------------------------
Training Data Balance AFTER MCO:
Counter({1: 1527789, 0: 1527789})
------------------------------
Shape of Resampled Features: (3055578, 1188276)
Shape of Resampled Target: (3055578,)


---

## TFIDF

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [60]:
tfidf = TfidfVectorizer()

In [62]:
X_train_tfidf = tfidf.fit_transform(X_train)

In [63]:
X_test_tfidf = tfidf.transform(X_test)

In [64]:
X_train_tfidf.shape, X_test_tfidf.shape

((1778064, 1188276), (444516, 1188276))

In [65]:
len(tfidf.vocabulary_)

1188276

#### Random Oversampling: TFIDF

In [66]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

ros_tfidf = RandomOverSampler(random_state=42, sampling_strategy='minority')
X_train_tfidf, y_train_tfidf = ros_tfidf.fit_resample(X_train_tfidf, y_train)

print("-" * 30)
print("Training Data Balance AFTER MCO:")
print(Counter(y_train_tfidf)) # Counter shows the new distribution

print("-" * 30)
print("Shape of Resampled Features:", X_train_tfidf.shape)
print("Shape of Resampled Target:", y_train_tfidf.shape)

------------------------------
Training Data Balance AFTER MCO:
Counter({1: 1527789, 0: 1527789})
------------------------------
Shape of Resampled Features: (3055578, 1188276)
Shape of Resampled Target: (3055578,)


---

## Word2Vec

In [67]:
from gensim.models import Word2Vec

In [68]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [69]:
type(X_train)

pandas.core.series.Series

In [70]:
word2vec = Word2Vec(X_train, vector_size=100)

In [71]:
len(word2vec.wv.index_to_key)

37

### AvgWord2Vec

In [72]:
import numpy as np

# Get the AVG vector of each sentence in the doc from the words used in the training
def avg_w2v(doc):
    return np.mean([word2vec.wv[word] for word in doc if word in word2vec.wv.index_to_key], axis=0)

#### AvgWord2Vec: X_train

In [73]:
type(X_train)

pandas.core.series.Series

In [74]:
%pip install tqdm
from tqdm import tqdm

# Apply the AVG Vector to every sentence used in the training
X_train_aw2v = []

for i in tqdm(range(len(X_train))):
    X_train_aw2v.append(avg_w2v(X_train.iloc[i]))

Note: you may need to restart the kernel to use updated packages.


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 1778064/1778064 [04:18<00:00, 6884.16it/s] 


In [75]:
type(X_train_aw2v)

list

In [76]:
X_train_aw2v = pd.Series(X_train_aw2v)

In [77]:
type(X_train_aw2v)

pandas.core.series.Series

In [78]:
X_train_aw2v.shape

(1778064,)

In [79]:
X_train_aw2v[0].shape

(100,)

### Reshape Data: X_train_aw2v

In [80]:
# 1. Create a list to store the individual DataFrames/rows
X_list = []

# 2. Loop through your data (X_train) and append the new DataFrame/row to the list
for i in range(0,len(X_train_aw2v)):
    # Assuming X[i] is a NumPy array
    new_row_X = pd.DataFrame(X_train_aw2v[i].reshape(1, -1))
    X_list.append(new_row_X)

# 3. Concatenate all DataFrames in the list *once*
#    The ignore_index=True parameter handles resetting the index for the final DataFrame,
#    which mimics the behavior of the old append() call.
X_train_aw2v = pd.concat(X_list, ignore_index=True)

  X_train_aw2v = pd.concat(X_list, ignore_index=True)


In [81]:
X_train_aw2v.shape

(1778064, 100)

In [82]:
type(X_train_aw2v)

pandas.core.frame.DataFrame

In [83]:
X_train_aw2v.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.119244,-0.072363,0.168405,-0.038415,0.212748,0.075023,0.068017,-0.188081,-0.153527,-0.105357,...,-0.016889,0.108033,-0.143034,-0.248071,0.092699,0.274771,0.081986,-0.16938,-0.134308,-0.127511
1,0.126066,-0.101383,0.154407,-0.024823,0.149063,0.010644,-0.004535,-0.181987,-0.093213,-0.04464,...,0.014702,0.074983,-0.101374,-0.117245,0.121603,0.216002,0.032895,-0.204323,-0.055498,-0.074425
2,0.175703,-0.03728,0.204199,-0.067216,0.236159,0.086466,0.072782,-0.201782,-0.175925,-0.111014,...,-0.020073,0.12986,-0.143863,-0.28096,0.107747,0.303367,0.070864,-0.235883,-0.135164,-0.165065
3,0.099069,-0.101606,0.200201,-0.071384,0.249454,0.09213,0.066612,-0.222832,-0.163769,-0.094835,...,-0.017657,0.120884,-0.153834,-0.245457,0.164866,0.282704,0.080592,-0.222269,-0.123327,-0.172466
4,0.11879,-0.072233,0.217656,-0.037744,0.228689,0.101356,0.101338,-0.2292,-0.19264,-0.064244,...,-0.030932,0.142956,-0.139646,-0.264926,0.178017,0.295808,0.069908,-0.22387,-0.12401,-0.174203


In [84]:
y_train.head()

620403    1
329746    1
277337    1
396191    1
22918     1
Name: rating, dtype: int64

In [85]:
y_train.shape

(1778064,)

#### AvgWord2Vec: X_test

In [86]:
%pip install tqdm
from tqdm import tqdm

# Apply the AVG Vector to every sentence used in the training
X_test_aw2v = []

for i in tqdm(range(len(X_test))):
    X_test_aw2v.append(avg_w2v(X_test.iloc[i]))

Note: you may need to restart the kernel to use updated packages.


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 444516/444516 [01:03<00:00, 7018.47it/s] 


In [87]:
type(X_test_aw2v)

list

In [88]:
X_test_aw2v = pd.Series(X_test_aw2v)

In [89]:
type(X_test_aw2v)

pandas.core.series.Series

In [90]:
X_test_aw2v.shape

(444516,)

#### Reshape Data: X_test_aw2v

In [91]:
# 1. Create a list to store the individual DataFrames/rows
X_list = []

# 2. Loop through your data (X_train) and append the new DataFrame/row to the list
for i in range(0,len(X_test_aw2v)):
    # Assuming X[i] is a NumPy array
    new_row_X = pd.DataFrame(X_test_aw2v[i].reshape(1, -1))
    X_list.append(new_row_X)

# 3. Concatenate all DataFrames in the list *once*
#    The ignore_index=True parameter handles resetting the index for the final DataFrame,
#    which mimics the behavior of the old append() call.
X_test_aw2v = pd.concat(X_list, ignore_index=True)

  X_test_aw2v = pd.concat(X_list, ignore_index=True)


In [92]:
X_test_aw2v.shape

(444516, 100)

In [93]:
y_test.shape

(444516,)

---

---

### Create DataFrame for the Train Dataset

In [94]:
df = pd.concat([X_train_aw2v,y_train], axis=1)

In [95]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,rating
0,0.119244,-0.072363,0.168405,-0.038415,0.212748,0.075023,0.068017,-0.188081,-0.153527,-0.105357,...,0.108033,-0.143034,-0.248071,0.092699,0.274771,0.081986,-0.169380,-0.134308,-0.127511,1.0
1,0.126066,-0.101383,0.154407,-0.024823,0.149063,0.010644,-0.004535,-0.181987,-0.093213,-0.044640,...,0.074983,-0.101374,-0.117245,0.121603,0.216002,0.032895,-0.204323,-0.055498,-0.074425,
2,0.175703,-0.037280,0.204199,-0.067216,0.236159,0.086466,0.072782,-0.201782,-0.175925,-0.111014,...,0.129860,-0.143863,-0.280960,0.107747,0.303367,0.070864,-0.235883,-0.135164,-0.165065,1.0
3,0.099069,-0.101606,0.200201,-0.071384,0.249454,0.092130,0.066612,-0.222832,-0.163769,-0.094835,...,0.120884,-0.153834,-0.245457,0.164866,0.282704,0.080592,-0.222269,-0.123327,-0.172466,1.0
4,0.118790,-0.072233,0.217656,-0.037744,0.228689,0.101356,0.101338,-0.229200,-0.192640,-0.064244,...,0.142956,-0.139646,-0.264926,0.178017,0.295808,0.069908,-0.223870,-0.124010,-0.174203,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1919803,,,,,,,,,,,...,,,,,,,,,,1.0
1825860,,,,,,,,,,,...,,,,,,,,,,1.0
2138622,,,,,,,,,,,...,,,,,,,,,,1.0
2003606,,,,,,,,,,,...,,,,,,,,,,1.0


In [96]:
df.isna().sum()

0         356985
1         356985
2         356985
3         356985
4         356985
           ...  
96        356985
97        356985
98        356985
99        356985
rating    356136
Length: 101, dtype: int64

In [97]:
df.dropna(inplace=True)

In [98]:
df.isna().sum()

0         0
1         0
2         0
3         0
4         0
         ..
96        0
97        0
98        0
99        0
rating    0
Length: 101, dtype: int64

In [99]:
X_train_aw2v.shape, y_train.shape

((1778064, 100), (1778064,))

In [100]:
df.shape

(1421256, 101)

In [101]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,rating
0,0.119244,-0.072363,0.168405,-0.038415,0.212748,0.075023,0.068017,-0.188081,-0.153527,-0.105357,...,0.108033,-0.143034,-0.248071,0.092699,0.274771,0.081986,-0.16938,-0.134308,-0.127511,1.0
2,0.175703,-0.03728,0.204199,-0.067216,0.236159,0.086466,0.072782,-0.201782,-0.175925,-0.111014,...,0.12986,-0.143863,-0.28096,0.107747,0.303367,0.070864,-0.235883,-0.135164,-0.165065,1.0
3,0.099069,-0.101606,0.200201,-0.071384,0.249454,0.09213,0.066612,-0.222832,-0.163769,-0.094835,...,0.120884,-0.153834,-0.245457,0.164866,0.282704,0.080592,-0.222269,-0.123327,-0.172466,1.0
4,0.11879,-0.072233,0.217656,-0.037744,0.228689,0.101356,0.101338,-0.2292,-0.19264,-0.064244,...,0.142956,-0.139646,-0.264926,0.178017,0.295808,0.069908,-0.22387,-0.12401,-0.174203,1.0
5,0.19228,-0.058642,0.091718,-0.190576,0.113844,0.15631,0.097109,-0.156393,-0.200063,-0.159526,...,0.06184,-0.19998,-0.155446,0.085585,0.269822,0.035706,-0.182361,-0.184886,-0.126201,1.0


In [102]:
X_train_aw2v = df.iloc[:,:-1]

In [103]:
y_train_aw2v = df.iloc[:,-1]

In [104]:
X_train_aw2v.isna().sum()

0     0
1     0
2     0
3     0
4     0
     ..
95    0
96    0
97    0
98    0
99    0
Length: 100, dtype: int64

In [105]:
y_train_aw2v.isna().sum()

np.int64(0)

In [106]:
y_train_aw2v.shape

(1421256,)

---

### Handle Missing Values in Test Set

In [107]:
from sklearn.impute import SimpleImputer
import numpy as np

# 1. Instantiate the imputer (fit on the training data)
# Use 'mean' or 'median' strategy
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') 
imputer.fit(X_train_aw2v) 

# 2. Transform the test data using the fitted imputer
X_test_aw2v = imputer.transform(X_test_aw2v)

# 3. Now run the prediction
# y_pred_aw2v = mnb_aw2v.predict(X_test_aw2v_imputed)

---

### Resample Data: AvgWord2Vec

In [108]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

ros_aw2v = RandomOverSampler(random_state=42, sampling_strategy='minority')
X_train_aw2v, y_train_aw2v = ros_aw2v.fit_resample(X_train_aw2v, y_train_aw2v)

print("-" * 30)
print("Training Data Balance AFTER MCO:")
print(Counter(y_train_aw2v)) # Counter shows the new distribution

print("-" * 30)
print("Shape of Resampled Features:", X_train_aw2v.shape)
print("Shape of Resampled Target:", y_train_aw2v.shape)

------------------------------
Training Data Balance AFTER MCO:
Counter({1.0: 1222926, 0.0: 1222926})
------------------------------
Shape of Resampled Features: (2445852, 100)
Shape of Resampled Target: (2445852,)


---

## MultinomialNB

In [109]:
from sklearn.naive_bayes import MultinomialNB

In [110]:
mnb = MultinomialNB()

---

In [111]:
mnb_bow = mnb.fit(X_train_bow, y_train_bow)

In [112]:
y_pred_bow = mnb_bow.predict(X_test_bow)

In [113]:
y_pred_bow

array([1, 1, 1, ..., 0, 1, 1], shape=(444516,))

---

In [114]:
mnb_tfidf = mnb.fit(X_train_tfidf, y_train_tfidf)

In [115]:
y_pred_tfidf = mnb_tfidf.predict(X_test_tfidf)

In [116]:
y_pred_tfidf

array([1, 1, 1, ..., 0, 1, 1], shape=(444516,))

---

In [117]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Create a pipeline to first scale the data, then fit MNB
mnb_aw2v = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', MultinomialNB())
])

# Fit the pipeline on your data. 
# The scaler handles the negative values internally.
mnb_aw2v.fit(X_train_aw2v, y_train_aw2v)

0,1,2
,steps,"[('scaler', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [118]:
# mnb_aw2v = mnb.fit(X_train_aw2v, y_train_aw2v)

In [119]:
y_pred_aw2v = mnb_aw2v.predict(X_test_aw2v)

In [120]:
y_pred_aw2v

array([1., 0., 1., ..., 0., 0., 1.], shape=(444516,))

---

## Model Evaluation

In [121]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [122]:
conf_matrix_bow = confusion_matrix(y_test, y_pred_bow)
cls_report_bow = classification_report(y_test, y_pred_bow)
acc_score_bow = accuracy_score(y_test, y_pred_bow)

print('Performance Metrics: MNG BOW')
print(f'Confusion Matrix:\n{conf_matrix_bow}')
print(f'\nClassification Report:\n{cls_report_bow}')
print(f'\nAccuracy Score: {acc_score_bow}')

Performance Metrics: MNG BOW
Confusion Matrix:
[[ 51116  11582]
 [ 64960 316858]]

Classification Report:
              precision    recall  f1-score   support

           0       0.44      0.82      0.57     62698
           1       0.96      0.83      0.89    381818

    accuracy                           0.83    444516
   macro avg       0.70      0.82      0.73    444516
weighted avg       0.89      0.83      0.85    444516


Accuracy Score: 0.827808222876117


In [123]:
conf_matrix_tfidf = confusion_matrix(y_test, y_pred_tfidf)
cls_report_tfidf = classification_report(y_test, y_pred_tfidf)
acc_score_tfidf = accuracy_score(y_test, y_pred_tfidf)

print('Performance Metrics: MNG TFIDF')
print(f'Confusion Matrix:\n{conf_matrix_tfidf}')
print(f'\nClassification Report:\n{cls_report_tfidf}')
print(f'\nAccuracy Score: {acc_score_tfidf}')

Performance Metrics: MNG TFIDF
Confusion Matrix:
[[ 50148  12550]
 [ 60509 321309]]

Classification Report:
              precision    recall  f1-score   support

           0       0.45      0.80      0.58     62698
           1       0.96      0.84      0.90    381818

    accuracy                           0.84    444516
   macro avg       0.71      0.82      0.74    444516
weighted avg       0.89      0.84      0.85    444516


Accuracy Score: 0.8356437113624706


In [124]:
conf_matrix_aw2v = confusion_matrix(y_test, y_pred_aw2v)
cls_report_aw2v = classification_report(y_test, y_pred_aw2v)
acc_score_aw2v = accuracy_score(y_test, y_pred_aw2v)

print('Performance Metrics: MNG AW2V')
print(f'Confusion Matrix:\n{conf_matrix_aw2v}')
print(f'\nClassification Report:\n{cls_report_aw2v}')
print(f'\nAccuracy Score: {acc_score_aw2v}')

Performance Metrics: MNG AW2V
Confusion Matrix:
[[ 37997  24701]
 [201520 180298]]

Classification Report:
              precision    recall  f1-score   support

           0       0.16      0.61      0.25     62698
           1       0.88      0.47      0.61    381818

    accuracy                           0.49    444516
   macro avg       0.52      0.54      0.43    444516
weighted avg       0.78      0.49      0.56    444516


Accuracy Score: 0.49108468536565614


---

In [125]:
end_time = time.time()
elapsed_seconds = end_time - start_time
elapsed_minutes = elapsed_seconds / 60

print(f"Execution time: {elapsed_minutes:.2f} minutes")

Execution time: 18.47 minutes
