# Business Understanding
* The days of customers walking into a shop to buy what they need/want are long
behind us and worse still if these are items are not basic needs.
More and more clients prefer to make purchases from the comfort of their home.
* The goods that a retailer is able to market online is limitless however customers easily get tired of scrolling though an endless catalogue of items for sale.
* Therefore rises the need for a recommendation system that will enable a client have a seamless buying experience.
The reading culture is changing hence our choice of the amazon books dataset.
* A recommendation system will enable buyers get the most ideal and trending books to buy.
* The target audience would be both the retailers and the purchasers.





# Data Understanding & Source
* The data has been obtained from https://amazon-reviews-2023.github.io/ and in jsonl format. An efficient format for storing data that is unstructured or produced over time.
* It contains a list of books sold in Amazon. The original dataset contains 4 million rows, from 1996 to 2023. We will trim it to the most recent 500k to make it easier to work with.
* The data contains following features/columns in the dataset.

| Column Name | Description |
|---|---|
| rating | Rating of the product (from 1.0 to 5.0). |
| title_x | Title of the user review. |
| text | Text body of the user review. |
| images | Links to images (comma-separated if multiple). |
| asin(product key) | Unique identifier for the product. |
| parent_asin | Identifier for the parent product (applicable for variations). |
| user_id | Unique identifier for the reviewer. |
| timestamp | Date and time of the review. |
| helpful_vote | Number of helpful votes received by the review. |
| verified_purchase | Indicates whether the reviewer purchased the product (True/False). |
| main_Category | Main category (domain) to which the product belongs (e.g., Electronics, Clothing). |
| title_y | Name of the product as mentioned in the review. |
| price | Price of the product in US dollars. |



## Data Importation

In [1]:
# Mount the google drive
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# import the necesarry libraries
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# Load the merged dataset
# file_path = '/content/drive/My Drive/Capstone_Group_14_Project/merged_Books.jsonl'

# # Initialize an empty list to store the parsed JSON objects
# data = []

# # Read each line of the JSON Lines file and parse it
# with open(file_path, 'r') as f:
#     for line in f:
#         data.append(json.loads(line))

# # Convert the list of JSON objects into a DataFrame
# df = pd.DataFrame(data)
# df.head()

Unnamed: 0,rating,title_x,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,main_category,title_y,price
0,5,Wonderful and Inspiring,This book is wonderful and inspiring for kids ...,[],B0C6Z8N9N8,B0C6Z8N9N8,AG2FEEHWHCQELOHBIDQDROZ3LSNA,1694657549017,0,False,Books,Of Life: The Rollercoaster,from 11.99
1,5,Awesome book,This is a wonderful children’s book! My daught...,[],B0C6Z8N9N8,B0C6Z8N9N8,AERUMG7KTKZAIOQ3PO5LJUF33UKQ,1693063638325,0,False,Books,Of Life: The Rollercoaster,from 11.99
2,5,Amazing,Product arrived quickly in great condition. Be...,[],1401241883,1401241883,AEK3AFSE3D2BSOC6XI65XNO23MKQ,1694654386695,0,True,Books,The Sandman Omnibus Vol. 1,80.23
3,5,Got this at a great price.,I payed $89.00 dollars. When it first came out...,[],1401241883,1401241883,AFPYBFVIJI3GFPPFANRYIBJZKPLA,1683048302761,0,True,Books,The Sandman Omnibus Vol. 1,80.23
4,5,The Best of the Best,Neil Gaimans stories are spellbinding. Moreove...,[],1401241883,1401241883,AECBBBUARXJEZYZS2PXN2K66A4DA,1679249708115,0,False,Books,The Sandman Omnibus Vol. 1,80.23


In [4]:
# Preview the attributes of the data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   rating             300000 non-null  int64 
 1   title_x            300000 non-null  object
 2   text               300000 non-null  object
 3   images             300000 non-null  object
 4   asin               300000 non-null  object
 5   parent_asin        300000 non-null  object
 6   user_id            300000 non-null  object
 7   timestamp          300000 non-null  int64 
 8   helpful_vote       300000 non-null  int64 
 9   verified_purchase  300000 non-null  bool  
 10  main_category      299989 non-null  object
 11  title_y            300000 non-null  object
 12  price              267594 non-null  object
dtypes: bool(1), int64(3), object(9)
memory usage: 27.8+ MB


In [5]:
# Review the rows and columns of the data
df.shape

(300000, 13)

## Data Cleaning

In [6]:
# Identify the unique values in the

print(df['main_category'].unique())

['Books' 'Buy a Kindle' 'Musical Instruments' 'Audible Audiobooks' ''
 'Toys & Games' 'Office Products' 'AMAZON FASHION' 'Amazon Home' None
 'Tools & Home Improvement' 'Arts, Crafts & Sewing'
 'Industrial & Scientific']


In [7]:
# Sum the null values in the main_category column  in the dataset

df['main_category'].isnull().sum()

11

In [8]:
# Drop the null values in the main_category column
df.dropna(subset= ['main_category'], inplace = True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 299989 entries, 0 to 299999
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   rating             299989 non-null  int64 
 1   title_x            299989 non-null  object
 2   text               299989 non-null  object
 3   images             299989 non-null  object
 4   asin               299989 non-null  object
 5   parent_asin        299989 non-null  object
 6   user_id            299989 non-null  object
 7   timestamp          299989 non-null  int64 
 8   helpful_vote       299989 non-null  int64 
 9   verified_purchase  299989 non-null  bool  
 10  main_category      299989 non-null  object
 11  title_y            299989 non-null  object
 12  price              267592 non-null  object
dtypes: bool(1), int64(3), object(9)
memory usage: 30.0+ MB


In [9]:
# Remove the words 'from' and 'None' from the price column
# Remove special characters from the price column
# Convert the price column to data type float

df['price'] = df['price'].astype(str).str.replace(r'(from|None)\s*','', regex=True)
df['price'] = df['price'].replace(['','—'],np.nan)
df['price'] = df['price'].astype(float)
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 299989 entries, 0 to 299999
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   rating             299989 non-null  int64  
 1   title_x            299989 non-null  object 
 2   text               299989 non-null  object 
 3   images             299989 non-null  object 
 4   asin               299989 non-null  object 
 5   parent_asin        299989 non-null  object 
 6   user_id            299989 non-null  object 
 7   timestamp          299989 non-null  int64  
 8   helpful_vote       299989 non-null  int64  
 9   verified_purchase  299989 non-null  bool   
 10  main_category      299989 non-null  object 
 11  title_y            299989 non-null  object 
 12  price              265145 non-null  float64
dtypes: bool(1), float64(1), int64(3), object(8)
memory usage: 30.0+ MB


In [10]:
# Sum up the null values in the price column

df['price'].isnull().sum()

34844

In [11]:
# Drop the rows with null values from the price column

df.dropna(subset = ['price'], inplace = True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265145 entries, 0 to 299999
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   rating             265145 non-null  int64  
 1   title_x            265145 non-null  object 
 2   text               265145 non-null  object 
 3   images             265145 non-null  object 
 4   asin               265145 non-null  object 
 5   parent_asin        265145 non-null  object 
 6   user_id            265145 non-null  object 
 7   timestamp          265145 non-null  int64  
 8   helpful_vote       265145 non-null  int64  
 9   verified_purchase  265145 non-null  bool   
 10  main_category      265145 non-null  object 
 11  title_y            265145 non-null  object 
 12  price              265145 non-null  float64
dtypes: bool(1), float64(1), int64(3), object(8)
memory usage: 26.6+ MB


In [12]:
# Rename the title_x and title_y column to title_rating and title_book respectively

df = df.rename(columns={'title_x': 'title_rating', 'title_y': 'title_book'})
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265145 entries, 0 to 299999
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   rating             265145 non-null  int64  
 1   title_rating       265145 non-null  object 
 2   text               265145 non-null  object 
 3   images             265145 non-null  object 
 4   asin               265145 non-null  object 
 5   parent_asin        265145 non-null  object 
 6   user_id            265145 non-null  object 
 7   timestamp          265145 non-null  int64  
 8   helpful_vote       265145 non-null  int64  
 9   verified_purchase  265145 non-null  bool   
 10  main_category      265145 non-null  object 
 11  title_book         265145 non-null  object 
 12  price              265145 non-null  float64
dtypes: bool(1), float64(1), int64(3), object(8)
memory usage: 26.6+ MB


In [13]:
# Preview the data with new columns

df.head()

Unnamed: 0,rating,title_rating,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,main_category,title_book,price
0,5,Wonderful and Inspiring,This book is wonderful and inspiring for kids ...,[],B0C6Z8N9N8,B0C6Z8N9N8,AG2FEEHWHCQELOHBIDQDROZ3LSNA,1694657549017,0,False,Books,Of Life: The Rollercoaster,11.99
1,5,Awesome book,This is a wonderful children’s book! My daught...,[],B0C6Z8N9N8,B0C6Z8N9N8,AERUMG7KTKZAIOQ3PO5LJUF33UKQ,1693063638325,0,False,Books,Of Life: The Rollercoaster,11.99
2,5,Amazing,Product arrived quickly in great condition. Be...,[],1401241883,1401241883,AEK3AFSE3D2BSOC6XI65XNO23MKQ,1694654386695,0,True,Books,The Sandman Omnibus Vol. 1,80.23
3,5,Got this at a great price.,I payed $89.00 dollars. When it first came out...,[],1401241883,1401241883,AFPYBFVIJI3GFPPFANRYIBJZKPLA,1683048302761,0,True,Books,The Sandman Omnibus Vol. 1,80.23
4,5,The Best of the Best,Neil Gaimans stories are spellbinding. Moreove...,[],1401241883,1401241883,AECBBBUARXJEZYZS2PXN2K66A4DA,1679249708115,0,False,Books,The Sandman Omnibus Vol. 1,80.23


In [14]:
# Convert text and title_rating column to lower case and remove punctuation marks

import string

def clean_text(text):
  if isinstance(text, str):
    text = text.lower()
    # Remove punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
  else:
    return str(text)

df['text'] = df['text'].apply(lambda x: clean_text(x))
df['title_rating'] = df['title_rating'].apply(lambda x: clean_text(x))



In [15]:
# Tokenize and remove stop words from the text and title_rating columns

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
  tokens = word_tokenize(text)
  filtered_tokens = [word for word in tokens if word not in stop_words]
  return filtered_tokens

df['tokenized_text'] = df['text'].apply(lambda x: remove_stopwords(x))
df['tokenized_title_rating'] = df['title_rating'].apply(lambda x: remove_stopwords(x))
df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,rating,title_rating,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,main_category,title_book,price,tokenized_text,tokenized_title_rating
0,5,wonderful and inspiring,this book is wonderful and inspiring for kids ...,[],B0C6Z8N9N8,B0C6Z8N9N8,AG2FEEHWHCQELOHBIDQDROZ3LSNA,1694657549017,0,False,Books,Of Life: The Rollercoaster,11.99,"[book, wonderful, inspiring, kids, adults, nee...","[wonderful, inspiring]"
1,5,awesome book,this is a wonderful children’s book my daughte...,[],B0C6Z8N9N8,B0C6Z8N9N8,AERUMG7KTKZAIOQ3PO5LJUF33UKQ,1693063638325,0,False,Books,Of Life: The Rollercoaster,11.99,"[wonderful, children, ’, book, daughter, 5, yo...","[awesome, book]"
2,5,amazing,product arrived quickly in great condition bea...,[],1401241883,1401241883,AEK3AFSE3D2BSOC6XI65XNO23MKQ,1694654386695,0,True,Books,The Sandman Omnibus Vol. 1,80.23,"[product, arrived, quickly, great, condition, ...",[amazing]
3,5,got this at a great price,i payed 8900 dollars when it first came out it...,[],1401241883,1401241883,AFPYBFVIJI3GFPPFANRYIBJZKPLA,1683048302761,0,True,Books,The Sandman Omnibus Vol. 1,80.23,"[payed, 8900, dollars, first, came, sold, 1500...","[got, great, price]"
4,5,the best of the best,neil gaimans stories are spellbinding moreover...,[],1401241883,1401241883,AECBBBUARXJEZYZS2PXN2K66A4DA,1679249708115,0,False,Books,The Sandman Omnibus Vol. 1,80.23,"[neil, gaimans, stories, spellbinding, moreove...","[best, best]"


In [16]:
# Display a frequency distribution of the most common words

from nltk.probability import FreqDist
from itertools import chain

def common_words(df, column, n=15):
  all_tokens = list(chain.from_iterable(df['tokenized_title_rating']))
  fdist = FreqDist(all_tokens)
  return fdist.most_common(n)

common_words(df, 'tokenized_title_rating', 15)

[('book', 45812),
 ('great', 33024),
 ('read', 20586),
 ('good', 14983),
 ('love', 11023),
 ('story', 9757),
 ('fun', 6776),
 ('’', 6483),
 ('excellent', 5908),
 ('amazing', 5735),
 ('beautiful', 5225),
 ('best', 4307),
 ('series', 4232),
 ('cute', 4186),
 ('loved', 4086)]

In [17]:
# Add 'book' and ' to the stop words list and remove them from the tokenized_title_rating in the dataset

additional_stop_words = {'book', '’','story'}

stop_words.update(additional_stop_words)

df['tokenized_title_rating'] = df['title_rating'].apply(lambda x: remove_stopwords(x))

print(common_words(df, 'tokenized_title_rating', 5))


[('great', 33024), ('read', 20586), ('good', 14983), ('love', 11023), ('fun', 6776)]
