# Step 1: NLP Book Recommendation System - Data Wrangling
Amazon Books Reviews Data
data source: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv
This is a rich dataset for Natural Language Processing containing 3,000,000 text reviews from users as well as text descriptions and categories for 212,403 books. Therefore it is ideal for text analysis.

# Importing libraries and reading the data

In [22]:
import pandas as pd
import numpy as np

In [23]:
# Reading the ratings data

ratings = pd.read_csv('Books_rating.csv')

In [24]:
ratings.head(3)

Unnamed: 0,Id,Title,Price,User_id,profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."


In [25]:
# Null Values in the rating data

ratings.isna().sum()

Id                          0
Title                     208
Price                 2518829
User_id                561787
profileName            561886
review/helpfulness          0
review/score                0
review/time                 0
review/summary             38
review/text                 8
dtype: int64

In [26]:
# Reading the books data

books = pd.read_csv('books_data.csv')

In [27]:
books.head(2)

Unnamed: 0,Title,description,authors,image,previewLink,publisher,publishedDate,infoLink,categories,ratingsCount
0,Its Only Art If Its Well Hung!,,['Julie Strain'],http://books.google.com/books/content?id=DykPA...,http://books.google.nl/books?id=DykPAAAACAAJ&d...,,1996,http://books.google.nl/books?id=DykPAAAACAAJ&d...,['Comics & Graphic Novels'],
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],http://books.google.com/books/content?id=IjvHQ...,http://books.google.nl/books?id=IjvHQsCn_pgC&p...,A&C Black,2005-01-01,http://books.google.nl/books?id=IjvHQsCn_pgC&d...,['Biography & Autobiography'],


In [28]:
# Null values in the books data

books.isna().sum()

Title                 1
description       68442
authors           31413
image             52075
previewLink       23836
publisher         75886
publishedDate     25305
infoLink          23836
categories        41199
ratingsCount     162652
dtype: int64

In [29]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212404 entries, 0 to 212403
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Title          212403 non-null  object 
 1   description    143962 non-null  object 
 2   authors        180991 non-null  object 
 3   image          160329 non-null  object 
 4   previewLink    188568 non-null  object 
 5   publisher      136518 non-null  object 
 6   publishedDate  187099 non-null  object 
 7   infoLink       188568 non-null  object 
 8   categories     171205 non-null  object 
 9   ratingsCount   49752 non-null   float64
dtypes: float64(1), object(9)
memory usage: 16.2+ MB


# Dropping Unneeded Columns

In [30]:
books.columns

Index(['Title', 'description', 'authors', 'image', 'previewLink', 'publisher',
       'publishedDate', 'infoLink', 'categories', 'ratingsCount'],
      dtype='object')

In [31]:
# Dropping unneeded columns in the books data

books = books.drop(columns=['image', 'previewLink', 'infoLink'])

In [32]:
books.head(3)

Unnamed: 0,Title,description,authors,publisher,publishedDate,categories,ratingsCount
0,Its Only Art If Its Well Hung!,,['Julie Strain'],,1996,['Comics & Graphic Novels'],
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],A&C Black,2005-01-01,['Biography & Autobiography'],
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],,2000,['Religion'],


In [33]:
# Dropping unneeded columns in the ratings data

ratings = ratings.drop(['Price', 'profileName', 'review/time'], axis=1)

In [34]:
ratings.head(3)

Unnamed: 0,Id,Title,User_id,review/helpfulness,review/score,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,AVCGYZL8FQQTD,7/7,4.0,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,A30TK6U7DNS82R,10/10,5.0,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,A3UH4UZ4RSVO82,10/11,5.0,Essential for every personal and Public Library,"If people become the books they read and if ""t..."


# Calculating average rating and total number of rating for each book and adding them to the books data

In [35]:
# Finding the average rating for each book
avg_rating = ratings.groupby('Title')['review/score'].mean()

In [36]:
avg_rating = pd.DataFrame(avg_rating)
avg_rating.head(3)

Unnamed: 0_level_0,review/score
Title,Unnamed: 1_level_1
""" Film technique, "" and, "" Film acting """,4.5
""" We'll Always Have Paris"": The Definitive Guide to Great Lines from the Movies",5.0
"""... And Poetry is Born ..."" Russian Classical Poetry",4.0


In [37]:
avg_rating.info()

<class 'pandas.core.frame.DataFrame'>
Index: 212403 entries, " Film technique, " and, " Film acting " to you can do anything with crepes
Data columns (total 1 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   review/score  212403 non-null  float64
dtypes: float64(1)
memory usage: 3.2+ MB


In [38]:
# Finding the total number of ratings for each book

num_rating = ratings.groupby('Title')['review/score'].count()

In [39]:
num_rating = pd.DataFrame(num_rating)
num_rating.head(3)

Unnamed: 0_level_0,review/score
Title,Unnamed: 1_level_1
""" Film technique, "" and, "" Film acting """,2
""" We'll Always Have Paris"": The Definitive Guide to Great Lines from the Movies",2
"""... And Poetry is Born ..."" Russian Classical Poetry",1


In [40]:
# Average number of rating and number of ratings for each book

avg_num_rating = avg_rating.join(num_rating, how='left', on='Title', lsuffix="_Avg", rsuffix="_Count")
avg_num_rating.head(3)

Unnamed: 0_level_0,review/score_Avg,review/score_Count
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
""" Film technique, "" and, "" Film acting """,4.5,2
""" We'll Always Have Paris"": The Definitive Guide to Great Lines from the Movies",5.0,2
"""... And Poetry is Born ..."" Russian Classical Poetry",4.0,1


In [41]:
# Adding average number of rating and total number of rating for each book to the books data

books = avg_num_rating.join(books.set_index('Title'), how='right', on='Title')

In [42]:
books.head(3)

Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publisher,publishedDate,categories,ratingsCount
Its Only Art If Its Well Hung!,Its Only Art If Its Well Hung!,4.0,1.0,,['Julie Strain'],,1996,['Comics & Graphic Novels'],
Dr. Seuss: American Icon,Dr. Seuss: American Icon,4.555556,9.0,Philip Nel takes a fascinating look into the k...,['Philip Nel'],A&C Black,2005-01-01,['Biography & Autobiography'],
Wonderful Worship in Smaller Churches,Wonderful Worship in Smaller Churches,5.0,4.0,This resource includes twelve principles in un...,['David R. Ray'],,2000,['Religion'],


In [43]:
# What does the revew/text column look like?
ratings.loc[1, 'review/text']

"I don't care much for Dr. Seuss but after reading Philip Nel's book I changed my mind--that's a good testimonial to the power of Rel's writing and thinking. Rel plays Dr. Seuss the ultimate compliment of treating him as a serious poet as well as one of the 20th century's most interesting visual artists, and after reading his book I decided that a trip to the Mandeville Collections of the library at University of California in San Diego was in order, so I could visit some of the incredible Seuss/Geisel holdings they have there.There's almost too much to take in, for, like William Butler Yeats, Seuss led a career that constantly shifted and metamoprhized itself to meet new historical and political cirsumstances, so he seems to have been both a leftist and a conservative at different junctures of his career, both in politics and in art. As Nel shows us, he was once a cartoonist for the fabled PM magazine and, like Andy Warhol, he served his time slaving in the ad business too. All was in

In [44]:
# What does the review/summary column look like?
ratings.loc[1, 'review/summary']

'Really Enjoyed It'

In [45]:
books.isna().sum()

Title                      1
review/score_Avg           1
review/score_Count         1
description            68442
authors                31413
publisher              75886
publishedDate          25305
categories             41199
ratingsCount          162652
dtype: int64

# Dropping the ratings count column from from the books data

In [46]:
# Ratings count column has the most null values. I am dropping this column because it has too many null values 
# and because I have now added the total number of ratings to the books data.

books.drop('ratingsCount', inplace=True, axis=1)

In [49]:
# Writing the books data to csv.

books.to_csv('books_wrangled.csv', index=True)

In [None]:
# Writing the ratings data to csv.

ratings.to_csv('ratings_wrangled.csv', index=True)

### I have done some initial data cleaning tasks in this stage. However, I am sure I will be doing more data cleaning in the EDA phase.