# Step 3: NLP Book Recommendation System - Text Preprocessing

Amazon Books Reviews Data data source: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv This is a rich dataset for Natural Language Processing containing 3,000,000 text reviews from users as well as text descriptions and categories for 212,403 books. Therefore it is ideal for text analysis.

# Importing libraries and reading the books data

In [46]:
import pandas as pd
import numpy as np
import re
import nltk
#nltk.download('stopwords')  # already downloaded
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [47]:
books = pd.read_csv('books_after_eda.csv')

In [48]:
books.head(3)

Unnamed: 0.1,Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publishedDate,categories
0,1,Dr. Seuss: American Icon,4.555556,9.0,Philip Nel takes a fascinating look into the k...,['Philip Nel'],2005.0,['Biography & Autobiography']
1,2,Wonderful Worship in Smaller Churches,5.0,4.0,This resource includes twelve principles in un...,['David R. Ray'],2000.0,['Religion']
2,3,Whispers of the Wicked Saints,3.71875,32.0,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],2005.0,['Fiction']


# Checking and removing duplicate titles

In [49]:
books['Title'] = books['Title'].str.lower()
duplicates = books[books.duplicated(subset='Title')]
duplicates.head(3)

Unnamed: 0.1,Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publishedDate,categories
3388,4850,the complete book of home inspection: for the ...,3.53125,32.0,Guidelines from a professional home inspector.,['Norman Becker'],1993.0,['Dwellings']
4117,5890,in the wet,4.4,45.0,"It is the rainy season. Drunk and delirious, a...",['Nevil Shute'],2010.0,['Fiction']
4725,6715,silence will speak: a study of the life of den...,4.0,2.0,"A study of the well-born Englishman who, after...",['Errol Trzebinski'],1985.0,['History']


In [50]:
# Removing empty spaces and special characters in the title column

books['Title'] = books['Title'].str.strip()
books['Title'] = books['Title'].replace(r'\s+',' ', regex=True)
books['Title'] = books['Title'].replace(r'[^\w\s]+', '', regex=True)

In [51]:
# Examples of duplicates

books[books['Title']=='in the wet']

Unnamed: 0.1,Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publishedDate,categories
2056,2956,in the wet,4.4,15.0,"It is the rainy season. Drunk and delirious, a...",['Nevil Shute'],2010.0,['Fiction']
4117,5890,in the wet,4.4,45.0,"It is the rainy season. Drunk and delirious, a...",['Nevil Shute'],2010.0,['Fiction']


In [52]:
books[books['Title']=='the high window']

Unnamed: 0.1,Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publishedDate,categories
3343,4786,the high window,4.192308,26.0,"Philip Marlowe, a private detective, searches ...",['Raymond Chandler'],1993.0,['Fiction']
5403,7667,the high window,4.1875,80.0,"Philip Marlowe, a private detective, searches ...",['Raymond Chandler'],1993.0,['Fiction']


In [53]:
# sorting the records by the title so the same titles will be together, and then sorted by the review/score count. 
# I want to keep one of the duplicate values which has higher number of review/score_count.

books = books.sort_values(by=['Title', 'review/score_Count'])

# Dropping the duplicate value with less number of review/score_count.

books = books.drop_duplicates(subset='Title', keep='last')

In [54]:
# Making sure the duplicate titles are removed.

books[books.duplicated(subset='Title')]

Unnamed: 0.1,Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publishedDate,categories


In [55]:
# checking to make sure the one with higher review/score_count was kept.

books[books['Title']=='the high window']

Unnamed: 0.1,Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publishedDate,categories
5403,7667,the high window,4.1875,80.0,"Philip Marlowe, a private detective, searches ...",['Raymond Chandler'],1993.0,['Fiction']


In [56]:
books[books['Title']=='in the wet']

Unnamed: 0.1,Unnamed: 0,Title,review/score_Avg,review/score_Count,description,authors,publishedDate,categories
4117,5890,in the wet,4.4,45.0,"It is the rainy season. Drunk and delirious, a...",['Nevil Shute'],2010.0,['Fiction']


# Combining the categories and description columns

In [57]:
books['description_categories'] = books['categories'] + " " + books['description']

In [58]:
books['description_categories'][4]

"['Biography & Autobiography'] The story for children 10 and up of St. Hyacinth, the Dominican who planted the Faith in Poland, Lithuania and Russia and worked many miracles. He went to Rome, where he met St. Dominic, and was one of the first to receive at his hands the habit of the newly established Order of Friars Preachers. After his novitiate he made his religious profession, and was made superior of the little band of missionaries sent to Poland to preach. Impr. 189 pgs 16 Illus, PB"

In [59]:
# Removing special characters and converting all of the text to lower case

books['description_categories'] = books['description_categories'].replace(r'[^\w\s]+', '', regex=True)
books['description_categories'] = books['description_categories'].replace(r'\d+', '', regex=True)
books['description_categories'] = books['description_categories'].str.lower()

In [60]:
books['description_categories'][4]

'biography  autobiography the story for children  and up of st hyacinth the dominican who planted the faith in poland lithuania and russia and worked many miracles he went to rome where he met st dominic and was one of the first to receive at his hands the habit of the newly established order of friars preachers after his novitiate he made his religious profession and was made superior of the little band of missionaries sent to poland to preach impr  pgs  illus pb'

In [61]:
# Removing unneeded columns and resetting the index

books = books.drop(columns=['Unnamed: 0', 'description', 'categories'], axis=1)
books = books.reset_index()
books.head(2)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
0,74190,and poetry is born russian classical poetry,4.0,1.0,['Aleksandr Sergeevich Pushkin'],1984.0,russian poetry a selection of russian poems in...
1,80644,and still king,4.0,1.0,['Keith Checkley'],2012.0,business economics nothing provides a clearer...


# Tokenize, Lemmatize, Remove Stop Words, 

In [62]:
import nltk.data
nltk.download('punkt')      # already downloaded
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\meske\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\meske\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [63]:
stopwords = list(stopwords.words('english'))
stopwords[0:5]

['i', 'me', 'my', 'myself', 'we']

# Lemmatization or Stemming

I chose lemmatization instead of stemming. Stemming truncates the word 
while lemmatization reduces the word to the dictionary root word form.
Stemming sometimes causes overstemming, which gives words that do not have any meaning.

Source: https://towardsdatascience.com/7-nlp-techniques-you-can-easily-implement-with-python-dc0ade1a53c2

In [64]:
# define the tokenize_lemmatize function

lemmatizer = WordNetLemmatizer()

def tokenize_lemmatize(text):
    new_words = []
    tokenized_text = word_tokenize(text)
    for word in tokenized_text:
        if word in stopwords:
            continue
        else:
            lemmetized_word = lemmatizer.lemmatize(word)
            new_words.append(lemmetized_word)  
    return new_words

In [65]:
# Calling the tokenize_lemmatize function

for i in range(len(books['description_categories'])):
    new_para = tokenize_lemmatize(books['description_categories'][i])
    books['description_categories'][i] = new_para


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  books['description_categories'][i] = new_para


In [66]:
# the output looks good! It doesn't have stop words and it is tokenized and lemmatized.

books['description_categories'][4][0:5]

['drama', 'film', 'technique', 'film', 'acting']

# Join the list of strings in the description_categories column into one string

In [67]:
books['description_categories'] = books['description_categories'].str.join(" ")

In [68]:
books.head(3)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
0,74190,and poetry is born russian classical poetry,4.0,1.0,['Aleksandr Sergeevich Pushkin'],1984.0,russian poetry selection russian poem russian ...
1,80644,and still king,4.0,1.0,['Keith Checkley'],2012.0,business economics nothing provides clearer pi...
2,31352,dancers in mourning,4.5,8.0,['Margery Allingham'],2015.0,fiction murder take center stage songanddance ...


In [69]:
books.to_csv('books_after_preprocessing.csv', index=False)

# Next steps

The next step will be vectorizing, creating cosine similarity matrix and making recommendations. I will be doing this next phase in Google Colab since creating the cosine similarity matrix requires 78 gb of memory and cannot be done on a local machine. I may also need to do this process on a subset of the data. I wgill be using Colab Pro in order to get better access to higher RAM and GPU. Please see the next step here. https://colab.research.google.com/drive/1ZJ0IckSniFoalPV0rFNIqwSEngl1xbZM?usp=sharing