# NLP Book Recommendation System - Modeling with SBERT Sentence Transformer

Amazon Books Reviews Data data source: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv This is a rich dataset for Natural Language Processing containing 3,000,000 text reviews from users as well as text descriptions and categories for 212,403 books. Therefore it is ideal for text analysis.

# Importing libraries and reading the data

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [4]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [5]:
fileDownloaded = drive.CreateFile({'id':'1dnURk-tdodpKuv-3Ic92ELyNoQLs9tLA'})
fileDownloaded.GetContentFile('books_after_preprocessing.csv')

In [6]:
books = pd.read_csv('books_after_preprocessing.csv')

In [7]:
books.head(5)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
0,74190,and poetry is born russian classical poetry,4.0,1.0,['Aleksandr Sergeevich Pushkin'],1984.0,russian poetry selection russian poem russian ...
1,80644,and still king,4.0,1.0,['Keith Checkley'],2012.0,business economics nothing provides clearer pi...
2,31352,dancers in mourning,4.5,8.0,['Margery Allingham'],2015.0,fiction murder take center stage songanddance ...
3,14856,eothen,3.888889,9.0,['Alexander William Kinglake'],2020.0,middle east eothen earliest work alexander wil...
4,77367,film technique and film acting,4.5,2.0,['V. I. Pudovkin'],2008.0,drama film technique film acting cinema writin...


# Taking a subset of the data by selecting the books which received more than 10 reviews

I am taking a subset of the book data to preform Sentence Embeddings and Cosine Similarity. The full dataset would be too large to process.

In [8]:
books_sm_10 = books[books['review/score_Count'] > 10]
books_sm_10.head(3)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
30,95768,1 is one,4.866667,30.0,['Tasha Tudor'],2015.0,juvenile nonfiction rhyming verse present numb...
34,76202,1 ragged ridge road,4.277778,18.0,"['Leonard Foglia', 'David Richards']",1998.0,fiction estranged husband carol robbins young ...
36,110134,10 button book,3.142857,28.0,['William Accorsi'],1999.0,juvenile nonfiction verse introduce number one...


In [9]:
books_sm_10 = books_sm_10.reset_index(drop=True)


In [10]:
books_sm_10 = books_sm_10.drop(columns='index')
books_sm_10.head(3)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
0,1 is one,4.866667,30.0,['Tasha Tudor'],2015.0,juvenile nonfiction rhyming verse present numb...
1,1 ragged ridge road,4.277778,18.0,"['Leonard Foglia', 'David Richards']",1998.0,fiction estranged husband carol robbins young ...
2,10 button book,3.142857,28.0,['William Accorsi'],1999.0,juvenile nonfiction verse introduce number one...


In [11]:
books_sm_10.shape

(29560, 6)

# Cosine Similarity using Word Embeddings

In [12]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [13]:
# source: https://www.sbert.net/docs/usage/semantic_textual_similarity.html

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single list of sentences
sentences = books_sm_10['description_categories'].values

#Compute embeddings
embeddings = model.encode(sentences, convert_to_numpy=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(embeddings, embeddings)

In [14]:
cosine_scores

tensor([[1.0000, 0.1183, 0.5271,  ..., 0.1800, 0.2950, 0.0608],
        [0.1183, 1.0000, 0.1760,  ..., 0.0393, 0.3236, 0.1506],
        [0.5271, 0.1760, 1.0000,  ..., 0.1901, 0.2620, 0.0790],
        ...,
        [0.1800, 0.0393, 0.1901,  ..., 1.0000, 0.3124, 0.0623],
        [0.2950, 0.3236, 0.2620,  ..., 0.3124, 1.0000, 0.4192],
        [0.0608, 0.1506, 0.0790,  ..., 0.0623, 0.4192, 1.0000]])

In [15]:
cosine_scores[0]

tensor([1.0000, 0.1183, 0.5271,  ..., 0.1800, 0.2950, 0.0608])

# Finding the 5 most similar books to the book in 0 index place

In [16]:
sim_0 = pd.DataFrame(cosine_scores[0], columns=['sim']).sort_values(by='sim', ascending=False)
sim_0.reset_index(inplace = True)
sim_0.head()

Unnamed: 0,index,sim
0,0,1.000001
1,756,0.772923
2,10572,0.74399
3,10571,0.739862
4,17176,0.721053


In [17]:
print(books_sm_10['Title'][0])
print('Similar books')
for i in range(1,6):
  indexes = int(sim_0.loc[i]['index'])
  print(indexes, books_sm_10['Title'][indexes])
  

1 is one
Similar books
756 a pinky is a baby mouse and other baby animal names pinky baby
10572 i spy little book
10571 i spy 4 picture riddle books school reader collection lvl 1 scholastic reader collection
17176 read to your bunny max  ruby
21379 the christian mother goose book of nursery rhymes


# Making a Function which finds similar books to a given title

In [18]:
books_sm_10.sample(5)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
28803,white bread competition,4.454545,11.0,['Jo Ann Yolanda Hernández'],1997.0,mexican american luz ninthgrade latina student...
20029,tarot of the new vision english and spanish ed...,4.6,20.0,['Lo Scarabeo'],2005.0,body mind spirit discover hidden secret popula...
22848,the grand inquisitor,4.090909,22.0,['Fyodor Dostoyevsky'],2012.0,fiction considered one crucial passage subplot...
7998,flash for freedom,4.5625,64.0,['George MacDonald Fraser'],2013.0,fiction game card lead flashman jungle deathho...
28933,wicked fix a home repair is homicide mystery,4.352941,17.0,['Sarah Graves'],2014.0,fiction doityourself killer fix smalltown thug...


In [19]:
def find_similar(title, df, df_col, sims):
    index_val = df[df_col == title].index
    sim = sims[index_val]
    sim = pd.DataFrame(sim).T
    sim.columns = ['sim']
    sim = sim.sort_values(by='sim', ascending = False)
    sim = sim.reset_index()

    print(title)
    print('Similar books')

    for i in range(1,6):
        indexes = int(sim.loc[i]['index'])
        print(indexes, df_col[indexes])


In [20]:
title = 'to kill a mockingbird'
df = books_sm_10
df_col = books_sm_10['Title']
sims = cosine_scores


In [21]:
find_similar(title, df, df_col, sims)

to kill a mockingbird
Similar books
27234 to kill a mocking bird
21685 the confessions of nat turner
8944 gone with the wind
8945 gone with the wind the margaret mitchell anniversary edition
20628 the autobiography of miss jane pittman


In [24]:
# This cell contains codes to save the cosine_scores matrix to a file. 
# The purpose was to avoid running the sentence similarity model every time we want to make recommendations.
# However, the file was too large. Therefore, running the model seems faster than reading the matrix from a file.

#cosine_scores_np = cosine_scores.numpy()
#cosine_scores_np.tofile('cosine_scores.csv', sep = ',')
#from google.colab import files
#files.download('cosine_scores.csv')

# Duplicate Titles

At this point in the project, I noticed that there were duplicate values in the title column with differing capitalization and/or spelling. Therefore, I returned to the Text Preprocessing step and converted all of book titles to small letters and removed duplicates. However, I did not find a good solution for the spelling differences at this time.

# How about the book reviews data

Initially, my plan was to also to work on the review texts of the reviews data. However, with the reviews data containing more than 2 million rows, it requires an insane amount of memory. Therefore, I am going to base my recommendations on just the categories and description columns of the books data.

# How about accuracy matrix

In the absence of labled data, I was not able to quantify the accuracy of the model. However, I have assessed the recommendations for several books and it does seem to make recommendations for similar books.

For example, the recommendation for To Kill a Mockingbird were the following:

27234 to kill a mocking bird
21685 the confessions of nat turner
8944 gone with the wind
8945 gone with the wind the margaret mitchell anniversary edition
20628 the autobiography of miss jane pittman

In this recommendation list, most of these books are historical fictions that deal with the topics of slavery, race relations, 17th Century and early 18th Century America. This looks like a really good recommendation list.

Unfortunately, you will notice that the same books that are sometimes listed with a little differing titles appear in recommendation lists. 

