<a href="https://colab.research.google.com/github/meskeremg/FinalCapstone/blob/main/Step_4_3_NLP_Book_Recommendation_Modeling_with_SBERT_Sentence_Embeddings_Meskerem_Goshime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 4-3 NLP Book Recommendation System
# Model 3 - SBERT Sentence Embeddings and Cosine Similarity

Amazon Books Reviews Data data source: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv This is a rich dataset for Natural Language Processing containing 3,000,000 text reviews from users as well as text descriptions and categories for 212,403 books. Therefore it is ideal for text analysis.

# Importing libraries and reading the data

In [None]:
import pandas as pd
import numpy as np

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
fileDownloaded = drive.CreateFile({'id':'1dnURk-tdodpKuv-3Ic92ELyNoQLs9tLA'})
fileDownloaded.GetContentFile('books_after_preprocessing.csv')

In [None]:
books = pd.read_csv('books_after_preprocessing.csv')

In [None]:
books.head(3)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
0,74190,and poetry is born russian classical poetry,4.0,1.0,['Aleksandr Sergeevich Pushkin'],1984.0,russian poetry selection russian poem russian ...
1,80644,and still king,4.0,1.0,['Keith Checkley'],2012.0,business economics nothing provides clearer pi...
2,31352,dancers in mourning,4.5,8.0,['Margery Allingham'],2015.0,fiction murder take center stage songanddance ...


# Taking a subset of the data by selecting the books which received more than 10 reviews

I am taking a subset of the book data to preform Sentence Embeddings and Cosine Similarity. The full dataset would be too large to process.

In [None]:
books_sm_10 = books[books['review/score_Count'] > 10]
books_sm_10.head(3)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
30,95768,1 is one,4.866667,30.0,['Tasha Tudor'],2015.0,juvenile nonfiction rhyming verse present numb...
34,76202,1 ragged ridge road,4.277778,18.0,"['Leonard Foglia', 'David Richards']",1998.0,fiction estranged husband carol robbins young ...
36,110134,10 button book,3.142857,28.0,['William Accorsi'],1999.0,juvenile nonfiction verse introduce number one...


In [None]:
books_sm_10 = books_sm_10.reset_index(drop=True)
books_sm_10 = books_sm_10.drop(columns='index')
books_sm_10.head(3)


Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
0,1 is one,4.866667,30.0,['Tasha Tudor'],2015.0,juvenile nonfiction rhyming verse present numb...
1,1 ragged ridge road,4.277778,18.0,"['Leonard Foglia', 'David Richards']",1998.0,fiction estranged husband carol robbins young ...
2,10 button book,3.142857,28.0,['William Accorsi'],1999.0,juvenile nonfiction verse introduce number one...


In [None]:
books_sm_10.shape

(29560, 6)

# Cosine Similarity using Word Embeddings

In [None]:
# this took 13 seconds to run using GPU and Hi RAM.
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 5.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 84.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 75.7 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 88.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 81.5 MB/s 
Building wheels for collected 

In [None]:
# source: https://www.sbert.net/docs/usage/semantic_textual_similarity.html
# using pretrained SentenceTransformer model
# This took 57 seconds using GPU and Hi RAM in Google Colab Pro.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single list of sentences
sentences = books_sm_10['description_categories'].values

#Compute embeddings
embeddings = model.encode(sentences, convert_to_numpy=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(embeddings, embeddings)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
cosine_scores

tensor([[1.0000, 0.1183, 0.5271,  ..., 0.1800, 0.2950, 0.0608],
        [0.1183, 1.0000, 0.1760,  ..., 0.0393, 0.3236, 0.1506],
        [0.5271, 0.1760, 1.0000,  ..., 0.1901, 0.2620, 0.0790],
        ...,
        [0.1800, 0.0393, 0.1901,  ..., 1.0000, 0.3124, 0.0623],
        [0.2950, 0.3236, 0.2620,  ..., 0.3124, 1.0000, 0.4192],
        [0.0608, 0.1506, 0.0790,  ..., 0.0623, 0.4192, 1.0000]])

In [None]:
cosine_scores[0]

tensor([1.0000, 0.1183, 0.5271,  ..., 0.1800, 0.2950, 0.0608])

# Finding the 5 most similar books to the book in 0 index place

In [None]:
sim_0 = pd.DataFrame(cosine_scores[0], columns=['sim']).sort_values(by='sim', ascending=False)
sim_0.reset_index(inplace = True)
sim_0.head()

Unnamed: 0,index,sim
0,0,1.000001
1,756,0.772923
2,10572,0.74399
3,10571,0.739862
4,17176,0.721053


In [None]:
print('Chosen book: ', books_sm_10['Title'][0])
print('Similar books: ')
for i in range(1,6):
  indexes = int(sim_0.loc[i]['index'])
  print(indexes, books_sm_10['Title'][indexes])
  

Chosen book:  1 is one
Similar books: 
756 a pinky is a baby mouse and other baby animal names pinky baby
10572 i spy little book
10571 i spy 4 picture riddle books school reader collection lvl 1 scholastic reader collection
17176 read to your bunny max  ruby
21379 the christian mother goose book of nursery rhymes


# Making a Function which finds similar books to a given title

In [None]:
def find_similar(title, df, df_col, sims):
    index_val = df[df_col == title].index
    sim = sims[index_val]
    sim = pd.DataFrame(sim).T
    sim.columns = ['sim']
    sim = sim.sort_values(by='sim', ascending = False)
    sim = sim.reset_index()

    print('Chosen book: ', title)
    print('Similar books: ')

    for i in range(1,6):
        indexes = int(sim.loc[i]['index'])
        print(i, '. ', df_col[indexes])


# Example Recommendations with the same books I used to test Model 1 and Model 2

note: please see recommendations for **1 is one** above

In [None]:
title = 'spanish stepbystep'
df = books_sm_10
df_col = books_sm_10['Title']
sims = cosine_scores
find_similar(title, df, df_col, sims)

Chosen book:  spanish stepbystep
Similar books: 
1 .  fundamental spanish
2 .  the big red book of spanish vocabulary
3 .  teach yourself korean
4 .  teach yourself brazilian portuguese teach yourself tape
5 .  basic english grammar second edition full student textbook


In [None]:
title = 'to kill a mockingbird'
df = books_sm_10
df_col = books_sm_10['Title']
sims = cosine_scores
find_similar(title, df, df_col, sims)

Chosen book:  to kill a mockingbird
Similar books: 
1 .  to kill a mocking bird
2 .  the confessions of nat turner
3 .  gone with the wind
4 .  gone with the wind the margaret mitchell anniversary edition
5 .  the autobiography of miss jane pittman


# Duplicate Titles

At this point in the project, I noticed that there were duplicate values in the title column with differing capitalization and/or spelling. Therefore, I returned to the Text Preprocessing step and converted all of book titles to small letters and removed duplicates. However, I did not find a good solution for duplicate titles with spelling/wording differences at this time.

# How about accuracy matrix

In the absence of labled data, I was not able to quantify the accuracy of the model. However, I have assessed the recommendations for several books and it does seem to make recommendations for similar books.

Here are my assessments on 3 example recommendations.

**1 is One** - **Very good** recommendations! All rhyming early childhood picture books like the chosen book.

**Spanish Step by Step** - the recommendations **could be better**. All language learning books, out of which 2 are for Spanish language and 3 are for other languages.

**To Kill a Mockingbird** - **Very good** recommendations! Historical fictions that deal with the topics of slavery, race relations, everyday life in the 17th Century and early 18th Century America.

Please see the recommendation lists for these books in the above codes.

# How about the book reviews data

Initially, my plan was to also to work on the review texts of the reviews data. However, with the reviews data containing more than 2 million rows, it requires an insane amount of memory. Therefore, I am going to base my recommendations on just the categories and description columns of the books data.