<a href="https://colab.research.google.com/github/meskeremg/FinalCapstone/blob/main/Step_4_NLP_Book_Recommendation_Modeling_with_Count_Vectorizer_and_Cosine_Similarity_Meskerem_Goshime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 4: NLP Book Recommendation System - Modeling with Count Vectorizer and Cosine Similarity

Amazon Books Reviews Data data source: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv This is a rich dataset for Natural Language Processing containing 3,000,000 text reviews from users as well as text descriptions and categories for 212,403 books. Therefore it is ideal for text analysis.

# Importing libraries and reading the data

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
fileDownloaded = drive.CreateFile({'id':'1_zl8d0FEqfoteFgV2uC4v1tpcfszxEKu'})
fileDownloaded.GetContentFile('books_description_categories_joined.csv')

In [None]:
books = pd.read_csv('books_description_categories_joined.csv')

In [None]:
books.sample(5)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories_joined
122004,Hannah's Garden,4.2,5.0,['Lisa M. Prysock'],2014.0,fiction step victorian era turn century time p...
6563,New York Times Crossword Puzzle Dictionary (NY...,3.818182,44.0,"['Tom Pulliam', 'Clare Grundman']",1997.0,game activity america foremost crossword puzzl...
100714,"Snopes. The Hamlet, The Town, The Mansion",4.473684,19.0,['William Faulkner'],2011.0,fiction published single volume always hoped w...
135151,The Road to Ubar: Finding the Atlantis of the ...,4.37037,27.0,['Nicholas Clapp'],1999.0,social science author recount discovery lost a...
89023,The Political Testament of Cardinal Richelieu:...,4.0,3.0,['Armand Jean du Plessis duc de Richelieu'],1961.0,history hill prepared excellent translation im...


# Taking a subset of the data by selecting the books which received more than 10 reviews

I am taking a subset of the book data to preform Count Vectorizer and Cosine Similarity. The full dataset proved to be too large even with Google Colab Pro and enabling GPU and High Ram.

In [None]:
books_sm = books[books['review/score_Count'] > 10]
books_sm.head(3)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories_joined
2,Whispers of the Wicked Saints,3.71875,32.0,['Veronica Haddon'],2005.0,fiction julia thomas find life spinning contro...
15,Alaska Sourdough,4.333333,27.0,['Ruth Allman'],1976.0,cooking sourdough magical food author ruth all...
17,Eyewitness Travel Guide to Europe,4.259259,27.0,"['Dorling Kindersley Publishing Staff', 'Jonat...",2015.0,europe dk eyewitness travel guide eastern cent...


In [None]:
books_sm = books_sm.reset_index(drop=True)
books_sm.head(3)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories_joined
0,Whispers of the Wicked Saints,3.71875,32.0,['Veronica Haddon'],2005.0,fiction julia thomas find life spinning contro...
1,Alaska Sourdough,4.333333,27.0,['Ruth Allman'],1976.0,cooking sourdough magical food author ruth all...
2,Eyewitness Travel Guide to Europe,4.259259,27.0,"['Dorling Kindersley Publishing Staff', 'Jonat...",2015.0,europe dk eyewitness travel guide eastern cent...


# Vectorizing and creating cosine similarity matrix

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
cv = CountVectorizer() 
count_matrix_sm = cv.fit_transform(books_sm['description_categories_joined'])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_sm = cosine_similarity(count_matrix_sm)
print(cosine_sim_sm.shape)
cosine_sim_sm

(30514, 30514)


array([[1.        , 0.03091593, 0.0106389 , ..., 0.03220041, 0.03821966,
        0.01796053],
       [0.03091593, 1.        , 0.01911798, ..., 0.05143445, 0.053418  ,
        0.04303315],
       [0.0106389 , 0.01911798, 1.        , ..., 0.05752438, 0.09453802,
        0.05923489],
       ...,
       [0.03220041, 0.05143445, 0.05752438, ..., 1.        , 0.05298799,
        0.        ],
       [0.03821966, 0.053418  , 0.09453802, ..., 0.05298799, 1.        ,
        0.        ],
       [0.01796053, 0.04303315, 0.05923489, ..., 0.        , 0.        ,
        1.        ]])

In [None]:
cosine_sim_sm[0]

array([1.        , 0.03091593, 0.0106389 , ..., 0.03220041, 0.03821966,
       0.01796053])

# Finding the 5 most similar books to the book in 0 index place

In [None]:
sim_0 = pd.DataFrame(cosine_sim_sm[0], columns=['sim']).sort_values(by='sim', ascending=False)
sim_0.reset_index(inplace = True)
sim_0.head()

Unnamed: 0,index,sim
0,0,1.0
1,7821,0.255056
2,25699,0.234787
3,25715,0.226362
4,21375,0.224901


In [None]:
for i in range(1,6):
  indexes = int(sim_0.loc[i]['index'])
  print(indexes, books['Title'][indexes])
  

7821 240 Vocabulary Words 4th Grade Kids Need To Know: 24 Ready-to-Reproduce Packets That Make Vocabulary Building Fun & Effective
25699 On the aesthetic education of man, in a series of letters
25715 The year around: Poems for children
21375 Unless You Repent
26958 Handbook of Hydraulic Resistance


# Making a Function which finds similar books to a given title

In [None]:
def find_similar(title, df, df_col, sims):
    index_val = df[df_col == title].index
    sim = sims[index_val]
    sim = pd.DataFrame(sim).T
    sim.columns = ['sim']
    sim = sim.sort_values(by='sim', ascending = False)
    sim = sim.reset_index()

    for i in range(1,6):
        indexes = int(sim.loc[i]['index'])
        print(indexes, df_col[indexes])


In [None]:
title = 'Eyewitness Travel Guide to Europe'
df = books_sm
df_col = books_sm['Title']
sims = cosine_sim_sm


In [None]:
find_similar(title, df, df_col, sims)

18384 Hawaii (Eyewitness Travel Guides)
11835 Denmark (Eyewitness Travel Guides)
25668 South Africa (Eyewitness Travel Guides)
20405 Insight Guides Puerto Rico (Insight Guide Puerto Rico)
18146 Amsterdam (Eyewitness Top 10 Travel Guides)


# How about accuracy matrix

In the absence of labled data, I was not able to quantify the accuracy of the model. However, I have assessed the recommendations for several books and it does seem to make some decent recommendations. The books in the recommendation list seem pretty similar. As an example, please see the below recommendation.

title = Eyewitness Travel Guide to Europe

Recommendations:

Hawaii (Eyewitness Travel Guides)

Denmark (Eyewitness Travel Guides)

South Africa (Eyewitness Travel Guides)

Insight Guides Puerto Rico (Insight Guide Puerto Rico)

Amsterdam (Eyewitness Top 10 Travel Guides)

# Count Vectorizer versus SBERT Word Embeddings 

In this step, I used Count Vectorizer and cosine similarity to make book recommendations and the recommendation seem decent. Count Vectorizer counts how many times each word appears in a given text. Therefore, when we calculate cosine similarity, it is based on the frequency of words in each text. This does not take into account the meaning of words and the fact that some words are closer to each other in meaning than others. 

SBERT Word Embeddings on the other hand considers the meaning of words and how close two words are in meaning. Therefore, in the next step, I will be using SBERT Word Embeddings to make book recommendations.

# How about the book reviews data

Initially, my plan was to also work on the review texts of the reviews data. However, with the reviews data containing more than 2 million rows, it requires an insane amount of memory. Therefore, I am going to base my recommendations on just the categories and description columns of the books data.

Please see the [Final Step, Step 5](https://colab.research.google.com/drive/1bAgZqXUfHo6ij38yZf7CFyZ6ypBPbnwr?usp=sharing), which is Modeling with SBERT Sentence Embeddings and Cosine Similarity. 