<a href="https://colab.research.google.com/github/meskeremg/FinalCapstone/blob/main/Step_4_1_NLP_Book_Recommendation_Modeling_with_Count_Vectorizer_and_Cosine_Similarity_Meskerem_Goshime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 4-1: NLP Book Recommendation System
# Model 1 - Count Vectorizer and Cosine Similarity

Amazon Books Reviews Data data source: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews?select=books_data.csv This is a rich dataset for Natural Language Processing containing 3,000,000 text reviews from users as well as text descriptions and categories for 212,403 books. Therefore it is ideal for text analysis.

# Importing libraries and reading the data

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
fileDownloaded = drive.CreateFile({'id':'1dnURk-tdodpKuv-3Ic92ELyNoQLs9tLA'})
fileDownloaded.GetContentFile('books_after_preprocessing.csv')

In [None]:
books = pd.read_csv('books_after_preprocessing.csv')

In [None]:
books.sample(3)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
44524,43142,hammer and blaze a gathering of contemporary a...,5.0,1.0,"['Ellen Bryant Voigt', 'Heather McHugh']",2002.0,poetry hammer blaze provides true crosssection...
76912,5594,pheromone on the street corner,4.0,5.0,['Yukio Yukimino'],2001.0,comic graphic novel latest yukio yukimino lie ...
11991,15785,barksdale air force base la images of america,3.75,4.0,['Kevin Bryant Jones'],2015.0,history bossier city sprung around cotton fiel...


# Taking a subset of the data by selecting the books which received more than 10 reviews

I am taking a subset of the book data to preform Count Vectorizer and Cosine Similarity. The full dataset proved to be too large even with Google Colab Pro and enabling GPU and High Ram.

In [None]:
books_sm = books[books['review/score_Count'] > 10]
books_sm.head(3)

Unnamed: 0,index,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
30,95768,1 is one,4.866667,30.0,['Tasha Tudor'],2015.0,juvenile nonfiction rhyming verse present numb...
34,76202,1 ragged ridge road,4.277778,18.0,"['Leonard Foglia', 'David Richards']",1998.0,fiction estranged husband carol robbins young ...
36,110134,10 button book,3.142857,28.0,['William Accorsi'],1999.0,juvenile nonfiction verse introduce number one...


In [None]:
books_sm = books_sm.reset_index(drop=True)
books_sm = books_sm.drop(columns=['index'])
books_sm.head(3)


Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
0,1 is one,4.866667,30.0,['Tasha Tudor'],2015.0,juvenile nonfiction rhyming verse present numb...
1,1 ragged ridge road,4.277778,18.0,"['Leonard Foglia', 'David Richards']",1998.0,fiction estranged husband carol robbins young ...
2,10 button book,3.142857,28.0,['William Accorsi'],1999.0,juvenile nonfiction verse introduce number one...


# Vectorizing and creating cosine similarity matrix

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
cv = CountVectorizer() 
count_matrix_sm = cv.fit_transform(books_sm['description_categories'])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_sm = cosine_similarity(count_matrix_sm)
print(cosine_sim_sm.shape)
cosine_sim_sm

(29560, 29560)


array([[1.        , 0.        , 0.38575837, ..., 0.03928371, 0.04303315,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.09325048,
        0.04445542],
       [0.38575837, 0.        , 1.        , ..., 0.03636965, 0.        ,
        0.        ],
       ...,
       [0.03928371, 0.        , 0.03636965, ..., 1.        , 0.04057204,
        0.        ],
       [0.04303315, 0.09325048, 0.        , ..., 0.04057204, 1.        ,
        0.06356417],
       [0.        , 0.04445542, 0.        , ..., 0.        , 0.06356417,
        1.        ]])

In [None]:
cosine_sim_sm[0]

array([1.        , 0.        , 0.38575837, ..., 0.03928371, 0.04303315,
       0.        ])

# Finding the 5 most similar books to the book in 0 index place

In [None]:
sim_0 = pd.DataFrame(cosine_sim_sm[0], columns=['sim']).sort_values(by='sim', ascending=False)
sim_0.reset_index(inplace = True)
sim_0.head()

Unnamed: 0,index,sim
0,0,1.0
1,2,0.385758
2,17280,0.3849
3,25807,0.372678
4,19458,0.372678


In [None]:
for i in range(1,6):
  indexes = int(sim_0.loc[i]['index'])
  print(indexes, books['Title'][indexes])
  

2  dancers in mourning
17280 building small barns sheds  shelters
25807 crowds and power 2
19458 ceramics
10571 assignment peking


In [None]:
books_sm.sample(10)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories
25423,the rough riders,4.428571,84.0,['Theodore Roosevelt'],1899.0,spanishamerican war based pocket diary spanish...
13306,management by vice a humorous satire on rd li...,4.714286,21.0,['C. B. Don'],2005.0,education science fiction science fact nonfict...
118,2008 riviera maya guide map by cando,5.0,13.0,['Joshua Eden Hinsdale'],2011.0,travel completely updated insider guide veers ...
22929,the greek treasure,4.833333,12.0,['Irving Stone'],1975.0,fiction fictionalized narrative derided determ...
11643,journey through genius the great theorems of m...,4.830189,106.0,['William Dunham'],1991.0,biography autobiography like masterpiece art m...
19132,spanish stepbystep,4.538462,13.0,['Barbara Bregstein'],2012.0,foreign language study proven grammarbased app...
19546,story structure architect a writers guide to b...,3.757576,33.0,['Victoria Lynn Schmidt'],2005.0,language art discipline build timeless origina...
12958,love according to lily american heiress,3.818182,22.0,['Julianne MacLean'],2009.0,fiction lily langdon finally grown brother duk...
17879,sailing the winedark sea why the greeks matter,3.642857,84.0,['Thomas Cahill'],2010.0,history sailing winedark sea fourth volume exp...
27286,tolkien,2.641026,39.0,"['Pam Pollack', 'Meg Belviso', 'Who HQ']",2015.0,juvenile nonfiction introduction life career f...


# Making a Function which finds similar books to a given title

In [None]:
def find_similar(title, df, df_col, sims):
    index_val = df[df_col == title].index
    sim = sims[index_val]
    sim = pd.DataFrame(sim).T
    sim.columns = ['sim']
    sim = sim.sort_values(by='sim', ascending = False)
    sim = sim.reset_index()

    print('Chosen book: ', title)
    print('Recommended books: ')

    for i in range(1,6):
        indexes = int(sim.loc[i]['index'])
        print(i, '. ', df_col[indexes])


In [None]:
title = '1 is one'
df = books_sm
df_col = books_sm['Title']
sims = cosine_sim_sm


In [None]:
find_similar(title, df, df_col, sims)

Chosen book:  1 is one
Recommended books: 
1 .  10 button book
2 .  red lace yellow lace
3 .  the skin you live in
4 .  sticky situations 2 365 devotions for elementary kids
5 .  i spy 4 picture riddle books school reader collection lvl 1 scholastic reader collection


In [None]:
title = 'spanish stepbystep'
df = books_sm
df_col = books_sm['Title']
sims = cosine_sim_sm
find_similar(title, df, df_col, sims)

Chosen book:  spanish stepbystep
Recommended books: 
1 .  the big red book of spanish vocabulary
2 .  teach yourself korean complete course korean edition
3 .  practice makes perfect spanish verb tenses
4 .  spanish made simple
5 .  teach yourself finnish


In [None]:
title = 'to kill a mockingbird'
df = books_sm
df_col = books_sm['Title']
sims = cosine_sim_sm
find_similar(title, df, df_col, sims)

Chosen book:  to kill a mockingbird
Recommended books: 
1 .  to kill a mocking bird
2 .  moll flanders norton critical editions series
3 .  the short novels of john steinbeck
4 .  the man who loved children
5 .  friday


# How about accuracy matrix

In the absence of labled data, I was not able to quantify the accuracy of the model. However, I have assessed the recommendations for several books and it does seem to make some decent recommendations most of the time. The books in the recommendation list seem pretty similar to the chosen title. Let us see assess the three recommendations above. 

1.   **1 is One** - **Very good** recommendations! The recommended titles are all early childhood picture books, several in rhyme/riddle format which matches the chosen book, **1 is one**.
2.   **Spanish Step by Step** **Good** recommendations. All language learning books and 3 out of 5 for the Spanish language specifically. It would have been nice to get all Spanish language learning books.
3.   **to Kill a Mockingbird** - **Good** recommendations. The recommended books are mostly older (18th and 19th Centures) popular fictions (one biography). However, I do not see a lot of overlap on their topics.

Please see the recommendation lists for these books in the above codes.



# Comparing Count Vectorizer, Gensim Library and SBERT Word Embeddings 

In this step, I used Count Vectorizer and cosine similarity to make book recommendations and the recommendation seem decent. Count Vectorizer counts how many times each word appears in a given text. Therefore, when we calculate cosine similarity, it is based on the frequency of words in each text. This does not take into account the meaning of words and the fact that some words are closer to each other in meaning than others. Inspite of this, the recommendation seemed surprisingly good.

In the next step, I will use the Gensim library, which is similar to the Count Vectorizer in that it depends on the frequency of words. Gensim is expected to be simple to implement and fast to run. 

In the last step, I will use SBERT Sentence Transformers. This on the other hand considers the context and meaning of words and sentences. Therefore, it might provide better result.