<a href="https://colab.research.google.com/github/meskeremg/FinalCapstone/blob/main/Modeling_and_Recommending.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing libraries and reading the data

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [3]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [4]:
fileDownloaded = drive.CreateFile({'id':'1_zl8d0FEqfoteFgV2uC4v1tpcfszxEKu'})
fileDownloaded.GetContentFile('books_description_categories_joined.csv')

In [5]:
books = pd.read_csv('books_description_categories_joined.csv')

In [7]:
books.sample(5)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories_joined
79989,The Christ of the Indian road,4.333333,6.0,['E. Stanley Jones'],1925.0,religion searching truth map help lead path way
57642,Futurehype the Tyranny of Prophecy,4.0,2.0,['Max Dublin'],1991.0,social science critique way society attempt so...
11153,Something Queer on Vacation,5.0,1.0,['Elizabeth Levy'],1980.0,seashore gwen jill determine win weekly sandca...
79261,Birds of Massachusetts Field Guide,4.875,16.0,"['Donald Stokes', 'Lillian Stokes']",2010.0,nature culmination many year research observat...
46878,The Criminal Personality: The Change Process (...,4.75,4.0,"['Samuel Yochelson', 'Stanton E. Samenow']",1976.0,criminal psychology dr chessick us metaphor te...


# Taking a subset of the data by selecting the books which received more than 10 reviews

I am taking a subset of the book data to preform Count Vectorizer and Cosine Similarity. The full dataset proved to be too large even with Google Colab Pro and enabling GPU and High Ram.

In [8]:
books_sm = books[books['review/score_Count'] > 10]
books_sm.head(3)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories_joined
2,Whispers of the Wicked Saints,3.71875,32.0,['Veronica Haddon'],2005.0,fiction julia thomas find life spinning contro...
15,Alaska Sourdough,4.333333,27.0,['Ruth Allman'],1976.0,cooking sourdough magical food author ruth all...
17,Eyewitness Travel Guide to Europe,4.259259,27.0,"['Dorling Kindersley Publishing Staff', 'Jonat...",2015.0,europe dk eyewitness travel guide eastern cent...


In [9]:
books_sm = books_sm.reset_index(drop=True)
books_sm.head(3)

Unnamed: 0,Title,review/score_Avg,review/score_Count,authors,publishedDate,description_categories_joined
0,Whispers of the Wicked Saints,3.71875,32.0,['Veronica Haddon'],2005.0,fiction julia thomas find life spinning contro...
1,Alaska Sourdough,4.333333,27.0,['Ruth Allman'],1976.0,cooking sourdough magical food author ruth all...
2,Eyewitness Travel Guide to Europe,4.259259,27.0,"['Dorling Kindersley Publishing Staff', 'Jonat...",2015.0,europe dk eyewitness travel guide eastern cent...


# Vectorizing and creating cosine similarity matrix

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [10]:
cv = CountVectorizer() 
count_matrix_sm = cv.fit_transform(books_sm['description_categories_joined'])

In [11]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim_sm = cosine_similarity(count_matrix_sm)
print(cosine_sim_sm.shape)
cosine_sim_sm

(30514, 30514)


array([[1.        , 0.03091593, 0.0106389 , ..., 0.03220041, 0.03821966,
        0.01796053],
       [0.03091593, 1.        , 0.01911798, ..., 0.05143445, 0.053418  ,
        0.04303315],
       [0.0106389 , 0.01911798, 1.        , ..., 0.05752438, 0.09453802,
        0.05923489],
       ...,
       [0.03220041, 0.05143445, 0.05752438, ..., 1.        , 0.05298799,
        0.        ],
       [0.03821966, 0.053418  , 0.09453802, ..., 0.05298799, 1.        ,
        0.        ],
       [0.01796053, 0.04303315, 0.05923489, ..., 0.        , 0.        ,
        1.        ]])

In [12]:
cosine_sim_sm[0]

array([1.        , 0.03091593, 0.0106389 , ..., 0.03220041, 0.03821966,
       0.01796053])

# Finding the 5 most similar books to the book in 0 index place

In [13]:
sim_0 = pd.DataFrame(cosine_sim_sm[0], columns=['sim']).sort_values(by='sim', ascending=False)
sim_0.reset_index(inplace = True)
sim_0.head()

Unnamed: 0,index,sim
0,0,1.0
1,7821,0.255056
2,25699,0.234787
3,25715,0.226362
4,21375,0.224901


In [14]:
for i in range(1,6):
  indexes = int(sim_0.loc[i]['index'])
  print(indexes, books['Title'][indexes])
  

7821 240 Vocabulary Words 4th Grade Kids Need To Know: 24 Ready-to-Reproduce Packets That Make Vocabulary Building Fun & Effective
25699 On the aesthetic education of man, in a series of letters
25715 The year around: Poems for children
21375 Unless You Repent
26958 Handbook of Hydraulic Resistance


# Making a Function which finds similar books to a given title

In [19]:
def find_similar(title, df, df_col, sims):
    index_val = df[df_col == title].index
    sim = sims[index_val]
    sim = pd.DataFrame(sim).T
    sim.columns = ['sim']
    sim = sim.sort_values(by='sim', ascending = False)
    sim = sim.reset_index()

    for i in range(1,6):
        indexes = int(sim.loc[i]['index'])
        print(indexes, df_col[indexes])


In [22]:
title = 'Eyewitness Travel Guide to Europe'
df = books_sm
df_col = books_sm['Title']
sims = cosine_sim_sm


In [23]:
find_similar(title, df, df_col, sims)

18384 Hawaii (Eyewitness Travel Guides)
11835 Denmark (Eyewitness Travel Guides)
25668 South Africa (Eyewitness Travel Guides)
20405 Insight Guides Puerto Rico (Insight Guide Puerto Rico)
18146 Amsterdam (Eyewitness Top 10 Travel Guides)


# How about the book reviews data

Initially, my plan was to also work on the review texts of the reviews data. However, with the reviews data containing more than 2 million rows, it requires an insane amount of memory. Therefore, I am going to base my recommendations on just the categories and description columns of the books data.