In [1]:
import pandas as pd

df = pd.read_csv('BX-Books.csv', on_bad_lines="skip",encoding='latin-1',sep=';', low_memory=False)
df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


# Removing Duplicates
#### First, let us check if there are any duplicate book titles. These are redundant to the algorithm and must be removed

In [3]:
df.duplicated(subset='Book-Title').sum() # 29,225

29225

## Random Sampling
#### As mentioned in the previous section, we need to randomly sample 15,000 rows from the dataframe to avoid running into memory errors:



In [4]:
sample_size = 15000
df = df.sample(n=sample_size, replace=False, random_state=490)

df = df.reset_index()
df = df.drop('index',axis=1)

## Processing Text Data
##### Now, let us print the head of the dataframe again:



In [6]:
df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,515098388,Only Make Believe,Marilyn Beck,1988,Jove Books,http://images.amazon.com/images/P/0515098388.0...,http://images.amazon.com/images/P/0515098388.0...,http://images.amazon.com/images/P/0515098388.0...
1,440405912,Too Many Rabbits,Peggy Parish,1992,Yearling Books,http://images.amazon.com/images/P/0440405912.0...,http://images.amazon.com/images/P/0440405912.0...,http://images.amazon.com/images/P/0440405912.0...
2,521046297,The General Prologue to the Canterbury Tales,Geoffrey Chaucer,1965,Cambridge University Press,http://images.amazon.com/images/P/0521046297.0...,http://images.amazon.com/images/P/0521046297.0...,http://images.amazon.com/images/P/0521046297.0...
3,821722506,Fortune's Flames,Janelle Taylor,1988,Zebra Books,http://images.amazon.com/images/P/0821722506.0...,http://images.amazon.com/images/P/0821722506.0...,http://images.amazon.com/images/P/0821722506.0...
4,226848787,The Chinese Maze Murders : A Judge Dee Mystery...,Robert van Gulik,1997,Press,http://images.amazon.com/images/P/0226848787.0...,http://images.amazon.com/images/P/0226848787.0...,http://images.amazon.com/images/P/0226848787.0...


#### Data Cleaning

Before converting the data into a word vector, we need to clean it. First, let’s remove whitespaces from the “Book-Author” column. If we do not do this, then CountVectorizer will count the authors’ first and last name as a separate word.
For instance, if one author is named James Clear and another is called James Patterson, the vectorizer will count the word James in both cases, and the recommender system might consider the books as highly similar, even though they are not related at all. James Clear writes self-help books while James Patterson is known for his mystery novels.
Run the following lines of code to combine the authors’ first and last names:



In [7]:
def clean_text(author):
    result = str(author).lower()
    return(result.replace(' ',''))

df['Book-Author'] = df['Book-Author'].apply(clean_text)

df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,515098388,Only Make Believe,marilynbeck,1988,Jove Books,http://images.amazon.com/images/P/0515098388.0...,http://images.amazon.com/images/P/0515098388.0...,http://images.amazon.com/images/P/0515098388.0...
1,440405912,Too Many Rabbits,peggyparish,1992,Yearling Books,http://images.amazon.com/images/P/0440405912.0...,http://images.amazon.com/images/P/0440405912.0...,http://images.amazon.com/images/P/0440405912.0...
2,521046297,The General Prologue to the Canterbury Tales,geoffreychaucer,1965,Cambridge University Press,http://images.amazon.com/images/P/0521046297.0...,http://images.amazon.com/images/P/0521046297.0...,http://images.amazon.com/images/P/0521046297.0...
3,821722506,Fortune's Flames,janelletaylor,1988,Zebra Books,http://images.amazon.com/images/P/0821722506.0...,http://images.amazon.com/images/P/0821722506.0...,http://images.amazon.com/images/P/0821722506.0...
4,226848787,The Chinese Maze Murders : A Judge Dee Mystery...,robertvangulik,1997,Press,http://images.amazon.com/images/P/0226848787.0...,http://images.amazon.com/images/P/0226848787.0...,http://images.amazon.com/images/P/0226848787.0...


In [8]:
df['Book-Title'] = df['Book-Title'].str.lower()
df['Publisher'] = df['Publisher'].str.lower()

In [9]:
# combine all strings:
df2 = df.drop(['ISBN','Image-URL-S','Image-URL-M','Image-URL-L','Year-Of-Publication'],axis=1)

df2['data'] = df2[df2.columns[1:]].apply(
    lambda x: ' '.join(x.dropna().astype(str)),
    axis=1
)

print(df2['data'].head())

0                        marilynbeck jove books
1                    peggyparish yearling books
2    geoffreychaucer cambridge university press
3                     janelletaylor zebra books
4                          robertvangulik press
Name: data, dtype: object


#### ●       Vectorize the Dataframe
###### Finally, we can apply Scikit-Learn’s CountVectorizer() on the combined text data:



In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized = vectorizer.fit_transform(df2['data'])
#print(vectorized)


### Building the Recommendation System
###### Now, we will use a distance measure called cosine similarity to find the resemblance between each bag-of-words. Cosine similarity is a metric that calculates the cosine of the angle between two or more vectors to determine if they are pointing in the same direction.
###### Cosine similarity ranges between 0 and 1. A value of 0 indicates that the two vectors are not similar at all, while 1 tells us that they are identical.
##### Run the following lines of code to apply cosine similarity on the vector we created:



In [15]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(vectorized)
#Now, print the “similarities” variable:

print(similarities)

[[1.         0.33333333 0.         ... 0.         0.         0.33333333]
 [0.33333333 1.         0.         ... 0.         0.         0.33333333]
 [0.         0.         1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.33333333 0.33333333 0.         ... 0.         0.         1.        ]]


In [16]:
df = pd.DataFrame(similarities, columns=df['Book-Title'], index=df['Book-Title']).reset_index()

df.head()


Book-Title,Book-Title.1,only make believe,too many rabbits,the general prologue to the canterbury tales,fortune's flames,"the chinese maze murders : a judge dee mystery (gulik, robert hans, judge dee mystery.)","what a man's got to do (home on the ranch) (harlequin superromance, 824)","when my parents were my age, they were old : or, who are you calling middle-aged?",pass the loot : a fox trotcollection,vegetarian classics: 300 essential recipes for every course and every meal,...,lisa in new york (the misadventures of gaspard and lisa),contact the first four minutes,stand up speak out,dark canyon,einstein's universe,"ghost squadron (gunsmith, 245)",lentejuelas,aerobic dance exercise,celtic borders,bringing out betsy
0,only make believe,1.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.333333,0.0,0.288675,0.0,0.666667,0.0,0.0,0.0,0.333333
1,too many rabbits,0.333333,1.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.333333,0.0,0.288675,0.0,0.333333,0.0,0.0,0.0,0.333333
2,the general prologue to the canterbury tales,0.0,0.0,1.0,0.0,0.353553,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,fortune's flames,0.333333,0.333333,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.333333,0.0,0.288675,0.0,0.333333,0.0,0.0,0.0,0.666667
4,the chinese maze murders : a judge dee mystery...,0.0,0.0,0.353553,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Displaying User Recommendations
###### Finally, let’s use the dataframe above to display book recommendations. If a book is entered as input, the top 10 similar books must be returned.

###### Let us do this using a book from the Star Trek series as input:



In [18]:
input_book = "the general prologue to the canterbury tales"
recommendations = pd.DataFrame(df.nlargest(100,input_book)['Book-Title'])
recommendations = recommendations[recommendations['Book-Title']!=input_book]
print(recommendations)

                                              Book-Title
2027   the wife of bath's prologue and tale (selected...
1609   taking care of men : sexual politics in the pu...
2120                  money and the morality of exchange
2938   plato: the republic (cambridge texts in the hi...
3066                              medieval women (canto)
...                                                  ...
11265  welsh legends and folk-tales (oxford myths and...
11313                                    birds of europe
11362  the strange case of dr. jekyll and mr. hyde an...
11521                                 the power of ideas
11559                  six memos for the next millennium

[99 rows x 1 columns]
