# Book Recommender System

## Elementary Data Analysis

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import pickle

The data can be downloaded from Kaggle ( 
<a href="https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset" target="_blank">Book Recommendation Datset</a>).

In [2]:
books = pd.read_csv('Books.csv')
users = pd.read_csv('Users.csv')
ratings = pd.read_csv('Ratings.csv')

  books = pd.read_csv('Books.csv')


In [3]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [5]:
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [6]:
print("Shape of books dataset: ", books.shape)
print("Shape of ratings dataset: ",ratings.shape)
print("Shape of users dataset: ",users.shape)

Shape of books dataset:  (271360, 8)
Shape of ratings dataset:  (1149780, 3)
Shape of users dataset:  (278858, 3)


In [7]:
books.isnull().sum()

ISBN                   0
Book-Title             0
Book-Author            2
Year-Of-Publication    0
Publisher              2
Image-URL-S            0
Image-URL-M            0
Image-URL-L            3
dtype: int64

Here, we see that there are cases with Book-Author, Publisher, Image-URL-L missing. So, to handle the missing values, we drop those rows.

In [8]:
books = books.dropna()

In [9]:
users.isnull().sum()

User-ID          0
Location         0
Age         110762
dtype: int64

Here, we can see a alot of "Age" values are missing. However, we keep it as it is because the "Age" field plays no part in recommending books.

In [10]:
ratings.isnull().sum()

User-ID        0
ISBN           0
Book-Rating    0
dtype: int64

In [11]:
print("Number of Duplicate data points in Books Dataset: ", books.duplicated().sum())
print("Number of Duplicate data points in Ratings Dataset: ", users.duplicated().sum())
print("Number of Duplicate data points in Users Dataset: ", ratings.duplicated().sum())

Number of Duplicate data points in Books Dataset:  0
Number of Duplicate data points in Ratings Dataset:  0
Number of Duplicate data points in Users Dataset:  0


## Popularity Based Recommender System

Here, we create a datset, which contains the names,authors, publication year of the books and their urls for their images.

In [12]:
ratings_with_name = ratings.merge(books,on='ISBN')
ratings_with_name.sample(5)

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
401052,107453,0671670689,10,Dawn (Cutler),V.C. Andrews,1990,Pocket,http://images.amazon.com/images/P/0671670689.0...,http://images.amazon.com/images/P/0671670689.0...,http://images.amazon.com/images/P/0671670689.0...
349363,94347,068486441x,5,Eating The Cheshire Cat: A Novel,Helen Ellis,2001,Scribner,http://images.amazon.com/images/P/068486441X.0...,http://images.amazon.com/images/P/068486441X.0...,http://images.amazon.com/images/P/068486441X.0...
25764,7072,059043411X,8,Vampires Don't Wear Polka Dots (Adventures of ...,Debbie Dadey,1997,Scholastic,http://images.amazon.com/images/P/059043411X.0...,http://images.amazon.com/images/P/059043411X.0...,http://images.amazon.com/images/P/059043411X.0...
19554,4972,8432055913,6,No Digas Que Fue UN Sueno: Marco Antonio Y Cle...,Terenci Moix,1986,Lectorum Pubns (J),http://images.amazon.com/images/P/8432055913.0...,http://images.amazon.com/images/P/8432055913.0...,http://images.amazon.com/images/P/8432055913.0...
288080,76576,080410526X,10,All I Really Need to Know,ROBERT FULGHUM,1989,Ivy Books,http://images.amazon.com/images/P/080410526X.0...,http://images.amazon.com/images/P/080410526X.0...,http://images.amazon.com/images/P/080410526X.0...


Here, we create a dataset based on the number of ratings.

In [13]:
number_of_ratings = ratings_with_name.groupby('Book-Title').count()['Book-Rating'].reset_index()
number_of_ratings.rename(columns={'Book-Rating':'num_ratings'},inplace=True)
number_of_ratings.sample(5)

Unnamed: 0,Book-Title,num_ratings
193816,The Killing of Monday Brown (A Phoebe Siegel M...,9
94479,It Had to Be You : A Grace &amp; Favor Mystery...,4
219075,Time Ghost,4
137202,Pale Gray for Guilt,3
108957,Little Boy Blue: And Other Rhymes (My Very Fir...,1


Dataset on average ratings. 

In [14]:
average_ratings = ratings_with_name.groupby('Book-Title')['Book-Rating'].agg(lambda x:x.astype(float).mean()).reset_index()
average_ratings.rename(columns={'Book-Rating':'avg_rating'},inplace=True)
average_ratings

Unnamed: 0,Book-Title,avg_rating
0,A Light in the Storm: The Civil War Diary of ...,2.250000
1,Always Have Popsicles,0.000000
2,Apple Magic (The Collector's series),0.000000
3,"Ask Lily (Young Women of Faith: Lily Series, ...",8.000000
4,Beyond IBM: Leadership Marketing and Finance ...,0.000000
...,...,...
241060,Ã?Â?lpiraten.,0.000000
241061,Ã?Â?rger mit Produkt X. Roman.,5.250000
241062,Ã?Â?sterlich leben.,7.000000
241063,Ã?Â?stlich der Berge.,2.666667


Dataset based on number of ratings. This is done to filter out books which have number of ratings less than 250.

In [15]:
popularity = number_of_ratings.merge(average_ratings,on='Book-Title')
popularity

Unnamed: 0,Book-Title,num_ratings,avg_rating
0,A Light in the Storm: The Civil War Diary of ...,4,2.250000
1,Always Have Popsicles,1,0.000000
2,Apple Magic (The Collector's series),1,0.000000
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1,8.000000
4,Beyond IBM: Leadership Marketing and Finance ...,1,0.000000
...,...,...,...
241060,Ã?Â?lpiraten.,2,0.000000
241061,Ã?Â?rger mit Produkt X. Roman.,4,5.250000
241062,Ã?Â?sterlich leben.,1,7.000000
241063,Ã?Â?stlich der Berge.,3,2.666667


Filtering out books with less than 250 ratings.

In [16]:
popularity = popularity[popularity['num_ratings']>=250].sort_values('avg_rating',ascending=False).head(100)
popularity.sample(5)

Unnamed: 0,Book-Title,num_ratings,avg_rating
223130,"Tuesdays with Morrie: An Old Man, a Young Man,...",493,4.35497
8751,About a Boy,262,3.900763
80419,Harry Potter and the Goblet of Fire (Book 4),387,5.824289
89777,Icy Sparks,309,3.346278
190741,The Handmaid's Tale,311,3.398714


In [17]:
popularity = popularity.merge(books,on='Book-Title').drop_duplicates('Book-Title')[['Book-Title','Book-Author','Image-URL-M','num_ratings','avg_rating']]

popularity.sample(3)

Unnamed: 0,Book-Title,Book-Author,Image-URL-M,num_ratings,avg_rating
197,Me Talk Pretty One Day,David Sedaris,http://images.amazon.com/images/P/0316776963.0...,457,3.752735
273,White Oleander : A Novel,Janet Fitch,http://images.amazon.com/images/P/0316182540.0...,387,3.50646
276,The Poisonwood Bible,Barbara Kingsolver,http://images.amazon.com/images/P/0060175400.0...,267,3.501873


In [18]:
popularity.sample(1)['Image-URL-M']

229    http://images.amazon.com/images/P/0385722206.0...
Name: Image-URL-M, dtype: object

## Collaborative Filtering

Here, we are selecting those users, who have given more than 200 ratings to books.

In [19]:
x = ratings_with_name.groupby('User-ID').count()['Book-Rating'] >= 200
frequent_users = x[x].index

frequent_users.shape

(816,)

In [20]:
filtered_rating = ratings_with_name[ratings_with_name['User-ID'].isin(frequent_users)]
filtered_rating.shape

(475002, 10)

Here, we filter out the top 50 books in terms of their ratings.

In [21]:
y = filtered_rating.groupby('Book-Title').count()['Book-Rating'] >= 50
famous_books = y[y].index

famous_books.shape

(707,)

In [22]:
final_ratings = filtered_rating[filtered_rating['Book-Title'].isin(famous_books)]
final_ratings.shape

(58823, 10)

In [23]:
pt = final_ratings.pivot_table(index='Book-Title',columns='User-ID',values='Book-Rating')
pt.fillna(0,inplace=True)
pt.sample(3)

User-ID,254,2276,2766,2977,3363,4017,4385,6251,6323,6543,...,271705,273979,274004,274061,274301,274308,275970,277427,277639,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Copper Beech,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0
Family Album,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Body of Lies,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
similarity_scores = cosine_similarity(pt)
similarity_scores.shape

(707, 707)

In the above cell, we basically calculated the similarity score of each book with the rest in the top selected books.

In [25]:
def recommend(book_name):
    # index 
    index = np.where(pt.index==book_name)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:6]
    
    data = []
    for i in similar_items:
        item = []
        temp_df = books[books['Book-Title'] == pt.index[i[0]]]
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Title'].values))
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Book-Author'].values))
        item.extend(list(temp_df.drop_duplicates('Book-Title')['Image-URL-M'].values))
        
        data.append(item)
    
    return data

In [26]:
recommend('Animal Farm')

[['1984',
  'George Orwell',
  'http://images.amazon.com/images/P/0451524934.01.MZZZZZZZ.jpg'],
 ['Angus, Thongs and Full-Frontal Snogging: Confessions of Georgia Nicolson',
  'Louise Rennison',
  'http://images.amazon.com/images/P/0064472272.01.MZZZZZZZ.jpg'],
 ['Midnight',
  'Dean R. Koontz',
  'http://images.amazon.com/images/P/0425118703.01.MZZZZZZZ.jpg'],
 ['Second Nature',
  'Alice Hoffman',
  'http://images.amazon.com/images/P/0399139087.01.MZZZZZZZ.jpg'],
 ['Call of the Wild',
  'Jack London',
  'http://images.amazon.com/images/P/1559029838.01.MZZZZZZZ.jpg']]

In [27]:
recommend('The Da Vinci Code')

[['Angels &amp; Demons',
  'Dan Brown',
  'http://images.amazon.com/images/P/0671027360.01.MZZZZZZZ.jpg'],
 ['Touching Evil',
  'Kay Hooper',
  'http://images.amazon.com/images/P/0553583441.01.MZZZZZZZ.jpg'],
 ['Saving Faith',
  'David Baldacci',
  'http://images.amazon.com/images/P/0446608890.01.MZZZZZZZ.jpg'],
 ["The Sweet Potato Queens' Book of Love",
  'JILL CONNER BROWNE',
  'http://images.amazon.com/images/P/0609804138.01.MZZZZZZZ.jpg'],
 ['Middlesex: A Novel',
  'Jeffrey Eugenides',
  'http://images.amazon.com/images/P/0312422156.01.MZZZZZZZ.jpg']]

Now, as we can see, the recommend function is able to provide recommendations based on out input. It also provides the image url.

<br>
Now, we dump our filtered our users, books, ratings and the similaritiy scores dataset which will be used by our streamlit code to generate a webpage.

In [28]:
pickle.dump(popularity,open('popular.pkl','wb'))
books.drop_duplicates('Book-Title')

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
271354,0449906736,Flashpoints: Promise and Peril in a New World,Robin Wright,1993,Ballantine Books,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...,http://images.amazon.com/images/P/0449906736.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


In [29]:
pickle.dump(pt,open('pt.pkl','wb'))
pickle.dump(books,open('books.pkl','wb'))
pickle.dump(similarity_scores,open('similarity_scores.pkl','wb'))

In [30]:
famous_books

Index(['1984', '1st to Die: A Novel', '2nd Chance', '4 Blondes',
       'A Bend in the Road', 'A Case of Need',
       'A Child Called \It\": One Child's Courage to Survive"',
       'A Civil Action', 'A Day Late and a Dollar Short', 'A Fine Balance',
       ...
       'Winter Solstice', 'Wish You Well', 'Without Remorse',
       'Wizard and Glass (The Dark Tower, Book 4)', 'Wuthering Heights',
       'Year of Wonders', 'You Belong To Me',
       'Zen and the Art of Motorcycle Maintenance: An Inquiry into Values',
       'Zoya', '\O\" Is for Outlaw"'],
      dtype='object', name='Book-Title', length=707)

***

## Authors
<a href="https://www.linkedin.com/in/kinjal-mitra-992147325/" target="_blank">Kinjal Mitra</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | 
| ----------------- | ------- | ---------- | 
| 2025-03-18        | 1.0     | Kinjal Mitra |