# Books Recommendation

## using K-Nearest Neighbors

This is a Jupyter notebook to develop a books recommendation system using Nearest Neighbors Algorithm.  

The dataset used is [Books Dataset](https://www.kaggle.com/datasets/saurabhbagchi/books-dataset) from Kaggle. This dataset contains more than 200,000 books and 1,000,000 ratings.

In [1]:
import pandas as pd
import numpy as np

## Importing Datasets 

In [2]:
books = pd.read_csv('data/books_data/books.csv', sep=';', encoding_errors='ignore', on_bad_lines='skip')
ratings = pd.read_csv('data/books_data/ratings.csv', sep=';', encoding_errors='ignore')

  books = pd.read_csv('data/books_data/books.csv', sep=';', encoding_errors='ignore', on_bad_lines='skip')


In [3]:
books.head(5)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
ratings.head(5)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [5]:
books.dtypes

ISBN                   object
Book-Title             object
Book-Author            object
Year-Of-Publication    object
Publisher              object
Image-URL-S            object
Image-URL-M            object
Image-URL-L            object
dtype: object

In [6]:
ratings.dtypes

User-ID         int64
ISBN           object
Book-Rating     int64
dtype: object

### Handling Ratings Dataset

In [7]:
ratings.head(10)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
5,276733,2080674722,0
6,276736,3257224281,8
7,276737,0600570967,6
8,276744,038550120X,7
9,276745,342310538,10


Finding out no. of ratings given to each books

In [8]:
rating_counts = ratings.groupby('ISBN')['Book-Rating'].count().sort_values(ascending=False)
rating_counts

ISBN
0971880107    2502
0316666343    1295
0385504209     883
0060928336     732
0312195516     723
              ... 
0671656198       1
0671656279       1
0671656317       1
0671656325       1
Խcrosoft         1
Name: Book-Rating, Length: 340553, dtype: int64

Checking No of Books with more than 100, 50, 20 ratings

In [9]:
print('More than 100: ', len(rating_counts[rating_counts>100]))
print('More Than 50: ', len(rating_counts[rating_counts>50]))
print('More Than 20:', len(rating_counts[rating_counts>20]))

More than 100:  721
More Than 50:  2125
More Than 20: 7064


Taking threshold value as 20, as 7000 books would be sufficient for a Recommendation System and 20 user ratings should be a optimal threshold value

### Appending Ratings Count to Ratings Table

In [10]:
ratings = ratings.merge(rating_counts, on='ISBN', how='left')
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating_x,Book-Rating_y
0,276725,034545104X,0,60
1,276726,0155061224,5,2
2,276727,0446520802,0,116
3,276729,052165615X,3,1
4,276729,0521795028,6,1


Renaming Columns

In [11]:
new_cols = {
    'Book-Rating_x' : 'Book_Rating',
    'Book-Rating_y' : 'Ratings_Count'
}
ratings.rename(columns=new_cols, inplace=True)
ratings.columns

Index(['User-ID', 'ISBN', 'Book_Rating', 'Ratings_Count'], dtype='object')

In [12]:
ratings.isnull().sum()

User-ID          0
ISBN             0
Book_Rating      0
Ratings_Count    0
dtype: int64

No null values as of now

### No filtering books based on Rating Count

Threshold as Rating Count > 20

In [13]:
print('Shape Before: ', ratings.shape)

ratings = ratings[ratings['Ratings_Count']>20]

print('Shape After: ',ratings.shape)

Shape Before:  (1149780, 4)
Shape After:  (385434, 4)


In [14]:
ratings.head(10)

Unnamed: 0,User-ID,ISBN,Book_Rating,Ratings_Count
0,276725,034545104X,0,60
2,276727,0446520802,0,116
8,276744,038550120X,7,184
10,276746,0425115801,0,134
11,276746,0449006522,0,111
12,276746,0553561618,0,137
13,276746,055356451X,0,170
16,276747,0060517794,9,66
17,276747,0451192001,0,86
18,276747,0609801279,0,22


## Merging Ratings & Books Dataset for final dataset

In [15]:
ratings.reset_index(inplace=True)
ratings.head()

Unnamed: 0,index,User-ID,ISBN,Book_Rating,Ratings_Count
0,0,276725,034545104X,0,60
1,2,276727,0446520802,0,116
2,8,276744,038550120X,7,184
3,10,276746,0425115801,0,134
4,11,276746,0449006522,0,111


In [16]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


#### Since, we are designing a Collaborative System, will only select 'Book-Title' feature and ISBN feature to join the two tables

In [17]:
books_title = books[['ISBN', 'Book-Title']]

In [18]:
data = ratings.merge(books_title, on='ISBN', how='left')
data.head()


Unnamed: 0,index,User-ID,ISBN,Book_Rating,Ratings_Count,Book-Title
0,0,276725,034545104X,0,60,Flesh Tones: A Novel
1,2,276727,0446520802,0,116,The Notebook
2,8,276744,038550120X,7,184,A Painted House
3,10,276746,0425115801,0,134,Lightning
4,11,276746,0449006522,0,111,Manhattan Hunt Club


In [19]:
data.shape

(385434, 6)

#### Checking any null values

In [20]:
data.isnull().sum()

index               0
User-ID             0
ISBN                0
Book_Rating         0
Ratings_Count       0
Book-Title       7799
dtype: int64

#### Removing the null values

In [21]:
data = data[pd.notnull(data['Book-Title'])]
data.head()

Unnamed: 0,index,User-ID,ISBN,Book_Rating,Ratings_Count,Book-Title
0,0,276725,034545104X,0,60,Flesh Tones: A Novel
1,2,276727,0446520802,0,116,The Notebook
2,8,276744,038550120X,7,184,A Painted House
3,10,276746,0425115801,0,134,Lightning
4,11,276746,0449006522,0,111,Manhattan Hunt Club


In [22]:
data.isnull().sum()

index            0
User-ID          0
ISBN             0
Book_Rating      0
Ratings_Count    0
Book-Title       0
dtype: int64

#### Checking No. Of Unique User

In [23]:
len(data['User-ID'].unique())

55167

#### Having 55167 users will have same number of features in our dataset. So we will drop users with very low number of ratings

In [24]:
user_rating_count = data.groupby('User-ID')['Ratings_Count'].count().sort_values(ascending=False)
user_rating_count

User-ID
11676     3260
35859     1515
76352     1288
153662    1250
16795     1050
          ... 
120094       1
120147       1
120149       1
120163       1
139242       1
Name: Ratings_Count, Length: 55167, dtype: int64

In [25]:
print('Mean: ', user_rating_count.mean())
print('Median', user_rating_count.median())

Mean:  6.845306070658183
Median 1.0


#### Selecting threshold values as 200, 100 and 50

In [26]:
print('More than 200: ', (user_rating_count>200).sum())
print('More than 100: ', (user_rating_count>100).sum())
print('More than  50:', (user_rating_count>50).sum())

More than 200:  234
More than 100:  599
More than  50: 1261


In [27]:
data = data.merge(user_rating_count, on='User-ID', how='left')
data.head()

Unnamed: 0,index,User-ID,ISBN,Book_Rating,Ratings_Count_x,Book-Title,Ratings_Count_y
0,0,276725,034545104X,0,60,Flesh Tones: A Novel,1
1,2,276727,0446520802,0,116,The Notebook,1
2,8,276744,038550120X,7,184,A Painted House,1
3,10,276746,0425115801,0,134,Lightning,4
4,11,276746,0449006522,0,111,Manhattan Hunt Club,4


In [28]:
new_cols = {
    'Ratings_Count_x' : 'Ratings_Count',
    'Ratings_Count_y' : 'User_Ratings_Count'
}
data.rename(columns=new_cols, inplace=True)
data.columns

Index(['index', 'User-ID', 'ISBN', 'Book_Rating', 'Ratings_Count',
       'Book-Title', 'User_Ratings_Count'],
      dtype='object')

#### Extracting Rows with User Count > 200

In [29]:
data = data[data['User_Ratings_Count']>200]
data.head()

Unnamed: 0,index,User-ID,ISBN,Book_Rating,Ratings_Count,Book-Title,User_Ratings_Count
355,1456,277427,002542730X,10,171,Politically Correct Bedtime Stories: Modern Ta...,228
356,1460,277427,0060002050,0,28,On a Wicked Dawn (Cynster Novels),228
357,1464,277427,0060192704,0,25,"Beauty Fades, Dumb Is Forever: The Making of a...",228
358,1465,277427,0060542128,7,41,When the Storm Breaks,228
359,1466,277427,0060913509,0,32,In Country RI,228


In [30]:
data.shape

(90997, 7)

In [31]:
data['Ratings_Count'].describe()

count    90997.000000
mean       103.352154
std        145.024316
min         21.000000
25%         33.000000
50%         59.000000
75%        118.000000
max       2502.000000
Name: Ratings_Count, dtype: float64

Now we are ready with the data to be used in our collaborative filtering recommendation system

## Model Creation

Model we will be using is a Nearest Neighbors with *metric = 'cosine'* and *algorithm = 'brute'*. This will help us find the cosine similarity between *n* number of books.  

To create this model, we would require a sparse matrix created out of a pivot table with User ID as Columns and Book Title as the index of the table.

In [32]:
train_data = pd.pivot_table(data, values='Book_Rating', index='Book-Title', columns='User-ID').fillna(0)
train_data.head()

User-ID,3363,6251,6575,7158,7346,11601,11676,12538,13273,13552,...,268330,269566,269719,271284,273979,274061,274308,275970,277427,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Salem's Lot,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01-01-00: The Novel of the Millennium,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10 Lb. Penalty,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"14,000 Things to Be Happy About",0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16 Lighthouse Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
train_data.to_csv('data/source/books_encoded_data.csv')

In [34]:
train_data.shape

(6067, 234)

### Creating Sparse Matrix

In [35]:
from scipy.sparse import csr_matrix
train_data_sparse = csr_matrix(train_data)

### Training Model

In [36]:
from sklearn.neighbors import NearestNeighbors

In [37]:
nn_model = NearestNeighbors(metric='cosine', algorithm='brute')

In [38]:
nn_model.fit(train_data_sparse)

NearestNeighbors(algorithm='brute', metric='cosine')

## Testing A Sample

In [39]:
test_index = np.random.choice(train_data.shape[0])
test_index

3622

In [40]:
train_data.iloc[test_index]

User-ID
3363       0.0
6251      10.0
6575       0.0
7158       0.0
7346       0.0
          ... 
274061     0.0
274308     0.0
275970     0.0
277427     0.0
278418     0.0
Name: Sarah, Plain and Tall (Sarah, Plain and Tall), Length: 234, dtype: float64

In [41]:
sample = np.array(train_data.iloc[test_index]).reshape(1, -1)
sample

array([[ 0., 10.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0., 10.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.

In [42]:
distances, indices = nn_model.kneighbors(sample, n_neighbors = 5)

In [43]:
indices.flatten()

array([3622, 2386, 3660, 5123, 5601])

In [60]:
recs = []
print('Recommendations: ')
for i in range(1, len(indices.flatten())):
    recs.append([train_data.iloc[indices.flatten()[i]].name, distances.flatten()[i]])
recs

Recommendations: 


[['Knocked Out by My Nunga-Nungas : Further, Further Confessions of Georgia Nicolson (Confessions of Georgia Nicolson)',
  0.29289321881345254],
 ['Second Chance (Left Behind: The Kids #2)', 0.29289321881345254],
 ['The Pleasure of My Company: A Novel', 0.29289321881345254],
 ['Titanic Crossing', 0.29289321881345254]]

## Exporting Model

In [45]:
import pickle
with open('nn_model_book.pkl', 'wb') as f:
    pickle.dump(nn_model, f)

In [49]:
temp_df = pd.read_csv('data/source/books_encoded_data.csv', index_col=0)
temp_df.head()

Unnamed: 0_level_0,3363,6251,6575,7158,7346,11601,11676,12538,13273,13552,...,268330,269566,269719,271284,273979,274061,274308,275970,277427,278418
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Salem's Lot,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
01-01-00: The Novel of the Millennium,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10 Lb. Penalty,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"14,000 Things to Be Happy About",0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16 Lighthouse Road,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
np.array(temp_df.loc[temp_df.index == ])

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 5., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 9.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 7., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 