# _Book Recommendation System_
<b>DESCRIPTION</b>

BookRent is the largest online and offline book rental chain in India. The company charges a fixed rental fee for a book per month. Lately, the company has been losing its user base.
The main reason for this is that users are not able to choose the right books for themselves. The company wants to solve this problem and increase its revenue and profit.

<b>Objective:</b> You, as an ML expert, have to model a recommendation engine so that users get recommendations for books based on the behavior of similar users. This will ensure that users are renting books based on their individual tastes.

<b>Actions to Perform:</b>

- Read the books dataset and explore it.
- Clean up NaN values.
- Read the data where ratings are given by users.
- Take a quick look at the number of unique users and books.
- Convert ISBN to numeric numbers in the correct order.
- Do the same for user_id. Convert it into numeric order.
- Convert both user_id and ISBN to the ordered list i.e. from 0...n-1.
- Re-index columns to build matrix later on.
- Split your data into two sets (training and testing).
- Calculate the cosine similarity.
- Use the evaluation metrics to make predictions.

In [1]:
## Import the libraries
import pandas as pd
import numpy as np
#visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#handle warnings
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)
#consistent sized plots
from pylab import rcParams
rcParams['figure.figsize']=12,5
rcParams['axes.labelsize']=10
rcParams['xtick.labelsize']=10
rcParams['ytick.labelsize']=10

In [2]:
#load the datasets
books = pd.read_csv('BX-Books.csv',delimiter=',',engine='python')
books.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


In [3]:
#check the info .. 
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271379 entries, 0 to 271378
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   isbn                 271379 non-null  object
 1   book_title           271379 non-null  object
 2   book_author          271378 non-null  object
 3   year_of_publication  271379 non-null  object
 4   publisher            271377 non-null  object
dtypes: object(5)
memory usage: 10.4+ MB


In [4]:
#load the book users dataset 
users = pd.read_csv('BX-Users.csv',delimiter=',',engine='python')
users.head()

Unnamed: 0,user_id,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [5]:
#check the info .. 
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278859 entries, 0 to 278858
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   user_id   278859 non-null  object 
 1   Location  278858 non-null  object 
 2   Age       168096 non-null  float64
dtypes: float64(1), object(2)
memory usage: 6.4+ MB


In [6]:
#load the books rating datafile 
ratings = pd.read_csv('BX-Book-Ratings.csv',delimiter=',',engine='python',nrows=50000)
ratings.head()

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,155061224,5
2,276727,446520802,0
3,276729,052165615X,3
4,276729,521795028,6


In [7]:
#check the info .. 
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  50000 non-null  int64 
 1   isbn     50000 non-null  object
 2   rating   50000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.1+ MB


_This datafile has over 1 million rows of data_

## _Explore the various dataset_

In [8]:
#check the basic stats 
books.describe()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
count,271379,271379,271378,271379,271377
unique,271379,242150,102042,137,16823
top,048642491X,Selected Poems,Agatha Christie,2002,Harlequin
freq,1,27,632,17627,7535


In [9]:
#check the top 2 rows 
books.head(2)

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada


In [10]:
#unique book titles in the dataset
books['book_title'].nunique()

242150

_Clearly there are a few titles which are repeated_

In [11]:
books['isbn'].nunique()

271379

_The number of isbn code is same as the length of the dataframe of Books_

In [12]:
#null values
books.isna().sum()

isbn                   0
book_title             0
book_author            1
year_of_publication    0
publisher              2
dtype: int64

_Just three null values altogether. Considering the size of the dataset, this is very negligent and hence can be easily dropped w/o severly impacting the dataset coverage_

In [13]:
#drop the null values 
books.dropna(inplace=True)

In [14]:
#check info again -- > now there should not be any null values 
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271376 entries, 0 to 271378
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   isbn                 271376 non-null  object
 1   book_title           271376 non-null  object
 2   book_author          271376 non-null  object
 3   year_of_publication  271376 non-null  object
 4   publisher            271376 non-null  object
dtypes: object(5)
memory usage: 12.4+ MB


In [15]:
#check the top 10 publisher
books['publisher'].value_counts().sort_values(ascending=False)[:10]

Harlequin                   7535
Silhouette                  4220
Pocket                      3905
Ballantine Books            3783
Bantam Books                3646
Scholastic                  3160
Simon &amp; Schuster        2971
Penguin Books               2844
Berkley Publishing Group    2771
Warner Books                2727
Name: publisher, dtype: int64

In [16]:
#explore the users ratings dataset
ratings.head(3)

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,155061224,5
2,276727,446520802,0


_Harlequin books are the maximum followed by Silhouette and Pocket publishers_

In [17]:
#check the number of unique users in ratings
ratings['user_id'].nunique()

5064

In [18]:
#check basic stats
ratings['rating'].value_counts().sort_values(ascending=False)

0     29038
8      4999
7      4033
10     3752
9      3127
5      2066
6      1957
4       443
3       333
2       141
1       111
Name: rating, dtype: int64

_So the books are rated on a score of 0 to 10_

In [19]:
#check the number of unique books / isbn
ratings['isbn'].nunique()

36247

In [20]:
#merge the book and the ratings dataset 
df = pd.merge(ratings,books,on='isbn')
df.head()

Unnamed: 0,user_id,isbn,rating,book_title,book_author,year_of_publication,publisher
0,276725,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
1,2313,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
2,6543,034545104X,0,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
3,8680,034545104X,5,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books
4,10314,034545104X,9,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books


## _Explore the merged data_

In [21]:
#unique book titles
df['book_title'].nunique()

28734

In [22]:
#unique publishers
df['publisher'].nunique()

3394

In [23]:
#unique users
df['user_id'].nunique()

4376

In [24]:
#unique book authors
df['book_author'].nunique()

14966

In [25]:
#unique isbn
df['isbn'].nunique()

30720

In [26]:
#ratings of the books
df['rating'].value_counts()

0     26007
8      4330
10     3417
7      3182
9      2815
5      1769
6      1582
4       372
3       294
2       119
1       101
Name: rating, dtype: int64

## _ISBN and user_id to numeric numbers in order_

In [27]:
isbn_list = df['isbn'].unique()
print('Length of the isbn list {}'.format(len(isbn_list)))

def isbn_numeric(isbn):
    '''This function returns the corresponding index from the isbn unique list
       depending on the isbn code'''
    isbn_index = np.where(isbn_list==isbn)
    return isbn_index[0][0]


Length of the isbn list 30720


In [28]:
user_list = df['user_id'].unique()
print('Length of the user list {}'.format(len(user_list)))

def user_numeric(user_id):
    '''This function returns the corresponding index from the isbn unique list
       depending on the isbn code'''
    user_index = np.where(user_list==user_id)
    return user_index[0][0]

Length of the user list 4376


In [29]:
#create a new column with numeric user id
df['num_user_id'] = df['user_id'].apply(user_numeric)

In [30]:
#create a new column with numeric isbn
df['num_isbn'] = df['isbn'].apply(isbn_numeric)

In [31]:
df.to_csv('recommender.csv',index=False)