# **Project Name**    - Book Recommendation System



# **Name - Hirapara Paras**

# **GitHub Link -**

https://github.com/parashirapara

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

# **Context**

During the last few decades, with the rise of Youtube, Amazon, Netflix and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys. In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. As a proof of the importance of recommender systems, we can mention that, a few years ago, Netflix organised a challenges (the “Netflix prize”) where the goal was to produce a recommender system that performs better than its own algorithm with a prize of 1 million dollars to win.

By applying this simple dataset and related tasks and notebooks , we will evolutionary go through different paradigms of recommender algorithms . For each of them, we will present how they work, describe their theoretical basis and discuss their strengths and weaknesses.

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image
import requests

from scipy.sparse import csr_matrix

from sklearn.neighbors import NearestNeighbors

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset path

books = pd.read_csv("/content/drive/MyDrive/Data Science/Capston Project/Book Recommendation System/Books.csv")
ratings = pd.read_csv("/content/drive/MyDrive/Data Science/Capston Project/Book Recommendation System/Ratings.csv")
users = pd.read_csv("/content/drive/MyDrive/Data Science/Capston Project/Book Recommendation System/Users.csv")

### Dataset First View

In [None]:
# Dataset First Look

books.head()

In [None]:
ratings.head()

In [None]:
users.head()

# **Content**

The Book-Crossing dataset comprises 3 files.

Users
Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers. Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL-values.

Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (Book-Title, Book-Author, Year-Of-Publication, Publisher), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (Image-URL-S, Image-URL-M, Image-URL-L), i.e., small, medium, large. These URLs point to the Amazon web site.

Ratings
Contains the book rating information. Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Shape of books file", books.shape)
print("Shape of ratings file", ratings.shape)
print("Shape of users file", users.shape)

### Dataset Information

In [None]:
# Dataset Info

books.info()

In [None]:
ratings.info()

In [None]:
users.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count for store and sales dataset

a=len(books[books.duplicated()])
b=len(ratings[ratings.duplicated()])
c=len(users[users.duplicated()])

print("Duplicate values of sales_df =",a)
print("Duplicate values of stores_df =",b)
print("Duplicate values of stores_df =",c)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

books.isna().sum()

In [None]:
ratings.isna().sum()

In [None]:
users.isna().sum()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

books.columns

In [None]:
# drop unused columns

books=books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher','Image-URL-S']]

In [None]:
# rename

books.rename(columns={'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'},inplace=True)

In [None]:
ratings.columns

In [None]:
# rename

ratings.rename(columns={'User-ID':'user-id', 'Book-Rating':'ratings'},inplace=True)

In [None]:
users.columns

In [None]:
# rename

users.rename(columns={'User-ID':'user-id', 'Location':'location', 'Age':'age'},inplace=True)

# **Data Processing**

In [None]:
# creating a Boolean Series by comparing each unique value

x=ratings['user-id'].value_counts()>200

# extract the indices of the True values in the boolean Series

y=x[x].index

In [None]:
# filter the DataFrame

ratings = ratings[ratings['user-id'].isin(y)]

In [None]:
# merge ratings and books data

ratings_with_books = ratings.merge(books,on='ISBN')

In [None]:
# group by eith respect to title and rating

number_rating = ratings_with_books.groupby('title')['ratings'].count().reset_index()

In [None]:
number_rating.head()

In [None]:
# rename

number_rating.rename(columns={'ratings':'number of ratings'},inplace=True)

In [None]:
# merge

final_rating = ratings_with_books.merge(number_rating,on='title')

In [None]:
final_rating.head()

In [None]:
# DataFrame will be updated to include only the rows related to books with 10 or more ratings

final_rating = final_rating[final_rating['number of ratings']>=10]

common_books = final_rating[~final_rating["title"].isin(final_rating)]

In [None]:
final_rating.head()

In [None]:
final_rating.shape

In [None]:
# drop duplicate values

final_rating.drop_duplicates(['user-id','title'],inplace=True)

In [None]:
final_rating.shape

# **Weighted Rating-Based Recommendation System**

In [None]:
avg_ratings = final_rating.groupby('title')['ratings'].mean().reset_index().rename(columns={'ratings': 'avg_rating'})

avg = pd.DataFrame(avg_ratings).sort_values('avg_rating',ascending=False)

In [None]:
cnt_ratings = final_rating.groupby('title')['ratings'].count().reset_index().rename(columns={'ratings': 'count_rating'})

cnt = pd.DataFrame(cnt_ratings).sort_values('count_rating',ascending=False)

In [None]:
popularite = avg.merge(cnt,on='title')

In [None]:
v = popularite["count_rating"]
R = popularite["avg_rating"]
m = v.quantile(0.90)
c = R.mean()
popularite['w_score']=((v*R) + (m*c)) / (v+m)

In [None]:
pop_sort = popularite.sort_values('w_score',ascending=False)

In [None]:
top_10_books = pop_sort.head(10)

plt.figure(figsize=(20, 10))
sns.barplot(x='w_score', y='title', data=top_10_books)
plt.xlabel('w_score')
plt.ylabel('Book-Title')
plt.title('Top 10 Books with Highest w_score')
plt.show()

# **Collaborative filtering recommendation system**

In [None]:
# creating pivot table

rating_pivot = final_rating.pivot_table(columns='user-id',index='title',values='ratings')
rating_pivot.fillna(0,inplace=True)

In [None]:
rating_pivot.head()

In [None]:
# CSR is a commonly used sparse matrix format, which is efficient for storing and performing operations on matrices with a large number of zero elements.

book_sparse = csr_matrix(rating_pivot)

In [None]:
users_items_pivot_matrix = rating_pivot.values
users_items_pivot_matrix[:10]

In [None]:
users_ids = list(rating_pivot.index)
users_ids[:10]

# **Collaborative filtering : NearestNeighbors model for recommendation system**

Collaborative filtering is undoubtedly a popular technique for creating recommendation systems based on user preferences and actions. I can give you a more thorough breakdown of how collaborative filtering functions and how to use it.

Collaborative filtering techniques leverage patterns and connections between users and products (in your instance, books) to generate recommendations. The two primary forms of collaborative filtering are.

# Content - Based Collaborative Filtering

This method finds people who share the target user's preferences and then suggests products that those users liked. It is assumed that if users A and B have shared historical preferences, then what one of them likes, the other may as well.


In [None]:
# model creation

model=NearestNeighbors(algorithm='brute')

In [None]:
# fit the model

model.fit(book_sparse)

In [None]:
distances,suggestions=model.kneighbors(rating_pivot.iloc[10,:].values.reshape(1,-1),n_neighbors=6)

In [None]:
distances
for i in range(len(suggestions)):
    print(rating_pivot.index[suggestions[i]])

In [None]:
def recommended_book(book_name):
    book_id=np.where(rating_pivot.index==book_name)[0][0]
    distances,suggestions=model.kneighbors(rating_pivot.iloc[book_id,:].values.reshape(1,-1),n_neighbors=6)
    for i in range(len(suggestions)):
        #if i==0:
        print('The suggestions ',book_name,' are :')

        print(rating_pivot.index[suggestions[i]])

In [None]:
recommended_book('Animal Farm')

In [None]:
recommended_book('84 Charing Cross Road')

In this notebook study, we used user ID and item based recommendation systems to choose books for a randomly chosen user from a vast collection of books. On an individual basis, we identified users who had behaviors comparable to our users and suggested their preferred books to our users. We recommended the six novels that most closely matched the rating system of the book our user had read based on the items.