# Recommendation system of books
Feed your favorite books then I will give you recommended books that you would like

Theorem of this app

* Get rating of each book
* Find someone who rated the book as good
* Find other book that the person rated as good
* This is the book you would like


I will use `cosine distance` to see how similar a book is to others

---

## Import dataset

* `book` - book title, book id
* `user` - user id
* `rate` - user id, book id and rating

In [1]:
import pandas as pd

In [2]:
book = pd.read_csv('../data/BX_Books.csv', sep=';', encoding='latin-1')
book.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [3]:
user = pd.read_csv('../data/BX-Users.csv', sep=';', encoding='latin-1')
user.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [4]:
rate = pd.read_csv('../data/BX-Book-Ratings.csv', sep=';', encoding='latin-1')
rate.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


---

## Cleaning and EDA

### Merge all datasets

In [5]:
df = pd.merge(user, rate, how='inner', left_on='User-ID', right_on='User-ID')
df = pd.merge(df, book, how='inner', left_on='ISBN', right_on='ISBN')
df.head()

Unnamed: 0,User-ID,Location,Age,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,2,"stockton, california, usa",18.0,195153448,0,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,8,"timmins, ontario, canada",,2005018,5,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,11400,"ottawa, ontario, canada",49.0,2005018,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
3,11676,"n/a, n/a, n/a",,2005018,8,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
4,41385,"sudbury, ontario, canada",,2005018,0,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...


--- 

## Assemble dataframe

In [6]:
# FInd missing values
df.isnull().sum()

User-ID                     0
Location                    0
Age                    277845
ISBN                        0
Book-Rating                 0
Book-Title                  0
Book-Author                 1
Year-Of-Publication         0
Publisher                   2
Image-URL-S                 0
Image-URL-M                 0
Image-URL-L                 0
dtype: int64

In [7]:
df.shape

(1031175, 12)

There are tons of rows and this is too big to assemble my app. I will used only infuential rows which has

* No missing values
* Books which are left more than 50 reviews

This app refers reviews which are given by users. It is easier to find the similarity with books which have many reviews

In [8]:
# Drop all missing values
df.dropna(inplace=True)

In [9]:
# Filter the books which have more than 50 reviews
# Set `Book-Title` as the index to filter the book.
# Reset the index after filtering

df.set_index('Book-Title', inplace=True, drop=False)
df = df.loc[df.index.value_counts()[df.index.value_counts()>50].index]
df.reset_index(inplace=True, drop=True)
df.shape

(167559, 12)

### Create pivot table to see relation of books, users and ratings

In [10]:
pivot = df.pivot_table(index='Book-Title', columns='User-ID', values='Book-Rating')
pivot.head()

User-ID,42,44,51,67,75,99,114,125,132,144,...,278807,278819,278820,278824,278832,278836,278843,278844,278846,278851
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16 Lighthouse Road,,,,,,,,,,,...,,,,,,,,,,
1984,,,,,,,,,,,...,,,,,,,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,
2010: Odyssey Two,,,,,,,,,,,...,,,,,,,,,,
204 Rosewood Lane,,,,,,,,,,,...,,,,,,,,,,


### Generate `cosine distance`
I will use `pairwise_distances` so that I will make sparse matrix of `pivot` using `scipy.sparse`

In [11]:
from scipy import sparse
pivot_sparse = sparse.csr_matrix(pivot.fillna(0))

In [12]:
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity
cos_distance = pairwise_distances(pivot_sparse, metric='cosine')
recommender = pd.DataFrame(cos_distance, columns=pivot.index, index=pivot.index)
recommender.head()

Book-Title,16 Lighthouse Road,1984,1st to Die: A Novel,2010: Odyssey Two,204 Rosewood Lane,24 Hours,2nd Chance,3rd Degree,4 Blondes,84 Charing Cross Road,...,"Word Freak: Heartbreak, Triumph, Genius, and Obsession in the World of Competitive Scrabble Players",Wouldn't Take Nothing for My Journey Now,Writ of Execution,Wuthering Heights,Wuthering Heights (Penguin Classics),Year of Wonders,You Belong To Me,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zoya,"\O\"" Is for Outlaw"""
Book-Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16 Lighthouse Road,0.0,1.0,0.982887,0.928254,0.719717,1.0,0.900733,0.944899,1.0,1.0,...,1.0,1.0,0.929309,1.0,1.0,1.0,0.947537,0.9531,1.0,0.966311
1984,1.0,0.0,0.97844,0.94308,1.0,1.0,1.0,1.0,1.0,0.971825,...,0.922686,1.0,1.0,0.995629,1.0,0.969629,1.0,0.959808,0.982339,0.989242
1st to Die: A Novel,0.982887,0.97844,0.0,0.989116,1.0,0.983159,0.841914,0.912649,1.0,1.0,...,0.982392,1.0,0.953232,0.978736,1.0,0.98704,0.963921,0.989861,0.979865,0.958263
2010: Odyssey Two,0.928254,0.94308,0.989116,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.972318,1.0,1.0,0.964341,1.0,0.979625,1.0,0.955257,1.0,1.0
204 Rosewood Lane,0.719717,1.0,1.0,1.0,0.0,1.0,0.934972,0.889507,1.0,1.0,...,1.0,1.0,0.925392,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### `recommender` is a DataFrame which contains the similarity of each book.
Find a book title in the column. Closer the score is to 0, more similar a book in index is.

In [13]:
# I like a book `16 Lighthouse Road`. I may like the below 10 books, especially 204 Rosewood Lane because it is closest to 0. This book ralatively similar.
recommender['16 Lighthouse Road'].sort_values()[:10]

Book-Title
16 Lighthouse Road                                       0.000000
204 Rosewood Lane                                        0.719717
Macgregor Brides (Macgregors)                            0.778454
Purity in Death                                          0.784723
Rising Tides                                             0.796446
The Morning After                                        0.809761
Seduction in Death                                       0.814512
Dangerous                                                0.820282
Imitation in Death (Eve Dallas Mysteries (Paperback))    0.822327
Girls Night                                              0.826423
Name: 16 Lighthouse Road, dtype: float64

---

## Function to find recommended books

In [14]:
def get_recommendations(title):
    for i in recommender[title].sort_values()[:20].index:
        if i != title:
            print(i)

In [15]:
# Provide your favorite books. I will find good books for you
get_recommendations('16 Lighthouse Road')

204 Rosewood Lane
Macgregor Brides (Macgregors)
Purity in Death
Rising Tides
The Morning After
Seduction in Death
Dangerous
Imitation in Death (Eve Dallas Mysteries (Paperback))
Girls Night
Sanctuary
Dance upon the Air (Three Sisters Island Trilogy)
Tears of the Moon (Irish Trilogy)
The Heir
Night Whispers
Key of Knowledge (Key Trilogy (Paperback))
Private Eyes (Alex Delaware Novels (Paperback))
Betrayal in Death
Macgregor Grooms (Macgregors)
Bygones
