### The datasets
We have two datasets:
* *books.csv* contains information for 10,000 books, such as ISBN, authors, title, year, etc.
* *ratings.csv* is a collection of user ratings for these books, ranging from 1 to 5 stars

### The Approach
The recommendation approach we will use is **collaborative filtering**, which make recommendations for a user according to how other users consume or rate items. The idea is that if two users consume or rate items in a similar way, then they probably like the same items.

### Feature Engineering
To employ the approach above, we need to construct a matrix of users and their books ratings. This users-ratings matrix will be sparse, as there are many books in the dataset than what an average user reads or rates, but we can compress this matrix before training the model.

### The Algorithm
We will use an unsupervised version k-Nearest Neighbors, with cosine similarity as the distance metric. Since this is an unsupervised learning task, there is so measure of accuracy score so we cannot use cross validation and require human judgement to evaluate how well the model recommends books.

*Enough talking, let's now train our recommendation system!*

## Import modules

In [1]:
import pandas as pd
import numpy as np
import pickle
from scipy.sparse import csr_matrix
import re

from sklearn.neighbors import NearestNeighbors

## 1. Load the datasets

### Ratings dataset

This dataset is straightforward. There is nothing to clean further.

In [2]:
ratings = pd.read_csv("data/ratings.csv")
print(ratings.shape)
ratings.head()

(5976479, 3)


Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


### Books dataset

In [3]:
books = pd.read_csv("data/books.csv")
print(books.shape)
books.head()

(10000, 23)


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


To construct our users-ratings matrix, we only need the book ID and title. Drop all other columns which we don't need.

In [4]:
cols = ['book_id', 'title']
books = books[cols]
books.head()

Unnamed: 0,book_id,title
0,1,"The Hunger Games (The Hunger Games, #1)"
1,2,Harry Potter and the Sorcerer's Stone (Harry P...
2,3,"Twilight (Twilight, #1)"
3,4,To Kill a Mockingbird
4,5,The Great Gatsby


## 2. Data cleaning
The book title contains unwanted extra spaces and special characters, so we will remove them.

In [5]:
def clean_book_title(title):
    title = re.sub(r'\([^)]*\)', '', title) # remove characters in brackets
    title = re.sub(' +', ' ', title) # convert multiple consecutive spaces into one space
    title = title.strip() # remove special characters at the beginning and end
    return title

Our books dataset now looks like this after cleaning:

In [6]:
books['title'] = books['title'].apply(clean_book_title)
books.head()

Unnamed: 0,book_id,title
0,1,The Hunger Games
1,2,Harry Potter and the Sorcerer's Stone
2,3,Twilight
3,4,To Kill a Mockingbird
4,5,The Great Gatsby


## 3. Construct feature matrix

First, we combine the ratings dataset and the books dataset to get a list of users and their ratings for each book.

In [9]:
combine_book_rating = pd.merge(ratings, books, on='book_id')
print(combine_book_rating.shape)
combine_book_rating.head()

(5976479, 4)


Unnamed: 0,user_id,book_id,rating,title
0,1,258,5,The Shadow of the Wind
1,11,258,3,The Shadow of the Wind
2,143,258,4,The Shadow of the Wind
3,242,258,5,The Shadow of the Wind
4,325,258,4,The Shadow of the Wind


Next, we remove the rows that have the same user ID and book title. Our matrix is constructed using these two fields, so it cannot have duplicates of them.

In [10]:
user_ratings = combine_book_rating.drop_duplicates(['user_id', 'title'])
print(f"Removed {combine_book_rating.shape[0] - user_ratings.shape[0]} duplicates.")
print(user_ratings.shape)
user_ratings.head()

Removed 3766 duplicates.
(5972713, 4)


Unnamed: 0,user_id,book_id,rating,title
0,1,258,5,The Shadow of the Wind
1,11,258,3,The Shadow of the Wind
2,143,258,4,The Shadow of the Wind
3,242,258,5,The Shadow of the Wind
4,325,258,4,The Shadow of the Wind


We now pivot to get the users-ratings matrix. Each column is a user, each row is a book. Each entry is the matrix is how that user rated that book.

In [12]:
users_ratings_matrix = user_ratings.pivot(index='title', columns='user_id', values='rating').fillna(0)
users_ratings_matrix.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,53415,53416,53417,53418,53419,53420,53421,53422,53423,53424
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""حكايات فرغلي المستكاوي ""حكايتى مع كفر السحلاوية",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#GIRLBOSS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0
'Tis,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"1,000 Places to See Before You Die",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As our matrix is very large and sparse (contains lots of zeros), we want to compress it before feeding it to the model.

In [13]:
compressed_matrix = csr_matrix(users_ratings_matrix.values)

### 4. Train kNN model

As discussed, this is an unsupervised learning task.
The *brute* algorithm is used for sparse input. 
*Cosine similarity* is used to measure how "close" the vectors of any two books are.

In [14]:
knn = NearestNeighbors(algorithm='brute', metric='cosine')
knn.fit(compressed_matrix)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

In [15]:
pickle.dump(knn, open('knn_model.pkl','wb'))

### (Finally) Get book recommendations!

In [22]:
def get_recommendations(book_title, matrix=users_ratings_matrix, model=knn, topn=5):
    book_index = list(matrix.index).index(book_title)
    distances, indices = model.kneighbors(matrix.iloc[book_index,:].values.reshape(1,-1), n_neighbors=topn+1)
    print('Recommendations for {}:'.format(matrix.index[book_index]))
    for i in range(1, len(distances.flatten())):
        print('{}. {}, distance = {}'.format(i, matrix.index[indices.flatten()[i]], "%.3f"%distances.flatten()[i]))
    print()
    
get_recommendations("Harry Potter and the Sorcerer's Stone")
get_recommendations("Moby-Dick or, The Whale")
get_recommendations("Little Women")
get_recommendations("Charlie and the Chocolate Factory")

Recommendations for Harry Potter and the Sorcerer's Stone:
1. Harry Potter and the Prisoner of Azkaban, distance = 0.320
2. Harry Potter and the Chamber of Secrets, distance = 0.327
3. Harry Potter and the Goblet of Fire, distance = 0.331
4. Harry Potter and the Order of the Phoenix, distance = 0.342
5. Harry Potter and the Half-Blood Prince, distance = 0.348

Recommendations for Moby-Dick or, The Whale:
1. The Odyssey, distance = 0.701
2. A Tale of Two Cities, distance = 0.703
3. The Adventures of Huckleberry Finn, distance = 0.703
4. Frankenstein, distance = 0.710
5. The Old Man and the Sea, distance = 0.729

Recommendations for Little Women:
1. Pride and Prejudice, distance = 0.519
2. Jane Eyre, distance = 0.539
3. To Kill a Mockingbird, distance = 0.561
4. The Diary of a Young Girl, distance = 0.561
5. Sense and Sensibility, distance = 0.561

Recommendations for Charlie and the Chocolate Factory:
1. James and the Giant Peach, distance = 0.541
2. Matilda, distance = 0.543
3. The Wit