## Book Recommendation System

**Author:** Gozde Turan

Submitted to Fellowship.ai at 12 August 2018

In this work, I will build a book recommendation system using the data provided in following link('http://www2.informatik.uni-freiburg.de/~cziegler/BX/'). I will apply 5 basic steps to accomplish the task.
1. Exploring the data provided
2. Filtering datasets
2. Splitting training and test data
3. Evaluating the algorithms and techniques
4. Creating a function to provide recommendations for a given user

In [57]:
# to ignore warnings on runtime
import warnings
warnings.filterwarnings('ignore')

# pandas library helps to read and parse csv files easily
import pandas as pd

### Exploring the data

Firstly, I will read all csv files (BX-Books.csv, BX-Users.csv and BX-Book-Ratings.csv) from the given destinations and print 5 row from each of them.

In [58]:
# Read books
books = pd.read_csv('/Users/gozdeturan/Downloads/BX-CSV-Dump/BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")
# Set columns for books
books.columns = ['ISBN', 'BookTitle', 'BookAuthor', 'YearOfPublication', 'Publisher', 'ImageUrlS', 'ImageUrlM', 'ImageUrlL']
# Print first 5 books to see a sample content
books.head()

Unnamed: 0,ISBN,BookTitle,BookAuthor,YearOfPublication,Publisher,ImageUrlS,ImageUrlM,ImageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [59]:
# Read users
users = pd.read_csv('/Users/gozdeturan/Downloads/BX-CSV-Dump/BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
# Set columns for users
users.columns = ['UserID', 'Location', 'Age']
# Print first 5 users to see a sample content
users.head()

Unnamed: 0,UserID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [60]:
# Read book ratings
book_ratings =  pd.read_csv('/Users/gozdeturan/Downloads/BX-CSV-Dump/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
# Set columns for book ratings
book_ratings.columns = ['UserID', 'ISBN', 'BookRating']
# Print first 5 book ratings to see a sample content
book_ratings.head()

Unnamed: 0,UserID,ISBN,BookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Filtering datasets

I will filter the datasets to fit my computer's processor and calculate more accurate results from more valuable users. Forexample, I will discard users who didn't rate 100 books and also I will discard the books which got rating less than 200.

In [61]:
# Before filtering
print('Size of the book ratings before filtering: %d' % book_ratings.shape[0])

Size of the book ratings before filtering: 1149780


In [62]:
# Filter the book_ratings by matching ISBN in book_ratings dataset
# with the ISBN in books dataset. 
book_ratings = book_ratings[book_ratings.ISBN.isin(books.ISBN)]

In [63]:
# Filter the book_ratings by matching UserID in book_ratings dataset 
# with the UserID in users dataset. 
book_ratings = book_ratings[book_ratings.UserID.isin(users.UserID)]

In [64]:
# To clean up the data, I eliminated zero ratings. 
# Zero rated book means, unrated book, which may bring incorrect results to us
book_ratings = book_ratings[book_ratings.BookRating != 0]

In [65]:
# Filter the users by matching UserID in book_ratings dataset 
# with the UserID in users dataset. 
users = users[users.UserID.isin(book_ratings.UserID)]

# To make dataset small, I filtered based on location too.
users = users[users['Location'].str.contains('usa')]

In [66]:
# Filtering the users those have read at least 100 books 
user_ratings_counts = book_ratings['UserID'].value_counts()
book_ratings        = book_ratings[book_ratings['UserID'].isin(user_ratings_counts[user_ratings_counts >= 100].index)]

In [67]:
# Filtering the books those have at least 200 ratings
book_ratings_counts = book_ratings['BookRating'].value_counts()
book_ratings        = book_ratings[book_ratings['BookRating'].isin(book_ratings_counts[book_ratings_counts >= 200].index)]

In [68]:
# After filtering
print('Size of the book ratings after filtering: %d' % book_ratings.shape[0])

Size of the book ratings after filtering: 103271


In [69]:
# Eliminate unnecessary columns from books dataset
books = books[['ISBN', 'BookTitle']]
# Print first 5 items from current books dataset
books.head()

Unnamed: 0,ISBN,BookTitle
0,195153448,Classical Mythology
1,2005018,Clara Callan
2,60973129,Decision in Normandy
3,374157065,Flu: The Story of the Great Influenza Pandemic...
4,393045218,The Mummies of Urumchi


##### Splitting training and test data

The book_ratings data with 'ISBN', 'UserID' and 'BookRating' contains user ratings for books. This allows applying supervised learning where make possible to predict the rating of a book for the given user. Since ratings can take discrete values from 1 to 10, we can model this is a regression problem. I choose the regression model instead of the classification model. Because classification model will treat unused ratings as misclassified. For example, if a user rates a book with 7 then the ratings from 1 to 6 and 8 to 10 will be misclassified. Unlike classification model, the regression model will give penalty to the other values.

In next step, I will split data into two dataset as the training data and the test data with 75% to 25% ratio.

In [116]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Assign X as the original book_ratings dataframe and y as the UserID column of ratings. 
X = book_ratings.copy()
y = book_ratings['UserID']

# Split into training and test datasets, along with UserID
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [117]:
# To see which UserID rated to which ISBN number we can produce a pivot table

# Build the rating matrix using pivot table 
r_matrix = X_train.pivot_table(values = 'BookRating', index = 'UserID', columns = 'ISBN')

# In the result we will see lots of 'NaN' which means users didn't rate all the books. 
# In next steps I will replace them with zeros to make it understandable for
# machine learning algorithms.
r_matrix.head()

ISBN,0000913154,0001046438,000104687X,000104799X,0001048082,0001053744,0001055607,0001056107,0001845039,0001935968,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,,,,,,,,,,,...,,,,,,,,,,
2110,,,,,,,,,,,...,,,,,,,,,,
2276,,,,,,,,,,,...,,,,,,,,,,
4017,,,,,,,,,,,...,,,,,,,,,,
4385,,,,,,,,,,,...,,,,,,,,,,


In [118]:
# Since most of the machine learning algorithms cannot handle NaN values, 
# I replaced them with 0. 
r_matrix_temp = r_matrix.copy().fillna(0)
r_matrix_temp.head()

ISBN,0000913154,0001046438,000104687X,000104799X,0001048082,0001053744,0001055607,0001056107,0001845039,0001935968,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Evaluation

I used Root Mean Squared Error (RMSE) to evaluate the performance of the algorithms.
Below are the RMSE for three algorithms I calculated.

In [119]:
# Import the mean_squared error function
from sklearn.metrics import mean_squared_error

import numpy as np

# Compute the RMSE
def rmse(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    return np.sqrt(mse)

In [120]:
# Function to compute the RMSE score obtained on the testing set
def score(cf_model, rating_matrix):
    
    # Get a list of user-isbn tuples from the testing dataset
    id_pairs = zip(X_test['UserID'], X_test['ISBN'])
    
    # Predict the rating for every user-isbn tuple
    y_pred = np.array([cf_model(user_id, isbn, rating_matrix) for (user_id, isbn) in id_pairs])
    
    # Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test['BookRating'])
    
    # Return the final RMSE score
    return rmse(y_true, y_pred)

In [121]:
# Define baseline model always return 5 
def baseline(_user_id, _isbn, _rating_matrix):
    return 4.0

In [122]:
# Baseline
print('score for baseline')
score(baseline, [])

score for baseline


4.2431564655791156

In [123]:
# User Based Collaborative Filter using Mean Ratings
def cf_user_mean(user_id, isbn, rating_matrix):
    # Check if isbn exists in r_matrix
    if isbn in r_matrix:
        # Compute the mean of all ratings
        mean_rating = rating_matrix[isbn].mean()
        
    else:
        # Default to rating 5.0 in the absence of any information
        mean_rating = 5.0
        
    return mean_rating

In [124]:
# If we compare the score with score(baseline), 
# we will see that we get lower number which means we improved our algorithm.
print('score for cf_user_mean with 0 filled rating_matrix:')
score(cf_user_mean, r_matrix_temp)

score for cf_user_mean with 0 filled rating_matrix:


5.929371869275756

In [125]:
# Import required classes and methods from the surprise library
from surprise import Reader, Dataset, KNNBasic, evaluate

# Define a Reader object to parse the file or dataframe containing ratings
reader = Reader()

# Create the dataset to be used for building the filter
data = Dataset.load_from_df(book_ratings, reader)

# Define the algorithm object kNN
knn = KNNBasic()

# Evaluate the performance in terms of RMSE
evaluate(knn, data, measures=['RMSE'])

Evaluating RMSE of algorithm KNNBasic.

------------
Fold 1
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 3.3774
------------
Fold 2
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 3.3722
------------
Fold 3
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 3.3749
------------
Fold 4
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 3.3695
------------
Fold 5
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 3.3663
------------
------------
Mean RMSE: 3.3721
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [3.3774343393237234,
                             3.3722296699652654,
                             3.3748922832386459,
                             3.3694679682551461,
                             3.3663030337097428]})

According to the results, KNN is the one which has the least Root Mean Square Error rate. 

|Algorithm   |RMSE       |
|------------|-----------|
|baseline    |4.2        |
|cf_user_mean|5.9        |
|KNN         |3.3        |

### Recommendation 

Our aim is to find the users similar to the user who had similar preferences given the ratings of a user. Then make predictions regarding all other books that the user has not rated but are being rated by the similar users. 

In [128]:
from sklearn.neighbors import NearestNeighbors

# Create a KNN model
def create_model(ratings):
    model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute') 
    model_knn.fit(ratings)
    
    return model_knn

# This function lists k similar users given the UserID and ratings matrix
# These similarities are same as obtained via using pairwise_distances
def list_k_similar_users(user_id, ratings, model_knn, k):
    # Find the index of the given user_id
    user_loc = ratings.index.get_loc(user_id)    
    distances, indices = model_knn.kneighbors(ratings.iloc[user_loc, :].values.reshape(1, -1), n_neighbors = k + 1)

    return distances, indices

def recommend_books(user_id, ratings, model_knn, max_recommendations = 10):
    # similar users based on cosine similarity
    distances, indices = list_k_similar_users(user_id, ratings, model_knn, max_recommendations)
    
    similar_user_ids = []
    # The first user_id is our user, thus I skipped the first
    for i in range(1, len(distances.flatten())):
        similar_user_ids.append(ratings.index[indices.flatten()[i]])
    
    # Find most rated by books from the user's neighbours(smilar users)
    temp_book_ratings = book_ratings[book_ratings['UserID'].isin(similar_user_ids)]
    
    # Discard the books already read by the user
    temp_book_ratings = temp_book_ratings[~book_ratings['UserID'].isin([user_id])]
    
    # Find the average BookRatings grouped by ISBN then take most rated top 10(max_recommendations)
    temp_book_ratings = temp_book_ratings.groupby(['ISBN'])['BookRating'].mean().nlargest(max_recommendations)

    print('Following books are recommended for the given UserID(%s)' % user_id)
    print('--------------------------------')
    for isbn, _average_rating in temp_book_ratings.iteritems():
        book_index = (books.index[books['ISBN'] == isbn])[0]
        print('%s %s' % (isbn, books.BookTitle[book_index]))
    print('\n')

In [129]:
# Create a reusable model
model_knn = create_model(r_matrix_temp)

# Recommend books to a random user
recommend_books(275970, r_matrix_temp, model_knn, 10)

# Recommend books to another random user
recommend_books(4017, r_matrix_temp, model_knn, 10)

Following books are recommended for the given UserID(275970)
--------------------------------
0060012366 The Wee Free Men (Bccb Blue Ribbon Fiction Books (Awards))
0060013117 Night Watch
006001315X Monstrous Regiment (Pratchett, Terry)
0060175966 The Professor and the Madman
0060392452 Stupid White Men ...and Other Sorry Excuses for the State of the Nation!
0060803444 Is Sex Necessary? : Or Why You Feel the Way You Do
0060920084 The Lost Continent: Travels in Small-Town America
0060923245 Sweet Hereafter Movie Tie-In : A Novel
0060925493 Feather Crowns
0060929847 Our Town: A Play in Three Acts (Perennial Classics)


Following books are recommended for the given UserID(4017)
--------------------------------
0060558199 Save Karyn : One Shopaholic's Journey to Debt and Back
0060974060 Friday Night Lights: A Town, a Team, and a Dream
0060981180 Mariette in Ecstasy
0062508164 Truth or Dare : Encounters with Power, Authority, and Mystery
0064400557 Charlotte's Web (Trophy Newbery)
0064400972

As a result, I built a book recommendation system which recommends **desired number of books** for the given user according to **the average ratings of the similar users**. It is possible to get better results by applying data cleaning and pruning operations to the given datasets. Also, possible to use other algorithms and choose the best one. I tried to analyze the data before starting to process. For future case study, user's age and book's publisher information could be used to have better results.