## I like Dune and the Lord of the Rings, tell me what I should read next

This is to build a simple recommender system based on the books one likes. We expect the person to like 2 books. Please, fill in the books:

In [78]:
import ipywidgets as widgets
import pandas as pd
import warnings
import numpy as np

warnings.filterwarnings('ignore')

# CREATE A WIDGET TO GATHER BOOK INFORMATION

accordion = widgets.Accordion(children=[widgets.Textarea(), widgets.Textarea()])
accordion.set_title(0, 'Book name:')
accordion.set_title(1, 'Authors surname:')

accordion1 = widgets.Accordion(children=[widgets.Textarea(), widgets.Textarea()])
accordion1.set_title(0, 'Book name:')
accordion1.set_title(1, 'Authors surname:')

tab_nest = widgets.Tab()
tab_nest.children = [accordion, accordion1]
tab_nest.set_title(0, '1st book')
tab_nest.set_title(1, '2nd book')
tab_nest

Tab(children=(Accordion(children=(Textarea(value=''), Textarea(value='')), _titles={'1': 'Authors surname:', '…

In [79]:
# ASSIGN THE WIDGET VALUES TO VARIABLES

book1_name = tab_nest.children[0].children[0].value
book1_author = tab_nest.children[0].children[1].value
book2_name = tab_nest.children[1].children[0].value
book2_author = tab_nest.children[1].children[1].value

Let us look at our database.

In [80]:
# LOAD DATAFRAME OF BOOKS

books_url = 'BX-CSV-Dump/BX-Books.csv'

# imports only first 5 relevant columns (others are just image urls), uses latin1 encoding while it otherwise reported
# error, uses different separator while quotechar didnt work
books = pd.read_csv(books_url, sep='";"', skipinitialspace=True, error_bad_lines=False, encoding='latin1',
                    usecols=[0, 1, 2, 3, 4])

# getting rid of irrelevant quotes produced by chosen separator
books.rename(columns={'"ISBN': 'ISBN'}, inplace=True)
books['ISBN'] = books['ISBN'].str[1:]

# exporting new csv
clean_books = books.copy()
clean_books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


To recommend you the best books, we compare you to similar users. Let us load our user database.

In [81]:
# LOAD DATAFRAME OF USERS

users = pd.read_csv('BX-CSV-Dump/BX-Users.csv', sep=';', error_bad_lines=False, encoding='latin1')
users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


Our database is complicated and therefore we store user ratings in a different table:

In [82]:
# Load the book ratings data set
ratings_url = 'BX-CSV-Dump/BX-Book-Ratings.csv'
ratings = pd.read_csv(ratings_url, sep=';', error_bad_lines=False, encoding='latin1')
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Now we get our hands dirty. We will look for the list of books, which fit the name and author you provided:

In [83]:
# FIND DUNE BOOKS

dune_filter = np.logical_and(clean_books['Book-Title'].str.contains(book2_name),
                             clean_books['Book-Author'].str.contains(book2_author))
dune_books = clean_books[dune_filter]

# FIND LOTR BOOKS

lotr_filter = np.logical_and(clean_books['Book-Title'].str.contains(book1_name),
                             clean_books['Book-Author'].str.contains(book1_author))
lotr_books = clean_books[lotr_filter]

The idea is to find a similar person to you. But who is you?
We define you as a person who could possibly read any of the books fitting your criteria and would rate them as 10. 
So we take every book fitting the criteria, assign it your id (I hope an id=12345678 works for you) and give them all a rating of 10.

In [84]:
# GENERATE OUR READER: a person, who gave all dune and all lotr books 10 out of 10 rating

# generates the reader and assigns him a User-ID and rating for the books he likes
def generate_reader(books, uid=12345678, rating=10):
    """Generates reader DataFrame by books he should like and assigns his rating.

    :rtype: pd.DataFrame
    :return: DataFrame of user and his ratings for books.
    :argument books: DataFrame of books.
    :type books: pd.DataFrame
    """
    reader = pd.DataFrame(columns=['ISBN', 'User-ID', 'Book-Rating'])
    reader['ISBN'] = books['ISBN'].copy()
    reader['Book-Rating'] = rating
    reader['User-ID'] = uid
    return reader

lotr_reader = generate_reader(lotr_books)
dune_reader = generate_reader(dune_books)
del lotr_books
del dune_books

the_user = pd.concat([lotr_reader, dune_reader])

A similar person to you, read at least one of the Lord of the Ring books and one of the Dune books. In general, he must have read at least 1 from both sets of books fitting your criteria. Now this would be a broad scale of people, therefore we choose to sort them based on how similar the rating is. We evaluate it by calculating the cosine similarity. For every book you rated as 10, we go through all other users who rated this book and calculate the difference between these 2 ratings. If there are more books in common, we work with vectors.

In [88]:
# picks users that read at least one of the dune books AND at least one of the lotr books
def calculate_similar(user, others: pd.DataFrame):
    """Calculates similarity between user and other users ratings. Picks users that read at least one of Dune and one
    of Lotr books.

    :argument user: User ratings data frame which similarity to others will be calculated.
    :type user: pd.DataFrame
    :argument others: DataFrame of ratings of other users.
    :type others: pd.DataFrame
    :return: User similarity data frame.
    :rtype: pd.DataFrame"""

    dune_raters = others[others['ISBN'].isin(dune_reader['ISBN'])]
    lotr_raters = others[others['ISBN'].isin(lotr_reader['ISBN'])]
    # After small experiments I decided to let users who don't rate books in set of all users.
    # Because their similarity was low, but it yielded much larger collection of possibly interesting books.
    # That other much more similar to NEW (UNKNOWN) user might like.
    # Most importantly it boosted popularity (Readers-Count) that is used to compute final score.
    # dune_raters = dune_raters.drop(dune_raters[dune_raters['Book-Rating'] == 0].index)
    # lotr_raters = lotr_raters.drop(lotr_raters[lotr_raters['Book-Rating'] == 0].index)
    lotr_raters = lotr_raters.drop(lotr_raters[~lotr_raters['User-ID'].isin(dune_raters['User-ID'])].index)
    dune_raters = dune_raters.drop(dune_raters[~dune_raters['User-ID'].isin(lotr_raters['User-ID'])].index)
    lotr_dune_raters = pd.concat([lotr_raters, dune_raters])

    # generates the set of ratings per book
    siml_rated = lotr_dune_raters.merge(user, on='ISBN')

    # computes the similarity between users with the help of distance function with euclidian metrics on ratings
    def euclidean_similarity(grouped):
        """Calculates similarity between two users (aggregated data frame by User-ID_x) using Euclidean distance as
        siml = 1/ (1 + d(u, v)) for u \in U; v \in U\{u}.
        :argument grouped: Data frame of ratings of both users.
        :type grouped: pd.DataFrame
        :return: Similarity calculated from euclidean distance, (0, 1].
        :rtype: float"""
        new_user_ratings = grouped.as_matrix(columns=['Book-Rating_y'])
        user_rating = grouped.as_matrix(columns=['Book-Rating_x'])
        dst_vec = new_user_ratings - user_rating
        dst = np.sqrt(dst_vec.transpose().dot(dst_vec))[0][0]
        return 1./(1. + dst)

    siml = (siml_rated.groupby(['User-ID_x']).apply(euclidean_similarity).reset_index(name='Similarity'))
    return siml

# here we calculate the similarity using the above functions
# we get columns: user-ID (user who read at least 1 dune AND 1 lotr book), Similarity (how much is the rating
# similar to ours)
sim = calculate_similar(the_user, ratings)

# merging ratings and users
ratings_with_age = pd.merge(ratings, users, on='User-ID')
cols = ['ISBN', 'User-ID', 'Book-Rating', 'Location','Age']
ratings_with_age = ratings_with_age[cols]
ratings_with_book_and_age = pd.merge(ratings_with_age, books, on='ISBN')
rated_book_colls = [cols[0], 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher'] + cols[1:]
ratings_with_book_and_age = ratings_with_book_and_age[rated_book_colls]

Now we have a dataframe of people similar to you. We find the books to recommend, by taking the other books that people similar to you read. We put higher the books, which would recommend people most similar to you. But we also put higher the books, which are frequently read, so that also a popularity of a book is considered.

In [41]:

# possible books to recommend - which books rated users similar to me?

rec_bookset = ratings_with_book_and_age.merge(sim, on='User-ID')
rec_bookset = rec_bookset[rec_bookset['Book-Rating'] != 0]

# Define a lambda function to compute the weighted mean:
weighted_mean = lambda x: np.average(x, weights=rec_bookset.loc[x.index, "Similarity"])

# Define a dictionary with the functions to apply for a given column:
f = {'Book-Rating': weighted_mean, 'Popularity': 'count'}

rec_bookset['Popularity'] = 0

books_wm = rec_bookset.groupby(["ISBN"], as_index=False).agg(f)

books_wm['Score'] = books_wm['Popularity'].multiply(books_wm['Book-Rating'])

recommended_books = books_wm.sort_values(['Score'], ascending=False).merge(books, on='ISBN')
recommended_books = recommended_books.drop_duplicates(subset='ISBN', keep='first')

So in the end we recommend you the first 10 best books based on what similar people to you liked the most by rating and which books were the most popular among them.

In [92]:
dune_filter = np.logical_and(recommended_books['Book-Title'].str.contains(book2_name),
                             recommended_books['Book-Author'].str.contains(book2_author))
lotr_filter = np.logical_and(recommended_books['Book-Title'].str.contains(book1_name),
                             recommended_books['Book-Author'].str.contains(book1_author))
                             
recommended_books = recommended_books[~dune_filter]
recommended_books = recommended_books[~lotr_filter]

recommended_books[['Book-Title', 'Book-Author']][:10]

Unnamed: 0,Book-Title,Book-Author
1,Harry Potter and the Chamber of Secrets (Book 2),J. K. Rowling
2,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling
3,Harry Potter and the Order of the Phoenix (Boo...,J. K. Rowling
4,Stranger in a Strange Land (Remembering Tomorrow),Robert A. Heinlein
5,Harry Potter and the Goblet of Fire (Book 4),J. K. Rowling
6,"The Vampire Lestat (Vampire Chronicles, Book II)",ANNE RICE
7,Harry Potter and the Prisoner of Azkaban (Book 3),J. K. Rowling
9,Watership Down,Richard Adams
10,The Stand (The Complete and Uncut Edition),Stephen King
11,The Joy Luck Club,Amy Tan
