# NOTES

 - collaborative recommendation - based on past interactions
 - when deploying the app
     - i need to log the interactions to optimeize the recommendations
     - text scraping from somewhere
     - build a databasee to map normalized book name to row in table for faster search

# Problem Introduction

Our task is to build a model, which is suitable for recommending books based on user reviews from open BX dataset.

TL;DR If there are users with similar taste in books, we want to recommend them the respective part of symmetric difference of books they liked

So lets tackle this problem using `numpy`, `pandas` and `sklearn`

In [29]:
import numpy as np
import pandas as pd
import seaborn as sns

# Exploratory Data Analysis

In [30]:
# input is encoded in latin-1 and there are come corrupted values
# as there is not many of those values, I decided to just skip them

books_df = pd.read_csv( "data/BX-Books.csv", encoding="latin-1", sep=";", on_bad_lines="skip", low_memory=False )
ratings_df = pd.read_csv( "data/BX-Book-Ratings.csv", encoding="latin-1", sep=";", on_bad_lines="skip" )
users_df = pd.read_csv( "data/BX-Users.csv", encoding="latin-1", sep=";", on_bad_lines="skip" )

In [45]:
# delete '#' to see relevant part
# books_df.head()
# books_df.describe()
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 5 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271359 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
dtypes: object(5)
memory usage: 10.4+ MB


As can be seen, there are some redundant columns which we do not want to consider when deciding what book to recommend, so lets get rid of them

In [32]:
books_df.drop( columns=[ "Image-URL-S", "Image-URL-M", "Image-URL-L" ], inplace=True )

In [33]:
users_df.head()
# users_df.describe()
# users_df.info()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [34]:
ratings_df.head()
# ratings_df.describe()
# ratings_df.info()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Lets see what are we working with. Speaking in language of DB's, we have three tables which we have to somewhat merge together. Luckily for us, `pd` offers some methods we can work with.

Lets merge `books_df` and `ratings_df` on `ISBN`

In [36]:
books_with_ratings_df = books_df.merge( ratings_df, on="ISBN" )
books_with_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031136 entries, 0 to 1031135
Data columns (total 7 columns):
 #   Column               Non-Null Count    Dtype 
---  ------               --------------    ----- 
 0   ISBN                 1031136 non-null  object
 1   Book-Title           1031136 non-null  object
 2   Book-Author          1031135 non-null  object
 3   Year-Of-Publication  1031136 non-null  object
 4   Publisher            1031134 non-null  object
 5   User-ID              1031136 non-null  int64 
 6   Book-Rating          1031136 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 62.9+ MB


In [37]:
books_with_ratings_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,User-ID,Book-Rating
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,2,0
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,8,5
2,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11400,0
3,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,11676,8
4,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,41385,0


We are left with table of books, which has entry for every review created. This is quite terrible, because we have not inspected the reviews yet and chosen the relevant one

For example, we _do not_ want to include reviews from users, who have too less, or too much reviews. Similarily, we do not want to recommend books, for which we do not have solid foundation of reviews from different users.

In [38]:
user_rating_counts = ratings_df[ "User-ID" ].value_counts()
user_rating_counts.head()

11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
Name: User-ID, dtype: int64

As can be seen, we have 105283 unique users who have left a review. Now we want to find some threshold when the user becomes relevant based on how much reviews he have left.

In [11]:
print( f"0.25 quantile: {user_rating_counts.quantile( q=0.25 )}" )
print( f"0.5 quantile: {user_rating_counts.quantile( q=0.5 )}" )
print( f"0.75 quantile: {user_rating_counts.quantile( q=0.75 )}")

0.25 quantile: 1.0
0.5 quantile: 1.0
0.75 quantile: 4.0


Observation is, that 75% or our users have left less than 4 reviews. This is kinda sad, but not that surprising - data are scraped from amazon, there are probably not that much bookworms who buy their precious books from Amazon :)

Lets say that we consider person a relevant reviewer if he has more than 20 reviews. The large values also seems suspicious, I would say they are caused by bots, therefore we gonna drop them as well. We are left with around 7000 reviewers.

In [12]:
relevant_user_rating_counts = user_rating_counts[ ( 20 <= user_rating_counts ) & ( user_rating_counts <= 500 ) ]
relevant_user_rating_counts.head()

29855     498
2276      498
179978    497
277427    497
221445    493
Name: User-ID, dtype: int64

In [39]:
user_ids = relevant_user_rating_counts.index
relevant_ratings_df = ratings_df[ ratings_df[ "User-ID" ].isin( user_ids ) ]

In [40]:
relevant_rating_with_books_df = relevant_ratings_df.merge( books_df, on="ISBN" )
relevant_rating_with_books_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
0,276762,034544003X,0,Southampton Row (Charlotte &amp; Thomas Pitt N...,Anne Perry,2002,Ballantine Books
1,134797,034544003X,0,Southampton Row (Charlotte &amp; Thomas Pitt N...,Anne Perry,2002,Ballantine Books
2,146345,034544003X,8,Southampton Row (Charlotte &amp; Thomas Pitt N...,Anne Perry,2002,Ballantine Books
3,155173,034544003X,0,Southampton Row (Charlotte &amp; Thomas Pitt N...,Anne Perry,2002,Ballantine Books
4,177113,034544003X,0,Southampton Row (Charlotte &amp; Thomas Pitt N...,Anne Perry,2002,Ballantine Books


In this dataset, we have all the so-called _relevant_ reviews with books. However, even though we cleaned the user reviews, we do not know anything about the cumulative reviews for each book. Some of them have too little reviews for us to be able to generalize well, so we will just drop them

In [48]:
book_absolute_ratings = relevant_rating_with_books_df.groupby( "ISBN" )[ "Book-Rating" ].count()
# book_absolute_ratings = relevant_rating_with_books_df.groupby( "ISBN" )[ "ISBN" ].count() #.reset_index()
# book_absolute_ratings.head()

# serie which contains relevant ISBN-s
relevant_book_absolute_ratings_s = book_absolute_ratings[ book_absolute_ratings >= 20 ]
relevant_book_absolute_ratings_s.info()

<class 'pandas.core.series.Series'>
Index: 2962 entries, 000649840X to 8433925180
Series name: Book-Rating
Non-Null Count  Dtype
--------------  -----
2962 non-null   int64
dtypes: int64(1)
memory usage: 46.3+ KB


In [49]:
relevant_rating_with_books_df.head()
final_df = relevant_rating_with_books_df[ relevant_rating_with_books_df[ "ISBN" ].isin( relevant_book_absolute_ratings_s.index ) ]

pd.options.mode.chained_assignment = None  # default='warn'
final_df.drop_duplicates( [ "ISBN", "User-ID" ], inplace=True )

final_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Book-Author,Year-Of-Publication,Publisher
7,276762,380711524,5,See Jane Run,Joy Fielding,1992,Avon
8,7158,380711524,0,See Jane Run,Joy Fielding,1992,Avon
9,44728,380711524,0,See Jane Run,Joy Fielding,1992,Avon
10,59150,380711524,0,See Jane Run,Joy Fielding,1992,Avon
11,82831,380711524,0,See Jane Run,Joy Fielding,1992,Avon


In [50]:
model_df = final_df.pivot_table( columns="User-ID", index="Book-Title", values="Book-Rating" )
model_df.fillna( 0, inplace=True )
model_df.head()
model_df.index

Index([''Salem's Lot', '10 Lb. Penalty', '16 Lighthouse Road', '1984',
       '1st to Die: A Novel', '2010: Odyssey Two', '204 Rosewood Lane',
       '2061: Odyssey Three', '24 Hours', '2nd Chance',
       ...
       'You Belong to Me',
       'You Belong to Me and Other True Cases (Ann Rule's Crime Files: Vol. 2)',
       'You Just Don't Understand', 'You Shall Know Our Velocity', 'Yukon Ho!',
       'Zen and the Art of Motorcycle Maintenance: An Inquiry into Values',
       'Zoya', '\O\" Is for Outlaw"', 'e', 'stardust'],
      dtype='object', name='Book-Title', length=2676)

Now, we have dataframe which we can feed to our model. First idea I came up with was to use KNN, and I did not think of anything better since, as I am trying to build rather MVP rather than full blown recommendation engine 

In [65]:
def print_suggestions( title: str, model ):
    # df.loc[df['column_name'] == some_value]
    row = model_df.iloc[ model_df.index == title ]
    if row.shape[ 0 ] == 0:
        print( f"Sorry, we do not have title '{ title }' in our database" )
        return
    
    _, suggestions = model.kneighbors( row.values.reshape(1, -1) )
    for suggestion in suggestions[ 0 ]:
        print( model_df.index[ suggestion ] )
    

In [66]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors( algorithm="brute" )
model.fit( model_df )

NearestNeighbors(algorithm='brute')

In [67]:
print_suggestions( "1984", model )

print()

print_suggestions( "How to build a book recommendation engine", model )

1984
Almost Paradise
Women in His Life
Passion's Promise
The Great Train Robbery

Sorry, we do not have title 'How to build a book recommendation engine' in our database


Seems everything works fine, but lets further examine the `model_df` dataframe

In [68]:
%time model_df.describe()

CPU times: user 13 s, sys: 200 ms, total: 13.2 s
Wall time: 13.4 s


User-ID,242,243,254,383,388,408,446,487,503,507,...,278194,278202,278221,278356,278535,278563,278582,278633,278843,278851
count,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,...,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0,2676.0
mean,0.003737,0.040732,0.053812,0.005605,0.01719,0.0,0.004111,0.007474,0.014948,0.028401,...,0.034753,0.013453,0.017937,0.023169,0.032138,0.0,0.037369,0.062033,0.033632,0.00299
std,0.193311,0.572439,0.676509,0.216096,0.351876,0.0,0.165145,0.223691,0.386405,0.467981,...,0.493611,0.316237,0.380427,0.456137,0.530645,0.0,0.553751,0.701854,0.525596,0.154649
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,10.0,10.0,10.0,10.0,10.0,0.0,8.0,7.0,10.0,10.0,...,9.0,10.0,9.0,10.0,10.0,0.0,10.0,10.0,9.0,8.0


Even simple describe takes considerable amount of time to be computed. And it is no surprise, as there is column for every user whose review we considered. Fortunately, module `scipy` privodes `csr_matrix` class for matrix compression

In [69]:
from scipy.sparse import csr_matrix

sparse_model_df = csr_matrix( model_df )

In [70]:
sparse_model = NearestNeighbors( algorithm="brute" )
sparse_model.fit( sparse_model_df )

NearestNeighbors(algorithm='brute')

In [71]:
print_suggestions( "1984", sparse_model )

print()

print_suggestions( "How to build a book recommendation engine", sparse_model )

1984
Almost Paradise
Women in His Life
Passion's Promise
The Great Train Robbery

Sorry, we do not have title 'How to build a book recommendation engine' in our database


# Future paths

I found [this](https://towardsdatascience.com/building-a-content-based-book-recommendation-engine-9fd4d57a4da) article on simmilar problem. I like the approach the author used, scraping the content and doing some NLP is a great way to make more qualified decision when doing recommendations, as we do not need the user review and worry about the outliers

Also at this stage, there is no evaluation of the model performance

Even though I dropped the images at the beginning, it would be nice to display book covers alongside the book titles, as we as humans are very visual

Currently, the model predicts something only if the two conditions are met - title is in the database and it is in the exact same format is it is sotred in the dataframe. One of the solutions is to match on ISBN (really inconvenient for the user) or to use some abstract layer, which computes for example hamming distance between possible matching titles or some other abstract layer over the dataframe itself.



# Bonus excercise

I spinned up a **very simple** web page to demonstrate the model, to run instance locally, please follow these steps: