## Recommendation in Python: LighFM

In [1]:
# import dependent libraries
import pandas as pd
import os
from scipy.sparse import csr_matrix
import numpy as np
from IPython.display import display_html
import warnings

import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns
%matplotlib inline

from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k
from lightfm import LightFM
from skopt import forest_minimize

def display_side_by_side(*args):
    html_str = ''
    for df in args:
        html_str += df.to_html()
    display_html(html_str.replace(
        'table', 'table style="display:inline"'), raw=True)


# update the working directory to the root of the project
os.chdir('..')
warnings.filterwarnings("ignore")

  from numpy.core.umath_tests import inner1d


** **
### Goodreads Data

The datasets were collected in late 2017 from goodreads.com, where we only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized. 

We collected these datasets for academic use only. Please do not redistribute them or use for commercial purposes. 


There are three groups of datasets: (1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by matching book/user/review ids. For the purposes of this tutorial, we'll be using only the former two.

You can download the dataset using in this article from here:
1. Books Metadata: https://drive.google.com/uc?id=1H6xUV48D5sa2uSF_BusW-IBJ7PCQZTS1
2. User-Book Interactions: https://drive.google.com/uc?id=17G5_MeSWuhYnD4fGJMvKRSOlBqCCimxJ

#### Load Raw Data

In [2]:
%%time
books_metadata = pd.read_json('./data/goodreads_books_poetry.json', lines=True)
interactions = pd.read_json('./data/goodreads_interactions_poetry.json', lines=True)

CPU times: user 2min 10s, sys: 12.8 s, total: 2min 23s
Wall time: 2min 31s


** **
#### Data Inspection & Preparation: Books Metadata

Let's start by inspecting the books' metadata information. To develop a reliable and robust ML model, it is essential to get a thorough understanding of the available data.

As the first step, let's take a look at all the available fields, and sample data

In [3]:
books_metadata.columns.values

array([u'asin', u'authors', u'average_rating', u'book_id',
       u'country_code', u'description', u'edition_information', u'format',
       u'image_url', u'is_ebook', u'isbn', u'isbn13', u'kindle_asin',
       u'language_code', u'link', u'num_pages', u'popular_shelves',
       u'publication_day', u'publication_month', u'publication_year',
       u'publisher', u'ratings_count', u'series', u'similar_books',
       u'text_reviews_count', u'title', u'title_without_series', u'url',
       u'work_id'], dtype=object)

In [4]:
books_metadata.sample(2)

Unnamed: 0,asin,authors,average_rating,book_id,country_code,description,edition_information,format,image_url,is_ebook,...,publication_year,publisher,ratings_count,series,similar_books,text_reviews_count,title,title_without_series,url,work_id
3086,,"[{u'author_id': u'5031312', u'role': u''}, {u'...",4.06,3656020,US,"A landmark of world literature, The Divine Com...",,Hardcover,https://images.gr-assets.com/books/1397524730m...,False,...,2008,Barnes and Noble,462,[444230],"[38154, 51799, 3311228, 13767037, 138144, 8756...",56,The Divine Comedy,The Divine Comedy,https://www.goodreads.com/book/show/3656020-th...,809248
18621,,"[{u'author_id': u'152483', u'role': u'Editor'}]",3.3,260876,US,,,Hardcover,https://s.gr-assets.com/assets/nophoto/book/11...,False,...,1973,OUP Oxford,33,[],[],4,The Homeric Hymn to Demeter,The Homeric Hymn to Demeter,https://www.goodreads.com/book/show/260876.The...,252847


In [5]:
books_metadata.shape

(36514, 29)

** **
While all the available information is vital to extract contextual information to be able to train a better recommendation system, for this example, we'll only focus on the selected fields that require minimal manipulation.

In [6]:
# Limit the books metadata to selected fields
books_metadata_selected = books_metadata[['book_id', 'average_rating', 'is_ebook', 'num_pages', 
                                          'publication_year', 'ratings_count', 'language_code']]
books_metadata_selected.sample(5)

Unnamed: 0,book_id,average_rating,is_ebook,num_pages,publication_year,ratings_count,language_code
7326,333171,4.02,False,176.0,1999,14,
33952,32940040,4.0,False,110.0,2017,38,
11012,702390,4.19,False,336.0,2002,53,
31968,1271413,4.19,False,,1992,19,eng
31512,955328,4.54,False,106.0,1994,28,


** **
Now that we have the data with selected fields, next, we'll run it through pandas profiler to perform preliminary exploratory data analysis to help us better understand the available data

In [7]:
import pandas_profiling
import numpy as np

# replace blank cells with NaN
books_metadata_selected.replace('', np.nan, inplace=True)

# not taking book_id into the profiler report
profile = pandas_profiling.ProfileReport(books_metadata_selected[['average_rating', 'is_ebook', 'num_pages', 
                                                                  'publication_year', 'ratings_count']])
profile.to_file('./results/profiler_books_metadata_1.html')

** **

Considering the results from the profiler, we'll perform following transformations to the dataset:
- Replace the missing value of categorical values with another value to create a new category
- Convert bin values for numeric variables into discrete intervals

In [8]:
# using pandas cut method to convert fields into discrete intervals
books_metadata_selected['num_pages'].replace(np.nan, -1, inplace=True)
books_metadata_selected['num_pages'] = pd.to_numeric(books_metadata_selected['num_pages'])
books_metadata_selected['num_pages'] = pd.cut(books_metadata_selected['num_pages'], bins=25)

# rounding ratings to neares .5 score
books_metadata_selected['average_rating'] = books_metadata_selected['average_rating'].apply(lambda x: round(x*2)/2)

# using pandas qcut method to convert fields into quantile-based discrete intervals
books_metadata_selected['ratings_count'] = pd.qcut(books_metadata_selected['ratings_count'], 25)

# replacing missing values to year 2100
books_metadata_selected['publication_year'].replace(np.nan, 2100, inplace=True)

# replacing missing values to 'unknown'
books_metadata_selected['language_code'].replace(np.nan, 'unknown', inplace=True)


# convert is_ebook column into 1/0 where true=1 and false=0
books_metadata_selected['is_ebook'] = books_metadata_selected.is_ebook.map(
    lambda x: 1.0*(x == 'true'))

In [9]:
profile = pandas_profiling.ProfileReport(books_metadata_selected[['average_rating', 'is_ebook', 'num_pages', 
                                                        'publication_year', 'ratings_count']])
profile.to_file('./results/profiler_books_metadata_2.html')

In [10]:
books_metadata_selected.sample(5)

Unnamed: 0,book_id,average_rating,is_ebook,num_pages,publication_year,ratings_count,language_code
36017,25517162,3.5,0.0,"(-11.961, 437.44]",2100,"(12.0, 14.0]",ara
18331,7135851,3.5,0.0,"(-11.961, 437.44]",2009,"(21.0, 25.0]",unknown
12530,25606981,3.0,0.0,"(-11.961, 437.44]",2014,"(14.0, 16.0]",unknown
30351,28604835,4.0,0.0,"(-11.961, 437.44]",2100,"(10.0, 12.0]",eng
14159,11247623,2.5,0.0,"(-11.961, 437.44]",2008,"(7.0, 8.0]",ara


** **
#### Data Inspection & Preparation: Interactions data

As the first step, let's take a look at all the available fields, and sample data

In [11]:
interactions.columns.values

array([u'book_id', u'date_added', u'date_updated', u'is_read', u'rating',
       u'read_at', u'review_id', u'review_text_incomplete', u'started_at',
       u'user_id'], dtype=object)

In [12]:
interactions.sample(5)

Unnamed: 0,book_id,date_added,date_updated,is_read,rating,read_at,review_id,review_text_incomplete,started_at,user_id
1347680,439414,Mon Feb 21 15:54:21 -0800 2011,Mon Mar 07 21:57:46 -0800 2011,True,4,,d5822adad9c6a2489c92d0b225c1798f,,,170f95b0339fb6056300f78e5d92d288
2444877,175626,Mon Feb 03 17:21:39 -0800 2014,Mon Feb 03 17:21:39 -0800 2014,False,0,,b2d7fc05f7ec173e92899436f26c42f7,,,b4043087709ce387d8293b1fd5a68af1
864186,158008,Tue Jun 23 00:14:03 -0700 2009,Tue Jun 23 00:14:03 -0700 2009,True,4,,46696d8f43e97004be17e95e8ea4e010,,,b656a86781f91808753cd64f0d26aae5
2693971,238389,Sat Sep 02 03:44:52 -0700 2017,Sat Sep 02 03:44:53 -0700 2017,False,0,,9e5f913fb46aebd974017937616fe33f,,,1dd9334eb5fd263a14da7bebcee9163b
1296876,820905,Fri Oct 09 05:41:44 -0700 2015,Mon Nov 14 15:58:09 -0800 2016,True,5,Sun Nov 01 00:00:00 -0700 2015,4e226da1b805157de0b8bb7248ce6e17,Fabulous!,Fri Oct 09 00:00:00 -0700 2015,f17efdd7949a3d13324a7d12c1c762ce


In [13]:
interactions.shape

(2734350, 10)

** **
While all the available information is vital to extract contextual information to be able to train a better recommendation system, for this example, we'll only focus on the selected fields that require minimal manipulation.

In [14]:
# Limit the books metadata to selected fields
interactions_selected = interactions[['user_id', 'book_id', 'is_read', 'rating']]

# mapping boolean to string
booleanDictionary = {True: 'true', False: 'false'}
interactions_selected['is_read'] = interactions_selected['is_read'].replace(booleanDictionary)

interactions_selected.sample(5)

Unnamed: 0,user_id,book_id,is_read,rating
1953585,00fa285b98be274a5795117c7eacbefb,159304,False,0
1470153,c8b35bdfe636ecfad71ad94073ac8d09,6017893,False,0
1600133,c525c7b37653b3323858a08b277104e3,31602,False,0
850988,fc7899ccbdff5743c974ba04b2aa0f41,461938,True,5
479841,d6fdb107be00c81e76c42f5a678283e7,27418,True,3


In [15]:
profile = pandas_profiling.ProfileReport(interactions_selected[['is_read', 'rating']])
profile.to_file('./results/profiler_interactions.html')

** **

Considering the results from the profiler, we'll perform following transformations to the dataset:
- Convert is_read column to 1/0

In [16]:
# convert is_read column into 1/0 where true=1 and false=0
interactions_selected['is_read'] = interactions_selected.is_read.map(
    lambda x: 1.0*(x == 'true'))

In [17]:
interactions_selected.sample(10)

Unnamed: 0,user_id,book_id,is_read,rating
1565125,1f9f847ce20c58c12ac7f1e815df5d7f,22267492,0.0,0
38635,9cd8fb7c611544b2e09ef5226ce8dbcb,24874353,1.0,3
847749,f2bac05b3932fe7c68960041744e5058,310336,0.0,0
1046835,700a4402bc8f09166fc98bc48bc0c525,11347806,0.0,0
486934,adfc7584a1fef507364c722ad6e3c106,12966360,0.0,0
860513,5a1c12910a59122c8653307c06cf130f,27494,1.0,5
656911,787f452029e97df574375ace96c5a782,119234,1.0,3
1800541,ec728fff5c2c096888e400118c0bc1b0,76542,0.0,0
498215,c8ff09eaf35e3d6b519cd5594139224c,69547,0.0,0
12693,38c2feb6b72d473f1b710516e018244e,493428,0.0,0


** **
Since we have two fields denoting interaction between a user and a book, `is_read` and `rating` - let's see how many data points we have where the user hasn't read the book but have given the ratings.

In [18]:
interactions_selected.groupby(['rating', 'is_read']).size().reset_index().pivot(columns='rating', index='is_read', values=0)

rating,0,1,2,3,4,5
is_read,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,1420740.0,,,,,
1.0,84551.0,20497.0,64084.0,237942.0,405565.0,500971.0


From the above results, we can conclusively infer that users with ratings >= 1 have all read the book. Therefore, we'll use the `ratings` as the final score, drop interactions where `is_read` is false, and limit interactions from random 500 users to limit the data size for further analysis 

In [19]:
import random

interactions_selected = interactions_selected.loc[interactions_selected['is_read']==1, ['user_id', 'book_id', 'rating']]

interactions_selected = interactions_selected[interactions_selected['user_id'].isin(random.sample(list(interactions_selected['user_id'].unique()), 
                                                                                                  k=5000))]

interactions_selected.sample(10)

Unnamed: 0,user_id,book_id,rating
368671,08184d08ae08d26bacd5c00230141ce5,6130588,5
1755032,cae938e69bbc27a393153aed6861b9fe,8744427,5
2675075,26ef759d338ff493bdcd501fb05127db,1420,5
2613547,44af3a5167e24e9fe4761ef20487e675,662635,5
794869,e471985fd3f272455e048aa7f7f05b73,1420,5
341820,d5708c6364fa01c5c0c2a31427c10664,6555075,5
1160855,b8ee26f9150c1e6d46a900c423b44d7b,133619,5
1015659,4e7d97a0afddd934f01e7d7c22eb3ef1,240258,4
917119,c6a203dbca8acc76a49bc68808ccce33,11030407,0
1024204,cc8abbdd380a5dc5c0d068f99b4eab1c,160959,0


In [20]:
interactions_selected.shape

(23989, 3)

** **
#### Data Preprocessing

Now, let's transform the available data into CSR sparse matrix that can be used for matrix operations. We will start by the process by creating books_metadata matrix which is np.float64 csr_matrix of shape ([n_books, n_books_features]) – Each row contains that book's weights over features. However, before we create a sparse matrix, we'll first create a item dictionar for future references

In [21]:
item_dict ={}
df = books_metadata[['book_id', 'title']].sort_values('book_id').reset_index()

for i in range(df.shape[0]):
    item_dict[(df.loc[i,'book_id'])] = df.loc[i,'title']

In [22]:
# dummify categorical features
books_metadata_selected_transformed = pd.get_dummies(books_metadata_selected, columns = ['average_rating', 'is_ebook', 'num_pages', 
                                                                                         'publication_year', 'ratings_count', 
                                                                                         'language_code'])

books_metadata_selected_transformed = books_metadata_selected_transformed.sort_values('book_id').reset_index().drop('index', axis=1)
books_metadata_selected_transformed.head(5)

Unnamed: 0,book_id,average_rating_0.0,average_rating_1.0,average_rating_1.5,average_rating_2.0,average_rating_2.5,average_rating_3.0,average_rating_3.5,average_rating_4.0,average_rating_4.5,...,language_code_tel,language_code_tgl,language_code_tha,language_code_tlh,language_code_tur,language_code_ukr,language_code_unknown,language_code_urd,language_code_vie,language_code_zho
0,234,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
1,236,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
2,241,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,244,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,254,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0


In [23]:
# convert to csr matrix
books_metadata_csr = csr_matrix(books_metadata_selected_transformed.drop('book_id', axis=1).values)
books_metadata_csr

<36514x357 sparse matrix of type '<type 'numpy.uint8'>'
	with 219084 stored elements in Compressed Sparse Row format>

** **
Next we'll create a iteractions matrix which is np.float64 csr_matrix of shape ([n_users, n_books]). We'll also create a user dictionary for future use cases

In [24]:
user_book_interaction = pd.pivot_table(interactions_selected, index='user_id', columns='book_id', values='rating')

# fill missing values with 0
user_book_interaction = user_book_interaction.fillna(0)

user_book_interaction.head(10)

book_id,234,236,254,284,289,290,291,292,459,462,...,35663570,35668923,35670989,35704999,35878020,35887236,36070215,36096745,36122873,36295400
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
001404f6349ae5aa020fbd9e30196067,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
002071c96681a45ca8f8dac10d080275,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
003299b767208cf9f83950e311e6856d,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0035eb991e74d5411f6f3ee88c6baff1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
003c3cbe1f0bf247fc1bae43984e333b,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0059f2af7ba41747be006788caa26f78,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
005d83a471aed1691c8447b52ce4baaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
006142c2fdbc566078193da9d3c11a4a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
006679ea5ba690fe5238d11238643a5c,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
006cace4ca0fb1c344e7148f5e63f22a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
user_id = list(user_book_interaction.index)
user_dict = {}
counter = 0 
for i in user_id:
    user_dict[i] = counter
    counter += 1

In [26]:
# convert to csr matrix
user_book_interaction_csr = csr_matrix(user_book_interaction.values)
user_book_interaction_csr

<5000x7396 sparse matrix of type '<type 'numpy.float64'>'
	with 22395 stored elements in Compressed Sparse Row format>

** **
### Model Training

Ideally, we would build, train, and evaluate several models for our recommender system to determine which model holds the most promise for further optimization (hyper-parameter tuning).

However, for this tutorial, we'll train the base model, with randomly selected input parameters for demonstrations.

In [27]:
model = LightFM(loss='warp',
                random_state=2016,
                learning_rate=0.90,
                no_components=150,
                user_alpha=0.000005)

model = model.fit(user_book_interaction_csr,
                  epochs=100,
                  num_threads=16, verbose=False)

** **
#### Top n Recommendations

In [28]:
def sample_recommendation_user(model, interactions, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 5, show = True):
    
    n_users, n_items = interactions.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items), item_features=books_metadata_csr))
    scores.index = interactions.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    known_items = list(pd.Series(interactions.loc[user_id,:] \
                                 [interactions.loc[user_id,:] > threshold].index).sort_values(ascending=False))
    
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print ("User: " + str(user_id))
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1

In [171]:
sample_recommendation_user(model, user_book_interaction, 'ff52b7331f2ccab0582678644fed9d85', user_dict, item_dict)

User: ff52b7331f2ccab0582678644fed9d85
Known Likes:
1- Brown Girl Dreaming
2- The Crossover
3- Love, Dishonor, Marry, Die, Cherish, Perish
4- Odysséen
5- Iliaden
6- The Weight of Water
7- Fänrik Ståls sägner
8- Eddan: De nordiska guda- och hjältesångerna
9- V.
10- Aniara: An Epic Science Fiction Poem
11- The Melancholy Death of Oyster Boy and Other Stories
12- Paradise Regained by John Milton
13- The Tent
14- Paradise Lost
15- Hamlet

 Recommended Items:
1- Bronx Masquerade
2- La Navidad para un niño en Gales
3- How We Fare
4- Maya Angelou: The Complete Poetry
5- Shakespeare's Love Sonnets
