### Week 13: Recommender Systems

Instructor: Cornelia Ilin <br>
Email: cilin@ischool.berkeley.edu <br>


Citations: <br>
 - https://towardsdatascience.com/recommendation-system-in-python-lightfm-61c85010ce17 [Our example is stolen from here]
 - https://machinelearningmastery.com/sparse-matrices-for-machine-learning/
 - https://making.lyst.com/lightfm/docs/home.html#how-to-cite
 - https://realpython.com/build-recommendation-engine-collaborative-filtering/
 - https://developers.google.com/machine-learning/recommendation [Google's course on Recommender Systems, highly recommended!!]

### Objectives
 - Review of main recommender systems concepts.
 - Goodreads example: focus on the LightFM package.
 - Final thoughts on W207 - Applied Machine Learning.

I usually present the notebook to the class and I ask questions along the way. Today, we will do a flipped-classroom exercise, as follows:

 - form teams of 3-4 people.
 - as a team, fill in the blanks for section [Review of main concepts].
 - go over the notebook; make sure you understand what's going on at each step; pay attention to all the summary stats.
 - explain the book_recommendation_system() function; again, make sure you understand what's implemented at each step.
 - finally, I will randomly choose a team to present the notebook to the class.

---
### Review of main concepts
---

In [None]:
from IPython.display import Image
print('Source: https://towardsdatascience.com/recommendation-system-in-python-lightfm-61c85010ce17')
Image(filename='recom_systems.png', width=600)

In ``_________ filtering``, if user A is similar to user B, and user B likes video 1, then the system can recommend video 1 to user A (even if user A hasn’t seen any videos similar to video 1).

``_________ filtering`` is well suited to complex domains where items are not purchased very often, such as apartments and cars. It is based not on a user’s rating history, but on specific queries made by the user.

In ``_________ filtering``, if user A watches two cute cat videos, then the system can recommend cute animal videos to that user.

``_________ filtering`` is also referred to as neighborhood-based collaborative filtering algorithms, where ratings of user-item combinations are predicted based on their neighborhoods. These neighborhoods can be further defined as (1) User Based, and (2) Item Based.

In ``_________ filtering``, ML techniques are used to learn model parameters within the context of a given optimization framework.

``_________ filtering`` is a special kind of recommender that uses both collaborative and content based filtering for making recommendations.

Today we will be focusing on an example that uses ``Hybrid Filtering``.

---
### Goodreads example (LightFM package)
---

### Step 1: Import packages

In [1]:
# standard
import pandas as pd
import numpy as np
import pandas_profiling
import random
import os


# display, plots
from IPython.display import display_html
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sns

# recommender systems
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k
from lightfm import LightFM
from scipy.sparse import csr_matrix

# warning
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'lightfm'

### Step 2: Define functions

In [None]:
def book_recommendation_user(model, user_book_interaction, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 5, show = True):
    """ Define function here
    # param
    # return
    """
    # model prediction for user_id
    n_users, n_items = user_book_interaction.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items), item_features=books_metadata_csr))
    scores.index = user_book_interaction.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    # known items for user_id
    known_items = list(pd.Series(user_book_interaction.loc[user_id,:] \
                                 [user_book_interaction.loc[user_id,:] > threshold].index).sort_values(ascending=False))
    
    # recommended items for user_id
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    
    if show == True:
        print ("User: " + str(user_id))
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1
            
        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1

---
### Step 3: Read Goodreads data
---

The datasets were collected in late 2017 from goodreads.com (see first citation). They only scraped users' public shelves, i.e. everyone can see it on web without login. User IDs and review IDs are anonymized. 

Two datasets: 
 - meta-data of the books, 
 - user-book interactions (users' public shelves). 

These datasets can be merged together by matching book/user/review ids. 

You can download the datasets from here (make sure to put in the same folder with this notebook or set the path accordingly):
 - meta-data: https://drive.google.com/uc?id=1H6xUV48D5sa2uSF_BusW-IBJ7PCQZTS1
 - user-book interactions: https://drive.google.com/uc?id=17G5_MeSWuhYnD4fGJMvKRSOlBqCCimxJ

In [None]:
books_metadata = pd.read_json('goodreads_books_poetry.json', lines=True)
interactions = pd.read_json('goodreads_interactions_poetry.json', lines=True)

print('Size of books metadata', books_metadata.shape)
print('Size of user-book interactions data', interactions.shape)

---
### Step 4: Data visualization and preprocessing
---

#### ``Books Metadata``

#### (1) Inspect the data

In [None]:
books_metadata.head(3)

#### (2) Focus on a subsample of features
but feel free to experiment with the others on your own!

In [None]:
features = ['title','book_id', 'average_rating', 'is_ebook', 'num_pages', 
            'publication_year', 'ratings_count', 'language_code']

books_metadata = books_metadata[features]
books_metadata.head(3)

#### (3) Visualize the data using pandas_profiling

pandas_profiling generates profile reports from a pandas DataFrame. 

Citing from their own website, "the pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis."

In [None]:
# for meaningful pandas_profiling statistics, replace blank cells with NaN (check to see how the stats differ if you don't do this)
books_metadata.replace('', np.nan, inplace=True)

# select features for summary stats
profile = pandas_profiling.ProfileReport(books_metadata[['average_rating', 'is_ebook', 'num_pages', 
                                                                  'publication_year', 'ratings_count']]) 
profile.to_notebook_iframe()

#### (4) Preprocess this data
- Feature transformation:
    - Replace the missing value of categorical values with another value to create a new category
    - Convert bin values for numeric variables into discrete intervals
- Perform one-hot encoding on the data
- Convert dense matrix to sparse matrix
- Create a book dictionary

Let's start by transforming the value of features

In [None]:
# using pandas cut method to convert fields into discrete intervals
books_metadata['num_pages'].replace(np.nan, -1, inplace=True)
books_metadata['num_pages'] = pd.to_numeric(books_metadata['num_pages'])
books_metadata['num_pages'] = pd.cut(books_metadata['num_pages'], bins=25)

# rounding ratings to nearest .5 score
books_metadata['average_rating'] = books_metadata['average_rating'].apply(lambda x: round(x*2)/2)

# using pandas qcut method to convert fields into quantile-based discrete intervals
books_metadata['ratings_count'] = pd.qcut(books_metadata['ratings_count'], 25)

# replacing missing values to year 2100
books_metadata['publication_year'].replace(np.nan, 2100, inplace=True)

# replacing missing values to 'unknown'
books_metadata['language_code'].replace(np.nan, 'unknown', inplace=True)

# convert is_ebook column into 1/0 where true=1 and false=0
books_metadata['is_ebook'] = books_metadata.is_ebook.map(lambda x: 1.0*(x == 'true'))

In [None]:
# visualize the data after feature transformation
books_metadata.head(3)

Now, let's perform a one-hot encoding on this data

In [None]:
books_metadata = pd.get_dummies(books_metadata, columns = ['average_rating', 'is_ebook', 'num_pages', 
                                                           'publication_year', 'ratings_count', 
                                                           'language_code'])

books_metadata = books_metadata.sort_values('book_id').reset_index().drop('index', axis=1)
books_metadata.head(3)

In [None]:
books_metadata.shape

We can see that now we have a lot more features in the books_metadata. At first sight it seems that there may be lots of zeros, so we need to investigate this.

``Question:`` What is the difference between a dense vs. sparse matrix?

In [None]:
print('The sparsity of the books_metadata is:',
      round((np.size(books_metadata)-np.count_nonzero(books_metadata))/np.size(books_metadata) *100,2),
      'percent')

---
``Sparse vs. Dense matrices``

This is clearly a waste of memory resources as those zero values do not contain any information.


Simply, if the matrix contains mostly zero-values, i.e. no data, then performing operations across this matrix may take a long time where the bulk of the computation performed will involve adding or multiplying zero values together.

Sparse matrices turn up a lot in applied ML!

Sparse matrices come up in encoding schemes used in the preparation of data. Three common examples include:

 - One-hot encoding, used to represent categorical data as sparse binary vectors [this is what we did above].
 - Count encoding, used to represent the frequency of words in a vocabulary for a document
 - TF-IDF encoding, used to represent normalized word frequency scores in a vocabulary.
 
The solution to representing and working with sparse matrices is to use an alternate data structure to represent the sparse data.


Two common data structures are:
- Compressed Sparse Row (CSR). The sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.
- Compressed Sparse Column (CSC). The same as the Compressed Sparse Row method except the column indices are compressed and read first before the row indices.

CSR is very popular in ML.

The idea is to ignore the zero values and focus only on the non-zero values.

SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. So this is what we will be using today.

---

Now, let's transform the books_metadata into a CSR sparse matrix that can be used for matrix operations

In [None]:
books_metadata_csr = csr_matrix(books_metadata.drop(['book_id','title'], axis=1).values)
books_metadata_csr

In [None]:
pd.DataFrame(books_metadata_csr).head()

Finally, let's create a dictionary of book titles:

In [None]:
item_dict ={}
df = books_metadata[['book_id', 'title']].sort_values('book_id').reset_index()

for i in range(df.shape[0]):
    item_dict[(df.loc[i,'book_id'])] = df.loc[i,'title']

# print first 5 items:
for item in list(item_dict)[0:5]:
    print (item, item_dict[item])

#### ``User-book interactions data``

#### (1) Inspect data

In [None]:
interactions.head(3)

In [None]:
interactions.shape

#### (2) Focus on a subsample of features
but feel free to experiment with the others on your own!

In [None]:
# limit the books metadata to selected features
features = ['user_id', 'book_id', 'is_read', 'rating']
interactions= interactions[features]

interactions.head()

#### (3) Visualize data using pandas_profiling

In [None]:
# for meaningful pandas_profiling statistics, replace blank cells with NaN
interactions.replace('', np.nan, inplace=True)

profile = pandas_profiling.ProfileReport(interactions[['is_read', 'rating']])
profile.to_notebook_iframe()

#### (4) Preprocess this data

- Feature transformation:
    - map boolean values (is_read) to string
    - vonvert is_read column to 1/0
    - check is_read and rating consistency
    select a subsample of this data (RAM concerns)
- Perform one-hot encoding on the data
- Convert dense matrix to sparse matrix
- Create a user dictionary

Let's start by transforming the value of features.

In [None]:
# mapping boolean to string
booleanDictionary = {True: 'true', False: 'false'}
interactions['is_read'] = interactions['is_read'].replace(booleanDictionary)

# convert is_read column into 1/0 where true=1 and false=0
interactions['is_read'] = interactions.is_read.map(lambda x: 1.0*(x == 'true'))

In [None]:
interactions.head()

Since we have two fields denoting interaction between a user and a book, `is_read` and `rating` - let's see how many data points we have where the user hasn't read the book but have given the ratings.

In [None]:
interactions.groupby(['rating', 'is_read']).size().reset_index().pivot(columns='rating', index='is_read', values=0)

So we can conlude that there is consitency between is_read and rating because ratings >= 1 have all read the book. 

Finally, for the feature transformation step we will:
- drop interactions where ``is_read`` is false
- limit interactions from random 5000 users

In [None]:
# drop if is_read == false
interactions = interactions.loc[interactions['is_read']==1, ['user_id', 'book_id', 'rating']]

# randomly select 5000 users
interactions = interactions[interactions['user_id'].isin(
               random.sample(list(interactions['user_id'].unique()), k=5000))]

# print first 5
interactions.head()

In [None]:
interactions.shape

``Question:`` Hoe many users did we drop?

Now, let's perform a user_book interaction (a form of hot-encoding) on this data. Note that if they read the book but did not provide any rating, then `rating`=0.

In [None]:
# on-hot encoding (0/1 representation)
user_book_interaction = pd.pivot_table(interactions, index='user_id', columns='book_id', values='rating')

# fill missing values with 0
user_book_interaction = user_book_interaction.fillna(0)

user_book_interaction.head(10)

In [None]:
user_book_interaction.shape

Again, the data seems to be filled with lots of zeros. Let's check the sparsity of the matrix.

In [None]:
print('The sparsity of the books_metadata is:',
      round((np.size(user_book_interaction)-np.count_nonzero(user_book_interaction))/np.size(user_book_interaction) *100,2),
      'percent')

Create a CSR spare matrix 

In [None]:
# convert to csr matrix
user_book_interaction_csr = csr_matrix(user_book_interaction.values)
user_book_interaction_csr

In [None]:
# visualize as data frame
pd.DataFrame(user_book_interaction_csr).head()

We can see that a list of tuples is stored with each tuple containing the row index, column index, and the non-zero value.

Finally, let's create a user dictionary

In [None]:
user_id = list(user_book_interaction.index)
user_dict = {}
counter = 0 
for i in user_id:
    user_dict[i] = counter
    counter += 1

# print first 5 items:
for item in list(user_dict)[0:5]:
    print (item, user_dict[item])

---
### Step 5: Model training
---

Ideally, we would build a train and test set, and evaluate several models for our recommender system to determine which model holds the most promise for further optimization (hyperparameter tuning).

Today we will train a base model, with randomly selected input parameters (play with these on your own!)

In [None]:
model = LightFM(loss='warp',
                random_state=0,
                learning_rate=0.90,
                no_components=150,
                user_alpha=0.000005)

model = model.fit(user_book_interaction_csr,
                  epochs=100,
                  num_threads=16, verbose=False)

---
#### Step 6: Find Top 5 book recommendations for a user
---

In [None]:
# define user_id
user_id = 0
for i, key in enumerate(user_dict.keys()):
    if i==1234:
        user_id=key

# find book recommendations
book_recommendation_user(model, user_book_interaction, user_id, user_dict, item_dict)