<span style="font-size:36px"><b>Recommender System</b></span>

Copyright 2019 Gunawan Lumban Gaol, Mike Bratanata

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language overning permissions and limitations under the License.

# Import Packages

In [1]:
import numpy as np
import pandas as pd

# Import Data

Dataset from https://github.com/zygmuntz/goodbooks-10k. Recent version of Goodreads Book dataset.

In [2]:
# ratings = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv", encoding="latin1")
# books = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv", encoding="latin1")
# book_tags = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/book_tags.csv", encoding="latin1")
# tags = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/tags.csv", encoding="latin1")

Locally saved the data files.

In [3]:
# ratings.to_csv('dataset/ratings.csv', header=True, index=False, line_terminator='\n', sep=',')
# books.to_csv('dataset/books.csv', header=True, index=False, line_terminator='\n', sep=',')
# book_tags.to_csv('dataset/book_tags.csv', header=True, index=False, line_terminator='\n', sep=',')
# tags.to_csv('dataset/tags.csv', header=True, index=False, line_terminator='\n', sep=',')

Load saved dataset.

In [4]:
ratings = pd.read_csv('dataset/ratings.csv')
books = pd.read_csv('dataset/books.csv')
book_tags = pd.read_csv('dataset/book_tags.csv')
tags = pd.read_csv('dataset/tags.csv')

In [5]:
print(ratings.shape)
print(books.shape)
print(book_tags.shape)
print(tags.shape)

(5976479, 3)
(10000, 23)
(999912, 3)
(34252, 2)


# Collaborative Filtering

* We have user each with rating vector $y^{(j)}$, each element correspond to the user rating for a product (book).
* We want to predict what rating will the user give, for a product that hasn't yet been given rating from a user.

**Objective**:

For a user $j$ with parameters $\theta^{(j)}$ and a movie with (learned) features $x^{(i)}$, predict a star rating of $(\theta^{(j)})^Tx^{(i)}$ stars.

# Content Based

* We have a product (in this case a book) each with its own feature vector $x^{(i)}$
* We have user each with rating vector $y^{(j)}$, each element correspond to the user rating for a product (book).
* We want to predict what rating will the user give, for a product that hasn't yet been given rating from a user.

**Objective**:

For each user $j$, learn a parameter $\theta^{(j)} \epsilon R^{(n+1)}$, where $n$ is total number of a book features. Predict user $j$ as rating book $i$ with $(\theta^{(j)})^Tx^{(i)}$ stars.

In [495]:
# Import necessary packages
from sklearn.preprocessing import MinMaxScaler

# Function to compute cost function
def compute_cost_cb(X, y, r, theta, C=1000):
    """
    Take in a numpy array X,y, theta and generate the cost function
    of using theta as parameter in a linear regression model with regularization parameter C
    """
    y_r = y * r
    predictions = X.dot(theta)
    regularizations = C * np.sum(theta[1:]**2)
    square_err = (predictions - y_r)**2
    
    return 0.5 * (np.sum(square_err) + regularizations)

# Gradient descent function
def gradient_descent(X, y, r, theta, C=1000, alpha=0.01, num_iters=1000):
    """
    Take in numpy array X, y and theta and update theta by taking num_iters gradient steps
    with learning rate of alpha
    
    return theta and the list of the cost of theta during each  iteration
    """
    y_r = y * r
    J_history=[]
    
    for i in range(num_iters):
        predictions = X.dot(theta)
        error = np.dot(X.transpose(), (predictions - y_r))
        error[1:] = error[1:] + C * theta[1:]
        descent = alpha * error
        theta -= descent
        J_history.append(compute_cost_cb(X, y, r, theta))
    
    return theta, J_history

In [496]:
# Sample only a few books because of performance issue
sample_book_id = np.arange(100) + 1
Y = ratings[ratings['book_id'].isin(sample_book_id)].pivot(index='book_id', columns='user_id', values='rating')
Y.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,53414,53415,53417,53418,53419,53420,53421,53422,53423,53424
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,4.0,,...,3.0,,4.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0
2,,5.0,,5.0,,,,,4.0,,...,4.0,,,,5.0,5.0,5.0,5.0,5.0,5.0
3,,,,,,,,,4.0,,...,2.0,,,,3.0,3.0,,,,4.0
4,5.0,,3.0,4.0,,,,3.0,,5.0,...,,,,,3.0,,5.0,,5.0,5.0
5,,5.0,,4.0,,,3.0,3.0,5.0,5.0,...,,,,,3.0,2.0,4.0,,,


In [497]:
# Sample books and select relevant features, then perform MinMaxScaler on the features
X = books.head(len(sample_book_id))[['books_count', 'ratings_count', 'average_rating', 'work_ratings_count']]
scaler = MinMaxScaler()
scaler.fit(X)
X = scaler.transform(X)

Consider only one user, with ID 53412.

In [498]:
# Initialization
y = Y[53412]
r = (~y.isnull()).values # list of 1 and 0, 1 if user already rated the book
y = y.fillna(0).values
theta = np.random.rand(X.shape[1]+1)

# Adding bias column term
bias = np.ones(shape=(X.shape[0], X.shape[1]+1))
bias[:, 1:] = X
X = bias

Learn the parameter vector for user with ID 53412.

In [499]:
theta, J_history = gradient_descent(X, y, r, theta, 10, 0.01, 1000)

Generate predictions for unrated books.

In [500]:
preds = np.dot(theta, X.T) * ~r

Sort the recommendation based on highest predicted ratings. Show only top 20 recommendations.

In [501]:
show = books.head(len(preds))[['book_id', 'title']].copy()
show['predicted_rating'] = preds
show.sort_values(by='predicted_rating', ascending=False).head(20)

Unnamed: 0,book_id,title,predicted_rating
0,1,"The Hunger Games (The Hunger Games, #1)",2.517208
5,6,The Fault in Our Stars,1.642309
42,43,Jane Eyre,1.565359
11,12,"Divergent (Divergent, #1)",1.512678
14,15,The Diary of a Young Girl,1.506652
16,17,"Catching Fire (The Hunger Games, #2)",1.48735
10,11,The Kite Runner,1.459469
30,31,The Help,1.439627
28,29,Romeo and Juliet,1.41912
13,14,Animal Farm,1.406018


# Hybrid