<span style="font-size:36px"><b>Recommender System</b></span>

Copyright 2019 Gunawan Lumban Gaol, Mike Bratanata

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language overning permissions and limitations under the License.

# Import Packages

In [1]:
import numpy as np
import pandas as pd

# Import Data

Dataset from https://github.com/zygmuntz/goodbooks-10k. Recent version of Goodreads Book dataset.

In [2]:
# ratings = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv", encoding="latin1")
# books = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv", encoding="latin1")
# book_tags = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/book_tags.csv", encoding="latin1")
# tags = pd.read_csv("https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/tags.csv", encoding="latin1")

Locally saved the data files.

In [3]:
# ratings.to_csv('dataset/ratings.csv', header=True, index=False, line_terminator='\n', sep=',')
# books.to_csv('dataset/books.csv', header=True, index=False, line_terminator='\n', sep=',')
# book_tags.to_csv('dataset/book_tags.csv', header=True, index=False, line_terminator='\n', sep=',')
# tags.to_csv('dataset/tags.csv', header=True, index=False, line_terminator='\n', sep=',')

Load saved dataset.

In [4]:
ratings = pd.read_csv('dataset/ratings.csv')
books = pd.read_csv('dataset/books.csv')
book_tags = pd.read_csv('dataset/book_tags.csv')
tags = pd.read_csv('dataset/tags.csv')

In [5]:
print(ratings.shape)
print(books.shape)
print(book_tags.shape)
print(tags.shape)

(5976479, 3)
(10000, 23)
(999912, 3)
(34252, 2)


# Collaborative Filtering

* We have user each with rating vector $y^{(j)}$, each element correspond to the user rating for a product (book).
* We want to predict what rating will the user give, for a product that hasn't yet been given rating from a user.

**Objective**:

For a user $j$ with parameters $\theta^{(j)}$ and a movie with (learned) features $x^{(i)}$, predict a star rating of $(\theta^{(j)})^Tx^{(i)}$ stars.

In [663]:
def cost(params, Y, R, num_features, learning_rate):  
    Y = np.matrix(Y)
    R = np.matrix(R)
    num_movies = Y.shape[0]
    num_users = Y.shape[1]
    
    # reshape the parameter array into parameter matrices
    X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features)))  
    Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features)))
    
    # initializations
    J = 0
    X_grad = np.zeros(X.shape)
    Theta_grad = np.zeros(Theta.shape)
    
    # compute the cost
    error = np.multiply((X * Theta.T) - Y, R) 
    squared_error = np.power(error, 2)
    J = (1. / 2) * np.sum(squared_error)
    
    # add the cost regularization
    J = J + ((learning_rate / 2) * np.sum(np.power(Theta, 2)))
    J = J + ((learning_rate / 2) * np.sum(np.power(X, 2)))
    
    # calculate the gradients with regularization
    X_grad = (error * Theta) + (learning_rate * X)
    Theta_grad = (error.T * X) + (learning_rate * Theta)
    
    # unravel the gradient matrices into a single array
    grad = np.concatenate((np.ravel(X_grad), np.ravel(Theta_grad)))
    
    return J, grad

In [661]:
# Sample only 100 books
sample_book_id = np.arange(100) + 1
Y = ratings[ratings['book_id'].isin(sample_book_id)].pivot(index='book_id', columns='user_id', values='rating')
R = (~Y.isnull()).values
Y = Y.fillna(0).values

In [665]:
from scipy.optimize import minimize

movies = Y.shape[0]  
users = Y.shape[1]

features = 10  
learning_rate = 10

X = np.random.random(size=(movies, features))  
Theta = np.random.random(size=(users, features))  
params = np.concatenate((np.ravel(X), np.ravel(Theta)))

Ymean = np.zeros((movies, 1))  
Ynorm = np.zeros((movies, users))

for i in range(movies):  
    idx = np.where(R[i,:] == 1)[0]
    Ymean[i] = Y[i,idx].mean()
    Ynorm[i,idx] = Y[i,idx] - Ymean[i]
    
fmin = minimize(fun=cost, x0=params, args=(Ynorm, R, features, learning_rate),  
                method='CG', jac=True, options={'maxiter': 100})
fmin

     fun: 206894.66174496856
     jac: array([-1.86667342,  0.63601453, -1.37675635, ...,  0.08356306,
       -0.1379003 ,  0.00265521])
 message: 'Maximum number of iterations has been exceeded.'
    nfev: 157
     nit: 100
    njev: 157
  status: 1
 success: False
       x: array([0.2305839 , 0.2874662 , 2.94729197, ..., 0.05866797, 0.02279597,
       0.15203869])

Reshape matrix from unrolled shapes.

In [666]:
X = np.matrix(np.reshape(fmin.x[:movies * features], (movies, features)))  
Theta = np.matrix(np.reshape(fmin.x[movies * features:], (users, features)))

Generate predictions considering only one user, with ID 1.

In [667]:
predictions = X * Theta.T  
preds_cf = predictions[:, 0] + Ymean

In [668]:
preds_cf = np.array(preds_cf.ravel())[0] * ~R[:, 0]

Sort the recommendation based on highest predicted ratings. Show only top 20 recommendations.

In [669]:
show = books.head(len(preds_cf))[['book_id', 'title']].copy()
show['predicted_rating'] = preds_cf
show.sort_values(by='predicted_rating', ascending=False).head(20)

Unnamed: 0,book_id,title,predicted_rating
82,83,A Tale of Two Cities,4.796
27,28,Lord of the Flies,4.200562
38,39,"A Game of Thrones (A Song of Ice and Fire, #1)",4.189732
13,14,Animal Farm,4.182049
57,58,The Adventures of Huckleberry Finn,4.177167
24,25,Harry Potter and the Deathly Hallows (Harry Po...,4.146779
86,87,Night (The Night Trilogy #1),4.114651
58,59,Charlotte's Web,4.070335
17,18,Harry Potter and the Prisoner of Azkaban (Harr...,4.059403
14,15,The Diary of a Young Girl,4.05855


# Content Based

* We have a product (in this case a book) each with its own feature vector $x^{(i)}$
* We have user each with rating vector $y^{(j)}$, each element correspond to the user rating for a product (book).
* We want to predict what rating will the user give, for a product that hasn't yet been given rating from a user.

**Objective**:

For each user $j$, learn a parameter $\theta^{(j)} \epsilon R^{(n+1)}$, where $n$ is total number of a book features. Predict user $j$ as rating book $i$ with $(\theta^{(j)})^Tx^{(i)}$ stars.

In [592]:
# Import necessary packages
from sklearn.preprocessing import MinMaxScaler

# Function to compute cost function
def compute_cost_cb(X, y, r, theta, C=1000):
    """
    Take in a numpy array X,y, theta and generate the cost function
    of using theta as parameter in a linear regression model with regularization parameter C
    """
    y_r = y * r
    predictions = X.dot(theta)
    regularizations = C * np.sum(theta[1:]**2)
    square_err = (predictions - y_r)**2
    
    return 0.5 * (np.sum(square_err) + regularizations)

# Gradient descent function
def gradient_descent(X, y, r, theta, C=1000, alpha=0.01, num_iters=1000):
    """
    Take in numpy array X, y and theta and update theta by taking num_iters gradient steps
    with learning rate of alpha
    
    return theta and the list of the cost of theta during each  iteration
    """
    y_r = y * r
    J_history=[]
    
    for i in range(num_iters):
        predictions = X.dot(theta)
        error = np.dot(X.transpose(), (predictions - y_r))
        error[1:] = error[1:] + C * theta[1:]
        descent = alpha * error
        theta -= descent
        J_history.append(compute_cost_cb(X, y, r, theta))
    
    return theta, J_history

In [593]:
# Sample only 100 books
sample_book_id = np.arange(100) + 1
Y = ratings[ratings['book_id'].isin(sample_book_id)].pivot(index='book_id', columns='user_id', values='rating')
Y.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,53414,53415,53417,53418,53419,53420,53421,53422,53423,53424
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,4.0,,...,3.0,,4.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0
2,,5.0,,5.0,,,,,4.0,,...,4.0,,,,5.0,5.0,5.0,5.0,5.0,5.0
3,,,,,,,,,4.0,,...,2.0,,,,3.0,3.0,,,,4.0
4,5.0,,3.0,4.0,,,,3.0,,5.0,...,,,,,3.0,,5.0,,5.0,5.0
5,,5.0,,4.0,,,3.0,3.0,5.0,5.0,...,,,,,3.0,2.0,4.0,,,


For now, consider using readily available numerical features in `books` dataframe, note that we can add more relevant features for the books using `book_tags` dataframe to provide contextual feature of the book content.

In [594]:
# Sample books and select relevant features, then perform MinMaxScaler on the features
X = books.head(len(sample_book_id))[['books_count', 'ratings_count', 'average_rating', 'work_ratings_count']]
scaler = MinMaxScaler()
scaler.fit(X)
X = scaler.transform(X)

Consider only one user, with ID 1.

In [595]:
# Initialization
y = Y[1]
r = (~y.isnull()).values # list of 1 and 0, 1 if user already rated the book
y = y.fillna(0).values
theta = np.random.rand(X.shape[1]+1)

# Adding bias column term
bias = np.ones(shape=(X.shape[0], X.shape[1]+1))
bias[:, 1:] = X
X = bias

Learn the parameter vector for user with ID 53412.

In [596]:
theta, J_history = gradient_descent(X, y, r, theta, 10, 0.01, 1000)

Generate predictions for unrated books.

In [597]:
preds_cb = np.dot(theta, X.T) * ~r

Sort the recommendation based on highest predicted ratings. Show only top 20 recommendations.

In [598]:
show = books.head(len(preds_cb))[['book_id', 'title']].copy()
show['predicted_rating'] = preds_cb
show.sort_values(by='predicted_rating', ascending=False).head(20)

Unnamed: 0,book_id,title,predicted_rating
79,80,The Little Prince,1.336455
24,25,Harry Potter and the Deathly Hallows (Harry Po...,1.321568
26,27,Harry Potter and the Half-Blood Prince (Harry ...,1.298624
23,24,Harry Potter and the Goblet of Fire (Harry Pot...,1.294921
17,18,Harry Potter and the Prisoner of Azkaban (Harr...,1.293565
38,39,"A Game of Thrones (A Song of Ice and Fire, #1)",1.268881
20,21,Harry Potter and the Order of the Phoenix (Har...,1.267728
96,97,Dracula,1.261945
92,93,The Secret Garden,1.258469
86,87,Night (The Night Trilogy #1),1.244056


# Hybrid

Combine the predicted rating from CF model and CB model, then sort the recommendation based on average of the two scores.

In [696]:
def recommend_cf(User_ID, topn=20, show=True):
    id = User_ID - 1
    X = np.matrix(np.reshape(fmin.x[:movies * features], (movies, features)))  
    Theta = np.matrix(np.reshape(fmin.x[movies * features:], (users, features)))
    
    predictions = X * Theta.T  
    preds_cf = predictions[:, id] + Ymean
    preds_cf = np.array(preds_cf.ravel())[0] * ~R[:, id]
    
    if show==True:
        show = books.head(len(preds_cf))[['book_id', 'title']].copy()
        show['predicted_rating'] = preds_cf
        display(show.sort_values(by='predicted_rating', ascending=False).head(topn))
    
    return preds_cf

def recommend_cb(User_ID, topn=20, show=True):
    sample_book_id = np.arange(100) + 1
    Y = ratings[ratings['book_id'].isin(sample_book_id)].pivot(index='book_id', columns='user_id', values='rating')
    
    X = books.head(len(sample_book_id))[['books_count', 'ratings_count', 'average_rating', 'work_ratings_count']]
    scaler = MinMaxScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    
    # Initialization
    y = Y[User_ID]
    r = (~y.isnull()).values # list of 1 and 0, 1 if user already rated the book
    y = y.fillna(0).values
    theta = np.random.rand(X.shape[1]+1)

    # Adding bias column term
    bias = np.ones(shape=(X.shape[0], X.shape[1]+1))
    bias[:, 1:] = X
    X = bias
    
    theta, J_history = gradient_descent(X, y, r, theta, 10, 0.01, 1000)
    
    preds_cb = np.dot(theta, X.T) * ~r
    
    if show==True:
        show = books.head(len(preds_cb))[['book_id', 'title']].copy()
        show['predicted_rating'] = preds_cb
        display(show.sort_values(by='predicted_rating', ascending=False).head(topn))
    
    return preds_cb

def recommend_hybrid(User_ID, topn=20):
    preds_cf = recommend_cf(User_ID, topn, show=False)
    preds_cb = recommend_cb(User_ID, topn, show=False)
    
    preds_hybrid = (preds_cf + preds_cb) / 2
    
    show = books.head(len(preds_hybrid))[['book_id', 'title']].copy()
    show['predicted_rating'] = preds_hybrid
    display(show.sort_values(by='predicted_rating', ascending=False).head(topn))
    
    return preds_hybrid

In [697]:
_ = recommend_cf(1)

Unnamed: 0,book_id,title,predicted_rating
82,83,A Tale of Two Cities,4.796
27,28,Lord of the Flies,4.200562
38,39,"A Game of Thrones (A Song of Ice and Fire, #1)",4.189732
13,14,Animal Farm,4.182049
57,58,The Adventures of Huckleberry Finn,4.177167
24,25,Harry Potter and the Deathly Hallows (Harry Po...,4.146779
86,87,Night (The Night Trilogy #1),4.114651
58,59,Charlotte's Web,4.070335
17,18,Harry Potter and the Prisoner of Azkaban (Harr...,4.059403
14,15,The Diary of a Young Girl,4.05855


In [698]:
_ = recommend_cb(1)

Unnamed: 0,book_id,title,predicted_rating
79,80,The Little Prince,1.336455
24,25,Harry Potter and the Deathly Hallows (Harry Po...,1.321568
26,27,Harry Potter and the Half-Blood Prince (Harry ...,1.298624
23,24,Harry Potter and the Goblet of Fire (Harry Pot...,1.294921
17,18,Harry Potter and the Prisoner of Azkaban (Harr...,1.293565
38,39,"A Game of Thrones (A Song of Ice and Fire, #1)",1.268881
20,21,Harry Potter and the Order of the Phoenix (Har...,1.267728
96,97,Dracula,1.261945
92,93,The Secret Garden,1.258469
86,87,Night (The Night Trilogy #1),1.244056


In [699]:
_ = recommend_hybrid(1)

Unnamed: 0,book_id,title,predicted_rating
82,83,A Tale of Two Cities,2.941302
24,25,Harry Potter and the Deathly Hallows (Harry Po...,2.734174
38,39,"A Game of Thrones (A Song of Ice and Fire, #1)",2.729306
57,58,The Adventures of Huckleberry Finn,2.680265
86,87,Night (The Night Trilogy #1),2.679353
17,18,Harry Potter and the Prisoner of Azkaban (Harr...,2.676484
26,27,Harry Potter and the Half-Blood Prince (Harry ...,2.664998
23,24,Harry Potter and the Goblet of Fire (Harry Pot...,2.6576
13,14,Animal Farm,2.628634
58,59,Charlotte's Web,2.622385
