## Simple Collaborative Filtering model

For this particular model I will be using a dataset which consists of a series of books and their corresponding ratings from a set of users. My model is aimed at essentially learning how different books relate to one another and also allow for the prediction of how users might rate books they havn't already rated. 

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from scipy.stats import norm
import latex
import random
import math
import seaborn as sns
from sklearn.model_selection import train_test_split

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Cleaning the data and creating a matrix of user ratings

Rows in the matrix will represent all of the different books while columns in the matrix will represent different users. The item at matrix[book i,user j] = rating(i,j). If there was no rating then this spot will be zero because rating are between 1 - 10 inclusive.

In [2]:
# Number of total users = 2946
# Number of total books = 17384
matrix = np.zeros((17384,2946))
matrix.shape

(17384, 2946)

In [3]:
# File oriented as: user item rating
with open("../data/book_data/book_ratings.dat") as file:
    lines = file.readlines()
    i = 0
    for line in lines:
        if i != 0:
            line = line.strip("\n")
            aline = line.split()
            matrix[int(aline[1])-1,int(aline[0])-1] = int(aline[2].split(".")[0])
        i += 1

#### Preprocessing the data with mean normalization

Before using the matrix for learning, I am going to apply mean normalization. Specifically, I am getting the mean ratings for all books in the matrix and using that value to update the ratings for each book in the matrix.

$$ Matrix = Matrix - \mu $$

I noticed that there are books with no rankings at all so I will be removing them from the original matrix and the mu vector.

In [4]:
num_books = matrix.shape[0]
mu = []
nan_indexes = [] # saving indexes that 

for i in range(num_books):
    ratings = matrix[i,:]
    avg = np.asscalar(ratings[np.nonzero(ratings)].mean())
    if ratings[np.nonzero(ratings)].shape == (0,): # no ratings
        nan_indexes.append(i)
    else:
        mu.append(avg)

In [5]:
mu = np.array(mu)
mu.shape = (14684,1)
print("shape of mu vector:",mu.shape)

shape of mu vector: (14684, 1)


In [6]:
matrix = np.delete(matrix,nan_indexes,axis=0)
print("shape of updated matrix:",matrix.shape)

shape of updated matrix: (14684, 2946)


In [7]:
# Mean normalization:
matrix_norm = matrix
for i in range(matrix_norm.shape[0]): # go through each book
    for j in range(matrix_norm.shape[1]): # go through each user rating for a book
        if matrix_norm[i,j] != 0:
            matrix_norm[i,j] = matrix_norm[i,j] - mu[i]
print(matrix_norm.shape)

(14684, 2946)


### Building collaborative filtering model

In this model I am attempting to learn both Theta, parameter vector for all users, and x, feature vector for each book. Note that I will be using regularization and I am choosing for there to be 4 features to learn.

Cost function: Minimizing x<sup>(1)</sup>,...,x<sup>(n<sub>b</sub>)</sup> and Theta<sup>(1)</sup>,...,Theta<sup>(n<sub>u</sub>)</sup> simultaneously:

$$ \frac{1}{2} \sum_{(i,j): r(i,j)=1} \big( (\theta^{(j)})^T x^{(i)} - y^{(i,j)} \big)^2 
   + \frac{\lambda}{2} \sum^{n_b}_{i=1} \sum^{n}_{k=1} (x^{(i)}_{k})^2 
   + \frac{\lambda}{2} \sum^{n_u}_{i=1} \sum^{n}_{k=1} (\theta^{(j)}_{k})^2 $$ 

Back prop to update x, theta:

$$ x^{(i)}_k := x^{(i)}_k - \alpha \bigg( \sum_{(i,j): r(i,j)=1} \big( (\theta^{(j)})^T x^{(i)} - y^{(i,j)} \big) \theta^{(j)}_k + \lambda x^{(i)}_k \bigg) $$

$$ \theta^{(j)}_k := \theta^{(j)}_k - \alpha \bigg( \sum_{(i,j): r(i,j)=1} \big( (\theta^{(j)})^T x^{(i)} - y^{(i,j)} \big) x^{(i)}_k + \lambda \theta^{(j)}_k \bigg) $$

- n<sub>b</sub>: Number of total books
- n<sub>u</sub>: Number of total users
- r(i,j) = 1: Means there is a rating by user j for book i
- y(i,j): rating by user j for book i
- Theta<sup>(j)</sup>: Parameter vector for user j
- x<sup>(i)</sup>: Parameter vector for book i

In [8]:
# Initializes theta parameters and x features with small randomized values from a uniform distribution
# theta: shape(#users,4), x: shape(#books,4)
def initialize_params(num_users, num_books, num_param):
    x = np.random.normal(0, 0.05, (num_books,num_param))
    theta = np.random.normal(0, 0.05, (num_users,num_param))
    return x,theta

In [9]:
"""
Returns a list of tuples representing the indexes of all nonzero values as (book index, user index)
We need this, as we are only computing the cost given indexes that actually have reviews
"""
def get_nonzero_indexes(matrix):
    indexes = []
    non_zero_i = np.nonzero(matrix)
    book_i = non_zero_i[0]
    user_i = non_zero_i[1]
    
    for i in range(len(book_i)):
        indexes.append((book_i[i],user_i[i]))
    return indexes

In [10]:
"""
To compute the non-regularization term, this cost function only loops through entries that have reviews, this
will ultimately cut down the number of items from 46.6 million to 74k, which is much more computationally reasonable
"""
def cost_function(matrix,x,theta,indexes,lam):
    reg_x = (lam / 2) * np.sum(np.sum(np.square(x),axis=1))
    reg_theta = (lam / 2) * np.sum(np.sum(np.square(theta),axis=1))
    
    cost = 0
    for i in range(len(indexes)):
        x_i = indexes[i][0]
        theta_i = indexes[i][1]
        ax = x[x_i,:] # shape(1,#features)
        atheta = theta[theta_i,:] # shape(1,#features)
        cost += np.square(np.dot(atheta.T,ax)-matrix[x_i,theta_i])
    
    final_cost = (1/2 * cost) + reg_x + reg_theta
    return final_cost

In [11]:
"""
Updates all of the x values, features for books
"""
def update_x(x,new_x,theta,matrix,indexes,alpha,lam,num_params):
    
    for i in range(x.shape[0]): # updating x values one at a time
        ratings = [j for j in indexes if j[0] == i] # ratings for specific x, list of tuples
        asum = np.zeros((num_params,))
        for tup in ratings:
            asum += (np.dot(theta[tup[1],:].T,x[tup[0],:]) - matrix[tup[0],tup[1]]) * theta[tup[1],:] + lam * x[tup[0],:]
        
        new_x[i,:] = new_x[i,:] - (alpha * asum)
    
    return new_x

In [12]:
"""
Updates all theta values, parameters for users
"""
def update_theta(theta,new_theta,x,matrix,indexes,alpha,lam,num_params):
    
    for i in range(theta.shape[0]): # updating theta values one at a time
        ratings = [j for j in indexes if j[1] == i] # ratings for specific theta, list of tuples
        asum = np.zeros((num_params,))
        for tup in ratings:
            asum += (np.dot(theta[tup[1],:].T,x[tup[0],:]) - matrix[tup[0],tup[1]]) * x[tup[0],:] + lam * theta[tup[1],:] 
        
        new_theta[i,:] = new_theta[i,:] - (alpha * asum)
    
    return new_theta

In [13]:
"""
Backprop step - returns updated matrices for x and theta
"""
def update_params(x,theta,matrix,indexes,alpha,lam,num_params):
    new_x = x
    new_theta = theta
    
    new_x = update_x(x,new_x,theta,matrix,indexes,alpha,lam,num_params)
    new_theta = update_theta(theta,new_theta,x,matrix,indexes,alpha,lam,num_params)
    
    return new_x,new_theta

In [14]:
"""
Returns a list of cost values along with the learned parameter matrices x,theta
"""
def model(matrix, alpha=0.1, lam=0.1, num_iter=100, print_cost=True,num_params=4):
    costs = []
    num_users = matrix.shape[1]
    num_books = matrix.shape[0]
    x,theta = initialize_params(num_users,num_books,num_params)
    indexes = get_nonzero_indexes(matrix)
    
    for i in range(num_iter):
        x,theta = update_params(x,theta,matrix,indexes,alpha,lam,num_params) # update params
        
        if i % 1 == 0 and print_cost:
            cost = cost_function(matrix,x,theta,indexes,lam)
            costs.append(cost)
            print("cost at epoch "+str(i+1)+": "+str(cost))
            
    return costs,x,theta 

In [16]:
a_costs,a_x,a_theta = model(matrix_norm, 0.01, 0.01, 100, True, 2)

cost at epoch 1: 64247.53308885036
cost at epoch 2: 64232.74246028733
cost at epoch 3: 64214.04128826246
cost at epoch 4: 64186.826878456144
cost at epoch 5: 64141.13447430443
cost at epoch 6: 64054.349679927065
cost at epoch 7: 63877.58055123876
cost at epoch 8: 63526.68685234158
cost at epoch 9: 62939.2550958964
cost at epoch 10: 62195.67037028232
cost at epoch 11: 61345.214753421555
cost at epoch 12: 60365.83022477897
cost at epoch 13: 59412.53608262942
cost at epoch 14: 58459.947462754106
cost at epoch 15: 57423.91717389084
cost at epoch 16: 56312.12629374273
cost at epoch 17: 55156.11329355991
cost at epoch 18: 53981.238066539125
cost at epoch 19: 52807.87841235516
cost at epoch 20: 51654.13679893898
cost at epoch 21: 50533.239701759994
cost at epoch 22: 49451.26559861989
cost at epoch 23: 48409.00902712231
cost at epoch 24: 47405.60208926375
cost at epoch 25: 46440.78952765072
cost at epoch 26: 45515.35880402278
cost at epoch 27: 44630.645283333346
cost at epoch 28: 43787.9183011

In [17]:
np.save("x.npy",a_x)
np.save("theta.npy",a_theta)

In [18]:
# Shape of learned parameters
print(a_x.shape)
print(a_theta.shape)

(14684, 2)
(2946, 2)


### Using learned parameters to predict user ratings and find books that are similar 
Note that the model hasn't reached a minimum, but that isn't strict necessary to show how user ratings would be predicted or how to find books that are similar to each other.

In [20]:
# Getting predicted user ratings
pred_ratings = np.dot(a_x,a_theta.T)
pred_ratings = pred_ratings + mu # Adding back mean
print(pred_ratings.shape)

(14684, 2946)


In [21]:
pred_ratings

array([[5.92111225, 6.07137104, 6.00398393, ..., 6.00311499, 5.99266823,
        6.00500183],
       [8.36575081, 9.21881347, 8.36455664, ..., 8.34683881, 7.55040565,
        8.3696176 ],
       [9.99430902, 7.64533903, 8.68791821, ..., 8.70120764, 8.84752976,
        8.67198906],
       ...,
       [5.96597223, 5.86992134, 5.99594012, ..., 5.99874368, 6.13504085,
        5.99541569],
       [8.22380668, 8.20254143, 8.00324613, ..., 7.99770885, 7.67283262,
        8.00278421],
       [6.84816081, 6.1314815 , 6.48049454, ..., 6.48538722, 6.57834857,
        6.47568156]])

#### Finding the book most similar to a given book
This is done by comparing the distance of the feature vectors of both books. A small distance indicated that two books are similar.

$$ || x^{(i)} - x^{(j)} || $$

In [27]:
"""
Computes the distance between two vectors of shape (2,)
"""
def distance(vec_a, vec_b):
    return np.sum(np.square(vec_a - vec_b))

In [28]:
# Finding five books most similar to the first book
book_1 = a_x[0,:]
print(book_1.shape)

(2,)


In [29]:
# Finds the most similar book to the first book
smallest_dist = 1000
closest_book = () # (book id, dist value, features)

for i in range(1,a_x.shape[0]):
    temp_features = a_x[i,:]
    dist = distance(book_1, temp_features)
    if dist < smallest_dist:
        smallest_dist = dist
        closest_book = (i,smallest_dist,temp_features)

In [30]:
# Found that book at index 10515 is the closest to the book at index 0
closest_book

(10515, 1.933215741400255e-06, array([-0.0435839, -0.026784 ]))