<h1 align="center">Matrix Factorization - Alternating Least Squares in Python</h1>

ALS is Matrix Factorization Algorithm. Matrix Factorization decomposes a large matrix into products of matrices.<br>
<br>
R = U * V<br>
<br>
For example in recommendation systems, let us consider R as a matrix of User (Rows) and Ratings (Columns). Matrix factorization will allow us to discover the latent features that define the interactions between User and Ratings. In other words, ALS uncovers the latent features.<br>

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Exploration</a></span></li><li><span><a href="#Alternating-Least-Squares" data-toc-modified-id="Alternating-Least-Squares-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Alternating Least Squares</a></span></li><li><span><a href="#Results" data-toc-modified-id="Results-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Results</a></span></li><li><span><a href="#References" data-toc-modified-id="References-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>References</a></span></li></ul></div>

## Imports

In [1]:
import pickle
import numpy as np
import pandas as pd
from collections import defaultdict

## Load the data
Original Dataset - https://www.kaggle.com/grouplens/movielens-20m-dataset<br>
The original dataset has been proprocessed to filter out and keep only the top users movies.<br>
Please refer to the preprocessing notebook in the repo for more details.

In [2]:
## user_to_movie_map={}  ## Key:= User_id, Value:= [list of movies] 
## movie_to_user_map={}  ## Key:= Movie_id, Value:=[list of users] 
## train_ratings={}      ## Key:= (User_id, Movie_id) Value:=Rating 
## test_ratings={}       ## Key:= (User_id, Movie_id) Value:=Rating 

with open('./data/user_to_movie_map.pkl', 'rb') as fp:
    user_to_movie_map=pickle.load(fp)

with open('./data/movie_to_user_map.pkl', 'rb') as fp:
    movie_to_user_map=pickle.load(fp)

with open('./data/train_ratings.pkl', 'rb') as fp:
    train_ratings=pickle.load(fp)

with open('./data/test_ratings.pkl', 'rb') as fp:
    test_ratings=pickle.load(fp)

with open('./data/user_statistics.pkl', 'rb') as fp:
    user_statistics=pickle.load(fp)
    
with open('./data/movie_statistics.pkl', 'rb') as fp:
    movie_statistics=pickle.load(fp)

## Data Exploration

In [3]:
N_USERS=len(user_to_movie_map)
N_MOVIES=len(movie_to_user_map)
user_ids=range(N_USERS)
movie_ids=range(N_MOVIES)
matrix_size=N_USERS*N_MOVIES
N_RATINGS=len(train_ratings)+len(test_ratings)

print("Number of unique users:",N_USERS)
print("Number of unique movies:",N_MOVIES)
print("Total ratings in the dataset:",N_RATINGS)

print("User-Item matrix size:",matrix_size)
print("User-Item matrix empty percentage:",(matrix_size-N_RATINGS)*100/(matrix_size))

Number of unique users: 300
Number of unique movies: 800
Total ratings in the dataset: 179576
User-Item matrix size: 240000
User-Item matrix empty percentage: 25.176666666666666


## Alternating Least Squares

In [4]:
LF=10  ## No. of latent factors
parameter=0.01
iterations=5

In [5]:
n=LF
user_weights={}
movie_weights={}

for user in user_ids:
    user_weights[user]=np.random.rand(n)
    
for movie in movie_ids:
    movie_weights[movie]=np.random.rand(n)

user_bias=defaultdict(int)
movie_bias=defaultdict(int)

mean=np.mean(list(train_ratings.values()))

print("Length of user and movie weight vectors: ",LF)
print("Mean rating of the train data: ",mean)

Length of user and movie weight vectors:  10
Mean rating of the train data:  3.4642578478772763


In [6]:
for i in range(iterations):
    print("Iteration: {} .........".format(i))
    for user in user_ids:
        A=np.zeros([n,n])
        b=np.zeros([1,n])
        #if user not in user_to_movie_map: continue
        for movie in user_to_movie_map[user]:
            A+=np.outer(movie_weights[movie],movie_weights[movie].T)+parameter*(np.eye(n))
            b+=(train_ratings[(user,movie)]-movie_bias[movie]-user_bias[user]-mean)*movie_weights[movie]

        user_weights[user]=np.linalg.solve(A,b.T).T

    for movie in movie_ids:
        A=np.zeros([n,n])
        b=np.zeros([1,n])
        #if movie not in movie_to_user_map: continue
        for user in movie_to_user_map[movie]:
            A+=np.outer(user_weights[user],user_weights[user].T)+parameter*(np.eye(n))
            b+=(train_ratings[(user,movie)]-movie_bias[movie]-user_bias[user]-mean)*user_weights[user]
        movie_weights[movie]=np.linalg.solve(A,b.T).T
        
    for user in user_ids:
        mlen=len(user_to_movie_map[user])
        movie_bias[movie]=0
        for movie in user_to_movie_map[user]:
            movie_bias[movie]=(1.0/(mlen+parameter))*(train_ratings[(user,movie)]-np.dot(user_weights[user],movie_weights[movie].T)-user_bias[user]-mean)[0]
    
    for movie in movie_ids:
        ulen=len(movie_to_user_map[movie])
        user_bias[user]=0
        for user in movie_to_user_map[movie]:
            user_bias[user]=(1.0/(ulen+parameter))*(train_ratings[(user,movie)]-np.dot(user_weights[user],movie_weights[movie].T)-movie_bias[movie]-mean)[0]
            

Iteration: 0 .........
Iteration: 1 .........
Iteration: 2 .........
Iteration: 3 .........
Iteration: 4 .........


In [7]:
def calculate_MSE(dataset):
    errors=[]
    for (user,movie),rating in dataset.items():
        #if user in user_weights and movie in movie_weights:
        pred=np.dot(user_weights[user],movie_weights[movie].T)[0][0]+movie_bias[movie]+user_bias[user]+mean
        errors.append((pred-rating)**2)

    return np.mean(errors)

## Results

In [8]:
print("Train error: ",calculate_MSE(train_ratings))
print("Test error: ",calculate_MSE(test_ratings))

Train error:  0.46769026515703016
Test error:  0.5784109762930091


## References

1. https://en.wikipedia.org/wiki/Recommender_system <br>
2. https://www.kaggle.com/grouplens/movielens-20m-dataset <br>
3. https://www.udemy.com/recommender-systems/ <br>
4. https://www.quora.com/What-is-the-Alternating-Least-Squares-method-in-recommendation-systems-And-why-does-this-algorithm-work-intuition-behind-this