# WALS on Movielens

This notebook read in the csv files produced by the previous notebook and builds a WALS model. 

A quick note on the WALS model. At its core, the model takes a sparse matrix A, and produces to dense matrices U and V such that the [Frobenius norm](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm) of (A - UV) is small. In fact the model does a little more than this: it minimizes the error of

`||W \odot (A - U V) ||_F^2 + \lambda (||U||_F^2 + ||V||_F^2)`

where lambda is regularization coefficient. Let `W_0` be an 'unobserved weight', and `R_i` and `C_j` be row and column weights for the input matrix A. Then the weight matrix W has the from `W_{ij} := W_0 + R_i * C_j` if `A_{ij}` is not zero else `W_{ij} := W_0`. The `\odot` operator is the element-wise multiplication between two matrices of the same dimensions. See the [documentation](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/factorization/python/ops/factorization_ops.py). 

As we can see, there are many parameters to the WALS model. Half the difficulty in using this model is selecting the right parameters. 


In [2]:
import os
import sys
import time
import yaml

import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.contrib.factorization.python.ops import factorization_ops

INPUT_DIR = 'embeddingModel'
TRAIN_FILE_NAME = os.path.join(INPUT_DIR, 'walsMovielensTrain.csv')
TEST_FILE_NAME = os.path.join(INPUT_DIR, 'walsMovielensTest.csv')
MATRIX_STATS_FILE = os.path.join(INPUT_DIR, 'matrixInfo.yaml')

OUTPUT_MODEL_DIR = 'embeddingModel/WALS'
MATRIX_FACTORS = os.path.join(OUTPUT_MODEL_DIR, 'matrixFactors.npz')

# WALS parameters
UNOBSERVED_WEIGHT = 0.001 # W_0
REGULARIZATION = 0.001 # \lambda
DIM = 35 # dimension of the matrix factors (embedding vectors)

# Training parameters
NUM_ITR = 10 

In [3]:
# Read in the preprocessed data from the previous notebook.
train_data = pd.read_csv(TRAIN_FILE_NAME, sep=',', header=None, names=['userid', 'movieid', 'rating'])
test_data = pd.read_csv(TEST_FILE_NAME, sep=',', header=None, names=['userid', 'movieid', 'rating'])

# Split the data into values_* and indices_* for the WALS matrix.
values_train = train_data['rating'].as_matrix().astype(np.float32)
indices_train = train_data[['userid', 'movieid']].as_matrix().astype(np.int32)

values_test = test_data['rating'].as_matrix().astype(np.float32)
indices_test = test_data[['userid', 'movieid']].as_matrix().astype(np.int32)

with open(MATRIX_STATS_FILE, 'r') as f:
  matrix_size = yaml.load(f)
  
nrows = matrix_size['num_rows']
ncols = matrix_size['num_columns']

assert indices_test.shape[0] == values_test.shape[0]
assert indices_train.shape[0] == values_train.shape[0]

                                                                       

In [19]:
# Tensorflow block for running the model.
with tf.Graph().as_default():
  inp = tf.SparseTensor(indices_train, values_train, [nrows, ncols])
  model = factorization_ops.WALSModel(
      nrows,
      ncols,
      DIM,
      unobserved_weight=UNOBSERVED_WEIGHT,
      regularization=REGULARIZATION,
      row_weights=None,
      col_weights=None)

  with tf.Session():
    row_update_op = model.update_row_factors(sp_input=inp)[1]
    col_update_op = model.update_col_factors(sp_input=inp)[1]

    model.initialize_op.run()
    model.worker_init.run()
    for i in xrange(NUM_ITR):
      print 'Itr %d/%d' % (i+1, NUM_ITR)
      sys.stdout.flush()
      model.initialize_row_update_op.run()
      row_update_op.run()
      model.initialize_col_update_op.run()
      col_update_op.run()
    row_factors = model.row_factors[0].eval()
    col_factors = model.col_factors[0].eval()

# Note that WALS returns the transpose of the column factor.
print 'row/col factor shapes'
print row_factors.shape
print col_factors.shape


Itr 1/10
Itr 2/10
Itr 3/10
Itr 4/10
Itr 5/10
Itr 6/10
Itr 7/10
Itr 8/10
Itr 9/10
Itr 10/10
row/col factor shapes
(66219, 35)
(9670, 35)


In [17]:
def get_prediction(row_factors, col_factors, indices_test, values_test):
  assert indices_test.shape[0] == values_test.shape[0]

  predicted_values = np.empty(values_test.shape)

  for i in xrange(values_test.shape[0]):
    r = indices_test[i, 0]
    c = indices_test[i, 1]

    rowf = row_factors[r, :]
    colf = col_factors[c, :]

    predicted_values[i] = np.dot(rowf, colf)
  return predicted_values

def print_stats(predicted_values, values_test):
  diff = np.abs(predicted_values - values_test)
  max_abs_err = np.max(diff)
  min_abs_err = np.min(diff)
  avg_abs_err = np.mean(diff)
  sd_abs_err = np.std(diff)
  rmse = np.sqrt(np.mean(np.square(diff)))

  print '\tmax absolute error %f' % (max_abs_err,)
  print '\tmin absolute error %f' % (min_abs_err,)
  print '\tavg absolute error %f, standard deviation %f' % (avg_abs_err, sd_abs_err)
  print '\trmse               %f' % (rmse,)




In [20]:
predicted_values = get_prediction(row_factors, col_factors, indices_train, values_train)
print 'Training Set Stats'
print_stats(predicted_values, values_train)

predicted_values = get_prediction(row_factors, col_factors, indices_test, values_test)
print 'Testing Set Stats'
print_stats(predicted_values, values_test)  

Training Set Stats
	max absolute error 3.743940
	min absolute error 0.000000
	avg absolute error 0.722751, standard deviation 0.576634
	rmse               0.924595
Testing Set Stats
	max absolute error 3.833365
	min absolute error 0.000000
	avg absolute error 0.793343, standard deviation 0.587389
	rmse               0.987127


In [12]:
# Save the two matrix files to a numpy file
if not os.path.exists(OUTPUT_MODEL_DIR):
    os.makedirs(OUTPUT_MODEL_DIR)
np.savez(MATRIX_FACTORS, row_factors=row_factors, col_factors=col_factors)

# Matrix factors can be read in with:
# npzfile = np.load(MATRIX_FACTORS)
# npzfile['row_factors']
# npzfile['col_factors']
