Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README
gen_recommendations.py
pmf.sh

README

What is this:
============
This is an example showing how to use parallel matrix factorization 
using Graphlab's command line tools.

Objective:
=========
We want to build a recommendation engine that recommends items to users,
but we have a very large number of items where traditional approaches
do not scale.

Solution:
=========
Parallel Matrix Factorization. (Alternating Least Squares)
A technique used to condense a large sparse matrix into a much smaller 
one while still preserving the original.
Sparse in this case means not many users have rated items

We are essentially saying: 
  "Can we take this big/huge matrix and make some smaller ones that
   approximate the original"

Think File Compression. We want a smaller file...it uses less resources
and is easier to work with.

Don't Think File Compression because we can't get back to the original
matrix, we only have an approximation.

We will be using a custom built command line tool that is packaged with 
Graphlab library for matrix factorization called pmf.

Inputs: A large sparse (many user,item pairs but few rankings) matrix 
in matrix market format.
USER<TAB>ITEM<TAB>RANK
Additional Information with Examples
http://bickson.blogspot.com/search/label/Matrix%20Market

Parameters: 
	    --scheduler - this is how tasks will be computed
	    		  this give us parallelism and control 
			  over how to compute things
			  round_robin -- cycle over users and items
			  max_iterations -- number of times
			  block_size -- should be 1
	    --matrix-market - tell graphlab that we are giving it 
	    		      a matrix in 'matrix market format' 
			      so it knows how to read it
	    --lambda - this is a weight that will be applied to your 
	    	       matrix vectors, optimal value is found through 
		       trial and error. prevents overfitting.
	    --npus - the number of cpus/cores to use in the computation
	    --D - the demension of the factorized matricies. too small
	          results in fast run time but less accuracy. Too Large
		  results in slow runtime but more accurate. Typical values
		  are 20 -> 50.
            --maxval - maximum value of allowed rating
	    --minval - minimum value of allowed rating

Outputs:
The output are two matrices U and V. In matrix U the rows are feature vectors
for users where the first row is user 1, second user 2 ... In matrix V the rows
are feature vectors for movies where the first row is movie 1, second movie 2 ...
To get a prediction for user 1 movie 1 simply do a dot product on the first two rows
of the matrix files U and V.  

Value:
======
We now have much smaller matricies to work with and build recommendations from.
Large Scale Problem on a Single Compter -- efficient use of resources (Multicore)
Fast Computation -- Results faster and at scale
cost:latency ratio is low






Additional resources:
GraphLab pmf website: http://graphlab.org/pmf.html

Matrix market format description incuding examples: 
http://bickson.blogspot.com/search/label/Matrix%20Market

Danny Bickson has a great blog discussing large scale machine learning with
some great examples, he is a principal scientist in CMUs Machine Learning
Department and a big contributor to the Graphlab project.
http://bickson.blogspot.com
You can’t perform that action at this time.