What this is
Runs a distributed computation of the cosine similarity of the item-vectors in a user-item matrix. This value can be used as the basis for an item-based recommendation system.
- It's fully distributed
- It handles string ids (unlike Mahout)
- It's short: only about 300 lines of Scala code.
- Hadoop 0.20.x
- Apache Buildr
bu clean package rake clean default
An user-item matrix:
iPad iPhone VW Nate - 5 3 Jay 4 3 -
The rating for the user must be between 1 (dislike) - 5 (like). Leave out a row which is "unrated" (-)
# (userId, itemId, prefVal) Nate,iPhone,5 Nate,VW,5 Jay,iPad,4 Jay,iPhone,3
Phase 1: Create the item-vectors
# (itemId, userId, prefVal) VW Nate,5 iPad Jay,4 iPhone Nate,5|Jay,3
Phase 2: Compute the length of the item vectors, store it with the item, create the user-vectors
# (userId, (itemId, ivLength, prefVal) ...) Jay iPad,4.0,4.0|iPhone,5.830951,3.0 Nate VW,5.0,5.0|iPhone,5.830951,5.0
Phase 3: Compute the pairwise cosine similarity for all item pairs that have been co-rated by at least n users
  VW iPhone 0.857492 # => 5.0 * 5.0 / (5.0 * 5.83) iPad iPhone 0.514495 # => 4.0 * 3.0 / (4.0 * 5.83) where  is the sum of the product of the co-rated pref values (dot product) and  is the product of the vector lengths
Phase 4 (optional): Generate stripes of each item with its list of similar items
This output allows you to easily give a recommendation for based on the key-item. Value-items are sorted in descending order by similarity..
VW iPhone,0.8574929396603828 iPad iPhone,0.5144957461491708 iPhone VW,0.8574929396603828|iPad,0.5144957461491708
- Outline originally based on ItemSimilarityJob in the Apache Mahout project.
- Algorithm used is a slight modification from the algorithm described in Pairwise Document Similarity in Large Collections with MapReduce
- HadoopInterop.scala forked from jmhodges "componentize"
- and a bit from: http://blog.jonhnnyweslley.net/2008/05/shadoop.html