# Recommender systems - Final Project requirements

## Goal

The main goal of the final project in the recommender Systems course is to analyze a massive data set with at least 1 million ratings, preferably using a distributed processing system such as hadoop. In this project I will be analyzing a 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 59132 users. Please visit http://eigentaste.berkeley.edu/dataset/ for more information about the data set. A collaborative filtering algorithm will be developed from the scratch using Pyspark in Spark environment. A collaborative filtering algorithm on a toy data set is discussed in this document, and the same algorithm will be implemented on 1.7 million ratings data set, on Hadoop environment.

## Approach

We have 2 data sets: jester_ratings.dat and jester_items.dat. The jseter_ratings data set has the following format (tab separated):

_user-id_,  _joke-id_,  _rating_

The jester_items.dat has the _joke-id_ and the actual joke (text)

We will be performing the following steps to build a recommender system using PySpark on Spark environment:

1. Normalize the ratings of the items, based on the following logic:

   a. Get the mean rating of each joke. 

   b. Get the mean rating of each user.
   
   c. Subtract the mean ratings of each joke (say _j_) and the mean rating of user (say _u_) from the actual rating of the joke.

2. Using the normalized ratings, compute the cosine similarity between all the pairs of jokes, which are rated by _same users_. The cosine similarity will help us to identify the potential jokes that a user could like based on the jokes which were already liked by the user. To exploit the spark's distributed processing, we will use the following logic to compute the cosine similarity between the pairs of jokes:

   a. Read the contents of jester_ratings.dat into an RDD (Resilient Distributed Dataset), with _user-id_ as the key and _(joke-id, rating)_ as values. Let this RDD be _ratings-rdd_
   
   b. Get the self join of this RDD (_ratings-rdd_), and filter the rows to eliminate duplicate jokes (details of the filtering process is explained later using an example). Let us call this RDD as _joke-pairs-rdd_. The key of this RDD will be _(joke-id-1, joke-id-2)_ and the value will be _(joke-id-1-rating, joke-id-2-rating)_.
   
   c. Group by the _joke-pairs-rdd_ by its keys, and perform the cosine similarity of the values. Let the resulting RDD be called as _jokes-similarity-rdd_. This RDD will have the cosine similarity between the pairs of jokes. The number of users who rated the jokes pairs is also recorded in the RDD
   
   d. Write the _jokes-similarity-rdd_ to HDFS

3. To make recommendations to a user:

   a. Get 5 joke-ids which are rated high by the user
   
   b. For each of the 5 joke-ids, get all the jokes associated with these 5 joke-ids from _jokes-similarity-rdd_. Rank the jokes in the descending order of _(similarity measure, Number of users who rated both the jokes)_
   
   c. Pick the top 5 jokes from the ordered items, and present them to the user. These items must not be already rated by the user. 


## Example:

Let us assume that we have the following ratings for some of the jokes:

<img src="toy-example.png">

The above figure shows the normalized ratings. These normalized ratings will be converted to an RDD with user ID as the key (the RDD is displayed below):

<img src="RDD-1.png">

Let us consider the first element highlighted in red color. This element specifies that the user ID 1 has given a normalized rating of -2.2 to joke 1. The second element highlighted in blue color specifies that the user ID 1 has given a normalized rating of -5.7 to joke 4.

This RDD is self joined with itself (on the join condition of USER-ID = USER-ID). The resultant RDD is filtered further to avoid duplicate combinations. For example, for the user ID 4, we will get the following elements (not all permutations are shown):

<img src="RDD-2a.png">

Consider the second element of the above list. (4, ((1, -6.5), (1, -6.5))), which is highlighted in red. This element is showing the combination of the same joke ID 1. Such elements can be eliminated. Consider third and 4th elements: (4, ((2, -5.0),(1, -6.5))),
(4, ((1, -6.5), (2, -5.0))), which are highlighted in green and blue respectively. Both represent the same combinations. We only need one element from such duplicate combinations. So we will filter the elements and include only the elements if and only if the joke ID of the first value is less than the joke ID of the second value. This filter condition will eliminate the second and third rows. When the same condition is applied on the whole data set, we will get the following elements:

<img src="RDD-2.png">

The above RDD is further reduced, by extracting the joke IDs from the values, making the joke IDs as the keys and making the ratings as the values. Note that the user ID is dropped. We will get the following RDD:

<img src="RDD-3.png">


The above RDD shows the combinations/pairs of jokes, which are rated by at least one user. For example, one user (highlighted in red) has rated the jokes (1,4) pairs as (-2.2, -5.7) and another user (highlighted in blue) has rated the same joke pairs as (-5.25, -2.75). This RDD is further reduced by grouping the keys (joke pairs) and getting the cosine similarity between the values. For example, for the joke pairs (1,4), we have the following ratings provided by two users: (-2.2, -5.7) and (-5.25, -2.75). Getting the cosine similarity between the vectors $[-2.2, -5.25]$ and $[-5.7, -2.75]$ give us 0.7518. This measure is captured along with the number of users who rated both the movies (in this example, we have 2 users who rated 1,4 movies)

This filter will give us the following RDD:

<img src="RDD-4.png">

The above RDD is written to a HDFS file, and this data set will be used to provide recommendations. For instance if a user has liked the joke-6, and he has not seen the joke-1, then we can recommend joke-1 to him, since the cosine similarity is 0.95.

The following code block shows the actual Python code written using PySpark to obtain the cosine similarity of the example data set shown above. This code was written on windows machine with 6 GB RAM, running Spark. The same code(with minor modifications) will be used to run on the bigger data set (with 1.7 Million ratings) on a Hadoop cluster. 

### Source Code (PySpark)

Do not run this code in Ipython, as pyspark is not available in Ipython notebook.

In [None]:
from pyspark import SparkConf, SparkContext
from math import sqrt
import sys

## Function to remove duplicate ratings
def filterDuplicates((userID, ratings)):
    (joke1,rating1) = ratings[0]
    (joke2,rating2) = ratings[1]
    return joke1 < joke2
 
## Cosine similarity function    
def computeCosineSimilarity(ratingPairs):
    numPairs = 0
    sum_xx = sum_yy = sum_xy = 0
    for ratingX, ratingY in ratingPairs:
        sum_xx += ratingX * ratingX
        sum_yy += ratingY * ratingY
        sum_xy += ratingX * ratingY
        numPairs += 1
 
    numerator = sum_xy
    denominator = sqrt(sum_xx) * sqrt(sum_yy)
 
    score = 0
    if (denominator):
        score = (numerator / (float(denominator)))
 
    return (score, numPairs)
 
## Make joke pairs as the keys, and their ratings as the values    
def makePairs((user, ratings)):
    (joke1, rating1) = ratings[0]
    (joke2, rating2) = ratings[1]
    return ((joke1, joke2), (rating1, rating2))
 

## Define conf object to run on local machine    
conf = SparkConf().setMaster("local").setAppName("Ratings..")

## Define SparkContext
sc = SparkContext(conf = conf)
 
data = sc.textFile("toy_data.csv")


print "Read the file..."
ratings = data.map(lambda x: x.split(",")).map(lambda x: (int(x[0]),(int(x[1]),float(x[2]))))

display_df = ratings.collect()

for i in display_df:
    print i

print "Prepared the ratings RDD..."

## Have to partition the data set when running on a cluster
#ratingsPartitioned = ratings.partitionBy(100)
#print "Partitioned the ratings into 100 parts..."

ratingsPartitioned = ratings

## Self Join
joinedRatings = ratingsPartitioned.join(ratingsPartitioned)
print "Self join completed ..."
 
##Filter duplicate ratings in the join RDD    
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
print "Filtered the duplicates..."

display_df = uniqueJoinedRatings.collect()

for i in display_df:
    print i


#jokePairs = uniqueJoinedRatings.map(makePairs).partitionBy(100)
jokePairs = uniqueJoinedRatings.map(makePairs)


display_df = jokePairs.collect()

for i in display_df:
    print i

jokePairRatings = jokePairs.groupByKey()

print "Computing the cosine similarity ..."
jokePairSimilarities = jokePairRatings.mapValues(computeCosineSimilarity).persist()
 
print "Sorting the results..."
 
jokePairSimilarities.sortByKey()

display_df = jokePairSimilarities.collect()
for i in display_df:
    print i

print "Saving the results..."
jokePairSimilarities.saveAsTextFile("joke-sims")

print "script ends..."


Once we successfully run the above script in Hadoop environment, I would like to work on the following requirements:


## Other requirements

1. Determine if SGD algorithm implemented in project-4 (https://github.com/msekhar12/Recommendation_Systems/tree/master/Project_4) can be used on the 1.7Million ratings data set? This will be run in a non-hadoop environment, since SGD is not available in Spark MLLib yet.

2. If SGD can be run on the 1.7 Million ratings data, determine the area under the ROC, to quantify the predictive performance of the SGD algorithm, and identify the optimal rating, which can be used to classify whether a user has really liked a joke or not.

3. _If I have time_, I would also like to work on building the recommendation system based on the textual data of the jokes text.

### References:
1. Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001

2. Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman 2014. Mining of Massive Datasets (Chapter 9)

3. Deepak K. Agarwal and Bee-Chung Chen. Statistical Methods for Recommender Systems