# Recommender Systems

Recommender systems are widely used in online commerce; the system processes information about the user, their networks, and the products/services, and subtly suggests personalized products or services that the user might purchase or click on.  For example, my Amazon account is always suggesting items that I might be interested in, Netflix has a list of shows that I might want to watch, Facebook and Google Ads customize the ads seen based on the user.  Under the hood is a recommendation system that is analyzing users' behaviors and preferences.

I believe there is an even bigger role for recommender systems to play in the coming years due to the vast volume of data being collected.  In Dec 2014, Youtube claimed 300 hours of content was uploaded every minute.  In July 2015, over 400 hours of content was uploaded every minute.  Many people now have access to multiple cameras capable of recording images or streaming videos (just think of it: every smart phone/tablet has multiple cameras -- there might even be more cameras than there are people on earth!) Question: how do we gain insight from this swath of data?  Ideally, our supervised learning and un-supervised learning algorithms will be able to help us classify objects, actions, moods, but this is a long way off!  Perhaps, a better target might be a hybrid recommender -- supervised learning approach, where a human starts labelling, a recommender starts learning and aids the labelling, and eventually transitions to an un-supervised learning solution ...

## Approaches

There are four approaches to designing recommender systems
- Simple Recommender
- Content-based filtering
- Collaborative filtering
- Hybrid filtering

### Simple Recommenders

Simple recommenders, as the name suggests, offer generalized recommendations to every user based on some global ranking system.  For example, if one could parse the ranking of items on amazon.com and make recommendations to the shopper based on the ranking of the particular class of items.  There are some subtleties in making a recommendation however: are you more likely to go with an item that has an average item rating of 4.5 out of five stars, with 100,000 reviews, or are you more likely to go with an item that has an average rating of 4.8 based on six reviews?  Intuitively, as the number of voters increase, the rating of an item approaches a value that is reflective of the item's quality. It is more difficult to discern the quality of an item with only a few voters.

One solution is to use a weighted rating, i.e., weight the rating based on how many reviews were given.
\begin{align}
\hat{r} = \frac{n\,r}{n+m} + \frac{m\,\bar{r}}{n+m}
\end{align}
where
- $n$ is the number of votes for the item
- $r$ is the average rating for the item
- $m$ is the minimum number of votes required for a recommendation
- $\bar{r}$ is the mean rating over all items being considered.

### Example

Lets look at an example: the MovieLens dataset, http://dx.doi.org/10.1145/2827872, which was part of the NetFlix prize in 2009.  This data was collected by the GroupLens Research Project at the University of Minnesota.
The data consists of:
- 100,000 ratings (1-5) from 943 users on 1682 movies. 
- Each user has rated at least 20 movies. 
- Simple demographic info for the users (age, gender, occupation, zip)

A larger data set (45,000 movies, 26 million ratings, 270,000 users) is available here:
https://www.kaggle.com/rounakbanik/the-movies-dataset/data#movies_metadata.csv

In [None]:
import pandas as pd
import numpy as np

from math import sqrt

In [None]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('data/ml-100k/ratings.dat', sep='\t', names=r_cols)

ratings.head()

Lets drop the timestamp, since we are not going to make use of it for now.

In [None]:
ratings.drop(columns=['unix_timestamp'],inplace=True)

In [None]:
ratings.head()

and now, group by movie

In [None]:
grouped = ratings.groupby('movie_id')['rating'].agg(['sum','count','mean'])
grouped.head()

In [None]:
grouped.describe()

Lets also decide $m$, the minimum number of ratings needed for a recommendation.  The 6 votes is the 25 percentil.  Lets drop all movies with less than 6 votes.

In [None]:
m = 6
grouped.drop(grouped[grouped['count']<m].index,inplace=True)
grouped.describe()

now, lets find the mean rating of this group of movies

In [None]:
rbar = float(grouped['sum'].sum())/grouped['count'].sum()
print "rbar = ", rbar

Now, lets add a new column that computes the weighted mean for each movie

In [None]:
grouped['weighted mean'] = grouped['count']*grouped['mean']/(grouped['count'] + m) \
+ m*rbar / (grouped['count'] + m) 
grouped.head()

Let's see which are the most highly rated movies that we can recommend.  First, let's load the movie information into another dataframe

In [None]:
m_cols = ['movie_id', 'title', 'release_date']
movie_info = pd.read_csv('data/ml-100k/movie-info.dat', sep='|', names=m_cols, usecols=range(3))
movie_info.head()

and now merge this dataframe with the grouped data frame

In [None]:
merged = pd.merge(grouped,movie_info,on="movie_id",how="inner")
merged.head()

and sort by weighted mean

In [None]:
merged.sort_values(by=['weighted mean'],ascending=False)

### Content-based filtering

Content-based filtering methods are based on a profile of a user, e.g. a user rates an object or clicks on a link.  Based on that data, the user profile is generated, which is then used to make suggestions to the user.  As the user provides more inputs or takes actions on the recommendations, the profile (and consequently the recommender system) becomes more accurate.  Content-based filtering assumes that information about the content is available and can be easily retrieved.  This is an increasingly difficult problem as the amount of content (e.g. number of items that is available for purchase on Amazon) is increasing.  There are entire classes based on information retrieval. Here are some pros and cons of content-based filtering.

Pros:
- ability to make recommendations o users with unique profiles
- ability to recommend new and unpopular items
- independent of other users
- can specify which content-features caused an item to be recommended

Cons:
- requires building a user profile.  What do we do for new users?
- "over-fitting": never recommends items outside of user's profile
- unable to exploit judgements of other users
- people might have multiple interests
- content filters need be available and accurate.  

### Example

This section is based on the [data camp tutorial](https://www.datacamp.com/community/tutorials/recommender-systems-python) on content-based filtering.  In this example, we wish to make a movie recommendation based on a plot description.  We will switch to the kaggle data set which has a column for the plot description.

In [None]:
metadata = pd.read_csv('data/movies_metadata.csv', low_memory=False)
metadata.describe()

As before, lets drop all movies with less than 6 votes

In [None]:
m = 6
metadata.drop(metadata[metadata['vote_count']<m].index,inplace=True)
metadata.describe()

In [None]:
#Print plot overviews of the first 5 movies.
metadata['overview'].head()

We want to generate a document-term matrix, that is, a matrix that describes the frequency at which terms appears in a document, or in this case, the plot overview.  In a document-term matrix, each row corresponds to a document (in this case, plot overview of a movie), each column corresponds to a term.  There is a built-in function in Scikit-Learn to generate this document-term matrix.  In the machine learning realm, the document-term matrix is referred to as Term Frequency-Inverse Document Frequency (TF-IDF).

In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

Here, 57k words are used to describe the 28k movies.  With this document-term matrix, we can now compute a similarity score.  We have seen similarity measures when we were exploring clustering.  Some common ones are:
- the Euclidean distance, $\|\vec{x}-\vec{y}\|_2$
- the cosine similarity (the cosine of an angle between two vectors)
\begin{align}
\cos{\theta} = \frac{\vec{x}\cdot\vec{y}}{\|x\|_2 \|y\|_2}
\end{align}
- Pearson correlation coefficient
\begin{align}
\rho = \frac{(\vec{x}-\vec{\bar{x}})\cdot(\vec{y}-\vec{\bar{y}})}{\|\vec{x}-\vec{\bar{x}}\|_2\|\vec{y}-\vec{\bar{y}}\|_2}
\end{align}

In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

And now, we can create our recommender:

In [None]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [None]:
get_recommendations('Toy Story')

In [None]:
get_recommendations('Kung Fu Panda')

Hmm, not that great.  I guess the plots are somewhat the same, but ...  One can create more sophisticated content-based recommendation systems by invoking genres, keywords, actors and actresses, directors, etc.

## Collaborative filtering

The third type of recommender system, is collaborative based filtering.  Instead of relying on a user profile and content of items to be recommended, a user's network is utilized to identify users that have similar profiles, and make recommendations based on those user's profile.  

Pros:
- don't need to identify features of the items

Cons:
- tends to recommend popular items, making it hard to recommend items to someone with unique tastes (popularity bias)
- “the cold start problem”, system is not able to give recommendations for users who have no (or very little) usage activity, aka new user problem, or recommend new items for which there is no (or very little) usage activity.

This recommender system is covered in your textbook, chapter 9.  Here are the key steps to create a collaborative recommender system.  We have to create: 
- a predictor function, 
- a user-similarity function

### Prediction function:

The prediction function behind the collaborative filtering is based on the movie ratings from similar users.  In order to recommend a movie, $p$, from a set of movies, $P$, to a given user, $a$, we first need to see the set of users, $B$, who have already seen $p$. Then, we need to see the taste similarity between these users in $B$ and user $a$. The most simple prediction function for a user $a$ and movie $p$ can be defined as follows:

$$pred(a,p) = \frac{\sum_{b \in B}{sim(a,b)*(r_{b,p})}}{\sum_{b \in B}{sim(a,b)}}$$

where $sim(a,b)$ is the similarity between user $a$ and user $b$,  $B$ is the set of users in the dataset that have already seen $p$ and $r_{b,p}$ is the rating of $p$ by $b$.

### Similarity function


We need to first identify the set of ratings for all movies common to two users before we can compute the user similarity. We have already mentioned three popular similarity functions: Euclidean Distance, Pearson Correlation Distance, Cosine similarity.

### Example

Here is a simple example, taken from https://github.com/mukesh-srivastav/MovieRecommenderSystem, that illustrates how we can find similar users and then give a simple recommendation.   Your text has a more complicated example using the movielens database.

In [None]:
#A Dictionary of movie critics and their ratings of a small set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
 'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
 'You, Me and Dupree': 3.5},
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
 'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
 'The Night Listener': 4.5, 'Superman Returns': 4.0,
 'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
 'You, Me and Dupree': 2.0},
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

In [None]:
print 'Lisa Rose'' rating of "Lady i the Water: "', critics['Lisa Rose']['Lady in the Water']

build custom function to compute Pearson correlation coefficient for this simple example

In [None]:
# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
    
    # Get the list of mutually rated items
    mutual={}
    for item in prefs[p1]:
        if item in prefs[p2]: 
            mutual[item]=1
 
    # Find the number of elements
    n=len(mutual)
     
    # if they are no ratings in common, return 0
    if n==0: return 0
 
    # Add up all the preferences
    sum1=sum([prefs[p1][movie] for movie in mutual])
    sum2=sum([prefs[p2][movie] for movie in mutual])
 
    # Sum up the squares
    sum1Sq=sum([pow(prefs[p1][movie],2) for movie in mutual])
    sum2Sq=sum([pow(prefs[p2][movie],2) for movie in mutual])
 
    # Sum up the products
    pSum=sum([prefs[p1][movie]*prefs[p2][movie] for movie in mutual])

    # Calculate Pearson score
    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
 
    if den==0: return 0

    r=num/den
 
    return r


In [None]:
print sim_pearson(critics,'Lisa Rose','Gene Seymour')

In [None]:
# Returns the best matches for person from the prefs dictionary.
# Number of results and similarity function are optional params.
def topMatches(prefs,person,n=5,similarity=sim_pearson):
    scores=[(similarity(prefs,person,other),other)
    for other in prefs if other!=person]

    # Sort the list so the highest scores appear at the top
    scores.sort( )
    scores.reverse( )
    return scores[0:n]

In [None]:
topMatches(critics,'Toby',n=3)

Create recommendation:

In [None]:
# Gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs,person,similarity=sim_pearson):
    totals={}
    simSums={}
    for other in prefs:
        # don't compare me to myself
        if other==person: continue
        
        sim=similarity(prefs,person,other)
     
        # ignore scores of zero or lower
        if sim<=0: continue
        for item in prefs[other]:
            # only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item]==0:
            
                # Similarity * Score
                totals.setdefault(item,0)
                totals[item]+=prefs[other][item]*sim
                
                # Sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+=sim
        
    # Create the normalized list
    rankings=[(total/simSums[item],item) for item,total in totals.items( )]
 
    # Return the sorted list
    rankings.sort( )
    rankings.reverse( )

    return rankings

In [None]:
getRecommendations(critics,'Toby')

If we want to get a recommendation based on movie,

In [None]:
def transformPrefs(prefs):
    result={}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item,{})
            # Flip item and person
            result[item][person]=prefs[person][item]
    return result

In [None]:
movies=transformPrefs(critics)
topMatches(movies,'Superman Returns')