In [1]:
# A dictionary of movie critics and their ratings of a small
# set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 
 'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 
 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 
 'You, Me and Dupree': 3.5}, 
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
 'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
 'The Night Listener': 4.5, 'Superman Returns': 4.0, 
 'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 
 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
 'You, Me and Dupree': 2.0}, 
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}




This dictionary uses a ranking from 1 to 5 as a way to express how much each of
these movie critics (and I) liked a given movie. No matter how preferences are
expressed, you need a way to map them onto numerical values. If you were building
a shopping site, you might use a value of 1 to indicate that someone had bought an
item in the past and a value of 0 to indicate that they had not.

In [3]:
critics['Lisa Rose']['Lady in the Water']

2.5

## Finding Similar Users

After collecting data about the things people like, you need a way to determine how
similar people are in their tastes. You do this by comparing each person with every
other person and calculating a similarity score.

## Euclidean Distance Score

In [7]:
from math import sqrt

# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
  # Get the list of shared_items
  si={}
  for item in prefs[person1]: 
    if item in prefs[person2]: si[item]=1

  # if they have no ratings in common, return 0
  if len(si)==0: return 0

  # Add up the squares of all the differences
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

  return 1/(1+sum_of_squares)

This function can be called with two names to get a similarity score.

In [8]:
sim_distance(critics,'Lisa Rose','Gene Seymour')

0.14814814814814814

## Pearson Correlation Score

In [9]:
# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
  # Get the list of mutually rated items
  si={}
  for item in prefs[p1]: 
    if item in prefs[p2]: si[item]=1

  # if they are no ratings in common, return 0
  if len(si)==0: return 0

  # Sum calculations
  n=len(si)
  
  # Sums of all the preferences
  sum1=sum([prefs[p1][it] for it in si])
  sum2=sum([prefs[p2][it] for it in si])
  
  # Sums of the squares
  sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
  sum2Sq=sum([pow(prefs[p2][it],2) for it in si])	
  
  # Sum of the products
  pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
  
  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/n)
  den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
  if den==0: return 0

  r=num/den

  return r

In [10]:
sim_pearson(critics,'Lisa Rose','Gene Seymour')

0.39605901719066977

### Ranking the Critics

You can create a function
that scores everyone against a given person and finds the closest matches. In this
case, I’m interested in learning which movie critics have tastes simliar to mine so that
I know whose advice I should take when deciding on a movie.

In [11]:
# Returns the best matches for person from the prefs dictionary. 
# Number of results and similarity function are optional params.
def topMatches(prefs,person,n=5,similarity=sim_pearson):
  scores=[(similarity(prefs,person,other),other) 
                  for other in prefs if other!=person]
  scores.sort()
  scores.reverse()
  return scores[0:n]

This function uses a Python list comprehension to compare me to every other user in
the dictionary using one of the previously defined distance metrics. Then it returns
the first n items of the sorted results.

In [12]:
topMatches(critics,'Toby',n=3)

[(0.9912407071619299, 'Lisa Rose'),
 (0.9244734516419049, 'Mick LaSalle'),
 (0.8934051474415647, 'Claudia Puig')]

From this I know that I should be reading reviews by Lisa Rose, as her tastes tend to
be similar to mine. If you’ve seen any of these movies, you can try adding yourself to
the dictionary with your preferences and see who your favorite critic should be.

### Recommending Items

In [14]:
# Gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs,person,similarity=sim_pearson):
  totals={}
  simSums={}
  for other in prefs:
    # don't compare me to myself
    if other==person: continue
    sim=similarity(prefs,person,other)

    # ignore scores of zero or lower
    if sim<=0: continue
    for item in prefs[other]:
	    
      # only score movies I haven't seen yet
      if item not in prefs[person] or prefs[person][item]==0:
        # Similarity * Score
        totals.setdefault(item,0)
        totals[item]+=prefs[other][item]*sim
        # Sum of similarities
        simSums.setdefault(item,0)
        simSums[item]+=sim

  # Create the normalized list
  rankings=[(total/simSums[item],item) for item,total in totals.items()]

  # Return the sorted list
  rankings.sort()
  rankings.reverse()
  return rankings

This code loops through every other person in the prefs dictionary. In each case, it
calculates how similar they are to the person specified. It then loops through every
item for which they’ve given a score. The line in bold shows how the final score for
an item is calculated—the score for each item is multiplied by the similarity and
these products are all added together. At the end, the scores are normalized by dividing
each of them by the similarity sum, and the sorted results are returned.

In [15]:
getRecommendations(critics,'Toby')

[(3.3477895267131013, 'The Night Listener'),
 (2.832549918264162, 'Lady in the Water'),
 (2.5309807037655645, 'Just My Luck')]

### Matching Products

In [16]:
def transformPrefs(prefs):
  result={}
  for person in prefs:
    for item in prefs[person]:
      result.setdefault(item,{})
      
      # Flip item and person
      result[item][person]=prefs[person][item]
  return result

In [18]:
movies = transformPrefs(critics)
topMatches(movies,'Superman Returns')

[(0.6579516949597695, 'You, Me and Dupree'),
 (0.4879500364742689, 'Lady in the Water'),
 (0.11180339887498941, 'Snakes on a Plane'),
 (-0.1798471947990544, 'The Night Listener'),
 (-0.42289003161103106, 'Just My Luck')]

Notice that in this example there are actually some negative correlation scores, which
indicate that those who like Superman Returns tend to dislike Just My Luck

In [19]:
getRecommendations(movies,'Just My Luck')

[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]

It’s not always clear that flipping people and items will lead to useful results, but in
many cases it will allow you to make interesting comparisons. An online retailer
might collect purchase histories for the purpose of recommending products to individuals.
Reversing the products with the people, as you’ve done here, would allow
them to search for people who might buy certain products. This might be very useful
in planning a marketing effort for a big clearance of certain items. Another potential
use is making sure that new links on a link-recommendation site are seen by the
people who are most likely to enjoy them.

### Item-Based Filtering

The general technique is to precompute the most similar items for each item.
Then, when you wish to make recommendations to a user, you look at his top-rated
items and create a weighted list of the items most similar to those. Although the first step requires you to examine all the data,
comparisons between items will not change as often as comparisons between users. This
means you do not have to continuously calculate each item’s most similar items—you
can do it at low-traffic times or on a computer separate from your main application.

#### Building the Item Comparison Dataset

To compare items, the first thing you’ll need to do is write a function to build the
complete dataset of similar items. Again, this does not have to be done every time a
recommendation is needed—instead, you build the dataset once and reuse it each
time you need it.

To generate the dataset

In [22]:
def calculateSimilarItems(prefs,n=10):
  # Create a dictionary of items showing which other items they
  # are most similar to.
  result={}
  # Invert the preference matrix to be item-centric
  itemPrefs=transformPrefs(prefs)
  c=0
  for item in itemPrefs:
    # Status updates for large datasets
    c+=1
    if c%100==0: print ("%d / %d" % (c,len(itemPrefs)))
    # Find the most similar items to this one
    scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance)
    result[item]=scores
  return result

This function first inverts the score dictionary using the transformPrefs function giving a list of items along with how they were rated by each user. It
then loops over every item and passes the transformed dictionary to the topMatches
function to get the most similar items along with their similarity scores. Finally, it
creates and returns a dictionary of items along with a list of their most similar items.



In [23]:
itemsim = calculateSimilarItems(critics)

In [24]:
itemsim

{'Just My Luck': [(0.2222222222222222, 'Lady in the Water'),
  (0.18181818181818182, 'You, Me and Dupree'),
  (0.15384615384615385, 'The Night Listener'),
  (0.10526315789473684, 'Snakes on a Plane'),
  (0.06451612903225806, 'Superman Returns')],
 'Lady in the Water': [(0.4, 'You, Me and Dupree'),
  (0.2857142857142857, 'The Night Listener'),
  (0.2222222222222222, 'Snakes on a Plane'),
  (0.2222222222222222, 'Just My Luck'),
  (0.09090909090909091, 'Superman Returns')],
 'Snakes on a Plane': [(0.2222222222222222, 'Lady in the Water'),
  (0.18181818181818182, 'The Night Listener'),
  (0.16666666666666666, 'Superman Returns'),
  (0.10526315789473684, 'Just My Luck'),
  (0.05128205128205128, 'You, Me and Dupree')],
 'Superman Returns': [(0.16666666666666666, 'Snakes on a Plane'),
  (0.10256410256410256, 'The Night Listener'),
  (0.09090909090909091, 'Lady in the Water'),
  (0.06451612903225806, 'Just My Luck'),
  (0.05333333333333334, 'You, Me and Dupree')],
 'The Night Listener': [(0.28

Remember, this function only has to be run frequently enough to keep the item similarities
up to date. You will need to do this more often early on when the user base
and number of ratings is small, but as the user base grows, the similarity scores
between items will usually become more stable.

#### Getting Recommendations

Now you’re ready to give recommendations using the item similarity dictionary without
going through the whole dataset. You’re going to get all the items that the user
has ranked, find the similar items, and weight them according to how similar they
are.

In [25]:
def getRecommendedItems(prefs,itemMatch,user):
  userRatings=prefs[user]
  scores={}
  totalSim={}
  # Loop over items rated by this user
  for (item,rating) in userRatings.items( ):

    # Loop over items similar to this one
    for (similarity,item2) in itemMatch[item]:

      # Ignore if this user has already rated this item
      if item2 in userRatings: continue
      # Weighted sum of rating times similarity
      scores.setdefault(item2,0)
      scores[item2]+=similarity*rating
      # Sum of all the similarities
      totalSim.setdefault(item2,0)
      totalSim[item2]+=similarity

  # Divide each total score by total weighting to get an average
  rankings=[(score/totalSim[item],item) for item,score in scores.items( )]

  # Return the rankings from highest to lowest
  rankings.sort( )
  rankings.reverse( )
  return rankings

In [26]:
getRecommendedItems(critics,itemsim,'Toby')

[(3.182634730538922, 'The Night Listener'),
 (2.5983318700614575, 'Just My Luck'),
 (2.4730878186968837, 'Lady in the Water')]