# Collecting Preferences Exercise

The first thing you need is a way to represent different people and their preferences. In Python, a very simple way to do this is to use a nested dictionary.

In [2]:
# A dictionary of movie critics and their ratings of a small
# set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 3.5},
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
'The Night Listener': 4.5, 'Superman Returns': 4.0,
'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0},
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

This dictionary uses a ranking from 1 to 5 as a way to express how much each of
these movie critics (and I) liked a given movie.

In [3]:
critics['Lisa Rose']['Lady in the Water']

2.5

In [4]:
critics['Toby']['Snakes on a Plane']=4.5

In [5]:
critics['Toby']

{'Snakes on a Plane': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 1.0}

## Finding Similar Users

After collecting data about things people like, you need a way to determine how similar people are in their tastes. You do this by comparing each person wich every other person and calculating a similarity score.

To calculate the distance take the difference in each axis, square them and add them together, then take the square root of the sum.

In [6]:
from math import sqrt

In [7]:
sqrt(pow(5-4, 2) + pow(4-1, 2))

3.1622776601683795

This formula calculates the distance, which will be smaller for people who are more similar. However, you need a function that gives higher values for people who are similar. This can be done by adding 1 to the function and inverting it.

In [8]:
1/(1+sqrt(pow(5-4, 2)+pow(4-1, 2)))

0.2402530733520421

This new function always return a value between 0 and 1, where a value of 1 means that two people have identical preferences.

In [10]:
from math import sqrt

# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs, person1, person2):
    # Get the list of shared_items
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item]=1
            
    # if they have no ratings in common, return 0
    if len(si)== 0: return 0
    
    # Add up the squares of all the differences
    sum_of_squares = sum([pow(prefs[person1][item] - prefs[person2][item], 2)
                         for item in prefs[person1] if item in prefs[person2]])
    
    return 1/(1 + sum_of_squares)


In [11]:
sim_distance(critics, 'Lisa Rose', 'Gene Seymour')

0.14814814814814814

In [12]:
sim_distance(critics, 'Lisa Rose', 'Michael Phillips')

0.4444444444444444

## Pearson Correlation Score

A slightly more sophisticated way to determine the similarity between people interest's is to use a Pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line.

In [13]:
# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs, p1, p2):
    # Get the list of mutually rated items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]: si[item]=1
            
    # Find the number of elements
    n = len(si)
    
    # if they are no ratings in common, return 0
    if n == 0: return 0
    
    # Add up all the preferences
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])
    
    # Sum up the squares
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
    
    # Sum up the products
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
    
    # Calculate Pearson score
    num = pSum - (sum1 * sum2/n)
    den = sqrt((sum1Sq-pow(sum1,2)/n) * (sum2Sq-pow(sum2, 2)/n))
    if den == 0: return 0
    
    r = num/den
    
    return r

In [14]:
sim_pearson(critics, 'Lisa Rose', 'Gene Seymour')

0.39605901719066977

## Ranking the critics
Now that you have functions for comparing two people, you can create a function that scores everyone against a given person and finds the closest matches.

In [15]:
# Returns the best matches for person from the prefs dictionary.
# Number of results and similarity function are optional params.
def topMatches(prefs, person, n=5, similarity=sim_pearson):
    scores = [(similarity(prefs, person, other), other)
                    for other in prefs if other != person]
    
    # Sort the list so the highest scores appear at the top
    scores.sort()
    scores.reverse()
    return scores[0:n]

This function uses a Python list comprehension to compare me to every other user in the dictionary using one of the previously defined distance metrics. Then it returns the first n items of the sorted results.

In [16]:
topMatches(critics, 'Toby', n=3)

[(0.9912407071619299, 'Lisa Rose'),
 (0.9244734516419049, 'Mick LaSalle'),
 (0.8934051474415647, 'Claudia Puig')]

## Recommending Items

In [17]:
# Get recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs, person, similarity=sim_pearson):
    totals = {}
    simSums = {}
    for other in prefs:
        # don't compare to myself
        if other == person: continue
        sim = similarity(prefs, person, other)
        
        # ignore scores of zero or lower
        if sim <= 0: continue
        for item in prefs[other]:
            
            # only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item] == 0:
                # Similarity * Score
                totals.setdefault(item, 0)
                totals[item] += prefs[other][item] * sim
                
                # Sum of similarities
                simSums.setdefault(item, 0)
                simSums[item] += sim
                
    # Create the normalized list
    rankings = [(total/simSums[item], item) for item, total in totals.items()]
    
    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

Now you can find out what movies I should watch next

In [18]:
getRecommendations(critics, 'Toby')

[(3.3477895267131013, 'The Night Listener'),
 (2.8325499182641614, 'Lady in the Water'),
 (2.5309807037655645, 'Just My Luck')]

## Matching Products
You can determine similarity by looking at who liked a particular item and seeing the other things they liked. This is actually the same method we used earlier to determine similarity between people--you just need to swap the people and the items.

In [19]:
def transformPrefs(prefs):
    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})
            
            # Flip item and person
            result[item][person] = prefs[person][item]
    return result

In [20]:
movies = transformPrefs(critics)

In [21]:
topMatches(movies, 'Superman Returns')

[(0.6579516949597695, 'You, Me and Dupree'),
 (0.4879500364742689, 'Lady in the Water'),
 (0.11180339887498941, 'Snakes on a Plane'),
 (-0.1798471947990544, 'The Night Listener'),
 (-0.42289003161103106, 'Just My Luck')]