# Recommendation Engines

## Monday 20 November 2016

1. Load and summarise data
2. Calculate the Jaccard Similarity
3. Recommender (User Similarity)
4. Recommender (Item Similarity)

In [1]:
# 1. Load and Summarise the data
import pandas as pd

#read in brands data - fill in
user_brands = pd.read_csv('user_brand.csv')

#look at count of stores
user_brands.Store.value_counts()

# Series of user IDs, note the duplicates
user_ids = user_brands.ID

user_ids

0        80002
1        80002
2        80010
3        80010
4        80010
5        80010
6        80010
7        80010
8        80010
9        80010
10       80010
11       80010
12       80011
13       80011
14       80011
15       80011
16       80011
17       80011
18       80011
19       80011
20       80011
21       80011
22       80011
23       80011
24       80011
25       80011
26       80015
27       80015
28       80015
29       80015
         ...  
23774    91924
23775    91927
23776    91927
23777    91931
23778    91931
23779    91931
23780    91931
23781    91943
23782    91943
23783    91943
23784    91944
23785    91944
23786    91944
23787    91944
23788    91944
23789    91944
23790    91944
23791    91944
23792    91944
23793    91944
23794    91946
23795    91946
23796    91946
23797    91946
23798    91955
23799    91957
23800    91957
23801    91957
23802    91957
23803    91957
Name: ID, dtype: int64

In [3]:
# groupby ID to see what each user likes
user_brands.groupby('ID').Store.value_counts()

ID     Store               
80002  Home Depot              1
       Target                  1
80010  Container Store         1
       Converse                1
       Cuisinart               1
       DKNY                    1
       Express                 1
       Kohl's                  1
       Levi's                  1
       Nordstrom               1
       Old Navy                1
       Puma                    1
80011  BCBGMAXAZRIA            1
       Banana Republic         1
       Calvin Klein            1
       Crate & Barrel          1
       Diesel                  1
       French Connection       1
       Gap                     1
       Guess                   1
       Kenneth Cole            1
       Nine West               1
       Nordstrom               1
       Restoration Hardware    1
       Steve Madden            1
       Target                  1
80015  Banana Republic         1
       Gap                     1
       Home Depot              1
       Target  

In [2]:
user_brands.head()

Unnamed: 0,ID,Store
0,80002,Target
1,80002,Home Depot
2,80010,Levi's
3,80010,Puma
4,80010,Cuisinart


In [4]:
# turns my data frame into a dictionary
# where the key is a user ID, and the value is a 
# list of stores that the user "likes"
brandsfor = {str(k): list(v) for k,v in user_brands.groupby("ID")["Store"]}


In [5]:
# try it out. User 83065 likes Kohl's and Target
brandsfor['83065']

["Kohl's", 'Target']

In [6]:
# User 82983 likes more
brandsfor['82983']

['Hanky Panky',
 'Betsey Johnson',
 'Converse',
 'Steve Madden',
 'Old Navy',
 'Target',
 'Nordstrom']


## 2. Jaccard Similarity ##

The Jaccard Similarity allows us to compare two sets
If we regard people as merely being a set of brands they prefer
the Jaccard Similarity allows us to compare people

Example. the jaccard similarty between user 82983 and 83065 is .125
            because
            
            brandsfor['83065'] == ["Kohl's", 'Target']
            
            brandsfor['82983'] == ['Hanky Panky', 'Betsey Johnson', 'Converse', 'Steve Madden', 'Old Navy', 'Target', 'Nordstrom']

the intersection of these two sets is just set("Target")
the union of the two sets is set(['Target', 'Hanky Panky', 'Betsey Johnson', 'Converse', 'Steve Madden', 'Old Navy', 'Target', 'Nordstrom'])
so the len(intersection) / len(union) = 1 / 8 == .125


EXERCISE: what is the Jaccard Similarity 
          between user 82956 and user 82963?

In [7]:
brandsfor['82956']
brandsfor['82963']

['Puma', 'New Balance', 'Old Navy', 'Target']

In [8]:
'''
EXERCISE: Complete the jaccard method below.
          It should take in a list of brands, and output the 
          jaccard similarity between them

This should work with anything in the set, for example
jaccard([1,2,3], [2,3,4,5,6])  == .3333333

HINT: set1 & set2 is the intersection
      set1 | set2 is the union

'''

def jaccard(first, second):
  first = set(first)
  second = set(second)
  return float(len(first & second)) / len(first | second)

In [9]:
# try it out!
print(brandsfor['83065']) # brands for user 83065
print(brandsfor['82983']) # brands for user 82983
jaccard(brandsfor['83065'], brandsfor['82983'])


["Kohl's", 'Target']
['Hanky Panky', 'Betsey Johnson', 'Converse', 'Steve Madden', 'Old Navy', 'Target', 'Nordstrom']


0.125

In [11]:
print(brandsfor['82956']) # brands for user 83065
print(brandsfor['82963']) # brands for user 82983
jaccard(brandsfor['82956'], brandsfor['82963'])

['Diesel', 'Old Navy', 'Crate & Barrel', 'Target']
['Puma', 'New Balance', 'Old Navy', 'Target']


0.3333333333333333

## 3. Recommender (User Similarity)

Our recommender will be a modified KNN collaborative algorithm.
Input: A given user's brands that they like
Output: A set (no repeats) of brand recommendations based on
        similar users preferences

1. When a user's brands are given to us, we will calculate the input user's
jaccard similarity with every person in our brandsfor dictionary

2. We will pick the K most similar users and recommend
the brands that they like that the given user doesn't know about

EXAMPLE:
Given User likes ['Target', 'Old Navy', 'Banana Republic', 'H&M']
Outputs: ['Forever 21', 'Gap', 'Steve Madden']

In [13]:
given_user = ['Target', 'Old Navy', 'Banana Republic', 'H&M']

#similarty between user 83065 and given user
print brandsfor['83065']
jaccard(brandsfor['83065'], given_user)

["Kohl's", 'Target']


0.2

#### EXERCISE:

Find the similarty between given_user and ALL of our users
output should be a dictionary where
the key is a user id and the value is the jaccard similarity



{...
 '83055': 0.25,
 '83056': 0.0,
 '83058': 0.1111111111111111,
 '83060': 0.07894736842105263,
 '83061': 0.4,
 '83064': 0.25,
 '83065': 0.2,
 ...}

In [28]:
similarities = {k: jaccard(given_user, v) for k, v in brandsfor.iteritems()}
similarities #dictionary 

#K = 10 #number of similar users to look at

{'80050': 0.10526315789473684,
 '81402': 0.3333333333333333,
 '84916': 0.0,
 '86360': 0.2222222222222222,
 '84914': 0.125,
 '84913': 0.0,
 '86367': 0.16666666666666666,
 '89376': 0.2,
 '86365': 0.18181818181818182,
 '89378': 0.0,
 '88710': 0.1111111111111111,
 '86369': 0.0,
 '84919': 0.2857142857142857,
 '89176': 0.2,
 '89173': 0.06818181818181818,
 '89170': 0.0,
 '89171': 0.0,
 '82198': 0.06666666666666667,
 '87969': 0.0,
 '82448': 0.18181818181818182,
 '82443': 0.17647058823529413,
 '80151': 0.25,
 '82440': 0.07692307692307693,
 '82446': 0.0,
 '82197': 0.23076923076923078,
 '82196': 0.16666666666666666,
 '85450': 0.0,
 '88625': 0.15789473684210525,
 '88155': 0.23076923076923078,
 '88154': 0.2,
 '88153': 0.14285714285714285,
 '88152': 0.2,
 '90246': 0.1111111111111111,
 '88628': 0.125,
 '85047': 0.0,
 '88395': 0.07142857142857142,
 '84862': 0.0,
 '88397': 0.25,
 '82992': 0.1111111111111111,
 '82995': 0.3333333333333333,
 '88390': 0.0,
 '82997': 0.2222222222222222,
 '82996': 0.25,
 '81

In [33]:
# Now for the top K most similar users, let's aggregate the brands they like.
# I sort by the jaccard similarty so most similar users are first
# I use the sorted method, but because I'm dorting dictionaries
# I specify the "key" as the value of the dictionary
# the key is what the list should sort on
# so the most similar users end up being on top
most_similar_users = sorted(similarities, key=similarities.get, reverse=True)[:K]


In [34]:
# list of K similar users' IDs
most_similar_users

['81012',
 '84807',
 '88549',
 '82970',
 '91362',
 '81438',
 '82838',
 '90225',
 '90547',
 '85664']

In [21]:
# let's see what some of the most similar users likes
print(brandsfor[most_similar_users[0]])
print(brandsfor[most_similar_users[1]])
print(brandsfor[most_similar_users[2]])
print(brandsfor[most_similar_users[3]])
print(brandsfor[most_similar_users[4]])

['Banana Republic', 'Old Navy', 'Target']
['Steve Madden', 'Banana Republic', 'Old Navy', 'Target']
['Banana Republic', 'Old Navy', 'Forever 21', 'Target']
['Banana Republic', 'Gap', 'Old Navy', 'Target']
['Banana Republic', 'Gap', 'Old Navy', 'Target']


In [24]:
# Aggregate all brands liked by the K most similar users into a single set
brands_to_recommend = set()
for user in most_similar_users:
    # for each user
    brands_to_recommend.update(set(brandsfor[user]))
    # add to the set of brands_to_recommend
    
    
brands_to_recommend


{'Banana Republic',
 'Forever 21',
 'Gap',
 'Home Depot',
 "Kohl's",
 'Old Navy',
 'Steve Madden',
 'Target'}

In [35]:
# EXERCISE: use a set difference so brands_to_recommend only has
# brands that given_user hasn't seen yet
brands_to_recommend = brands_to_recommend - set(given_user)
brands_to_recommend

{'Forever 21', 'Gap', 'Home Depot', "Kohl's", 'Steve Madden'}

In [36]:
import collections

# We can take this one step further and caculate a "score" of recommendation
# We will define the score as being the number of times
# a brand appears within the first K users
brands_to_recommend = []
for user in most_similar_users:
    brands_to_recommend += list(set(brandsfor[user]) - set(given_user))

# Use a counter to count the number of times a brand appears
recommend_with_scores = collections.Counter(brands_to_recommend)

# Now we see Gap has the highest score!
recommend_with_scores


Counter({'Forever 21': 1,
         'Gap': 2,
         'Home Depot': 1,
         "Kohl's": 1,
         'Steve Madden': 1})

## 4. Recommender (Item Similarity)

In [None]:
'''
We can also define a similary between items using jaccard similarity.
We can say that the similarity between two items is the jaccard similarity
between the sets of people who like the two brands.

Example: similarity of Gap to Target is:
'''

In [40]:
# filter users by liking Gap
gap_lovers = set(user_brands['Gap' == user_brands.Store].ID)
old_navy_lovers = set(user_brands['Old Navy' == user_brands.Store].ID)
old_navy_lovers

{81921,
 86020,
 86022,
 81927,
 91655,
 90121,
 88076,
 86029,
 86031,
 89473,
 90155,
 89064,
 90136,
 91451,
 90986,
 83995,
 81950,
 84000,
 90148,
 84657,
 84973,
 88451,
 90157,
 90158,
 86365,
 84022,
 81976,
 90169,
 90170,
 81982,
 88127,
 89440,
 88130,
 90179,
 84036,
 81989,
 90225,
 90185,
 81996,
 90125,
 86096,
 90193,
 86098,
 82003,
 88148,
 82006,
 80569,
 82009,
 82010,
 88155,
 90207,
 90976,
 86116,
 84071,
 85692,
 87021,
 82029,
 82031,
 82032,
 86129,
 89309,
 91454,
 86132,
 84088,
 82044,
 88189,
 86145,
 89793,
 90249,
 80010,
 82967,
 86156,
 87746,
 82064,
 84113,
 91055,
 90259,
 86165,
 90263,
 91737,
 82073,
 88218,
 84123,
 82077,
 80032,
 88091,
 88982,
 82086,
 86184,
 82089,
 86186,
 86868,
 80046,
 86200,
 86203,
 82109,
 86206,
 90303,
 80065,
 82114,
 86211,
 82116,
 88262,
 90315,
 84172,
 90317,
 90829,
 88273,
 80082,
 80084,
 88277,
 86230,
 86231,
 85028,
 88525,
 84187,
 88164,
 88286,
 91173,
 88288,
 86395,
 80100,
 84200,
 86251,
 90348,


In [38]:
# similarty between Gap and Old Navy
jaccard(gap_lovers, old_navy_lovers)

0.35437212360289283

In [41]:
guess_lovers = set(user_brands['Guess' == user_brands.Store].ID)
# similarty between Gap andGuess
jaccard(guess_lovers, gap_lovers)


0.21257750221434898

In [42]:
calvin_lovers = set(user_brands['Calvin Klein' == user_brands.Store].ID)
# similarty between Gap and Calvin Klein
jaccard(calvin_lovers, gap_lovers)


0.2068654019873532