<h3>2.1 Problem 1</h3>

Load the Movielens 100k dataset (ml-100k.zip) into Python using Pandas dataframes. Build a user profile on unscaled data for both users 200 and 15, and calculate the cosine similarity and distance between the user’s preferences and the item/movie 95. Which user would a recommender system suggest this movie to?


In [1]:
import pickle
import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine

ratings = pd.read_table('data/ml-100k/u.data', names=['user_id', 'item_id', 'rating', 'timestamp'])
users = pd.read_table('data/ml-100k/u.user', delimiter='|', names=["user_id", "age", "gender", "occupation", "zip_code"])
item_names = """movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western""".split('|')

item_names = [item.strip().replace(' ', '_') for item in item_names]
items = pd.read_table('data/ml-100k/u.item', delimiter='|', names=item_names, encoding='latin')

In [2]:
ratings.head(5)

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
users.head(5)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [4]:
items.head(5)

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [5]:
all_profiles = users.join(ratings.set_index('user_id'), on='user_id', how='inner') \
                    .join(items.set_index('movie_id'), on='item_id', how='inner')
all_profiles.columns = ['movie_id' if col == 'item_id' else col for col in all_profiles.columns]
all_profiles.head(5)

Unnamed: 0,user_id,age,gender,occupation,zip_code,movie_id,rating,timestamp,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,24,M,technician,85711,61,4,878542420,Three Colors: White (1994),01-Jan-1994,...,0,0,0,0,0,0,0,0,0,0
12,13,47,M,educator,29206,61,4,882140552,Three Colors: White (1994),01-Jan-1994,...,0,0,0,0,0,0,0,0,0,0
17,18,35,F,other,37212,61,4,880130803,Three Colors: White (1994),01-Jan-1994,...,0,0,0,0,0,0,0,0,0,0
57,58,27,M,programmer,52246,61,5,884305271,Three Colors: White (1994),01-Jan-1994,...,0,0,0,0,0,0,0,0,0,0
58,59,49,M,educator,8403,61,4,888204597,Three Colors: White (1994),01-Jan-1994,...,0,0,0,0,0,0,0,0,0,0


In [6]:
for df, fname in zip([ratings, users, items, all_profiles], ['ratings','users','items','profiles']):
    df.to_pickle('pickles/' + fname + '.pickle')

In [7]:
profile_ids = [15, 200]
noted_profiles = all_profiles[all_profiles['user_id'].isin(profile_ids)].sort_index()

In [8]:
profile_dict = {user: data.ix[:,'unknown':].values
                for user, data in noted_profiles.groupby('user_id')}
profile_dict

{15: array([[0, 0, 0, ..., 1, 0, 0],
        [0, 0, 0, ..., 1, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 1, ..., 0, 0, 0]]), 200: array([[0, 0, 1, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 1, 0],
        [0, 1, 0, ..., 0, 1, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0]])}

In [9]:
movie_id = 95
movie_vector = items[items['movie_id'] == movie_id].ix[:, 'unknown':].values[0]

In [10]:
cosine_sims = {user: [1 - cosine(movie_vector, doc) for doc in docs] for user, docs in profile_dict.items()}
for user, sims in cosine_sims.items():
    print('\ncosine similarites for user: ', user,
          '\n\n', sims)


cosine similarites for user:  200 

 [0.28867513459481287, 0.0, 0.0, 0.35355339059327373, 0.0, 0.86602540378443871, 0.0, 0.40824829046386313, 0.0, 0.0, 0.0, 0.35355339059327373, 0.0, 0.86602540378443871, 0.75, 0.0, 0.35355339059327373, 0.0, 0.0, 0.86602540378443871, 0.28867513459481287, 0.75, 0.35355339059327373, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.35355339059327373, 0.0, 0.35355339059327373, 0.0, 0.35355339059327373, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.28867513459481287, 0.0, 0.0, 0.35355339059327373, 0.0, 0.0, 0.35355339059327373, 0.0, 0.0, 0.35355339059327373, 0.0, 0.57735026918962584, 0.0, 0.0, 0.0, 0.22360679774997894, 0.35355339059327373, 0.0, 0.5, 0.0, 0.75, 0.0, 0.35355339059327373, 0.35355339059327373, 0.0, 0.67082039324993692, 0.86602540378443871, 0.0, 0.86602540378443871, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.35355339059327373, 0.0, 0.0, 0.70710678118654746, 0.0, 0.0, 0.0, 0.0, 0.86602540378443871, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.70710678118654746, 0.0, 0.0,

In [11]:
means = {user: np.mean(scores) for user, scores in cosine_sims.items()}
means

{15: 0.1386861741099753, 200: 0.18414358539047271}

<h3>Conclusion</h3>

User 200 would most likely be recommended movie 95. This was determined by taking the cosine similarites between the movie vector and all preferences for each user. I then took the mean of the scores and observed that user 200's average cosine similarity was 0.18414358539047271 while user 15's was 0.1386861741099753


<h2>2.2 Problem 2</h2>

Load the Movielens 100k dataset (ml-100k.zip) into Python using Pandas dataframes. Convert the ratings data into a utility matrix representation, and find the 10 most similar users for user 1 based on cosine similarity of the user ratings data. Based on the average of of the ratings for item 508 from the similar users, what is the expected rating for this item for user 1?

In [12]:
with open('pickles/ratings.pickle', 'rb') as r:
    ratings = pickle.load(r)
    
with open('pickles/users.pickle', 'rb') as u:
    users = pickle.load(u)
    
with open('pickles/items.pickle', 'rb') as i:
    items = pickle.load(i)

with open('pickles/profiles.pickle', 'rb') as p:
    profiles = pickle.load(p)

In [13]:
ratings.head(5)

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [14]:
pivoted_ratings = ratings.pivot(index='user_id', columns='item_id', values='rating')
pivoted_ratings.to_pickle('pickles/pivoted_ratings.pickle')
pivoted_ratings.head(5)

item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


In [15]:
util_matrix = pivoted_ratings.as_matrix()
util_matrix

array([[  5.,   3.,   4., ...,  nan,  nan,  nan],
       [  4.,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       ..., 
       [  5.,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,   5.,  nan, ...,  nan,  nan,  nan]])

Need to normalize these nan vals for cosine similarity. For cosine similarity calculation, I'll fill each users nan vals with zeros. I'm also going to subtract each users average rating from their non nan ratings. By doing this normalization, when we then take the cosine distance, we find that users with opposite ratings will have vectors in opposite directions.

In [16]:
mean_normed = pivoted_ratings.fillna(0).sub(pivoted_ratings.mean(axis=1), axis=0)
mean_normed.head(5)

item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.389706,-0.610294,0.389706,-0.610294,-0.610294,1.389706,0.389706,-2.610294,1.389706,-0.610294,...,-3.610294,-3.610294,-3.610294,-3.610294,-3.610294,-3.610294,-3.610294,-3.610294,-3.610294,-3.610294
2,0.290323,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-1.709677,...,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677,-3.709677
3,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,...,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296,-2.796296
4,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,...,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333,-4.333333
5,1.125714,0.125714,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,...,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286,-2.874286


In [17]:
normed_matrix = mean_normed.as_matrix()
normed_matrix

array([[ 1.38970588, -0.61029412,  0.38970588, ..., -3.61029412,
        -3.61029412, -3.61029412],
       [ 0.29032258, -3.70967742, -3.70967742, ..., -3.70967742,
        -3.70967742, -3.70967742],
       [-2.7962963 , -2.7962963 , -2.7962963 , ..., -2.7962963 ,
        -2.7962963 , -2.7962963 ],
       ..., 
       [ 0.95454545, -4.04545455, -4.04545455, ..., -4.04545455,
        -4.04545455, -4.04545455],
       [-4.26582278, -4.26582278, -4.26582278, ..., -4.26582278,
        -4.26582278, -4.26582278],
       [-3.41071429,  1.58928571, -3.41071429, ..., -3.41071429,
        -3.41071429, -3.41071429]])

In [18]:
cosine_vals = {i + 1 : 1 - cosine(normed_matrix[0, :], normed_matrix[i, :]) for i in range(1, len(util_matrix))}
top10_users = sorted(cosine_vals, key=cosine_vals.get, reverse=True)[:10]
top10_users

[738, 521, 215, 77, 508, 823, 44, 538, 352, 177]

In [19]:
top10_similarities = [(user_id, cosine_vals[user_id]) for user_id in top10_users]
top10_similarities

[(738, 0.92307815253486536),
 (521, 0.91866882442682762),
 (215, 0.9180387810372892),
 (77, 0.91753622612739083),
 (508, 0.91688949356614935),
 (823, 0.91681387026002781),
 (44, 0.91622674665440951),
 (538, 0.91613113949792968),
 (352, 0.91540113499389575),
 (177, 0.91477202635692589)]

In [20]:
#get movie 508 ratings for the 10 similar users 
similar_ratings = pivoted_ratings.iloc[top10_users][508].values
similar_ratings

array([ nan,  nan,   4.,  nan,  nan,  nan,  nan,  nan,  nan,   3.])

In [21]:
#weight the rating based on how similar each similar user is to user 1
weighted = [rating * weight[1] for rating, weight in zip(similar_ratings, top10_similarities)]
weighted

[nan,
 nan,
 3.6721551241491568,
 nan,
 nan,
 nan,
 nan,
 nan,
 nan,
 2.7443160790707779]

In [22]:
expected_rating = np.nanmean(weighted)
expected_rating

3.2082356016099673

In [23]:
print('Expected rating user 1, movie 508: ', expected_rating)

Expected rating user 1, movie 508:  3.20823560161
