# MovieLens dataset recommendation system

A small demo of a movie recommendation system based on user profile neighborhood in Pearson correlation matrix.

The system is designed to first produce a similarity matrix for a given user with specified tolerance for common items rated, and to find a specified number of neighbor users w.r.t. taste in films, whose ratings are then used to identify a specified number of items with the assumption that they will coincide with the users taste.

Finally, the system is tested against a baseline of most highly rated items (bayesian average) in the whole dataset.

### Import libraries

In [1]:
import numpy as np
import pandas as pd
import random
import os

  from pandas.core.computation.check import NUMEXPR_INSTALLED


### Read data from files

In [2]:
cwd = os.getcwd()

In [3]:
data_path = cwd + "/ml-100k/u.data"
item_path = cwd + "/ml-100k/u.item"

In [4]:
# Ratings by users

data = pd.read_csv(data_path, sep="\t", names=["user_id","item_id","rating","timestamp"])

data

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


In [5]:
# Item information (name, year, genre etc.)

items = pd.read_csv(item_path, sep="|", encoding="ISO-8859-1", on_bad_lines="warn", names=["movie_id", "movie_title",
                        "release_date", "video_release_date", "IMDb_url", "unknown", "Action", "Adventure", "Animation",
                        "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
                        "Musical", "Mystery", "Romance", "Sci-fi", "Thriller", "War", "Western"])

items = items.set_index("movie_id")

items

Unnamed: 0_level_0,movie_title,release_date,video_release_date,IMDb_url,unknown,Action,Adventure,Animation,Children's,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-fi,Thriller,War,Western
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Timestamp not used for this analysis

data = data.drop(axis=1, columns=["timestamp"])

In [7]:
# Pivot the data for ease of use

data = data.pivot(index="user_id", columns="item_id", values="rating")

data

item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,5.0,,...,,,,,,,,,,
940,,,,2.0,,,4.0,5.0,3.0,,...,,,,,,,,,,
941,5.0,,,,,,4.0,,,,...,,,,,,,,,,
942,,,,,,,,,,,...,,,,,,,,,,


### Define functions to use for producing recommendations and testing

In [8]:
def produce_similarity_matrix(data, tolerance=10):
    # transpose to item_id * user_id, take pearson correlation
    # set a threshold on minimum of commonly seen films to avoid perfect correlations
    similarity = data.T.corr(min_periods=tolerance)
    return similarity

def find_user_neighborhood(similarity_matrix, user_id, n_neighbors=11):
    # use user_id to select row in similarity matrix, sort to descending and move nans to end
    # select only n_neighbors. obviously skip first, since it will be the user. should be an odd number?
    neighbors = similarity_matrix.loc[user_id].sort_values(ascending=False, na_position="last")[1:n_neighbors+1].index
    return neighbors

def get_neighbors_favorites(data, user_id, neighbors, n_recommendations=5, method="sum", seen=False):
    # use neighbor ids to get column-wise (i.e. for each film) sums of ratings. sort rating aggregates
    # todo: might be better to normalize ratings around zero so popular bad films get decrease instead of small increase
    
    if method=="sum":
        recommendations = data.loc[neighbors].sum(axis=0).sort_values(ascending=False)
    elif method=="avg":
        recommendations = bayesian_average_ratings(data.loc[neighbors]).sort_values(ascending=False)
    else:
        print("define method as either sum or avg")
        return
    
    if seen:
        # pick n highest rated films from ones user has seen and rated (only for testing purposes)
        recommendations = recommendations[data.loc[user_id].notna()][:n_recommendations].index
    else:
        # pick n highest rated films from the ones user hasn't yet seen (or at least rated)
        recommendations = recommendations[data.loc[user_id].isna()][:n_recommendations].index
    return recommendations

def find_recommendation_titles(items, recommendations):
    recommended_titles = items.loc[recommendations]["movie_title"]
    return recommended_titles

def bayesian_average_ratings(data):
    # https://medium.com/@ertuodaba/web-scraping-with-python-and-calculating-bayesian-averages-with-sql-for-better-product-ranking-218dcbb75e4b
    n_ratings = data.notna().sum()
    C = n_ratings.mean() # average number of ratings
    mean_ratings = data.mean(axis=0)
    m = mean_ratings.mean() # average of average ratings
    bA = data.apply(lambda col: (C*m + mean_ratings[col.name]*n_ratings[col.name])/(C+n_ratings[col.name]))
    return bA

def test_user_recommendations(data, user_id, similarity_tol=10, n_neighbors=10, n_seen=5, n_unseen=5, verbose=False):
    
    """
    Test targeted recommendations for user u using following procedure:
    1) get n recommendations allowing items already seen by the user
    2) for i in 0..n, hide rating of i by u and get new recommendations as if u has not rated i
    3) if i in new "unseen" recommendations, save rating of i by u to a list
    
    The purpose of this procedure is to find items rated by u and simulate peer recommendation on those items
    
    """
    
    # a list for tested recommendations to compute accuracy
    tested = []
    # deep copy of the data so as to not make changes
    test_data = data.copy(deep=True)
    # produce similarity matrix without hiding any user ratings
    non_masked_similarity = produce_similarity_matrix(test_data, similarity_tol)
    # produce neighborhood without hiding any user ratings
    non_masked_neighbors = find_user_neighborhood(non_masked_similarity, user_id, n_neighbors)
    # (try to) get recommendations from the neighborhood from films user has seen and rated
    seen_recommendations = get_neighbors_favorites(test_data, user_id, non_masked_neighbors, n_seen, "avg", True).values
    if verbose:
        print(seen_recommendations)
    # next: hide seen recommendations (one by one) and make new recommendations
    for seen_item_id in seen_recommendations:
        # save true rating before masking
        true_rating = test_data.loc[user_id, seen_item_id]
        # "mask" i.e. just delete the rating from test data
        test_data.loc[user_id, seen_item_id] = np.nan
        # new similarity matrix, slightly different due to masking current item
        masked_similarity = produce_similarity_matrix(test_data)
        # new neighbors, again slightly different due to masking current item
        masked_neighbors = find_user_neighborhood(masked_similarity, user_id, n_neighbors)
        # new recommendations, again slightly different due to masking current item. Also this time only unseen items
        unseen_recommendations = get_neighbors_favorites(test_data, user_id, masked_neighbors, n_unseen, "avg", False).values
        if seen_item_id in unseen_recommendations:
            tested.append(true_rating)
        else:
            tested.append(np.nan)
        # return current item's true rating to test_data
        test_data.loc[user_id, seen_item_id] = true_rating
        
    return tested

def test_baseline_recommendations(data, user_id, n, verbose=False):
    
    """
    Test generic/baseline recommendations for user u using following procedure
    1) get list of most highly rated (bayesian avg.) items in the dataset in a descending order
    2) choose n first items of the list that are rated by u
    3) save rating of items to a list
    
    This is to simulate recommendation based solely on global average ratings
    
    """
    
    # deep copy of the data so as to not make changes
    test_data = data.copy(deep=True)
    # to not have user ratings affect global bayesian avg scores
    test_data = test_data.drop(user_id)
    # get average scores for each item
    bayes_averages = bayesian_average_ratings(test_data)
    # list of baseline recommendations by sorting to descending
    recommendations = bayes_averages.sort_values(ascending=False)
    # pick n highest rated that are in fact seen by the user (see original data)
    recommendations = recommendations[data.loc[user_id].notna()][:n].index.values
    # get user ratings for recommendations from original data
    tested = data.loc[user_id, recommendations].values
    
    return tested

### Demonstration of using the recommendations system

In [27]:
random.seed(55) # set random seed for reproducibility

user = random.randint(1, data.shape[0]+1)

user

93

In [28]:
tol = 10 # set "similarity tolerance" value, i.e. how many common titles are required for computing correlation

similarity = produce_similarity_matrix(data, tolerance=tol)

similarity

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,1.608412e-01,,,0.420809,0.287159,0.258137,0.692086,,-0.092344,...,0.061695,-0.260242,0.383733,0.029000,0.326744,5.343904e-01,0.263289,0.205616,-0.180784,0.067549
2,0.160841,1.000000e+00,,,,0.446269,0.643675,,,0.668145,...,0.021007,-0.271163,0.214017,0.561645,0.331587,-7.671236e-18,-0.011682,,0.085960,
3,,,1.000000,-0.262600,,-0.109109,0.064803,,,,...,,,-0.045162,,-0.137523,,-0.104678,,,
4,,,-0.262600,1.000000,,,-0.266632,,,,...,,,,,,,0.850992,,,
5,0.420809,,,,1.000000,0.241817,0.175630,0.537400,,0.087343,...,0.229532,,0.439286,,0.484211,,0.027038,,0.318163,0.346234
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.534390,-7.671236e-18,,,,0.206315,0.142404,,,,...,,-0.033059,0.471172,-0.275839,-0.073374,1.000000e+00,,,,-0.187317
940,0.263289,-1.168173e-02,-0.104678,0.850992,0.027038,-0.024419,0.000931,0.320487,,0.158976,...,-0.125059,,-0.338327,-0.148608,0.110022,,1.000000,,-0.022813,0.332497
941,0.205616,,,,,0.399186,,,,,...,,,0.273060,,-0.214147,,,1.000000,,
942,-0.180784,8.596024e-02,,,0.318163,0.092349,0.452075,0.201328,,0.408994,...,0.438252,,-0.216119,,0.244989,,-0.022813,,1.000000,0.277433


In [29]:
# set number of neighbors to 10
n_neighbors = 10

neighbors = find_user_neighborhood(similarity, user, n_neighbors)

neighbors

Index([399, 57, 330, 907, 298, 378, 804, 821, 630, 901], dtype='int64', name='user_id')

In [31]:
# get recommendations as bayesian average in neighborhood
recommendations_avg = get_neighbors_favorites(data, user, neighbors, 10, "avg", False)

print(find_recommendation_titles(items, recommendations_avg))

item_id
318              Schindler's List (1993)
427         To Kill a Mockingbird (1962)
603                   Rear Window (1954)
208            Young Frankenstein (1974)
98      Silence of the Lambs, The (1991)
88           Sleepless in Seattle (1993)
286          English Patient, The (1996)
483                    Casablanca (1942)
313                       Titanic (1997)
1084        Anne Frank Remembered (1995)
Name: movie_title, dtype: object


### Testing the model by comparing recommendations to a baseline of highest-rated items previously unseen by the user

Testing the recommendation system is a somewhat difficult task given that for any film that may be rated by the user, there is no guarantee that the film in question will be recommended to the user. Same is true for any recommendation: the user may have not seen the film and as such no rating by the user exists.

The test procedure here is relatively crude: first we get recommendations ignoring any ratings the user has made, leading to a list of recommendations that may include films seen and rated by the user. Next, for all films rated by the user in the list of recommendations, the ratings are temporarily replaced with NaN-values, and new recommendations are produced. If any of the films first identified in the list of recommendations that allow user-rated films appears in the new list of "masked" recommendations, it will be used as a test item, since we know the user rating and that it appears as a recommendation for the user.

Such the user ratings of such rated and recommended items are added to a list, and the lists mean is computed. As a last step, the mean rating of the recommended items is compared to a similar list produced by only recommending the items with the highest bayesian average rating the user has not yet seen.

This is obviously not the smartest way to test this system, and should be improved in the future.

In [None]:
# get a random list of user ids for testing
test_users = data.sample(100, random_state=42).index.values

avg_scores = [] # save average scores for recommended items here.
avg_baseline = [] # save average scores for baseline recs here.

for u in test_users:
    print(u)
    # get a list of ratings corresponding with peer recommended items
    tested = test_user_recommendations(data, u, 10, 31, 10, 10, False)
    # get rid of nan tested
    # reason for nans: appear as rec in non-masked but doesn't appear in masked peer recommendations
    tested = np.array(tested)[~np.isnan(tested)]
    # save mean rating of peer recommended items
    avg_scores.append(np.mean(tested))
    n_tested = tested.shape[0]
    # get a list of baseline ratings
    baseline = test_baseline_recommendations(data, u, n_tested, False)
    avg_baseline.append(np.mean(baseline))

In [35]:
# compare recommendations to baseline

print("Mean rating for films identified via. the recommendation system: {}".format(np.nanmean(np.array(avg_scores))))
print("Mean rating for films identified via. the baseline system: {}".format(np.nanmean(np.array(avg_baseline))))

Mean rating for films identified via. the recommendation system: 4.498384687208217
Mean rating for films identified via. the baseline system: 4.331157796451914
