<h1>Basic collaborative recommender based on Pearsons R correlation and built with Python and Pandas</h1>

In [124]:
import pandas as pd
import numpy as np

<p>The dataset I've used is a popular MovieLens dataset available for public download <a href="https://grouplens.org/datasets/movielens/">here</a>.</p>
<p><b>Dataset description:</b> 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.</p>

In [125]:
movies = pd.read_csv('ml-latest-small/movies.csv')
print(f'movies dataframe has {movies.shape[0]} rows and {movies.shape[1]} columns')
movies.tail(5)

movies dataframe has 9742 rows and 3 columns


Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [126]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')
print(f'ratings dataframe has {ratings.shape[0]} rows and {ratings.shape[1]} columns')
ratings.head()

ratings dataframe has 100836 rows and 4 columns


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [127]:
ratings.drop(['timestamp'], axis=1, inplace=True)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


<p>Will be convinient to replace movieId with movie title in the rating column. To do that we may use .map mathod that is <i>used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.</i></p>


In [128]:
%%time
def replace_id_with_name(id):
    return movies[movies['movieId'] == id].title.values[0]

ratings.movieId = ratings.movieId.map(replace_id_with_name)

CPU times: user 53.5 s, sys: 31.5 ms, total: 53.5 s
Wall time: 53.5 s


In [129]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,Toy Story (1995),4.0
1,1,Grumpier Old Men (1995),4.0
2,1,Heat (1995),4.0
3,1,Seven (a.k.a. Se7en) (1995),5.0
4,1,"Usual Suspects, The (1995)",5.0


<p>To create recommendations we have to build s pivot table using <i>pivot_table</i> command.</p>


In [130]:
matrix = ratings.pivot_table(index=['userId'],
                             columns=['movieId'],
                             values='rating')
print(f"""matrix_of_users_by_movies has {matrix.shape[0]} rows and {matrix.shape[1]} columns""")

matrix.head(20)

matrix_of_users_by_movies has 610 rows and 9719 columns


movieId,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,1.0,,,
10,,,,,,,,,,,...,,,,,,,,,,


<p>This table is very sparse, it contains mostly NaN values which is kind of natural because not every person saw every movie.</p>


In [131]:
def pearson_r(series1, series2):
    """Calculate correlation between two series"""
    series1_c = series1 - series1.mean()
    series2_c = series2 - series2.mean()
    pearson_r_value = np.sum(series1_c * series2_c) / np.sqrt(np.sum(series1_c ** 2) * np.sum(series2_c ** 2))
    return {'pearson_r_value': pearson_r_value,
            'series1_non_Nan': series1.count(),
            'series2_non_Nan': series2.count(),
            'np.sum(series1_c * series2_c)': np.sum(series1_c * series2_c),
            'non-NaN_products': (series1_c * series2_c).count()}

<p>For two random movies whick does not seem to be correlated in any way the correlation coefficient is as follows:</p>

In [132]:
for k, v in pearson_r(matrix['Sherlock - A Study in Pink (2010)'], 
                      matrix['(500) Days of Summer (2009)']).items():
    print(k, '-->', v)

pearson_r_value --> 0.05222329678670935
series1_non_Nan --> 2
series2_non_Nan --> 42
np.sum(series1_c * series2_c) --> 0.125
non-NaN_products --> 2


<p>But is it really a reliable metrics given the fact that we calculated it based only on two rows with non-NaN values simultaneously?</p>

<p>What about the movie Matrix? At first we should find the exact names of this movies.</p>


In [133]:
movies[movies['title'].str.match('Matr')]

Unnamed: 0,movieId,title,genres
1939,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
4351,6365,"Matrix Reloaded, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX
4639,6934,"Matrix Revolutions, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX


In [134]:
pearson_r(matrix['Matrix Reloaded, The (2003)'], 
          matrix['Matrix Revolutions, The (2003)'])

{'pearson_r_value': 0.5492530881058778,
 'series1_non_Nan': 96,
 'series2_non_Nan': 79,
 'np.sum(series1_c * series2_c)': 54.24459388185654,
 'non-NaN_products': 64}

<p>Now the coefficient is a lot more reliable because we have 64 non-NaN products and 0.54 indicates that movies are pretty close to each other.</p>


<p>What about Harry Potter? All movies from the series should be close to each other.</p>


In [135]:
movies[movies['title'].str.match('Harry')]

Unnamed: 0,movieId,title,genres
2528,3388,Harry and the Hendersons (1987),Children|Comedy
3574,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy
4076,5816,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy
5166,8368,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy|IMAX
6062,40815,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller|IMAX
6522,54001,Harry Potter and the Order of the Phoenix (2007),Adventure|Drama|Fantasy|IMAX
7078,69844,Harry Potter and the Half-Blood Prince (2009),Adventure|Fantasy|Mystery|Romance|IMAX
7285,74948,Harry Brown (2009),Crime|Drama|Thriller
7465,81834,Harry Potter and the Deathly Hallows: Part 1 (...,Action|Adventure|Fantasy|IMAX
7644,88125,Harry Potter and the Deathly Hallows: Part 2 (...,Action|Adventure|Drama|Fantasy|Mystery|IMAX


In [136]:
pearson_r(matrix['Harry Potter and the Chamber of Secrets (2002)'], 
          matrix['Harry Potter and the Prisoner of Azkaban (2004)'])

{'pearson_r_value': 0.42701539417752366,
 'series1_non_Nan': 102,
 'series2_non_Nan': 93,
 'np.sum(series1_c * series2_c)': 27.891787897954877,
 'non-NaN_products': 76}

In [137]:
pearson_r(matrix['Harry Potter and the Goblet of Fire (2005)'], matrix['Matrix, The (1999)'])

{'pearson_r_value': 0.052211472313611715,
 'series1_non_Nan': 71,
 'series2_non_Nan': 278,
 'np.sum(series1_c * series2_c)': 4.98632080251292,
 'non-NaN_products': 55}

In [138]:
pearson_r(matrix['Harry Potter and the Goblet of Fire (2005)'], matrix['Harry Potter and the Prisoner of Azkaban (2004)'])

{'pearson_r_value': 0.4357134402128312,
 'series1_non_Nan': 71,
 'series2_non_Nan': 93,
 'np.sum(series1_c * series2_c)': 18.002726033621084,
 'non-NaN_products': 61}

<h2>What is the correlation between movies that belong to different genres?</h2>
<p>Let's pick a horror movie.</p>

In [146]:
movies[movies['genres'].str.match('Horror')].tail(15)

Unnamed: 0,movieId,title,genres
9335,160571,Lights Out (2016),Horror
9344,160872,Satanic (2016),Horror
9346,160978,Hellevator (2004),Horror|Sci-Fi
9389,163937,Blair Witch (2016),Horror|Thriller
9390,163981,31 (2016),Horror
9447,167538,Microwave Massacre (1983),Horror
9462,168250,Get Out (2017),Horror
9480,169670,The Void (2016),Horror
9508,170945,It Comes at Night (2017),Horror|Mystery|Thriller
9574,174479,Unedited Footage of a Bear (2014),Horror|Thriller


In [147]:
movies[movies['genres'].str.match('Comedy')].tail(15)

Unnamed: 0,movieId,title,genres
9684,183911,The Clapper (2018),Comedy
9685,183959,Tom Segura: Disgraceful (2018),Comedy
9686,184015,When We First Met (2018),Comedy
9691,184349,Elsa & Fred (2005),Comedy|Drama|Romance
9695,184791,Fred Armisen: Standup for Drummers (2018),Comedy
9698,184997,"Love, Simon (2018)",Comedy|Drama
9704,185473,Blockers (2018),Comedy
9712,188189,Sorry to Bother You (2018),Comedy|Fantasy|Sci-Fi
9715,188751,Mamma Mia: Here We Go Again! (2018),Comedy|Romance
9716,188797,Tag (2018),Comedy


In [163]:
pearson_r(matrix['Love, Simon (2018)'], matrix['Silver Spoon (2014)'])

  """


{'pearson_r_value': nan,
 'series1_non_Nan': 1,
 'series2_non_Nan': 1,
 'np.sum(series1_c * series2_c)': 0.0,
 'non-NaN_products': 0}

<h3>What the genres are?</h3>

<p>It turns out the size of our dataset is not enough to find correlation between most random movies.</p>

In [159]:
movies[movies['genres'].str.match('Comedy')].tail(10)

Unnamed: 0,movieId,title,genres
9698,184997,"Love, Simon (2018)",Comedy|Drama
9704,185473,Blockers (2018),Comedy
9712,188189,Sorry to Bother You (2018),Comedy|Fantasy|Sci-Fi
9715,188751,Mamma Mia: Here We Go Again! (2018),Comedy|Romance
9716,188797,Tag (2018),Comedy
9718,189043,Boundaries (2018),Comedy|Drama
9723,189713,BlacKkKlansman (2018),Comedy|Crime|Drama
9726,190209,Jeff Ross Roasts the Border (2017),Comedy
9734,193571,Silver Spoon (2014),Comedy|Drama
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [191]:
movies['genres'].unique()[:25]

array(['Adventure|Animation|Children|Comedy|Fantasy',
       'Adventure|Children|Fantasy', 'Comedy|Romance',
       'Comedy|Drama|Romance', 'Comedy', 'Action|Crime|Thriller',
       'Adventure|Children', 'Action', 'Action|Adventure|Thriller',
       'Comedy|Horror', 'Adventure|Animation|Children', 'Drama',
       'Action|Adventure|Romance', 'Crime|Drama', 'Drama|Romance',
       'Action|Comedy|Crime|Drama|Thriller', 'Comedy|Crime|Thriller',
       'Crime|Drama|Horror|Mystery|Thriller', 'Drama|Sci-Fi',
       'Children|Drama', 'Adventure|Drama|Fantasy|Mystery|Sci-Fi',
       'Mystery|Sci-Fi|Thriller', 'Children|Comedy', 'Drama|War',
       'Action|Crime|Drama'], dtype=object)

In [164]:
len(movies['genres'].unique())

951

<p>The number of genres is really high because each genre is constructed from several single genres.</p>

<h2>Get_recommendations function</h2>

In [194]:
def get_recommendations(movie_name, matrix, number_of_recommendations):
    """Returns a list of recommendations of desired length."""
    reviews = []
    for title in matrix.columns:
        if title == movie_name:
            continue
        correlation_dict = pearson_r(matrix[movie_name], matrix[title])
        cor = correlation_dict['pearson_r_value']
        if np.isnan(cor):
            continue
        else:
            reviews.append((title,cor))
    reviews.sort(key=lambda tup: tup[1], reverse=True)
    return reviews[:number_of_recommendations]

In [193]:
for rec in get_recommendations('Lord of the Rings: The Two Towers, The (2002)', matrix, 30):
    print(rec)

  """


('Lord of the Rings: The Fellowship of the Ring, The (2001)', 0.6675278465841459)
('Lord of the Rings: The Return of the King, The (2003)', 0.6078579357239998)
('Hobbit: The Desolation of Smaug, The (2013)', 0.25287914197028527)
('Dark Knight Rises, The (2012)', 0.23913434063058228)
('Hangover Part II, The (2011)', 0.23725565513038355)
('San Andreas (2015)', 0.2326167964643644)
('Logan (2017)', 0.22464275223058755)
('Hobbit: An Unexpected Journey, The (2012)', 0.22429471599406797)
('Confidence (2003)', 0.22057617917549227)
('X-Men: Days of Future Past (2014)', 0.2188155747695293)
('My Father the Hero (Mon père, ce héros.) (1991)', 0.21804769151881898)
("Winter's Bone (2010)", 0.21656170190503485)
("Pirates of the Caribbean: Dead Man's Chest (2006)", 0.2158769865484502)
('Spirited Away (Sen to Chihiro no kamikakushi) (2001)', 0.21563897748310537)
('Appleseed (Appurushîdo) (2004)', 0.2152278245807626)
('Marked for Death (1990)', 0.2114981895832457)
('Star Wars: Episode V - The Empire Str

In [188]:
for rec in get_recommendations('Harry Potter and the Goblet of Fire (2005)', matrix, 30):
    print(rec)

  """


('Harry Potter and the Prisoner of Azkaban (2004)', 0.4357134402128312)
("Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)", 0.43178996248200396)
('Harry Potter and the Order of the Phoenix (2007)', 0.4229389293942761)
('Harry Potter and the Chamber of Secrets (2002)', 0.3671601806736893)
('Fantastic Four (2005)', 0.3353315920406499)
('Harry Potter and the Half-Blood Prince (2009)', 0.3331589196927088)
('Akeelah and the Bee (2006)', 0.32993059904151223)
("Pirates of the Caribbean: Dead Man's Chest (2006)", 0.3256862595375306)
("Jet Li's Fearless (Huo Yuan Jia) (2006)", 0.3018063896287586)
('Boogeyman (2005)', 0.30044379975365604)
('Cry_Wolf (a.k.a. Cry Wolf) (2005)', 0.30044379975365604)
('Dear Frankie (2004)', 0.30044379975365604)
('Marine, The (2006)', 0.30044379975365604)
('Paper Clips (2004)', 0.30044379975365604)
('Sarah Silverman: Jesus Is Magic (2005)', 0.30044379975365604)
('Harry Potter and the Deathly Hallows: Part 1 (2010)', 0.29

<h2>Conclusion</h2>
<p>So this is a very basic collaborative recommender based on Pearsons R correlation coefficient calculation.</p>