<h1>Pearsons R Recommeder built with Python and Pandas</h1>

In [2]:
import pandas as pd
import numpy as np

<p>The dataset I've used is a popular MovieLens dataset available for public download <a href="https://grouplens.org/datasets/movielens/">here</a>.</p>
<p><b>Dataset description:</b> MovieLens 20M movie ratings. Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Released 4/2015; updated 10/2016 to update links.csv and add tag genome data.8.</p>

In [3]:
movies = pd.read_csv('ml-20m/movies.csv')
print(f'movies dataframe has {movies.shape[0]} rows and {movies.shape[1]} columns')
movies.tail(5)

movies dataframe has 27278 rows and 3 columns


Unnamed: 0,movieId,title,genres
27273,131254,Kein Bund für's Leben (2007),Comedy
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy
27275,131258,The Pirates (2014),Adventure
27276,131260,Rentun Ruusu (2001),(no genres listed)
27277,131262,Innocence (2014),Adventure|Fantasy|Horror


In [4]:
ratings = pd.read_csv('ml-20m/ratings.csv')
print(f'ratings dataframe has {ratings.shape[0]} rows and {ratings.shape[1]} columns')
ratings.head()

ratings dataframe has 20000263 rows and 4 columns


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


<p>Will be convinient to replace movieId with movie title in the rating column. To do that we may use .map mathod that is <i>used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.</i></p>


In [None]:
%%time
def replace_id_with_name(id):
    return movies[movies['movieId'] == id].title.values[0]

ratings.movieId = ratings.movieId.map(replace_id_with_name)

In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,Toy Story (1995),4.0,964982703
1,1,Grumpier Old Men (1995),4.0,964981247
2,1,Heat (1995),4.0,964982224
3,1,Seven (a.k.a. Se7en) (1995),5.0,964983815
4,1,"Usual Suspects, The (1995)",5.0,964982931


<p>To create recommendations we have to build s pivot table using <i>pivot_table</i> command.</p>


In [11]:
matrix = ratings.pivot_table(index=['userId'],
                             columns=['movieId'],
                             values='rating')
print(f"""matrix_of_users_by_movies has {matrix.shape[0]} rows and {matrix.shape[1]} columns""")

matrix.head(20)

matrix_of_users_by_movies has 610 rows and 9719 columns


movieId,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,1.0,,,
10,,,,,,,,,,,...,,,,,,,,,,


<p>This table is very sparse, it contains mostly NaN values which is kind of natural because not every person saw every movie.</p>


In [30]:
def pearson_r(series1, series2):
    """Takw two pd.Series objects and return a pearson correlation berween the two"""
    series1_c = series1 - series1.mean()
    series2_c = series2 - series2.mean()
    return np.sum(series1_c * series2_c) / np.sqrt(np.sum(series1_c ** 2) * np.sum(series2_c ** 2))

In [43]:
print('The correlation coefficient between movies is:')
pearson_r(matrix['Sherlock - A Study in Pink (2010)'], matrix['(500) Days of Summer (2009)'])

print('\nNumber of ratings for first movie:')
print(matrix['Sherlock - A Study in Pink (2010)'].count())
print('\nNumber of ratings for second movie:')
print(matrix['(500) Days of Summer (2009)'].count())

The correlation coefficient between movies is:

Number of ratings for first movie:
2

Number of ratings for second movie:
42


In [33]:
movies[movies['title'].str.match('Matr')]

Unnamed: 0,movieId,title,genres
1939,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
4351,6365,"Matrix Reloaded, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX
4639,6934,"Matrix Revolutions, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX


In [34]:
pearson_r(matrix['Matrix, The (1999)'], matrix["Matrix Reloaded, The (2003)"])

0.24007237063979342

In [35]:
pearson_r(matrix['Matrix Revolutions, The (2003)'], matrix["Matrix Reloaded, The (2003)"])

0.5492530881058778

In [27]:
movies[movies['title'].str.match('Harry')]

Unnamed: 0,movieId,title,genres
2528,3388,Harry and the Hendersons (1987),Children|Comedy
3574,4896,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy
4076,5816,Harry Potter and the Chamber of Secrets (2002),Adventure|Fantasy
5166,8368,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy|IMAX
6062,40815,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller|IMAX
6522,54001,Harry Potter and the Order of the Phoenix (2007),Adventure|Drama|Fantasy|IMAX
7078,69844,Harry Potter and the Half-Blood Prince (2009),Adventure|Fantasy|Mystery|Romance|IMAX
7285,74948,Harry Brown (2009),Crime|Drama|Thriller
7465,81834,Harry Potter and the Deathly Hallows: Part 1 (...,Action|Adventure|Fantasy|IMAX
7644,88125,Harry Potter and the Deathly Hallows: Part 2 (...,Action|Adventure|Drama|Fantasy|Mystery|IMAX


In [41]:
print('The correlation coefficient between movies is:')
print(pearson_r(matrix['Harry Potter and the Goblet of Fire (2005)'], 
                matrix["Harry Potter and the Order of the Phoenix (2007)"]))

print('\nNumber of ratings for first movie:')
print(matrix['Harry Potter and the Goblet of Fire (2005)'].count())
print('\nNumber of ratings for second movie:')
print(matrix['Harry Potter and the Order of the Phoenix (2007)'].count())

The correlation coefficient between movies is:
0.4229389293942761

Number of ratings for first movie:
71

Number of ratings for second movie:
58


In [37]:
pearson_r(matrix['Harry Potter and the Goblet of Fire (2005)'], matrix['Matrix, The (1999)'])

0.052211472313611715

In [45]:
pearson_r(matrix['Harry Potter and the Goblet of Fire (2005)'], matrix['Harry Potter and the Prisoner of Azkaban (2004)'])

0.4357134402128312