<h1>Pearsons R Recommeder built with Python and Pandas</h1>

In [1]:
import pandas as pd
import numpy as np

<p>The dataset I've used is a popular MovieLens dataset available for public download <a href="https://grouplens.org/datasets/movielens/">here</a>.</p>
<p><b>Dataset description:</b> 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.</p>

In [16]:
movies = pd.read_csv('ml-latest-small/movies.csv')
print(f'movies dataframe has {movies.shape[0]} rows and {movies.shape[1]} columns')
movies.tail(50)

movies dataframe has 9742 rows and 3 columns


Unnamed: 0,movieId,title,genres
9692,184471,Tomb Raider (2018),Action|Adventure|Fantasy
9693,184641,Fullmetal Alchemist 2018 (2017),Action|Adventure|Fantasy
9694,184721,First Reformed (2017),Drama|Thriller
9695,184791,Fred Armisen: Standup for Drummers (2018),Comedy
9696,184931,Death Wish (2018),Action|Crime|Drama|Thriller
9697,184987,A Wrinkle in Time (2018),Adventure|Children|Fantasy|Sci-Fi
9698,184997,"Love, Simon (2018)",Comedy|Drama
9699,185029,A Quiet Place (2018),Drama|Horror|Thriller
9700,185031,Alpha (2018),Adventure|Thriller
9701,185033,I Kill Giants (2018),Drama|Fantasy|Thriller


In [3]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')
print(f'ratings dataframe has {ratings.shape[0]} rows and {ratings.shape[1]} columns')
ratings.head()

ratings dataframe has 100836 rows and 4 columns


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


<p>Will be convinient to replace movieId with movie title in the rating column. To do that we may use .map mathod that is <i>used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.</i></p>


In [4]:
%%time
def replace_id_with_name(id):
    return movies[movies['movieId'] == id].title.values[0]

ratings.movieId = ratings.movieId.map(replace_id_with_name)

CPU times: user 52.5 s, sys: 8.58 ms, total: 52.5 s
Wall time: 52.5 s


In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,Toy Story (1995),4.0,964982703
1,1,Grumpier Old Men (1995),4.0,964981247
2,1,Heat (1995),4.0,964982224
3,1,Seven (a.k.a. Se7en) (1995),5.0,964983815
4,1,"Usual Suspects, The (1995)",5.0,964982931


<p>To create recommendations we have to build s pivot table using <i>pivot_table</i> command.</p>


In [11]:
matrix = ratings.pivot_table(index=['userId'],
                             columns=['movieId'],
                             values='rating')
print(f"""matrix_of_users_by_movies has {matrix.shape[0]} rows and {matrix.shape[1]} columns""")

matrix.head(20)

matrix_of_users_by_movies has 610 rows and 9719 columns


movieId,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,1.0,,,
10,,,,,,,,,,,...,,,,,,,,,,


<p>This table is very sparse, it contains mostly NaN values which is kind of natural because not every person saw every movie.</p>


In [12]:
def pearson_r(series1, series2):
    """Takw two pd.Series objects and return a pearson correlation berween the two"""
    series1_c = series1 - series1.mean()
    series2_c = series2 - series2.mean()
    return np.sum(series1_c * series1_c) / np.sqrt(np.sum(series1_c ** 2) * np.sum(series2_c ** 2))

In [19]:
pearson_r(matrix['Sherlock - A Study in Pink (2010)'], matrix['(500) Days of Summer (2009)'])

0.05222329678670935

In [22]:
pearson_r(matrix['Sherlock - A Study in Pink (2010)'], matrix["eXistenZ (1999)"])

0.09590268199959645