## Movielens correlations

We are investigating correlations for the Movielens data set.

In [2]:
import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr

In [3]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
df = pd.merge(movies, ratings)

Now, let's create a table with movies in the rows and user ratings in the columns.

In [4]:
user_ratings = pd.pivot_table(df, index='userId', columns='title', values='rating')
user_ratings.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


Let's show the correlations for a favorite movie of mine, "Pulp Fiction (1994)"

In [5]:
fav_movie = 'Pulp Fiction (1994)'
corr = user_ratings.corrwith(user_ratings['Pulp Fiction (1994)'])
corr.sort_values(ascending=False)[0:10]

  c = cov(x, y, rowvar)
  c *= 1. / np.float64(fact)


title
Coffee Town (2013)                            1.0
Wolfman, The (2010)                           1.0
Swiss Army Man (2016)                         1.0
When in Rome (2010)                           1.0
Tinker, Tailor, Soldier, Spy (1979)           1.0
War Zone, The (1999)                          1.0
Eddie Izzard: Dress to Kill (1999)            1.0
Hope Floats (1998)                            1.0
Hoodwinked! (2005)                            1.0
Claymation Christmas Celebration, A (1987)    1.0
dtype: float64

In [6]:
reduced = user_ratings.dropna(axis=1, thresh=100)
corr = reduced.corrwith(reduced[fav_movie])
corr.sort_values(ascending=False)[0:10]

title
Pulp Fiction (1994)                                     1.000000
Fight Club (1999)                                       0.543465
Kill Bill: Vol. 1 (2003)                                0.504147
Trainspotting (1996)                                    0.437714
Kill Bill: Vol. 2 (2004)                                0.421685
Usual Suspects, The (1995)                              0.411700
Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)    0.402193
Eternal Sunshine of the Spotless Mind (2004)            0.401534
Reservoir Dogs (1992)                                   0.394687
Twelve Monkeys (a.k.a. 12 Monkeys) (1995)               0.391141
dtype: float64

In [8]:
user_ratings_full = user_ratings.fillna(0)
user_ratings_full.corrwith(user_ratings_full['Pulp Fiction (1994)']).sort_values(ascending=False)[0:10]

title
Pulp Fiction (1994)                 1.000000
Seven (a.k.a. Se7en) (1995)         0.517086
Usual Suspects, The (1995)          0.470340
Silence of the Lambs, The (1991)    0.464080
Reservoir Dogs (1992)               0.449298
Goodfellas (1990)                   0.443585
Shawshank Redemption, The (1994)    0.413893
Trainspotting (1996)                0.398049
Fargo (1996)                        0.394537
American History X (1998)           0.391515
dtype: float64

In [9]:
user_ratings_full = user_ratings.fillna(0)
user_ratings_full.corrwith(user_ratings_full['Toy Story (1995)']).sort_values(ascending=False)[0:10]

title
Toy Story (1995)                              1.000000
Toy Story 2 (1999)                            0.461761
Groundhog Day (1993)                          0.361540
Independence Day (a.k.a. ID4) (1996)          0.358473
Willy Wonka & the Chocolate Factory (1971)    0.357314
Mission: Impossible (1996)                    0.352847
Nutty Professor, The (1996)                   0.350295
Bug's Life, A (1998)                          0.345431
Lion King, The (1994)                         0.344248
Babe (1995)                                   0.341136
dtype: float64

In [10]:
user_ratings_full = user_ratings.fillna(0)
user_ratings_full.corrwith(user_ratings_full['Titanic (1997)']).sort_values(ascending=False)[0:10]

title
Titanic (1997)                                      1.000000
Men in Black (a.k.a. MIB) (1997)                    0.413630
Star Wars: Episode I - The Phantom Menace (1999)    0.401172
Catch Me If You Can (2002)                          0.372946
Truman Show, The (1998)                             0.365216
Wedding Crashers (2005)                             0.358098
Shrek (2001)                                        0.354767
Finding Nemo (2003)                                 0.349419
Saving Private Ryan (1998)                          0.349046
Good Will Hunting (1997)                            0.345960
dtype: float64