

# Recommender Systems



In [2]:
import pandas as pd
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances

## Loading in  `movies.csv` and `ratings.csv`
---

I'll be using the [MovieLens](https://grouplens.org/datasets/movielens/) dataset for building our recommendation engine. There are two CSVs (`movies.csv` and `ratings.csv`) that I'll eventually inner join. 

In [4]:
#
movies = pd.read_csv('ml-latest-small/movies.csv')

ratings = pd.read_csv('ml-latest-small/ratings.csv')

## Drop unnecessary columns
---

I don't need the `timestamp` column from `ratings`, nor will we need the `genres` column from `movies`.

In [5]:
ratings.drop('timestamp', axis =1, inplace = True)
movies.drop('genres', axis =1, inplace = True)

## Merge `movies` and `ratings`
---

In [6]:
#merge` to **inner join** `movies` with `ratings` on the `movieId` column.

df = pd.merge(ratings, movies, on = 'movieId')

df.shape

(100836, 4)

## Building a pivot table
---

Because I'm creating an item-based collaborative recommender (where item in this case is our movies), I'll set up our pivot table as follows:
1. The `title` will be the index
2. The `userId` will be the column
3. The `rating` will be the value


In [7]:
pivot = pd.pivot_table(df, index = 'title', columns = 'userId', values = 'rating')

In [8]:
pivot.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),,,,,,,,,,,...,,,,,,,,,,4.0
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,
'Salem's Lot (2004),,,,,,,,,,,...,,,,,,,,,,
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,


## Creating a  sparse matrix
---

In a minute, I'll calculate the cosine similarity for each movie using the `pairwise_distances` function. Before that, I need to create a sparse matrix (datatype) using `scipy`'s `sparse` module like so:
```python
sparse.csr_matrix(pivot.fillna(0))
```

In [9]:
sparse_pivot = sparse.csr_matrix(pivot.fillna(0))

In [10]:
print(sparse_pivot)

  (0, 609)	4.0
  (1, 331)	4.0
  (2, 331)	3.5
  (2, 376)	3.5
  (3, 344)	5.0
  (4, 112)	3.0
  (4, 344)	5.0
  (5, 20)	1.5
  (6, 11)	5.0
  (6, 18)	2.0
  (6, 90)	2.0
  (6, 94)	3.0
  (6, 171)	4.0
  (6, 216)	4.0
  (6, 287)	3.0
  (6, 293)	1.0
  (6, 306)	3.5
  (6, 376)	3.5
  (6, 413)	3.0
  (6, 473)	1.0
  (6, 476)	3.5
  (6, 519)	4.0
  (6, 554)	5.0
  (6, 560)	4.5
  (6, 598)	2.0
  :	:
  (9717, 26)	5.0
  (9717, 41)	5.0
  (9717, 56)	2.0
  (9717, 67)	4.0
  (9717, 87)	3.5
  (9717, 140)	3.5
  (9717, 197)	2.0
  (9717, 214)	2.5
  (9717, 216)	2.0
  (9717, 220)	3.5
  (9717, 238)	3.0
  (9717, 281)	4.0
  (9717, 293)	4.0
  (9717, 306)	2.5
  (9717, 312)	1.0
  (9717, 413)	3.0
  (9717, 420)	3.0
  (9717, 447)	3.0
  (9717, 473)	3.0
  (9717, 476)	3.5
  (9717, 554)	3.0
  (9717, 560)	4.0
  (9717, 596)	3.0
  (9717, 598)	2.5
  (9718, 526)	1.0


## Calculate cosine similarity
---

`sklearn` has a built-in `pairwise_distances` function that we can use for our recommender. It will return a square matrix, comparing every movie with every other movie in the dataset.

```python
pairwise_distances(sparse_pivot, metric='cosine')
```

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

In [15]:
pairwise_distances(sparse_pivot, metric='cosine')

array([[0.        , 1.        , 1.        , ..., 0.67267316, 1.        ,
        1.        ],
       [1.        , 0.        , 0.29289322, ..., 1.        , 1.        ,
        1.        ],
       [1.        , 0.29289322, 0.        , ..., 1.        , 1.        ,
        1.        ],
       ...,
       [0.67267316, 1.        , 1.        , ..., 0.        , 1.        ,
        1.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 0.        ,
        1.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 1.        ,
        0.        ]])

In [19]:
recommender = cosine_similarity(sparse_pivot)

## Create distances DataFrame
---

At this point, we essentially have a recommender. We'll load it into a `pandas` DataFrame for readability. 

You'll notice that each movie has a "distance" of 0 with itself (along the diagonal).

In [21]:
recommender_df = pd.DataFrame(recommender, columns = pivot.index, index = pivot.index)

recommender_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,1.0,0.857493,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.857493,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Evaluate recommender performance
---

Now comes the fun part! Let's check out a few movies to see if the recommender aligns with our intuition. In the cell below we'll do the following:
1. Create a search term
2. Use that to find all titles matching the search query
3. For each title, we'll list off the following:
  1. The average rating
  2. The number of ratings
  3. The ten most similar movies

In [28]:
search = 'Die Hard'

#getting all the movies that have die hard in the title 
for title in movies.loc[movies['title'].str.contains(search), 'title']:
    print(title)
    
    print('Avg rating', pivot.loc[title, :].mean()) #avg rating for given movie 
    
    print('Count of ratings', pivot.T[title].count())
    
    print('')
    
    print('10 closest movies:')
    
    #getting 10 movies, index 0 is itself
    print(recommender_df[title].sort_values(ascending = False)[1:11])
    
    print('-----------------')

Die Hard: With a Vengeance (1995)
Avg rating 3.5555555555555554
Count of ratings 144

10 closest movies:
title
True Lies (1994)                     0.658258
Speed (1994)                         0.634087
Cliffhanger (1993)                   0.622068
Ace Ventura: Pet Detective (1994)    0.618543
GoldenEye (1995)                     0.615794
Clear and Present Danger (1994)      0.600089
Fugitive, The (1993)                 0.591703
Batman (1989)                        0.586716
Outbreak (1995)                      0.576947
Batman Forever (1995)                0.576758
Name: Die Hard: With a Vengeance (1995), dtype: float64
-----------------
Die Hard (1988)
Avg rating 3.8620689655172415
Count of ratings 145

10 closest movies:
title
Indiana Jones and the Last Crusade (1989)                                         0.663156
Terminator, The (1984)                                                            0.643057
Die Hard 2 (1990)                                                               

In [35]:
search = 'Twilight'

#getting all the movies that have die hard in the title 
for title in movies.loc[movies['title'].str.contains(search), 'title']:
    print(title)
    
    print('Avg rating', pivot.loc[title, :].mean()) #avg rating for given movie 
    
    print('Count of ratings', pivot.T[title].count())
    
    print('')
    
    print('10 closest movies:')
    
    #getting 10 movies, index 0 is itself
    print(recommender_df[title].sort_values(ascending = False)[1:11])
    
    print('----------------------')

Twilight (1998)
Avg rating 2.3333333333333335
Count of ratings 3

10 closest movies:
title
Damsels in Distress (2011)                                 0.759257
Over the Garden Wall (2013)                                0.727607
Jane Eyre (2011)                                           0.727607
Janie Jones (2010)                                         0.727607
Company Men, The (2010)                                    0.727607
Sleepwalkers (1992)                                        0.727607
Extraordinary Adventures of Adèle Blanc-Sec, The (2010)    0.727607
Fulltime Killer (Chuen jik sat sau) (2001)                 0.727607
Little Drummer Boy, The (1968)                             0.727607
Funeral, The (1996)                                        0.727607
Name: Twilight (1998), dtype: float64
----------------------
Twilight Zone: The Movie (1983)
Avg rating 3.0
Count of ratings 3

10 closest movies:
title
From Paris with Love (2010)                                  0.927047
Gonzo:

In [34]:
search = 'Princess Bride'

#getting all the movies that have die hard in the title 
for title in movies.loc[movies['title'].str.contains(search), 'title']:
    print(title)
    
    print('Avg rating', pivot.loc[title, :].mean()) #avg rating for given movie 
    
    print('Count of ratings', pivot.T[title].count())
    
    print('')
    
    print('10 closest movies:')
    
    #getting 10 movies, index 0 is itself
    print(recommender_df[title].sort_values(ascending = False)[1:11])
    
    print('-----------------')

Princess Bride, The (1987)
Avg rating 4.232394366197183
Count of ratings 142

10 closest movies:
title
Monty Python and the Holy Grail (1975)                                            0.631292
Star Wars: Episode V - The Empire Strikes Back (1980)                             0.602857
Ferris Bueller's Day Off (1986)                                                   0.601883
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    0.595145
Groundhog Day (1993)                                                              0.593322
Back to the Future (1985)                                                         0.571364
Star Wars: Episode IV - A New Hope (1977)                                         0.567963
Indiana Jones and the Last Crusade (1989)                                         0.547876
Star Wars: Episode VI - Return of the Jedi (1983)                                 0.543540
Breakfast Club, The (1985)                                                    

In [37]:
search = 'Shape of Water'

#getting all the movies that have die hard in the title 
for title in movies.loc[movies['title'].str.contains(search), 'title']:
    print(title)
    
    print('Avg rating', pivot.loc[title, :].mean()) #avg rating for given movie 
    
    print('Count of ratings', pivot.T[title].count())
    
    print('')
    
    print('10 closest movies:')
    
    #getting 10 movies, index 0 is itself
    print(recommender_df[title].sort_values(ascending = False)[1:11])
    
    print('-----------------')

The Shape of Water (2017)
Avg rating 3.6875
Count of ratings 8

10 closest movies:
title
I, Tonya (2017)                          0.605703
The Edge of Seventeen (2016)             0.538383
Isle of Dogs (2018)                      0.536358
Untitled Spider-Man Reboot (2017)        0.525071
Okja (2017)                              0.507349
Angel's Egg (Tenshi no tamago) (1985)    0.507323
Comic Book Villains (2002)               0.506856
Jetée, La (1962)                         0.506856
Man's Best Friend (1993)                 0.506225
Dunkirk (2017)                           0.505933
Name: The Shape of Water (2017), dtype: float64
-----------------
