# Netflix Recommendation System 
In this project, I demonstrate 2 ways of building a recommender system:
1. Popularity Based
2. Content Based 

### Popularity based
A popularity based recommender system provides the most popular searches, 

` For example, Top 10 movies watched in your region/ country`

### Content based 
A Content based recommender system provides the most relevant searches in regards to the content being consumed 

` For Example: Watching The Godfather will give you the top N movies similar to The Godfather`

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
movie_names = pd.read_csv('movies (1).csv')
movie_names.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [4]:
# df = pd.merge(movie_names, ratings, how='left', 
#               on='movieId')

In [5]:
# df.shape

In [6]:
df2 = pd.merge(movie_names, ratings, on='movieId')
df2.shape

(100004, 6)

## **Criteria** for Popularity Based Recommendation System

The Criteria is based on:
1. Movies with the highest rating
2. Number of views

In [7]:
df2.groupby('title')['rating'].mean().sort_values(ascending=False).head()

title
Ivan Vasilievich: Back to the Future (Ivan Vasilievich menyaet professiyu) (1973)    5.0
Alien Escape (1995)                                                                  5.0
Boiling Point (1993)                                                                 5.0
Bone Tomahawk (2015)                                                                 5.0
Borgman (2013)                                                                       5.0
Name: rating, dtype: float64

In [8]:
df2.groupby('title')['rating'].count().sort_values(ascending=False).head()

title
Forrest Gump (1994)                          341
Pulp Fiction (1994)                          324
Shawshank Redemption, The (1994)             311
Silence of the Lambs, The (1991)             304
Star Wars: Episode IV - A New Hope (1977)    291
Name: rating, dtype: int64

In [9]:
ratings_mean_count = pd.DataFrame(df2.groupby('title')['rating'].mean())
ratings_mean_count

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
"""Great Performances"" Cats (1998)",1.750000
$9.99 (2008),3.833333
'Hellboy': The Seeds of Creation (2004),2.000000
'Neath the Arizona Skies (1934),0.500000
'Round Midnight (1986),2.250000
...,...
xXx (2002),2.478261
xXx: State of the Union (2005),1.000000
¡Three Amigos! (1986),3.258065
À nous la liberté (Freedom for Us) (1931),4.500000


In [10]:
ratings_mean_count['rating_counts'] = pd.DataFrame(df2.groupby('title')['rating'].count())
ratings_mean_count

Unnamed: 0_level_0,rating,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"""Great Performances"" Cats (1998)",1.750000,2
$9.99 (2008),3.833333,3
'Hellboy': The Seeds of Creation (2004),2.000000,1
'Neath the Arizona Skies (1934),0.500000,1
'Round Midnight (1986),2.250000,2
...,...,...
xXx (2002),2.478261,23
xXx: State of the Union (2005),1.000000,1
¡Three Amigos! (1986),3.258065,31
À nous la liberté (Freedom for Us) (1931),4.500000,1


In [11]:
ratings_mean_count['rating'] = round(ratings_mean_count['rating'],1)

In [12]:
ratings_mean_count

Unnamed: 0_level_0,rating,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"""Great Performances"" Cats (1998)",1.8,2
$9.99 (2008),3.8,3
'Hellboy': The Seeds of Creation (2004),2.0,1
'Neath the Arizona Skies (1934),0.5,1
'Round Midnight (1986),2.2,2
...,...,...
xXx (2002),2.5,23
xXx: State of the Union (2005),1.0,1
¡Three Amigos! (1986),3.3,31
À nous la liberté (Freedom for Us) (1931),4.5,1


In [13]:
ratings_mean_count = ratings_mean_count[(ratings_mean_count['rating'] > 3) & (ratings_mean_count['rating_counts'] > 100)]

In [14]:
ratings_mean_count = ratings_mean_count.sort_values(by='rating', ascending=False)
ratings_mean_count

Unnamed: 0_level_0,rating,rating_counts
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Godfather, The (1972)",4.5,200
"Shawshank Redemption, The (1994)",4.5,311
"Usual Suspects, The (1995)",4.4,201
"Godfather: Part II, The (1974)",4.4,135
Pulp Fiction (1994),4.3,324
...,...,...
Cliffhanger (1993),3.1,106
Dumb & Dumber (Dumb and Dumber) (1994),3.1,158
Home Alone (1990),3.1,129
"Mask, The (1994)",3.1,157


So lets suppose that you make a subset of movies being watched in the region, you can take the count of films being watch for that region

## Content Based Recommender System 

Calculating Cosine Similarity

In [15]:
from math import *

def square_rooted(x):
    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):
    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)* square_rooted(y)
    return round(numerator/float(denominator))

In [67]:
from sklearn.metrics.pairwise import cosine_similarity # performs same work as the cosine similarity we created above
from sklearn.feature_extraction.text import CountVectorizer 

pd.set_option('display.max_columns', 100)
new_movies = pd.read_csv('https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7')

In [68]:
df = new_movies[['Title', 'Genre', 'Director', 'Actors', 'Plot']]

In [69]:
#Discarding the commas between actors' full names and getting only the first three names
df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])


In [70]:
# putting the genres in a list of words
df['Genre'] = df['Genre'].map(lambda x: x.lower().split(','))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Genre'] = df['Genre'].map(lambda x: x.lower().split(','))


In [71]:
df['Director'] = df['Director'].map(lambda x: x.split(' '))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Director'] = df['Director'].map(lambda x: x.split(' '))


In [72]:
#convert the actors names to lower case do avoid duplicates. Example, so that names like 'ROBBIN' and 'robbin' will not be repeated.

for index, row in df.iterrows():
    row['Actors'] = [x.lower().replace(' ','') for x in row['Actors']]
    row['Director'] = ''.join(row['Director']).lower()

In [75]:
import rake_nltk
from rake_nltk import Rake
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mhlaghari/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /Users/mhlaghari/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [77]:
df['Key_words'] = ''

for index, row in df.iterrows():
    plot = row['Plot']
    
    #instantiating Rake
    r = Rake()
    
    #extracting key workds by passing the text
    r.extract_keywords_from_text(plot)
    
    #Getting the dictionary with key workds and their scores 
    key_words_dict_scores = r.get_word_degrees()
    
    #assigning the key words to the new column
    row['Key_words'] = list(key_words_dict_scores.keys())
    
# Dropping the plot column
df.drop('Plot', axis=1, inplace=True)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Key_words'] = ''
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('Plot', axis=1, inplace=True)


In [78]:
key_words_dict_scores

defaultdict(<function rake_nltk.rake.Rake._build_word_co_occurance_graph.<locals>.<lambda>()>,
            {'mumbai': 3,
             'teen': 3,
             'reflects': 3,
             'upbringing': 1,
             'slums': 1,
             'accused': 1,
             'cheating': 1,
             'indian': 2,
             'version': 2,
             'wants': 1,
             'millionaire': 2,
             '?"': 2})

In [80]:
df.head()

Unnamed: 0,Title,Genre,Director,Actors,Key_words
0,The Shawshank Redemption,"[crime, drama]",frankdarabont,"[timrobbins, morganfreeman, bobgunton]","[two, imprisoned, men, bond, number, years, fi..."
1,The Godfather,"[crime, drama]",francisfordcoppola,"[marlonbrando, alpacino, jamescaan]","[aging, patriarch, organized, crime, dynasty, ..."
2,The Godfather: Part II,"[crime, drama]",francisfordcoppola,"[alpacino, robertduvall, dianekeaton]","[early, life, career, vito, corleone, 1920s, n..."
3,The Dark Knight,"[action, crime, drama]",christophernolan,"[christianbale, heathledger, aaroneckhart]","[menace, known, joker, emerges, mysterious, pa..."
4,12 Angry Men,"[crime, drama]",sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]","[jury, holdout, attempts, prevent, miscarriage..."


In [81]:
df.set_index('Title', inplace=True)

In [82]:
df.head()

Unnamed: 0_level_0,Genre,Director,Actors,Key_words
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Shawshank Redemption,"[crime, drama]",frankdarabont,"[timrobbins, morganfreeman, bobgunton]","[two, imprisoned, men, bond, number, years, fi..."
The Godfather,"[crime, drama]",francisfordcoppola,"[marlonbrando, alpacino, jamescaan]","[aging, patriarch, organized, crime, dynasty, ..."
The Godfather: Part II,"[crime, drama]",francisfordcoppola,"[alpacino, robertduvall, dianekeaton]","[early, life, career, vito, corleone, 1920s, n..."
The Dark Knight,"[action, crime, drama]",christophernolan,"[christianbale, heathledger, aaroneckhart]","[menace, known, joker, emerges, mysterious, pa..."
12 Angry Men,"[crime, drama]",sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]","[jury, holdout, attempts, prevent, miscarriage..."


In [83]:
df['bag_of_words'] = ''
columns = df.columns
for index, row in df.iterrows():
    words = ''
    for col in columns:
        if col!= 'Director':
            words = words + ' '.join(row[col]) + ' '
        else:
            words = words + row[col]+ ' '
    row['bag_of_words'] = words
    
df.drop(columns = [col for col in df.columns if col != 'bag_of_words'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['bag_of_words'] = ''


Unnamed: 0_level_0,bag_of_words
Title,Unnamed: 1_level_1
The Shawshank Redemption,crime drama frankdarabont timrobbins morganfr...
The Godfather,crime drama francisfordcoppola marlonbrando a...
The Godfather: Part II,crime drama francisfordcoppola alpacino rober...
The Dark Knight,action crime drama christophernolan christia...
12 Angry Men,crime drama sidneylumet martinbalsam johnfied...
...,...
The Lost Weekend,drama film-noir billywilder raymilland janewy...
Short Term 12,drama destindanielcretton brielarson johngalla...
His Girl Friday,comedy drama romance howardhawks carygrant r...
The Straight Story,biography drama davidlynch sissyspacek janega...


In [84]:
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])

In [86]:
c= count_matrix.todense()
c

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [87]:
print(count_matrix[0,:])

  (0, 584)	1
  (0, 768)	1
  (0, 1011)	1
  (0, 2678)	1
  (0, 1810)	1
  (0, 306)	1
  (0, 2765)	1
  (0, 1269)	1
  (0, 1733)	1
  (0, 311)	1
  (0, 1899)	1
  (0, 2950)	1
  (0, 969)	1
  (0, 2481)	1
  (0, 888)	1
  (0, 2174)	1
  (0, 59)	1
  (0, 519)	1
  (0, 655)	1


In [88]:
#Generate cosine similarity matrix 
cos_sim = cosine_similarity(count_matrix, count_matrix)
cos_sim

array([[1.        , 0.15789474, 0.13764944, ..., 0.05263158, 0.05263158,
        0.05564149],
       [0.15789474, 1.        , 0.36706517, ..., 0.05263158, 0.05263158,
        0.05564149],
       [0.13764944, 0.36706517, 1.        , ..., 0.04588315, 0.04588315,
        0.04850713],
       ...,
       [0.05263158, 0.05263158, 0.04588315, ..., 1.        , 0.05263158,
        0.05564149],
       [0.05263158, 0.05263158, 0.04588315, ..., 0.05263158, 1.        ,
        0.05564149],
       [0.05564149, 0.05564149, 0.04850713, ..., 0.05564149, 0.05564149,
        1.        ]])

In [89]:
# creating a series for the movie titles so they are associated with an ordered numerical list
indices = pd.Series(df.index)
indices[:20]

0                              The Shawshank Redemption
1                                         The Godfather
2                                The Godfather: Part II
3                                       The Dark Knight
4                                          12 Angry Men
5                                      Schindler's List
6         The Lord of the Rings: The Return of the King
7                                          Pulp Fiction
8                                            Fight Club
9     The Lord of the Rings: The Fellowship of the Ring
10                                         Forrest Gump
11       Star Wars: Episode V - The Empire Strikes Back
12                                            Inception
13                The Lord of the Rings: The Two Towers
14                      One Flew Over the Cuckoo's Nest
15                                           Goodfellas
16                                           The Matrix
17                   Star Wars: Episode IV - A N

In [90]:
# Function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, cos_sim=cos_sim):
    
    recommended_movies = []
    
    # getting the index of the movie that matches the title
    idx= indices[indices == title].index[0]
    
    # creating a Series with the similarity score in descending order 
    score_series = pd.Series(cos_sim[idx]).sort_values(ascending=False)
    
    #getting the indexes of the 10 most similar movies 
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(top_10_indexes)
    
    #populating the list with the title of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df.index)[i])
        
    return recommended_movies

### Now, for the fun part, I will see the recommendation my project gives me for some of the movies I like 

- Interstellar
- Snatch
- Blood Diamond
- The Godfather
- Fight Club

In [109]:
recommendations('Interstellar')

[40, 222, 55, 237, 74, 219, 12, 167, 199, 69]


['The Prestige',
 'The Martian',
 'Aliens',
 'The Revenant',
 '2001: A Space Odyssey',
 'The Avengers',
 'Inception',
 'The Truman Show',
 'Guardians of the Galaxy',
 'Eternal Sunshine of the Spotless Mind']

In [110]:
recommendations('Snatch')

[54, 115, 109, 218, 234, 151, 1, 125, 214, 43]


['Once Upon a Time in America',
 'The Wolf of Wall Street',
 'Lock, Stock and Two Smoking Barrels',
 'The Killing',
 'Blood Diamond',
 'Butch Cassidy and the Sundance Kid',
 'The Godfather',
 'The Big Lebowski',
 'Arsenic and Old Lace',
 'The Great Dictator']

In [112]:
recommendations('Blood Diamond')

[237, 34, 201, 98, 140, 239, 232, 147, 161, 63]


['The Revenant',
 'The Departed',
 'Jaws',
 'The Gold Rush',
 'Shutter Island',
 'The Manchurian Candidate',
 'JFK',
 'Stand by Me',
 'What Ever Happened to Baby Jane?',
 'Requiem for a Dream']

In [113]:
recommendations('The Godfather')

[2, 83, 128, 226, 100, 15, 123, 76, 110, 66]


['The Godfather: Part II',
 'Scarface',
 'Fargo',
 'Rope',
 'On the Waterfront',
 'Goodfellas',
 'Cool Hand Luke',
 'Baby Driver',
 'Casino',
 'A Clockwork Orange']

In [114]:
recommendations('Fight Club')

[137, 246, 123, 85, 135, 245, 243, 167, 53, 26]


['Gone Girl',
 'Short Term 12',
 'Cool Hand Luke',
 'Good Will Hunting',
 'Into the Wild',
 'The Lost Weekend',
 'Big Fish',
 'The Truman Show',
 'American Beauty',
 'American History X']