# Build a Movie Recommendation System Using Collaborative Filtering

---

## Outline

1. Introuction
2. Data Preparation & Cleaning
   * Reading Data in Pandas
   * Cleaning Movie Titles Using Regex
   * Creating a TFIDF Matrix 
   * Creating a Search Function 
3. Building an Interactive Search Box in Jupyter 
4. Building a Recommendation System
   * Finding Users Who Liked the Same Movie
   * Determining How Much Users Like Movies
   * Creating a Recommendation Score
   * Building a Recommendation Function
   * Creating an Interactive Recommendation Widget
---

## 1. Introuction


This project is going to build an interactive movie recommendation system that allows you to type in a movie name and immediately get ten recommendations for other movies you might want to watch. A recommendation engine to recommend movies based on a movie that we have watched.

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import ipywidgets as widgets
from IPython.display import display

## 2. Data Cleaning & Preparation 

### Reading Data in Pandas

This dataset (ml-25m) describes 5-star rating and free-text tagging activity. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. These data were created by 162541 users between January 09, 1995 and November 21, 2019. This dataset was generated on November 21, 2019.


Dataset from [MovieLens](https://movielens.org)

In [2]:
movies=pd.read_csv("movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
movies.dtypes

movieId     int64
title      object
genres     object
dtype: object

In [4]:
ratings=pd.read_csv("ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [5]:
ratings.shape

(25000095, 4)

In [6]:
ratings["userId"].unique().size

162541

In [7]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

### Cleaning Movie Titles Using Regex
Removing brackets in every title

In [8]:
def clean_title(title):
    new_title=re.sub("[^a-zA-Z0-9]"," ",title)
    return new_title

In [9]:
movies["clean_title"]=movies["title"].apply(clean_title)
movies.head()

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


### Creating a TFIDF Matrix 
Instead of only looking at individual words in the title, I will also look at two consecutive words. This makes our search a more accurate. For example, looking at toy story in 1995, it's going to look at toy story together and story 1995 together as well.

In [10]:
vectorizer=TfidfVectorizer(ngram_range=(1,2))
tfidf=vectorizer.fit_transform(movies["clean_title"])

### Creating a Search Function
Creating a Search Function to compute the similarity between our search term and all the titles in our data. 

The function will then do the following:
* Clean the title 
* Convert the title into a set of numbers 
* Use cosine_similarity to find the similarity between our search term and all the titles in our data 
* Return the five most similar titles to our search term

In [11]:
def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices].iloc[::-1]
    return results

In [12]:
search("toy")

Unnamed: 0,movieId,title,genres,clean_title
14813,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
3021,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
4823,4929,"Toy, The (1982)",Comedy,Toy The 1982
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
59767,201588,Toy Story 4 (2019),Adventure|Animation|Children|Comedy,Toy Story 4 2019


## 3. Building an Interactive Search Box in Jupyter

In [13]:
movie_input=widgets.Text(value="  ",description="Movie Title:",disabled=False)
movie_search=widgets.Output()
def on_type(movie):
    with movie_search:
        movie_search.clear_output()
        title=movie["new"]
        if len(title)>5:
            display(search(title))
movie_input.observe(on_type,names="value")        
        
display(movie_input,movie_search)

Text(value='  ', description='Movie Title:')

Output()

## 4. Building a Recommendation System

### Finding Users Who Liked the Same Movie


Establish a threshold for recommendations:


* Find the users who also liked the same movie we liked.
* Find the other movies that they liked.
* Find only the moives that more than 10% of similar users liked.



In [14]:
#For example:toy story
movieID=1

In [15]:
#Find the users who also liked the same movie we liked 
samemovie_userid=ratings[(ratings["movieId"]==movieID)&(ratings["rating"]>4)]["userId"].unique()
samemovie_userid

array([    36,     75,     86, ..., 162527, 162530, 162533])

In [16]:
#We have 18835 people who also like toy story
samemovie_userid.shape

(18835,)

In [17]:
#Find the other movies that they liked 
otherlike_movieid=ratings[(ratings["userId"].isin(samemovie_userid))&(ratings["rating"]>4)]["movieId"]
otherlike_movieid

5101            1
5105           34
5111          110
5114          150
5127          260
            ...  
24998854    60069
24998861    67997
24998876    78499
24998884    81591
24998888    88129
Name: movieId, Length: 1358326, dtype: int64

In [18]:
#Find only the moives that more than 10% of similar users liked.
otherlike_ratio=otherlike_movieid.value_counts()/len(samemovie_userid)
otherlike_ratio=otherlike_ratio[otherlike_ratio>0.1]
otherlike_ratio

1        1.000000
318      0.445607
260      0.403770
356      0.370215
296      0.367295
           ...   
953      0.103053
551      0.101195
1222     0.100876
745      0.100345
48780    0.100186
Name: movieId, Length: 113, dtype: float64

### Determining How Much Users Like Movies

* Find all users who rated a movie highly that is in our set of recommended movies
* Find what percentage of all users recommend each of these movies


In [19]:
#Find all users who rated a movie highly that is in our set of recommended movies
all_userlike=ratings[(ratings["movieId"].isin(otherlike_ratio.index))&(ratings["rating"]>4)]
all_userlike

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
29,1,4973,4.5,1147869080
48,1,7361,5.0,1147880055
72,2,110,5.0,1141416589
76,2,260,5.0,1141417172
...,...,...,...,...
25000062,162541,5618,4.5,1240953299
25000065,162541,5952,5.0,1240952617
25000078,162541,7153,5.0,1240952613
25000081,162541,7361,4.5,1240953484


In [20]:
#Find what percentage of all users recommend each of these movies
alluserlike_ratio=all_userlike["movieId"].value_counts()/len(all_userlike["userId"].unique())
alluserlike_ratio

318      0.342220
296      0.284674
2571     0.244033
356      0.235266
593      0.225909
           ...   
551      0.040918
50872    0.039111
745      0.037031
78499    0.035131
2355     0.025091
Name: movieId, Length: 113, dtype: float64

### Creating a Recommendation Score
otherlike_ratio: Users Who Liked the Same Movie  
alluserlike_ratio: How Much Users Like Movies  
Let's compare them.  

* Concatenate similar user recommendations and all user recommendations.
* Create a score by dividing similar user recommendations by all user recommendations.
* Sort the values to show highest values first.
* Take the top 10 recommendations and merge them with movies data.


In [21]:
#Concatenate similar user recommendations and all user recommendations.
ratio_compare=pd.concat([otherlike_ratio, alluserlike_ratio], axis=1)
ratio_compare

Unnamed: 0,movieId,movieId.1
1,1.000000,0.124728
318,0.445607,0.342220
260,0.403770,0.222207
356,0.370215,0.235266
296,0.367295,0.284674
...,...,...
953,0.103053,0.045792
551,0.101195,0.040918
1222,0.100876,0.066877
745,0.100345,0.037031


In [22]:
ratio_compare.columns=["similar_users","all_users"]
ratio_compare

Unnamed: 0,similar_users,all_users
1,1.000000,0.124728
318,0.445607,0.342220
260,0.403770,0.222207
356,0.370215,0.235266
296,0.367295,0.284674
...,...,...
953,0.103053,0.045792
551,0.101195,0.040918
1222,0.100876,0.066877
745,0.100345,0.037031


In [23]:
#Create a score by dividing similar user recommendations by all user recommendations
ratio_compare["score"]=ratio_compare["similar_users"]/ratio_compare["all_users"]
ratio_compare

Unnamed: 0,similar_users,all_users,score
1,1.000000,0.124728,8.017414
318,0.445607,0.342220,1.302105
260,0.403770,0.222207,1.817089
356,0.370215,0.235266,1.573604
296,0.367295,0.284674,1.290232
...,...,...,...
953,0.103053,0.045792,2.250441
551,0.101195,0.040918,2.473085
1222,0.100876,0.066877,1.508376
745,0.100345,0.037031,2.709748


In [24]:
ratio_compare.sort_values("score",ascending=False).head(10).merge(movies,left_index=True,right_on="movieId")

Unnamed: 0,similar_users,all_users,score,movieId,title,genres,clean_title
0,1.0,0.124728,8.017414,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
3021,0.280648,0.053706,5.225654,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
2264,0.110539,0.025091,4.405452,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bug s Life A 1998
14813,0.15296,0.035131,4.354038,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
4780,0.235147,0.070811,3.320783,4886,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,Monsters Inc 2001
580,0.216618,0.067513,3.208539,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,Aladdin 1992
6258,0.228139,0.072268,3.156862,6377,Finding Nemo (2003),Adventure|Animation|Children|Comedy,Finding Nemo 2003
587,0.1794,0.059977,2.99115,595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX,Beauty and the Beast 1991
8246,0.203504,0.068453,2.972889,8961,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,Incredibles The 2004
359,0.253411,0.085764,2.954762,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,Lion King The 1994


In [25]:
ratio_compare.sort_values("score",ascending=False).head(10).merge(movies,left_index=True,right_on="movieId")[["score","title","genres"]]

Unnamed: 0,score,title,genres
0,8.017414,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3021,5.225654,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy
2264,4.405452,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy
14813,4.354038,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX
4780,3.320783,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy
580,3.208539,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical
6258,3.156862,Finding Nemo (2003),Adventure|Animation|Children|Comedy
587,2.99115,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX
8246,2.972889,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy
359,2.954762,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX


### Building a Recommendation Function
Determining How Much Users Like Movies
Find all users who rated a movie highly that is in our set of recommended movies
Find what percentage of all users recommend each of these movies

In [26]:
def recommend_movies(movie_id):
    #Finding Users Who Liked the Same Movie
    samemovie_userid=ratings[(ratings["movieId"]==movie_id)&(ratings["rating"]>4)]["userId"].unique()
    otherlike_movieid=ratings[ratings["userId"].isin(samemovie_userid)&(ratings["rating"]>4)]["movieId"]
    otherlike_ratio=otherlike_movieid.value_counts()/len(samemovie_userid)
    otherlike_ratio=otherlike_ratio[otherlike_ratio>0.1]
    #Determining How Much Users Like Movies
    all_userlike=ratings[(ratings["movieId"].isin(otherlike_ratio.index))&(ratings["rating"]>4)]
    alluserlike_ratio=all_userlike["movieId"].value_counts()/len(all_userlike["userId"].unique())
    #Creating a Recommendation Score
    ratio_compare=pd.concat([otherlike_ratio,alluserlike_ratio],axis=1)
    ratio_compare.columns=["similar_users","all_users"]
    ratio_compare["score"]=ratio_compare["similar_users"]/ratio_compare["all_users"]
    movies_list=ratio_compare.sort_values("score",ascending=False).head(10).merge(movies,left_index=True,right_on="movieId")
    return movies_list[["score","title","genres"]]  

In [27]:
recommend_movies(1)

Unnamed: 0,score,title,genres
0,8.017414,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3021,5.225654,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy
2264,4.405452,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy
14813,4.354038,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX
4780,3.320783,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy
580,3.208539,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical
6258,3.156862,Finding Nemo (2003),Adventure|Animation|Children|Comedy
587,2.99115,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX
8246,2.972889,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy
359,2.954762,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX


### Creating an Interactive Recommendation Widget

In [28]:
movie_input=widgets.Text(value="  ",description="Movie Title:",disabled=False)
movie_list=widgets.Output()

def on_type(movie):
    with movie_list:
        movie_list.clear_output()
        title=movie["new"]
        if len(title)>3:
            results=search(title)
            movie_id=results.iloc[0]["movieId"]
            display(recommend_movies(movie_id))
movie_input.observe(on_type,names="value")        
        
display(movie_input,movie_list)

Text(value='  ', description='Movie Title:')

Output()

---

## References
https://reurl.cc/ymN4ra  
https://reurl.cc/deklG8  
https://grouplens.org/datasets/movielens/  
https://movielens.org  