# Movie Recomendation Systems

## First Part

Import Pandas and read your first dataset

In [None]:
import pandas as pd

movies = pd.read_csv("movies.csv")

Let's open our data and see what is inside it.

In [None]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Oh we have 3 major columns: movie id, title and genres. Title and genre is meaningfull for us.

Next We will try to make a simple search engine.


That is we want to input name of the film and want to see weather it is there or not or see movies with similar names.

For that we will be using the titles column, the column is full of text data so we will clean them first any thing exept alphabets and numbers we will remove.

In [None]:
import re

def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title) #Keeps only alphabets and numbers
    return title








Add the cleaned titles as a new column

In [None]:
movies["clean_title"] = movies["title"].apply(clean_title)

In [None]:
movies.head()

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


Now Last lab, we learnt that we cannot directly use the text data, we need to convert them into numbers. For sentiment analysis we used something known as Bag of Words. Today we will use TF-IDF this is another method to convert text data into numerical form.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))

Next the titles are converted to matrix of numbers

In [None]:
tfidf = vectorizer.fit_transform(movies["clean_title"])

How can we find out which movies are similar?

There are many ways to find similarity between the words.

We will use Cosine Similarity

The below function Will take a movie title you are entering, then will clean it first then convert the text u entered to number matrix then will compare that to all the other movies to see which ones are similar.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices].iloc[::-1]

    return results

This is not yet a proper recomendation system.

Below is an interactive way to do the search.

In [None]:
import ipywidgets as widgets
from IPython.display import display

movie_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            display(search(title))

movie_input.observe(on_type, names='value')


display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title:')

Output()

## Second Part

In [None]:
ratings = pd.read_csv("ratings.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'ratings.csv'

In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


Let's take an example movie to see how to make a proper recomendation system

**Step 1** I liked the movie Avengers how to recommend movies for me?

In [None]:
movie_id = 89745
movie = movies[movies["movieId"] == movie_id] # The movie_id correspond to the movie Avengers

How does the recomendation system works?

Suppose i watched the movie Avengers and i liked it very much so i gave the rating to be 5 star.

**Step 2** Find all the other users who liked the same movie avenger. We will call them similar_users

So now the system will search for the other users who gave 5 star for iron man.

In [None]:
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

In [None]:
similar_users # all those users gave similar rating for the movie

array([    21,    187,    208, ..., 162469, 162485, 162532], dtype=int64)

**Step 3** Next Find out what other movies did these similar_users liked? we will call them similar_user_recomendations

In [None]:
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]

In [None]:
similar_user_recs # These are the other movies which similar users gave 5 star ratings to.

3741           318
3742           527
3743           541
3744           589
3745           741
             ...  
24998517     91542
24998518     92259
24998522     98809
24998523    102125
24998524    112852
Name: movieId, Length: 577796, dtype: int64

**Step 4** From the above cell u can see there are so many movies and I might not be intrested in all of them..(i am only intrested in movies similar to Avengers)

We are going to find only the movies that greater than 10% of the users who are similar to us liked.

That is the from the above cell 'Length: 577796' is the total movies liked by similar users. now we will take the value counts of each movie meaning of total users how many liked a movie then take percentage of it. eg: movie id = 318, how many similar users have liked this movie, take percentage of it.

In [None]:
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

In [None]:
similar_user_recs = similar_user_recs[similar_user_recs > .10]
similar_user_recs

movieId
89745    1.000000
58559    0.573393
59315    0.530649
79132    0.519715
2571     0.496687
           ...   
47610    0.103545
780      0.103380
88744    0.103048
1258     0.101226
1193     0.100895
Name: count, Length: 193, dtype: float64

From the similar users recomendations the above 193 movies were liked the most.

**Step 5** Certain movies are liked generaly so i dont want recomendations of such movies only recomend movies similar to mine (Avengers).

In [None]:
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]

In [None]:
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

**Step 6** Below is the percentage of people who generaly liked the movie.

In [None]:
all_user_recs

movieId
318       0.346395
296       0.288146
2571      0.247010
356       0.238136
593       0.228665
            ...   
86332     0.010142
91630     0.009324
122900    0.008573
122926    0.008070
106072    0.005289
Name: count, Length: 193, dtype: float64

In [None]:
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

In [None]:
rec_percentages

Unnamed: 0_level_0,similar,all
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
89745,1.000000,0.040459
58559,0.573393,0.148256
59315,0.530649,0.054931
79132,0.519715,0.132987
2571,0.496687,0.247010
...,...,...
47610,0.103545,0.022770
780,0.103380,0.054723
88744,0.103048,0.010383
1258,0.101226,0.083887


**Step 7** Make a new column score which shows the similarity which is = [similar]/[all]

In [None]:
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

In [None]:
rec_percentages = rec_percentages.sort_values("score", ascending=False)

In [None]:
rec_percentages.head().merge(movies, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
17067,1.0,0.040459,24.716368,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,Avengers The 2012
20513,0.103711,0.005289,19.610199,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX,Thor The Dark World 2013
25058,0.241054,0.012367,19.49177,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,Avengers Age of Ultron 2015
19678,0.216534,0.012119,17.867419,102125,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,Iron Man 3 2013
16725,0.215043,0.012052,17.843074,88140,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,Captain America The First Avenger 2011


**Final Step** Putting everything in a function

In [None]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]

    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]


In [None]:
import ipywidgets as widgets
from IPython.display import display

movie_name_input = widgets.Text(
    value='Toy Story',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()