# "Find movies tailor'd to your tastes!

An interactive tool that allows you to enter a movie, and 5 receive recommendations for similiar movies.

# Dataset

[Data: ml-25m](https://files.grouplens.org/datasets/movielens/ml-25m.zip)

GroupLens Research has collected and made available rating data sets from the MovieLens web site (https://movielens.org). 

"MovieLens 25M movie ratings. Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019" - Provided By [GroupLens Research](https://grouplens.org/datasets/movielens/)


# Imports

* Pandas for analysis and DataFrames
* re for Regular Expression Support
* sklearn for building our tfid x idf search engine
* numpy for search engine
* ipywidgets & display for interactivity

In [1]:
import numpy as np
import pandas as pd
import re  # reg express
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# interaction
import ipywidgets as widgets
from IPython.display import display

# formating
import jupyter_black

jupyter_black.load(lab=False)

<IPython.core.display.Javascript object>

# <p style="text-align: center;">Movies Data Dictionary</p>


|Column Name| Description|
|-----------|-----------|
|**movieid**|ID & Row Number|
|**title**|Name of the Movie|
|**genres**| category/type of movie|
|                                           




In [2]:
movies = pd.read_csv("movies.csv")
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


In [3]:
def clean_title(title):
    """
    cleans titles
    removes special characters such as parentheses
    """

    # search & remove special characters
    return re.sub("[^a-zA-Z0-9 ]", "", title)

In [4]:
# apply clean_title to title column
# to create a new column
movies["clean_title"] = movies["title"].apply(clean_title)
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


## Movie Search System

How does it work?

Convert titles into sets of numbers using a Term Frequency Matrix or TF.

Each column is unique "term".

If a "term" is in a row and occurs in the title, its assigned a 1, else a 0.

Then we use the Inverse Document Frequency or IDF method. This Helps find unique terms by assigning them a logarithmic value. 

> "IDF looks at the number of times a term is used in other pieces of content in a database, assigning a higher value to words used less often. It is used to measure how much information a word adds to the piece of content. " - [Source](https://www.seobility.net/en/wiki/Inverse_Document_Frequency)

Combining our TF with IDF we create a vector that "describes" each movie title.

We enter a title into our search engine. It gets converted into a numerical vector. This is then matched with other titles in our dataset.

In [5]:
# ngrams vector, groups of two consecutive words
# increases accuracy
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

# create our matrix
tfidf = vectorizer.fit_transform(movies["clean_title"])

In [6]:
# compute the similiarities
def search(title):
    """
    takes a search term
    cleans it
    vectorizes/transforms the term
    """
    title = clean_title(title)

    # creates our sparse matrix
    query_vec = vectorizer.transform([title])

    # takes our search matrix and compares it our clean_titles matrix
    # returns a numpy vector
    similarity = cosine_similarity(query_vec, tfidf).flatten()

    # find the last 5 matches, most similiar
    # returns the index of each result from our vector
    # "returns an array of indices of the same shape"
    indices = np.argpartition(similarity, -5)[-5:]

    # search for those titles
    # most similiar result is last, reverse
    results = movies.iloc[indices][::-1]

    return results

# Interaction

We can add interactivity to our notebook using ipywidgets and IPython disaply capabilties.

This creates an UI/UX friendly way ot interact with our Recommendation System

In [7]:
# an input widget
# returns data
movie_input = widgets.Text(
    value="Enter A Movie Title", description="Movie Title:", disabled=False
)

movie_list = widgets.Output()


def on_type(data):
    """
    called when text is typed
    """
    with movie_list:
        # clear any text
        movie_list.clear_output()

        # grab the title
        # start searching
        title = data["new"]

        if len(title) > 5:
            display(search(title))


# waits for an event(value) to call on_type function
movie_input.observe(on_type, names="value")

display(movie_input, movie_list)

Text(value='Enter A Movie Title', description='Movie Title:')

Output()

# Recommendations Using Ratings

A system to find users who liked the same movie, and then suggest movies based on those users ratings and taste.

In [8]:
ratings = pd.read_csv("ratings.csv")
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


In [9]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

## Finding Users who liked the same movie

Similar User Criteria: 
* watched the same movie
* rated the movie >= 4
* unique userId

In [10]:
# similar_users = ratings[(ratings["movieID"])]

# test id
movie_id = 1

# holds similar users
# movideId matches + rating is over 4 + unique userids
# returns numpy array
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)][
    "userId"
].unique()

# array
similar_users

array([    36,     75,     86, ..., 162527, 162530, 162533])

# Find the movies those users liked
* match userid from similar user array
* find the movies they rated above > 4
* return their like'd movies

In [11]:
# find usersId in our similiar user array
similar_user_recs = ratings[
    (ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)
]["movieId"]

In [12]:
similar_user_recs

5101            1
5105           34
5111          110
5114          150
5127          260
            ...  
24998854    60069
24998861    67997
24998876    78499
24998884    81591
24998888    88129
Name: movieId, Length: 1358326, dtype: int64

## Ratings Overview

We can now find 2 things:

1. Users with similiar tastes
2. Movies those users like.

This allows us to get movie recomendations using another dimension.
These two queries can be joined on the movieId column.

## Next Steps

Find the movies that only more than 10% of similiar users like'd. 

This lets us fine tune our results, and thus increase the quality of our search.

In [13]:
# checks how many times as specific movieId pops up
similar_user_recs.value_counts()

1         18835
318        8393
260        7605
356        6973
296        6918
          ...  
128478        1
125125        1
119701        1
107563        1
7625          1
Name: movieId, Length: 19282, dtype: int64

In [14]:
# converted values to percents
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

# greater than .1 or 10%
similar_user_recs = similar_user_recs[similar_user_recs > 0.1]

# verify top 10%
similar_user_recs

1        1.000000
318      0.445607
260      0.403770
356      0.370215
296      0.367295
           ...   
953      0.103053
551      0.101195
1222     0.100876
745      0.100345
48780    0.100186
Name: movieId, Length: 113, dtype: float64

# Finding a Niche

Instead of targetting specific users and their likes, we'll take a look at how much a movie is liked by the average user of our dataset.

Then we can filter out those results from our similar user recommendations.

This allows us to find the middle spot or "niche" recommendations within our similar users data.

## Steps: 
1. find any movie watched by our "similar users" that has a rating > 4
2. compare the all user set to the similar user set

In [15]:
# return all movieIds from our similar user recs
# index = movieId
all_users = ratings[
    (ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)
]

all_users

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
29,1,4973,4.5,1147869080
48,1,7361,5.0,1147880055
72,2,110,5.0,1141416589
76,2,260,5.0,1141417172
...,...,...,...,...
25000062,162541,5618,4.5,1240953299
25000065,162541,5952,5.0,1240952617
25000078,162541,7153,5.0,1240952613
25000081,162541,7361,4.5,1240953484


In [16]:
# convert to percents
# find counts for each movie / amount of unique users
all_users_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

In [17]:
# this is used for comparison
all_users_recs

318      0.342220
296      0.284674
2571     0.244033
356      0.235266
593      0.225909
           ...   
551      0.040918
50872    0.039111
745      0.037031
78499    0.035131
2355     0.025091
Name: movieId, Length: 113, dtype: float64

# Recommendation Comparison and Scoring

Here we will compare our two series to compute a score based on nicheness.
1. combine our two series

In [27]:
recommendation_percents = pd.concat([similar_user_recs, all_users_recs], axis=1)
recommendation_percents.columns = ["similar", "all"]
recommendation_percents

Unnamed: 0,similar,all
1,1.000000,0.124728
318,0.445607,0.342220
260,0.403770,0.222207
356,0.370215,0.235266
296,0.367295,0.284674
...,...,...
953,0.103053,0.045792
551,0.101195,0.040918
1222,0.100876,0.066877
745,0.100345,0.037031


In [28]:
recommendation_percents["score/ratio"] = (
    recommendation_percents["similar"] / recommendation_percents["all"]
)

In [34]:
recommendation_percents.sort_values("score/ratio", ascending=False).head(10).merge(
    movies, left_index=True, right_on="movieId"
)

Unnamed: 0,similar,all,score/ratio,movieId,title,genres,clean_title
0,1.0,0.124728,8.017414,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
3021,0.280648,0.053706,5.225654,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
2264,0.110539,0.025091,4.405452,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bugs Life A 1998
14813,0.15296,0.035131,4.354038,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
4780,0.235147,0.070811,3.320783,4886,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,Monsters Inc 2001
580,0.216618,0.067513,3.208539,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,Aladdin 1992
6258,0.228139,0.072268,3.156862,6377,Finding Nemo (2003),Adventure|Animation|Children|Comedy,Finding Nemo 2003
587,0.1794,0.059977,2.99115,595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX,Beauty and the Beast 1991
8246,0.203504,0.068453,2.972889,8961,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,Incredibles The 2004
359,0.253411,0.085764,2.954762,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,Lion King The 1994


# Similar Movie Function

In [35]:
def find_sim_movies(movie_id):
    """
    takes a movie id
    and retuns a niche filtered list of movie recommendations
    combined above code
    """
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)][
        "userId"
    ].unique()
    similar_user_recs = ratings[
        (ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)
    ]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > 0.10]
    all_users = ratings[
        (ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)
    ]
    all_user_recs = all_users["movieId"].value_counts() / len(
        all_users["userId"].unique()
    )
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]

    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score/ratio", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[
        ["score", "title", "genres"]
    ]

# Search Engine

In [39]:
movie_name_input = widgets.Text(
    value="Toy Story", description="Movie Title:", disabled=False
)
recommendation_list = widgets.Output()


def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_sim_movies(movie_id))


movie_name_input.observe(on_type, names="value")

display(movie_name_input, recommendation_list)

Text(value='Toy Story', description='Movie Title:')

Output()