# "Find movies tailor'd to your tastes!

An interactive tool that allows you to enter a movie, and 5 receive recommendations for similiar movies.

# Dataset

[Data: ml-25m](https://files.grouplens.org/datasets/movielens/ml-25m.zip)

GroupLens Research has collected and made available rating data sets from the MovieLens web site (https://movielens.org). 

"MovieLens 25M movie ratings. Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019" - Provided By [GroupLens Research](https://grouplens.org/datasets/movielens/)


# Imports

* Pandas for analysis and DataFrames
* re for Regular Expression Support
* sklearn for building our tfid x idf search engine
* numpy for search engine
* ipywidgets & display for interactivity

In [2]:
import pandas as pd
import numpy as np
import re  # reg express
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# interaction
import ipywidgets as widgets
from IPython.display import display

# formating
import jupyter_black

jupyter_black.load(lab=False)

<IPython.core.display.Javascript object>

# <p style="text-align: center;">Movies Data Dictionary</p>


|Column Name| Description|
|-----------|-----------|
|**movieid**|ID & Row Number|
|**title**|Name of the Movie|
|**genres**| category/type of movie|
|                                           




In [None]:
movies = pd.read_csv("movies.csv")
movies

In [None]:
def clean_title(title):
    """
    cleans titles
    removes special characters such as parentheses
    """

    # search & remove special characters
    return re.sub("[^a-zA-Z0-9 ]", "", title)

In [None]:
# apply clean_title to title column
# to create a new column
movies["clean_title"] = movies["title"].apply(clean_title)
movies

## Movie Search System

How does it work?

Convert titles into sets of numbers using a Term Frequency Matrix or TF.

Each column is unique "term".

If a "term" is in a row and occurs in the title, its assigned a 1, else a 0.

Then we use the Inverse Document Frequency or IDF method. This Helps find unique terms by assigning them a logarithmic value. 

> "IDF looks at the number of times a term is used in other pieces of content in a database, assigning a higher value to words used less often. It is used to measure how much information a word adds to the piece of content. " - [Source](https://www.seobility.net/en/wiki/Inverse_Document_Frequency)

Combining our TF with IDF we create a vector that "describes" each movie title.

We enter a title into our search engine. It gets converted into a numerical vector. This is then matched with other titles in our dataset.

In [None]:
# ngrams vector, groups of two consecutive words
# increases accuracy
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

# create our matrix
tfidf = vectorizer.fit_transform(movies["clean_title"])

In [None]:
# compute the similiarities
def search(title):
    """
    takes a search term
    cleans it
    vectorizes/transforms the term
    """
    title = clean_title(title)

    # creates our sparse matrix
    query_vec = vectorizer.transform([title])

    # takes our search matrix and compares it our clean_titles matrix
    # returns a numpy vector
    similarity = cosine_similarity(query_vec, tfidf).flatten()

    # find the last 5 matches, most similiar
    # returns the index of each result from our vector
    # "returns an array of indices of the same shape"
    indices = np.argpartition(similarity, -5)[-5:]

    # search for those titles
    # most similiar result is last, reverse
    results = movies.iloc[indices][::-1]

    return results

In [None]:
# search(movies.iloc[0, 3])

# Interaction

We can add interactivity to our notebook using widgets and 

In [5]:
# an input widget
# 
movie_input = widgets.Text(
    value="Enter A Movie Title", description="Movie Title:", disabled=False
)
movie_input

Text(value='Enter A Movie Title', description='Movie Title:')