# "Find movies tailor'd to your tastes!

An interactive tool that allows you to enter a movie, and receive recommendations for similiar movies.

# Dataset

[Data: ml-25m](https://files.grouplens.org/datasets/movielens/ml-25m.zip)

GroupLens Research has collected and made available rating data sets from the MovieLens web site (https://movielens.org). 

"MovieLens 25M movie ratings. Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019" - Provided By [GroupLens Research](https://grouplens.org/datasets/movielens/)


# Imports

* Pandas for analysis and DataFrames
* re for Regular Expression Support
* sklearn for building our tfid x idf search engine
* numpy for search engine

In [1]:
import pandas as pd
import re  # reg express
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# formating
import jupyter_black

jupyter_black.load(lab=False)

<IPython.core.display.Javascript object>

# <p style="text-align: center;">Movies Data Dictionary</p>


|Column Name| Description|
|-----------|-----------|
|**movieid**|ID & Row Number|
|**title**|Name of the Movie|
|**genres**| category/type of movie|
|                                           




In [2]:
movies = pd.read_csv("movies.csv")
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


In [3]:
def clean_title(title):
    """
    cleans titles
    removes special characters such as parentheses
    """

    # search & remove special characters
    return re.sub("[^a-zA-Z0-9 ]", "", title)

In [4]:
# apply clean_title to title column
# to create a new column
movies["clean_title"] = movies["title"].apply(clean_title)
movies

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


## Movie Search System

How does it work?

Convert titles into sets of numbers using a Term Frequency Matrix or TF.

Each column is unique "term".

If a "term" is in a row and occurs in the title, its assigned a 1, else a 0.

Then we use the Inverse Document Frequency or IDF method. This Helps find unique terms by assigning them a logarithmic value. 

> "IDF looks at the number of times a term is used in other pieces of content in a database, assigning a higher value to words used less often. It is used to measure how much information a word adds to the piece of content. " - [Source](https://www.seobility.net/en/wiki/Inverse_Document_Frequency)

Combing our TF with IDF we create a vector that "describes" each movie title.

We enter a title into our search engine. It gets converted into a numerical vector. This is then matched with other titles in our dataset.

In [5]:
# ngrams vector, groups of two consecutive words
# increases accuracy
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

# create our matrix
tfidf = vectorizer.fit_transform(movies["clean_title"])

In [10]:
# compute the similiarities
def search(title):
    """
    takes a search term
    cleans it
    vectorizes/transforms the term
    """
    title = clean_title(title)

    # creates our sparse matrix
    query_vec = vectorizer.transform([title])

    # takes our search matrix and compares it our clean_titles matrix
    # returns a numpy vector
    similarity = cosine_similarity(query_vec, tfidf).flatten()

    return similarity

In [11]:
 # search(movies.iloc[2, 3])

array([0.06531543, 0.07277985, 1.        , ..., 0.        , 0.        ,
       0.        ])