<h1 align='center'>Final Capstone: <i>Betcha Can't Guess What I Watched</i></h1>
<h2 align='center'>Philip Bowman</h2>
<h1 align='center'>Modeling Pipeline and Evaluation</h1>

## Overview:
Now that the movies and variables viable for use in this project have been identified. It is now time to create some models with the goal to find one that will appropriately identify movies similar to others, then on top of that, limit the responses of that model to those movies that have been deemed *obscure* by the definition and limitations made in the [Data Wrangling and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/data_wrangling_and_exploration.ipynb) notebook. Ultimately, the steps taken to go from the beginning to end with regards to the model will be put into a pipeline that can be called in one function.

The prospective methods that could be used in a content recommender such as this are essentially endless. There are countless ways you could go about tokenizing text and numerous techniques when it comes to selecting corpus stop words and there's also the rules of the language itself. This project's intention is not to jump into the deep end of Natural Language Processing (NLP), but because any model's success is inherently connected to the textual data, it is important to take care when making decisions based on the text. For the most part, simple solutions that work are probably better than overcomplicated ones that are hard to explain. The goal here is to find a relatively simple solution by trying a few different methods and tuning each to find their "best" results:
- Count vectorizer in conjunction with cosine similarity (the simplest)
- Latent Dirichlet allocation in conjunction with cosine similarity (relatively simple with the added step of reducing dimensions to their latent topics first)
- doc2vec (the most sophisticated of the three options presented, but also the most complicated)

Without further ado, onto the count vectorizer method.

***This product uses the TMDb API but is not endorsed or certified by TMDb.***

<img src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRw%0D%0AOi8vd3d3LnczLm9yZy8xOTk5L3hsaW5rIiB2aWV3Qm94PSIwIDAgMTg1LjA0IDEzMy40Ij48ZGVm%0D%0Acz48c3R5bGU+LmNscy0xe2ZpbGw6dXJsKCNsaW5lYXItZ3JhZGllbnQpO308L3N0eWxlPjxsaW5l%0D%0AYXJHcmFkaWVudCBpZD0ibGluZWFyLWdyYWRpZW50IiB5MT0iNjYuNyIgeDI9IjE4NS4wNCIgeTI9%0D%0AIjY2LjciIGdyYWRpZW50VW5pdHM9InVzZXJTcGFjZU9uVXNlIj48c3RvcCBvZmZzZXQ9IjAiIHN0%0D%0Ab3AtY29sb3I9IiM5MGNlYTEiLz48c3RvcCBvZmZzZXQ9IjAuNTYiIHN0b3AtY29sb3I9IiMzY2Jl%0D%0AYzkiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMwMGIzZTUiLz48L2xpbmVhckdyYWRp%0D%0AZW50PjwvZGVmcz48dGl0bGU+QXNzZXQgNDwvdGl0bGU+PGcgaWQ9IkxheWVyXzIiIGRhdGEtbmFt%0D%0AZT0iTGF5ZXIgMiI+PGcgaWQ9IkxheWVyXzEtMiIgZGF0YS1uYW1lPSJMYXllciAxIj48cGF0aCBj%0D%0AbGFzcz0iY2xzLTEiIGQ9Ik01MS4wNiw2Ni43aDBBMTcuNjcsMTcuNjcsMCwwLDEsNjguNzMsNDlo%0D%0ALS4xQTE3LjY3LDE3LjY3LDAsMCwxLDg2LjMsNjYuN2gwQTE3LjY3LDE3LjY3LDAsMCwxLDY4LjYz%0D%0ALDg0LjM3aC4xQTE3LjY3LDE3LjY3LDAsMCwxLDUxLjA2LDY2LjdabTgyLjY3LTMxLjMzaDMyLjlB%0D%0AMTcuNjcsMTcuNjcsMCwwLDAsMTg0LjMsMTcuN2gwQTE3LjY3LDE3LjY3LDAsMCwwLDE2Ni42Myww%0D%0AaC0zMi45QTE3LjY3LDE3LjY3LDAsMCwwLDExNi4wNiwxNy43aDBBMTcuNjcsMTcuNjcsMCwwLDAs%0D%0AMTMzLjczLDM1LjM3Wm0tMTEzLDk4aDYzLjlBMTcuNjcsMTcuNjcsMCwwLDAsMTAyLjMsMTE1Ljdo%0D%0AMEExNy42NywxNy42NywwLDAsMCw4NC42Myw5OEgyMC43M0ExNy42NywxNy42NywwLDAsMCwzLjA2%0D%0ALDExNS43aDBBMTcuNjcsMTcuNjcsMCwwLDAsMjAuNzMsMTMzLjM3Wm04My45Mi00OWg2LjI1TDEy%0D%0ANS41LDQ5aC04LjM1bC04LjksMjMuMmgtLjFMOTkuNCw0OUg5MC41Wm0zMi40NSwwaDcuOFY0OWgt%0D%0ANy44Wm0yMi4yLDBoMjQuOTVWNzcuMkgxNjcuMVY3MGgxNS4zNVY2Mi44SDE2Ny4xVjU2LjJoMTYu%0D%0AMjVWNDloLTI0Wk0xMC4xLDM1LjRoNy44VjYuOUgyOFYwSDBWNi45SDEwLjFaTTM5LDM1LjRoNy44%0D%0AVjIwLjFINjEuOVYzNS40aDcuOFYwSDYxLjlWMTMuMkg0Ni43NVYwSDM5Wm00MS4yNSwwaDI1VjI4%0D%0ALjJIODhWMjFoMTUuMzVWMTMuOEg4OFY3LjJoMTYuMjVWMGgtMjRabS03OSw0OUg5VjU3LjI1aC4x%0D%0AbDksMjcuMTVIMjRsOS4zLTI3LjE1aC4xVjg0LjRoNy44VjQ5SDI5LjQ1bC04LjIsMjMuMWgtLjFM%0D%0AMTMsNDlIMS4yWm0xMTIuMDksNDlIMTI2YTI0LjU5LDI0LjU5LDAsMCwwLDcuNTYtMS4xNSwxOS41%0D%0AMiwxOS41MiwwLDAsMCw2LjM1LTMuMzcsMTYuMzcsMTYuMzcsMCwwLDAsNC4zNy01LjVBMTYuOTEs%0D%0AMTYuOTEsMCwwLDAsMTQ2LDExNS44YTE4LjUsMTguNSwwLDAsMC0xLjY4LTguMjUsMTUuMSwxNS4x%0D%0ALDAsMCwwLTQuNTItNS41M0ExOC41NSwxOC41NSwwLDAsMCwxMzMuMDcsOTksMzMuNTQsMzMuNTQs%0D%0AMCwwLDAsMTI1LDk4SDExMy4yOVptNy44MS0yOC4yaDQuNmExNy40MywxNy40MywwLDAsMSw0LjY3%0D%0ALjYyLDExLjY4LDExLjY4LDAsMCwxLDMuODgsMS44OCw5LDksMCwwLDEsMi42MiwzLjE4LDkuODcs%0D%0AOS44NywwLDAsMSwxLDQuNTIsMTEuOTIsMTEuOTIsMCwwLDEtMSw1LjA4LDguNjksOC42OSwwLDAs%0D%0AMS0yLjY3LDMuMzQsMTAuODcsMTAuODcsMCwwLDEtNCwxLjgzLDIxLjU3LDIxLjU3LDAsMCwxLTUs%0D%0ALjU1SDEyMS4xWm0zNi4xNCwyOC4yaDE0LjVhMjMuMTEsMjMuMTEsMCwwLDAsNC43My0uNSwxMy4z%0D%0AOCwxMy4zOCwwLDAsMCw0LjI3LTEuNjUsOS40Miw5LjQyLDAsMCwwLDMuMS0zLDguNTIsOC41Miww%0D%0ALDAsMCwxLjItNC42OCw5LjE2LDkuMTYsMCwwLDAtLjU1LTMuMiw3Ljc5LDcuNzksMCwwLDAtMS41%0D%0ANy0yLjYyLDguMzgsOC4zOCwwLDAsMC0yLjQ1LTEuODUsMTAsMTAsMCwwLDAtMy4xOC0xdi0uMWE5%0D%0ALjI4LDkuMjgsMCwwLDAsNC40My0yLjgyLDcuNDIsNy40MiwwLDAsMCwxLjY3LTUsOC4zNCw4LjM0%0D%0ALDAsMCwwLTEuMTUtNC42NSw3Ljg4LDcuODgsMCwwLDAtMy0yLjczLDEyLjksMTIuOSwwLDAsMC00%0D%0ALjE3LTEuMywzNC40MiwzNC40MiwwLDAsMC00LjYzLS4zMmgtMTMuMlptNy44LTI4LjhoNS4zYTEw%0D%0ALjc5LDEwLjc5LDAsMCwxLDEuODUuMTcsNS43Nyw1Ljc3LDAsMCwxLDEuNy41OCwzLjMzLDMuMzMs%0D%0AMCwwLDEsMS4yMywxLjEzLDMuMjIsMy4yMiwwLDAsMSwuNDcsMS44MiwzLjYzLDMuNjMsMCwwLDEt%0D%0ALjQyLDEuOCwzLjM0LDMuMzQsMCwwLDEtMS4xMywxLjIsNC43OCw0Ljc4LDAsMCwxLTEuNTcuNjUs%0D%0AOC4xNiw4LjE2LDAsMCwxLTEuNzguMkgxNjVabTAsMTQuMTVoNS45YTE1LjEyLDE1LjEyLDAsMCwx%0D%0ALDIuMDUuMTUsNy44Myw3LjgzLDAsMCwxLDIsLjU1LDQsNCwwLDAsMSwxLjU4LDEuMTcsMy4xMywz%0D%0ALjEzLDAsMCwxLC42MiwyLDMuNzEsMy43MSwwLDAsMS0uNDcsMS45NSw0LDQsMCwwLDEtMS4yMywx%0D%0ALjMsNC43OCw0Ljc4LDAsMCwxLTEuNjcuNyw4LjkxLDguOTEsMCwwLDEtMS44My4yaC03WiIvPjwv%0D%0AZz48L2c+PC9zdmc+"
width="100" height="50" align='left'>

## 1. Count Vectorizer and Cosine Similarity Method

Some quick housekeeping first, the data actually has to be pulled in and limited to the relevant features (the text data).

In [9]:
import os
from os.path import join
import pandas as pd
import numpy as np
import time
from datetime import timedelta
from time import asctime
from IPython.display import clear_output
import pickle

In [2]:
movies_dir = r'C:\Users\philb\Datasets\movies_post_exploration'
movies_file = 'movies.pkl'
unpop_file = 'less_popular_movies.pkl'
pool_file = 'recommendation_pool.pkl' #won't need this until later

In [3]:
#this cell takes all the textual features in the movies dataframe and puts it altogether into the movies_strings variable 

movies = pd.read_pickle(join(movies_dir, movies_file))
unpop_movies_ids = pd.read_pickle(join(movies_dir, unpop_file)).index

text_features = ['spoken_languages', 'genres', 'overview', 'tagline', 'keywords', 'production_companies', 'acting_top_5', 'director', 'writers']
list_features = ['spoken_languages', 'genres', 'keywords', 'production_companies', 'acting_top_5', 'director', 'writers']

movies_features = movies[text_features].copy()
movies_features = movies_features.fillna(' ')

def unpack_list(x):
    string_rep = ' '
    for item in x:
        string_rep = string_rep + str(item) + ' '
    return string_rep

for column in list_features:
    movies_features[column] = movies_features[column].apply(unpack_list)
    
add_space_columns = [column for column in movies_features.columns if column not in list_features]

for column in add_space_columns:
    movies_features[column] = movies_features[column].apply(lambda x: x + ' ')
    
movies_text = movies_features.sum(axis=1)
movies_titles = movies.title
pop_movies_indicies = [index for index in movies_titles.index if index not in unpop_movies_ids]
pop_movies = movies_titles[pop_movies_indicies]

del movies, text_features, list_features, movies_features, add_space_columns, unpop_movies_ids

All the potential model features are now all combined in `movies_text`. For simplicity's sake the first attempt will be done with no cleaning whatsoever. This is going to be a very sparse representation of the data (as most of the feature vectors will tend to be). But it should give a decent baseline for what should be expected moving forward. In other words, this is one of the most simple solutions possible and it's likely it will not perform very well. For this particular method (count vectorizer + cosine similarity), the best way to verify the results is to actually check some similarities and inspect their accuracy. This is where my domain knowledge comes into use (I've watched a decent amount of movies and I've often been told I should get into movie criticism/reviewing). Obviously, the results will be quite subjective and my knowledge of this industry is far from comprehensive, but you use what you've got, right?

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Since there are way too many unique words in this corpus only the top 2500 will be considered, also English stop words will be removed as they won't add much value to the differences between movies.

In [79]:
cv = CountVectorizer(max_features=2500, stop_words='english')

In [81]:
feature_csr = cv.fit_transform(movies_text)

In [92]:
feature_df = pd.DataFrame(feature_csr.todense(), index=movies_text.index)

The model in this case, is essentially the cosine simliarity calculation (that is the angle similarity) between the feature vectors. Now, with 170000 films with 2500 vectors a piece, in order to create the similarity matrix and store it for each movie, it would require a pretty large space to store. So, in order to test the ability of this model, it actually makes sense to cherry pick a few movies and see how they perform in general (before the obscurity filter). That way, this method in this vector context can be reviewed without requiring a ton of storage space.

In [104]:
pop_movies.head()

id
11            Star Wars
12         Finding Nemo
13         Forrest Gump
14      American Beauty
18    The Fifth Element
Name: title, dtype: object

This actually looks like a decent list of movies to look into. Gives a sample of pretty different movies.

In [171]:
def top_10_similar(movie_id):
    cos_sim = cosine_similarity(feature_csr, feature_df.loc[movie_id].values.reshape(1,-1))
    top_10_idx = pd.DataFrame(cos_sim, index=feature_df.index).loc[:, 0].sort_values(ascending=False).iloc[1:11].index
    print(f'Top 10 Movies Similar to {movies_titles.loc[movie_id]}')
    return movies_titles.loc[top_10_idx]

In [172]:
top_10_similar(11)

Top 10 Movies Similar to Star Wars


id
1892                                Return of the Jedi
74849                    The Star Wars Holiday Special
1895      Star Wars: Episode III - Revenge of the Sith
140607                    Star Wars: The Force Awakens
1891                           The Empire Strikes Back
1893         Star Wars: Episode I - The Phantom Menace
1894      Star Wars: Episode II - Attack of the Clones
51686          Space Battleship Yamato: The New Voyage
9255                              Hot Shots! Part Deux
567097                             Star wars: Goretech
Name: title, dtype: object

This looks not terrible, actually. But Star Wars may be too easy of a test as the descriptions are probably very similar to one another (not to mention George Lucas probably appearing all over the place).

In [173]:
top_10_similar(12)

Top 10 Movies Similar to Finding Nemo


id
13205                               Bambi II
45791     When Did You Last See Your Father?
187585                         Ramen Samurai
151403                   La maison de Himiko
371642                    The Rogue Stallion
475594                        Father and Son
535624                               One Son
468813                       My Father Iqbal
418272                       A Boy Called Po
108954              Postmen in the Mountains
Name: title, dtype: object

Definitely looks like it's picking up on the father/son information likely found in the overview and perhaps keywords/tagline for all these movies. But is this too simple?

In [174]:
top_10_similar(13)

Top 10 Movies Similar to Forrest Gump


id
205065                  Land of Sorrows
272721                    Hicran Gecesi
11947               Emil and the Piglet
382717                         Sanam Re
5177                         Dark Horse
688       The Bridges of Madison County
357854                        Aishwarya
608124                         Chhalawa
24684                      An Education
232420                          Twinkle
Name: title, dtype: object

So after digging into these, the [top pick](https://www.themoviedb.org/movie/205065) appears to be a movie about a family during the Vietnam war (which probably isn't too far off in topic). [Hicran Gecesi](themoviedb.org/movie/272721-hicran-gecesi) is a very obscure pick which appears to be about a love triangle. [Emil and the Piglet](https://www.themoviedb.org/movie/11947) appears to veer off in an interesting, unexpected direction. [An Education](https://www.themoviedb.org/movie/24684) is a love story where the woman's name is Jenny. So definitely some interesting picks, but definitely far from perfect.

In [175]:
top_10_similar(14)

Top 10 Movies Similar to American Beauty


id
11446     Welcome to the Dollhouse
31146             Box of Moonlight
486269      Under the Eiffel Tower
3              Shadows in Paradise
59210                   The Mother
152484          Oldies but Goldies
224917               Dear Sidewalk
531951         I Made This For You
436245                      Daphne
502061                  Early Fall
Name: title, dtype: object

[Welcome to the Dollhouse](https://www.themoviedb.org/movie/11446) looks to be quite similar in vein to American Beauty, though I haven't seen it. [Box of Moonlight](https://www.themoviedb.org/movie/31146) also looks to be loosely connected as it revolves about a man having a mid-life crisis. [Under the Eiffel Tower](https://www.themoviedb.org/movie/486269) appears to be another movie connected to a man having a mid-life crisis. So this is definitely in the right alley.

In [176]:
top_10_similar(18)

Top 10 Movies Similar to The Fifth Element


id
181808                       Star Wars: The Last Jedi
15493                   Star Wreck: In the Pirkinning
14460                         Battle Beyond the Stars
13836                          Race to Witch Mountain
339964    Valerian and the City of a Thousand Planets
563                                 Starship Troopers
44957      10,000 A.D.: The Legend of the Black Pearl
9278                                         Freejack
32064                               The Time Guardian
13475                                       Star Trek
Name: title, dtype: object

These recommendations seem to be pretty accurate as well. In all, these really aren't terrible recommendations based on the eye test.

It appears that this super-simple system actually works relatively well by picking up on the main keywords found throughout the corpus. Is it possible to improve this system? Perhaps by simply doubling the feature space it will be able to pick up on more minute similarities between movies? The pipeline for this model is also extremely simple.

In [6]:
def cosine_pipe(text_docs, feature_space):
    cv = CountVectorizer(max_features=feature_space, stop_words='english')
    feature_csr = cv.fit_transform(text_docs)
    all_vars = (cv, feature_csr)
    return all_vars

def top_X_pipe(movie_id, pipe_output, x_sim):
    cos_sim = cosine_similarity(pipe_output[1], pipe_output[1][movies_titles.index.get_loc(movie_id)])
    top_X_idx = pd.DataFrame(cos_sim, index=movies_titles.index).loc[:, 0].sort_values(ascending=False).iloc[1:(x_sim+1)].index
    print(f'Top {x_sim} Movies Similar to {movies_titles.loc[movie_id]}')
    return movies_titles.loc[top_X_idx]

In [195]:
cosine_5000_features = cosine_pipe(movies_text, 5000)

In [197]:
top_X_pipe(12, cosine_5000_features, 10)

Top 10 Movies Similar to Finding Nemo


id
13205                               Bambi II
371642                    The Rogue Stallion
45791     When Did You Last See Your Father?
151403                   La maison de Himiko
187585                         Ramen Samurai
475594                        Father and Son
398362                      Izzie's Way Home
410363                         Father's Acre
108954              Postmen in the Mountains
535624                               One Son
Name: title, dtype: object

In [206]:
for movie_id in pop_movies.index[:5]:
    print(top_X_pipe(movie_id, cosine_5000_features, 10))
    print()

Top 10 Movies Similar to Star Wars
id
1892                                Return of the Jedi
74849                    The Star Wars Holiday Special
140607                    Star Wars: The Force Awakens
1891                           The Empire Strikes Back
1895      Star Wars: Episode III - Revenge of the Sith
1893         Star Wars: Episode I - The Phantom Menace
1894      Star Wars: Episode II - Attack of the Clones
567097                             Star wars: Goretech
9255                              Hot Shots! Part Deux
181808                        Star Wars: The Last Jedi
Name: title, dtype: object

Top 10 Movies Similar to Finding Nemo
id
13205                               Bambi II
371642                    The Rogue Stallion
45791     When Did You Last See Your Father?
151403                   La maison de Himiko
187585                         Ramen Samurai
475594                        Father and Son
398362                      Izzie's Way Home
410363                      

Looks like there were a few changes, but nothing major. What if the features were increased to 25000?

In [203]:
cosine_25000_features = cosine_pipe(movies_text, 25000)

In [205]:
for movie_id in pop_movies.index[:5]:
    print(top_X_pipe(movie_id, cosine_25000_features, 10))
    print()

Top 10 Movies Similar to Star Wars
id
1892                                Return of the Jedi
74849                    The Star Wars Holiday Special
1891                           The Empire Strikes Back
140607                    Star Wars: The Force Awakens
1895      Star Wars: Episode III - Revenge of the Sith
1893         Star Wars: Episode I - The Phantom Menace
1894      Star Wars: Episode II - Attack of the Clones
567097                             Star wars: Goretech
181812                Star Wars: The Rise of Skywalker
181808                        Star Wars: The Last Jedi
Name: title, dtype: object

Top 10 Movies Similar to Finding Nemo
id
127380                          Finding Dory
13205                               Bambi II
371642                    The Rogue Stallion
282468                    Coral Sea Dreaming
45791     When Did You Last See Your Father?
468813                       My Father Iqbal
398362                      Izzie's Way Home
475594                      

Color me impressed, these recommendations look pretty good. Looks like some obscure films are already appearing on the list! Movies like Big Fish have hopped into the top for American Beauty (which seems fitting), Finding Dory has now popped into the most similar position to Finding Nemo, the similarities between the top Star Wars films appear to have solidified. This is coming out pretty well, what if the features were increased even further?

In [208]:
cosine_100000_features = cosine_pipe(movies_text, 100000)

In [209]:
for movie_id in pop_movies.index[:5]:
    print(top_X_pipe(movie_id, cosine_100000_features, 10))
    print()

Top 10 Movies Similar to Star Wars
id
1891                           The Empire Strikes Back
1892                                Return of the Jedi
74849                    The Star Wars Holiday Special
140607                    Star Wars: The Force Awakens
1895      Star Wars: Episode III - Revenge of the Sith
1893         Star Wars: Episode I - The Phantom Menace
1894      Star Wars: Episode II - Attack of the Clones
181808                        Star Wars: The Last Jedi
567097                             Star wars: Goretech
181812                Star Wars: The Rise of Skywalker
Name: title, dtype: object

Top 10 Movies Similar to Finding Nemo
id
127380                          Finding Dory
371642                    The Rogue Stallion
13205                               Bambi II
282468                    Coral Sea Dreaming
45791     When Did You Last See Your Father?
468813                       My Father Iqbal
398362                      Izzie's Way Home
109349                      

The results appear to be pretty consistent going from 25000 to 100000 features. Interestingly, the order of the top 3 movies to Star Wars has changed minutely in an order that seems to make sense. Since this runs so quickly, why not go all in and use all the features available?

In [7]:
cosine_all_features = cosine_pipe(movies_text, None)

In [211]:
for movie_id in pop_movies.index[:5]:
    print(top_X_pipe(movie_id, cosine_all_features, 10))
    print()

Top 10 Movies Similar to Star Wars
id
1891                           The Empire Strikes Back
1892                                Return of the Jedi
74849                    The Star Wars Holiday Special
140607                    Star Wars: The Force Awakens
1895      Star Wars: Episode III - Revenge of the Sith
1893         Star Wars: Episode I - The Phantom Menace
1894      Star Wars: Episode II - Attack of the Clones
181808                        Star Wars: The Last Jedi
567097                             Star wars: Goretech
181812                Star Wars: The Rise of Skywalker
Name: title, dtype: object

Top 10 Movies Similar to Finding Nemo
id
127380                          Finding Dory
371642                    The Rogue Stallion
13205                               Bambi II
282468                    Coral Sea Dreaming
45791     When Did You Last See Your Father?
468813                       My Father Iqbal
398362                      Izzie's Way Home
21291                       

Alright, so overall, this model appears to be quite solid due to its insane simplicity, superfast runtime (which is important when searching for recommendations), and its surprisingly decent and seemingly accurate suggestions. The question becomes, can it get much better for a recommendation system from here? And also, if this model is ultimately used, the use case for the user comes into question. Will they search for a particular movie (like the above) and get the most similar obscure movies to that film? Or will they be able to enter text freely, where the count vectorizer will transform their text in the same manner as it did all the movie information, then query the most similar obscure films based on that result? Or perhaps it could do both? Enter a movie title or enter text freely. This is probably the best method, perhaps a simple movie search function could be created that performs the former and a free text function will perform the latter. Luckily TMDb API makes it very easy to search for a movie title (that will definitely come in handy).

## 2. LDA and Cosine Similarity Method

Next up is to compare movies in a similar vein to the previous model, but this model attempts to hone in on the most relevant hidden topics found within documents of text and gives each document a distribution of the number of topics created. This takes the count vectorizer's raw counts to another level by limiting to a number of topics/themes based on those raw numbers.

In [212]:
from sklearn.decomposition import LatentDirichletAllocation

In [270]:
def lda_pipe(text_docs, feature_space, no_topics):
    cv = CountVectorizer(max_features=feature_space, stop_words='english')
    bow = cv.fit_transform(text_docs)
    lda = LatentDirichletAllocation(n_components=no_topics, learning_method='online', max_iter=5, random_state=27, n_jobs=3)
    lda_trans = lda.fit_transform(bow)
    all_vars = (cv, lda_trans, bow, lda)
    return all_vars

def lda_top_X_pipe(movie_id, pipe_output, x_sim):
    cos_sim = cosine_similarity(pipe_output[1], pipe_output[1][movies_titles.index.get_loc(movie_id)].reshape(1,-1))
    top_X_idx = pd.DataFrame(cos_sim, index=movies_titles.index).loc[:, 0].sort_values(ascending=False).iloc[1:(x_sim+1)].index
    print(f'Top {x_sim} Movies Similar to {movies_titles.loc[movie_id]}')
    return movies_titles.loc[top_X_idx]

In [247]:
lda_5000cv_100n = lda_pipe(movies_text, 5000, 100)

In [250]:
for movie_id in pop_movies.index[:5]:
    print(lda_top_X_pipe(movie_id, lda_5000cv_100n, 10))
    print()

Top 10 Movies Similar to Star Wars
id
1892              Return of the Jedi
404902                    The Sector
33                        Unforgiven
123907            Demonic Possession
150715    The Sheriff of Rock Spring
179921                   Black Spurs
465516              The Purple Hills
111942              Star in the Dust
51051              Dead Man's Bounty
12160                     Wyatt Earp
Name: title, dtype: object

Top 10 Movies Similar to Finding Nemo
id
346219     A Question of Love
592686    My Open Minded Wife
560441             They Fight
405225           The Triangle
406000            The Teacher
580844               Uda Aida
127380           Finding Dory
420590       De Zevende Hemel
90799        Hi Diddle Diddle
114              Pretty Woman
Name: title, dtype: object

Top 10 Movies Similar to Forrest Gump
id
263173                        Nguyen Van Troi
360648                     Ennu Ninte Moideen
407445                                Breathe
354308          

These results are certainly interesting. It looks as if things are loosely connected: very loose. Perhaps a greater number of topics is needed? Maybe more terms as well? The drawback of LDA is that it takes a decent amount of time to train. This particular instance took 17 minutes. Definitely a knock against this method in comparison to the super simple one in the prior section. With unlimited computing power, this could turn up with some interesting and useful results, but it looks like it would require a lot of tuning and time for that to happen. But we'll give it a few more tries. One with double the count vectorizer features (10000), one with double the topic features (200).

In [273]:
lda_5000cv_200n = lda_pipe(movies_text, 5000, 200)
lda_10000cv_100n = lda_pipe(movies_text, 10000, 100)

In [275]:
for movie_id in pop_movies.index[:5]:
    print(lda_top_X_pipe(movie_id, lda_5000cv_200n, 10))
    print()
    
for movie_id in pop_movies.index[:5]:
    print(lda_top_X_pipe(movie_id, lda_10000cv_100n, 10))
    print()

Top 10 Movies Similar to Star Wars
id
1892                 Return of the Jedi
9255               Hot Shots! Part Deux
101665                         Hayabusa
245222    Treasure of the Golden Condor
1891            The Empire Strikes Back
298526                 St Mark's Gospel
490492                   The First Baby
301917              Crack in the Mirror
490655             Youth Will Be Served
487620                       Kill Order
Name: title, dtype: object

Top 10 Movies Similar to Finding Nemo
id
127380                                         Finding Dory
467366                                         Big Business
405225                                         The Triangle
205267                                    The Autism Puzzle
37718                                              The Muse
433571                                      False Pretenses
25890                Looking for Comedy in the Muslim World
86184     Doraemon: Nobita's Great Battle of the Mermaid...
396095       

As previously stated, with unlimited time and computing power, this appears to be a potentially useful method, but the results and speed at this point just don't appear to be as salient as the really simple "Count Vectorizer + Cosine Similarity" combo. The giant drawback with LDA is the difficulty of tuning and amount of time it takes to actually build a model (thus tuning becomes cumbersome). Perhaps doc2vec will give some noticeable improvements?

## 3. Doc2Vec

The final model type to be tested for this prototype is going to be doc2vec. This modeling technique builds off of word2vec, where each word is given its own vector (the more associated the words, the closer they appear in the vector-space). Instead of comparing words to one another, doc2vec compares documents to one another. So the goal here is to use doc2vec to train on the entire corpus, then use the model to find the most similar documents to a particular input.

In [278]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [373]:
tagged_movies = [TaggedDocument(words=word_tokenize(text.lower()), tags=[i]) for i, text in zip(movies_text.index, movies_text.to_list())]

In [376]:
doc2vec = Doc2Vec(documents=tagged_movies, vector_size=100, min_count=1, window=3, max_vocab_size=5000, epochs=50, workers=4)

In [438]:
def doc2vec_sim(movie_id, model, x_sim):
    top_X_idx = []
    j = 0
    for i in doc2vec.docvecs.most_similar(positive=movie_id, topn=x_sim+10):
        if j < x_sim:
            try:
                if movies_titles.loc[i[0]]:
                    top_X_idx.append(i[0])
                    j += 1
            except:
                pass
    print(f'Top {x_sim} Movies Similar to {movies_titles.loc[movie_id]}')
    return movies_titles.loc[top_X_idx]

In [439]:
for movie_id in pop_movies.index[:5]:
    print(doc2vec_sim(movie_id, doc2vec, 10))
    print()

Top 10 Movies Similar to Star Wars
id
1891                          The Empire Strikes Back
168                     Star Trek IV: The Voyage Home
83768                                       Number 96
39230     Mobile Suit Gundam III: Encounters in Space
528646                                  Sentinel 2099
65665             Farewell to Space Battleship Yamato
14372                                       Leviathan
75612                                        Oblivion
227434     Star Worms II: Attack of the Pleasure Pods
147408                    Shaolin Temple Strikes Back
Name: title, dtype: object

Top 10 Movies Similar to Finding Nemo
id
13205                                 Bambi II
106182                         Alberto Express
16111                                  Hanuman
106207                              Sea People
151492    The Life Story of David Lloyd George
182212                           Ya La Hicimos
399975                        Fourwinds Island
402036               O T

These really aren't that great. I do think with some proper tuning and computing power, some useful movies could be found, but in this case, due to the amount of time this takes as well, it appears that the very clear winner looks to be the simplest bag of words and cosine similarity model.

## 4. Model Selection and getting in the Correct Format for Queries

After reviewing the above models, the one that is the fastest to use and to query is the bag of words/count vectorized + cosine similarity model. This model essentially queries the dataset whenever it wants to find the most similar movies. The next steps that need taken are to actually get this model into a format conducive for querying and to also set up a function to perform those queries. This is essentially a case where if it's simple and it works, then why over-complicate things with fancier models? This is also the point where the filtration of obscurity comes into play. Only movies that fall into the particular pool of obscurity should be returned.

In [8]:
tokenizer = cosine_all_features[0]
movie_vectors = cosine_all_features[1]

So right now, when `top_X_pipe` is called, the X most similar movies based on cosine similarity is returned, here's a quick example along with a reminder of the actual code.

In [450]:
top_X_pipe(11, cosine_all_features, 10)

Top 10 Movies Similar to Star Wars


id
1891                           The Empire Strikes Back
1892                                Return of the Jedi
74849                    The Star Wars Holiday Special
140607                    Star Wars: The Force Awakens
1895      Star Wars: Episode III - Revenge of the Sith
1893         Star Wars: Episode I - The Phantom Menace
1894      Star Wars: Episode II - Attack of the Clones
181808                        Star Wars: The Last Jedi
567097                             Star wars: Goretech
181812                Star Wars: The Rise of Skywalker
Name: title, dtype: object

In [451]:
# def top_X_pipe(movie_id, pipe_output, x_sim):
#     cos_sim = cosine_similarity(pipe_output[1], pipe_output[1][movies_titles.index.get_loc(movie_id)])
#     top_X_idx = pd.DataFrame(cos_sim, index=movies_titles.index).loc[:, 0].sort_values(ascending=False).iloc[1:(x_sim+1)].index
#     print(f'Top {x_sim} Movies Similar to {movies_titles.loc[movie_id]}')
#     return movies_titles.loc[top_X_idx]

Currently, the entire matrix is being queried for similarities. There is no longer a need for that, now only obscure movies should be the ones that are compared, however, it is still important to keep the original vectors for the other movies. What if a user wants to search for movies that are similar to a well known movie? In order to do that, the original vectors must still be there. We'll start off with the simplest case, which is searching by a particular movie ID. But first, the obscure movies' indexes need to be brought in.

In [10]:
rec_pool = pd.read_pickle(join(movies_dir, pool_file))

For these movies, the location of their row within the `movie_vectors` must be found, which is relatively easy to do using list notation.

In [11]:
rec_idxs = [movies_titles.index.get_loc(movie_id) for movie_id in rec_pool.index]

In [12]:
rec_vectors = movie_vectors[rec_idxs]

And just like that, the pool of available movies and their respective vectors have been gathered.

In [68]:
def top_X_obscure(movie_id, x_sim=10):
    cos_sim = cosine_similarity(rec_vectors, movie_vectors[movies_titles.index.get_loc(movie_id)])
    if movie_id in rec_idxs:
        top_X_idx = pd.DataFrame(cos_sim, index=rec_pool.index).loc[:, 0].sort_values(ascending=False).iloc[1:(x_sim+1)].index
    elif movie_id not in rec_idxs:
        top_X_idx = pd.DataFrame(cos_sim, index=rec_pool.index).loc[:, 0].sort_values(ascending=False).iloc[0:(x_sim)].index
    print(f'Top {x_sim} Obscure Movies Similar to {movies_titles.loc[movie_id]}')
    movie_links = ['https://www.themoviedb.org/movie/'+str(i) for i in top_X_idx]
    movie_df = pd.DataFrame(movies_titles.loc[top_X_idx])
    movie_df['links'] = movie_links
    return movie_df.to_markdown()

In [70]:
for movie_id in pop_movies.index[:5]:
    print(top_X_obscure(movie_id, 10))
    print()

Top 10 Obscure Movies Similar to Star Wars
|     id | title                                             | links                                   |
|-------:|:--------------------------------------------------|:----------------------------------------|
|  10179 | The Ice Pirates                                   | https://www.themoviedb.org/movie/10179  |
|  75311 | The People vs. George Lucas                       | https://www.themoviedb.org/movie/75311  |
|   9703 | The Last Legion                                   | https://www.themoviedb.org/movie/9703   |
| 328429 | Approaching the Unknown                           | https://www.themoviedb.org/movie/328429 |
|  17277 | The Fall of the Roman Empire                      | https://www.themoviedb.org/movie/17277  |
|  19287 | Leprechaun 4: In Space                            | https://www.themoviedb.org/movie/19287  |
|  23719 | Trapped in Paradise                               | https://www.themoviedb.org/movie/23719  |
| 388885 | S

Alright, this version of the query appears to be working, but what about a user that wants to search for a movie? It should be assumed that the user doesn't actually know the ID for a particular movie that is being searched. This is where the API comes in handy.

In [15]:
from tmdbv3api import Movie, TMDb

In [16]:
tmdb = TMDb()
movie = Movie()

In [17]:
API_KEY_PATH = r'C:\Users\philb\Datasets\API_KEY.txt'
with open(API_KEY_PATH) as f:
    API_KEY = f.readline()
f.close()
tmdb.api_key = API_KEY

In [24]:
def get_id(search_term):
    print('------------')
    print('MOVIE SEARCH')
    print('------------')
    try:
        page1 = [(i, movie) for i, movie in enumerate(movie.search(search_term))]
        for tup in page1:
            print(tup[0], tup[1])
        time.sleep(.01)
        movie_sel = int(input('Enter index # of selected movie: '))
        print(page1)
        movie_id = page1[movie_sel][1].id
        clear_output(wait=True)
        return movie_id
    except:
        print('No movies were found using that query.')

In [27]:
top_X_obscure(get_id('the dark knight'))

Top 10 Obscure Movies Similar to The Dark Knight


id
464882                      Batman vs. Two-Face
342917         Batman Unlimited: Monster Mayhem
69735                          Batman: Year One
411736    Batman: Return of the Caped Crusaders
366924                        Batman: Bad Blood
21683           Batman: Mystery of the Batwoman
16234        Batman Beyond: Return of the Joker
624479      Superman II: The Richard Donner Cut
17445               Green Lantern: First Flight
30061      Justice League: Crisis on Two Earths
Name: title, dtype: object

Alright, that works pretty well. For this, it obviously turns up a lot of animated films as these will be the most similar based on the text representations of the movies. For Batman, this just happens to be the case. But we can see that it is still appropriately identifying movies with the same topics.

In [71]:
def user_query(user_input, x_sim=10):
    vector = tokenizer.transform([user_input])
    cos_sim = cosine_similarity(rec_vectors, vector)
    top_X_idx = pd.DataFrame(cos_sim, index=rec_pool.index).loc[:, 0].sort_values(ascending=False).iloc[0:(x_sim)].index
    print(f'Top {x_sim} Obscure Movies Similar to user search: "{user_input}"')
    movie_links = ['https://www.themoviedb.org/movie/'+str(i) for i in top_X_idx]
    movie_df = pd.DataFrame(movies_titles.loc[top_X_idx])
    movie_df['links'] = movie_links
    return movie_df.to_markdown()

In [73]:
print(user_query('spiderman'))

Top 10 Obscure Movies Similar to user search: "spiderman"
|     id | title                  | links                                   |
|-------:|:-----------------------|:----------------------------------------|
| 460738 | Super Singh            | https://www.themoviedb.org/movie/460738 |
| 268508 | The Mummy Resurrected  | https://www.themoviedb.org/movie/268508 |
|  24488 | Demonlover             | https://www.themoviedb.org/movie/24488  |
|  24499 | Scar                   | https://www.themoviedb.org/movie/24499  |
|  24505 | Necromentia            | https://www.themoviedb.org/movie/24505  |
|  24508 | The Lineup             | https://www.themoviedb.org/movie/24508  |
|  24518 | The Long, Long Trailer | https://www.themoviedb.org/movie/24518  |
|  24525 | Hurlyburly             | https://www.themoviedb.org/movie/24525  |
|  24527 | Undefeatable           | https://www.themoviedb.org/movie/24527  |
|  24528 | Killer Bean Forever    | https://www.themoviedb.org/movie/24528  |


In [74]:
def get_recommendations(query, kind='movie', x_sim=10):
    if kind == 'movie':
        print(top_X_obscure(get_id(query), x_sim))
    elif kind == 'query':
        print(user_query(query, x_sim))

In [77]:
get_recommendations('high school musical', 'movie', 10)

Top 10 Obscure Movies Similar to High School Musical
|     id | title                         | links                                   |
|-------:|:------------------------------|:----------------------------------------|
|   8669 | Charlie Bartlett              | https://www.themoviedb.org/movie/8669   |
| 306943 | The Outcasts                  | https://www.themoviedb.org/movie/306943 |
|  24232 | Daria in 'Is It College Yet?' | https://www.themoviedb.org/movie/24232  |
|  12621 | Hamlet 2                      | https://www.themoviedb.org/movie/12621  |
|   5693 | Hoosiers                      | https://www.themoviedb.org/movie/5693   |
|  20224 | Remember the Daze             | https://www.themoviedb.org/movie/20224  |
|  10013 | Peggy Sue Got Married         | https://www.themoviedb.org/movie/10013  |
|  13259 | American Teen                 | https://www.themoviedb.org/movie/13259  |
|  26156 | Hiding Out                    | https://www.themoviedb.org/movie/26156  |
|  54555 | S

And there we go. It is looking pretty good. Far from perfect, for sure, but the obscurity is definitely there. And relative similarities are apparent. With nice clickable links.