<h1 align='center'>Final Capstone: <i>Betcha Can't Guess What I Watched</i></h1>
<h2 align='center'>Philip Bowman</h2>
<h1 align='center'><u>Final Notebook: Project Summary</u></h1>

## Overview:
The goal of this project is to create a movie recommendation system with a twist: to recommend obscure/relatively unknown movies. Think of it as a hidden gem detector for films. The product's application will include the ability to enter keywords or another movie title, then generate a list of obscure movies that are highly associated with the query. The data used comes from [The Movie DB](https://www.themoviedb.org/) by means of one of their various APIs.

***This product uses the TMDb API but is not endorsed or certified by TMDb.***

<img src=".\TMDb_resources\blue_square_2-d537fb228cf3ded904ef09b136fe3fec72548ebc1fea3fbbd1ad9e36364db38b.svg"
width="100" height="50" align='left'>

## Notebook Breakdown:
This notebook is a summary of the project on the whole. In most cases, a particular section will have ancillary notebooks or scripts that can be referred to for a deeper understanding of how things were done. For those sections that do have additional information, the title of the section will be a click-able link to the connected notebook. This notebook serves primarily as a rundown of the most important steps/highlights for each step. This notebook's contents are as follows:
1. Opportunity and Limitations
2. [Data Collection](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/data_collection.py)
3. [Data Wrangling and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/data_wrangling_and_exploration.ipynb)
4. [Modeling Pipeline and Evaluation](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/modeling_pipeline_and_evaluation.ipynb)
5. [Resulting Product](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/final_product.ipynb) and Summary

## 1. Opportunity and Limitations

Movie recommendations and their systems are pretty much ubiquitous in this day and age. But what if you want to watch something that isn't as popular or well known to the general movie goer? For individuals who want to be the first to watch a new movie in their friend group; for those movie buffs who want to round out their awareness; or for those who are just getting bored of the same old movie recommendations over and over again - this is the movie recommendation system that they need to use. By using a content based recommender system with the caveat that the movies recommended should be obscure enough that you've probably never heard of them: you'll be watching some new, interesting and memorable films. The question then arises, how is obscure defined? What makes a movie relatively unknown?

For the purposes of this project, an obscure movie will be defined as a movie that falls under a certain threshold for number of ratings. The data collected includes information on the number of ratings submitted for any particular movie. Therefore, some deep digging will be done for this particular variable to identify an appropriate amount of ratings for a hard cutoff. In the future, it may be of use to create a more robust system for defining obscurity in the context of this problem. For now, in this prototypical phase, a hard cutoff with respect to the volume of ratings is prudent. Note, this threshold will likely create a general limitation as to the definition of obscurity. Movies that may have been considered to be obscure at release, but have since garnered a cult following, by this definition, will not be considered obscure. Also, movies that are recently released or soon to be released will not be included as a prerequisite for knowing if something is obscure since those movies have not yet had time to garner "popularity."

So to solve this threshold for obscurity, a deep dive into the data must be conducted. But first, a quick explanation of the data collection process should be understood.

## 2. [Data Collection](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/data_collection.py)

As mentioned previously, the data was collected from [The Movie DB](https://www.themoviedb.org/) (aka TMDb). This source has been grown and built exclusively by its community since 2008. They have an extremely large record of movies and television shows. One of its big selling points is the user friendly nature of the database. They have a very large number of APIs in multiple languages for use in querying their database. For this project, I utilized [tmdbv3api](https://github.com/AnthonyBloomer/tmdbv3api), "A lightweight Python library for The Movie Database (TMDb) API." Using this API in a Python script, [`data_collection.py`](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/data_collection.py), movies and their relevant information were collected and stored in comma separated value format. The data includes many variables regarding each movie: overview, title, genre, id, original language, popularity, tagline, runtime, cast/crew, and more. In all, nearly 400,000 movies and their relevant data were collected. A quick snippet of how a movie's data was collected is below:
```python
movie = Movie()
current_movie = movie.details(page_id)
```
Now, the actual script is more verbose as it iterates over numerous `page_id`s and saves each run in a particular batch; it also skips over IDs that don't exist and films that are marked adult. But the above is essentially how the data was collected for every single movie. Extremely user friendly and simple!

When all was said and done, there were a total of 45 CSV files of varying length and size containing information on 390,000 movies from TMDb. Considering TMDb currently claims to have [533,814 movies](https://www.themoviedb.org/faq/general) as of this project's creation, that means over 75% of all their movies have been collected for utilization in this project. This should hopefully make for a good basis for a recommendation system model. But before making a model, the data must first be explored and put into a context that a model can utilize.

In the future, it would be wise to collect and maintain the entire database and update changes made to films as they are updated, but for the prototyping phase, this was not done. It's best to make sure a model of some value could be created from a sample of the data before deploying it and setting up all the infrastructure needed to maintain all that information (i.e. if a good model is not found, that's a lot of overhead for nothing).

## 3. [Data Wrangling and Exploration](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/data_wrangling_and_exploration.ipynb)

These two components have been lumped together because they go hand in hand. In order to wrangle data, one must also explore and understand it. And for one to explore data, it also must be wrangled in a reasonable way, initially. In the data collection step, the data was partially wrangled by means that only potentially relevant information was collected for each movie. However, it isn't wrangled in the sense that it is ready for modeling, for that to happen a further exploration was done concerning the data.

In this step, an extensive look into the variables at hand was done. First off, irrelevant and unhelpful variables were removed from the dataset to clean things up. Then a number of limitations were made to the data based on particular variables. Notable limitations and changes are as follows:
- Movies that have been released are the only ones included - the user is likely only interested in recommendations that have actually been released (not tentative titles or those still in production).

<img src='.\snips\status.PNG'>


- The movie runtime has been limited to those between 1hr (60mins) and 4hrs (240mins) - it is assumed the user would use this recommendation engine to find a movie they could watch in one sitting (and it's still considered a movie).

<img src='.\plots\runtime_distplot.png'>


- Extremely short outlines/summaries have been removed as they generally didn't hold any information about the film.

<img src='.\plots\boxplot_overview_length.png'>


- Extracted important features from dictionary and list variables such as language, top actors, director, genres, etc. (This makes for the brunt of the useful information for modeling.
- Drilled down to find the most important features for defining a movie as obscure and found that the popularity variable and the vote count variable ended up being the most useful for determining a film that is generally unknown.

<img src='.\plots\vote_count_vs_popularity_scatter.png'>


- Decided on a particular threshold for the aforementioned variables after exploring many iterations: no more than 499 votes on a movie and a popularity score less than 11 were the decided limiting factors. Below is a scatter plot containing the movies in that range.

<img src='.\plots\vote_count_vs_popularity_zoomed_scatter.png'>


- Further limited the vote count parameter as movies with very low vote counts are likely very difficult to find (and ultimately watch). Therefore, only movies that have 20 or more votes are included, limiting the vote counts to between 20 and 499 (inclusive). A quick look at some "obscure" films below:

<img src='.\snips\obscure_sample1.png'>
<img src='.\snips\obscure_sample2.PNG'>

So, after an extended exploration phase, the data was now more or less ready to move on to the modeling stage. The most important aspect of this section was to define what makes a movie obscure in this context and to get the data into a format that was conducive to modeling. Along the way some fun facts were discovered and an overall understanding of this data was created. Also, a pool of movies that are considered to be obscure and fit into the definitions and limitations of what obscure means in this context was produced for future use. The next logical step was to create some models using all the text data available and see if some reasonable recommendations could be made!

## 4. [Modeling Pipeline and Evaluation](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/modeling_pipeline_and_evaluation.ipynb)

In this section, the textual data was transformed by using either Scikit-learn's `CountVectorizer` or by NLTK's `word_tokenize` depending on the model used. The models tested were as follows:
- CountVectorizer in conjunction with cosine similarity (the simplest)
- Latent Dirichlet allocation (LDA) in conjunction with cosine similarity (relatively simple with the added step of reducing dimensions to their latent topics first)
- doc2vec (the most sophisticated of the three options presented, but also the most complicated)

In order to validate and test results. The same five popular movies were queried for each iteration of each model to see how the parameters affected results. The ultimate goal here was to find a model that was consistent and that had seemingly good results when querying all movies. For this particular format of data and modeling, it was particularly hard to find a proper quantitative way to validate results, so with some domain knowledge concerning movies, results were qualitatively based by inspecting each iteration regarding these five movies. This is far from a robust system for validation and is definitely an area for improvement when it comes to this project, as its results are really only as good as my research and domain knowledge.

With that being said, qualitatively decent results seemed to be appearing, heres a look at *Star Wars*, *Finding Nemo* and *American Beauty* under an iteration of each model type:

**CountVectorizer + Cosine Similarity**
<img src='.\snips\cosine_star_wars.PNG'>
<img src='.\snips\cosine_finding_nemo.PNG'>
<img src='.\snips\cosine_american_beauty.PNG'>

**LDA + Cosine Similarity**
<img src='.\snips\lda_star_wars.PNG'>
<img src='.\snips\lda_finding_nemo.PNG'>
<img src='.\snips\lda_american_beauty.PNG'>

**Doc2Vec**
<img src='.\snips\doc2vec_star_wars.PNG'>
<img src='.\snips\doc2vec_finding_nemo.PNG'>
<img src='.\snips\doc2vec_american_beauty.PNG'>

After looking into multiple iterations for most of the models above, it was decided that the simplest  of the models (CountVectorizer + Cosine Similarity) actually turned out to perform the best both in execute time and quality of results! 

## 5. [Resulting Product](https://github.com/philbowman212/Thinkful_repo/blob/master/projects/final_capstone/final_product.ipynb) and Summary

After that selection was made, a filter had to be created to limit the results to only those movies that were defined to be *obscure* in the data wrangling and exploration phase. This was pretty easy to do as the movies queried for their similarity were just limited to those in the pool of obscure movies. Next, the user shouldn't have to know the exact movie ID when querying movies similar to another. The TMDb API came in handy here as it has a very useful search feature, so that was implemented into the function in a GUI-like format. That way if you search for "batman" in the query, but actually want "Batman & Robin", the movie will likely appear on the first page of results (which is the page shared with the user). Then the user is instructed to enter the index for that particular movie on the page and the function finds the movie ID associated with that film. One future implementation to be added here should be a way to cycle through pages if the desired result is not on the first page returned. 

Also, the user shouldn't be limited to only searching movies by comparisons to other movies, the model should also be able to look for movies similar to a particular input determined solely by the words a user enters. So an additional query parameter was added to indicate when searching general words rather than film specific comparisons. This feature likely needs the most work as user input is likely going to be very sparse and high cosine similarities may be hard to generate. As it is, the more you enter, the better the recommendations should perform. With that being said, here is the final function call that actually gathers recommendations based on function parameters:

```python
def get_recommendations(query, kind='movie', x_sim=10):
    if kind == 'movie':
        top_X_obscure(get_id(query), x_sim)
    elif kind == 'query':
        user_query(query, x_sim)
```

Now there's a number of helper functions in place to get from query to recommendations, but this is a relatively simple result and is quite intuitive. The result is far from perfect, but it makes for a pretty great initial prototype. Here's a look at a few queries:

Top 10 Obscure Movies Similar to user search: "space marine"

|     id | title                              | links                                   |
|-------:|:-----------------------------------|:----------------------------------------|
|  25952 | Conquest of Space                  | https://www.themoviedb.org/movie/25952  |
|  11802 | Space Chimps                       | https://www.themoviedb.org/movie/11802  |
| 438740 | Salyut-7                           | https://www.themoviedb.org/movie/438740 |
|  13766 | SpaceCamp                          | https://www.themoviedb.org/movie/13766  |
|  10916 | Babylon 5: A Call to Arms          | https://www.themoviedb.org/movie/10916  |
|    811 | Silent Running                     | https://www.themoviedb.org/movie/811    |
|  40430 | Journey to the Far Side of the Sun | https://www.themoviedb.org/movie/40430  |
|  34069 | Cargo                              | https://www.themoviedb.org/movie/34069  |
|  15493 | Star Wreck: In the Pirkinning      | https://www.themoviedb.org/movie/15493  |
| 522964 | Incoming                           | https://www.themoviedb.org/movie/522964 |

Top 10 Obscure Movies Similar to The Lord of the Rings: The Fellowship of the Ring

|    id | title                                         | links                                  |
|------:|:----------------------------------------------|:---------------------------------------|
|  7234 | Wizards of the Lost Kingdom                   | https://www.themoviedb.org/movie/7234  |
|  2274 | The Seeker: The Dark Is Rising                | https://www.themoviedb.org/movie/2274  |
| 24993 | Kickboxer 2:  The Road Back                   | https://www.themoviedb.org/movie/24993 |
|  9964 | Bad Taste                                     | https://www.themoviedb.org/movie/9964  |
| 87689 | The Dragon Ring                               | https://www.themoviedb.org/movie/87689 |
|  1362 | The Hobbit                                    | https://www.themoviedb.org/movie/1362  |
| 14034 | What's the Worst That Could Happen?           | https://www.themoviedb.org/movie/14034 |
| 11188 | Ring of the Nibelungs                         | https://www.themoviedb.org/movie/11188 |
| 12129 | Herbie Goes Bananas                           | https://www.themoviedb.org/movie/12129 |
| 82864 | King Kong: Peter Jackson's Production Diaries | https://www.themoviedb.org/movie/82864 |

**Things that could be improved:**
- Maintain an up to date database or create a more sophisticated method for making queries to TMDb
- Create filters based on other movie information (perhaps use some of the textual information used to compare films as filters instead and only use overview/keywords/taglines as comparable text)
- Create a more sophisticated way to track and define what makes a film *obscure*
- Ability to sort by obscurity (after the above bullet has been done)
- Ability to sort by user rating
- Ability to sort by number of votes and popularity (the current thresholds for *obscurity*)
- Make a webpage with a GUI that does all this in a user friendly format
- Limit and link to movies that are available to stream on popular platforms only (may need to widen threshold for obscurity in this case) OR automatically search streaming sites to see if particular movies are available
- Find a more sophisticated method for calculating similarity between movies and their vectors (beyond simple BoW and Cosine Similarity) --> LDA and Doc2Vec look to be reasonable, but require much more computing time and power to tune a workable model
- Incorporate user feedback ratings for results in order to actually be able to train a model to know if its recommendations were good or not

The ultimate goal was to present reasonable and obscure recommendations to a user based on a query. Most of the movies seen using this system are not recognizable, which was the goal! So, I would definitely call it a success in that respect. Now, the complexity and ability of the model itself definitely requires improvements, but overall, it appears to catch specific and more general terms relatively well even in this prototypical phase. 

<h1 align=center><i>Fin</i></h1>