## About Me 

This notebook is focused on using and querying TMDB for movies efficiently 

We start by using the logic in eda1.parse netflix data to split title into sections and identify which show are TV Show vs Movies. 

Next we take the movie list and search using the movie API and select the fields we want from the respoonse. 

The final logic can be found in the **get_movie_API_results** function 

Alternate idea: 
* Create database of all previously searched... no need to recall tmdb API --- can just search ones already archived 
- Not necessary right now

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import tmdbsimple as tmdb
import os
import sys
import pickle
import time

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
sys.path.append(os.path.abspath('../src'))

In [4]:
with open('../Data/api_key.pkl', 'rb') as hnd:
    tmdb.API_KEY = pickle.load(hnd)['api_key']

In [5]:
data = pd.read_csv('../Data/NetflixViewingHistory.csv')

In [6]:
import gather_data as eda1

In [7]:
netflix_df = eda1.parse_netflix(data)

Total number of TV Show + Movies:  1405
TV Show vs Movie
TV_Show    1357
Movie        48
Name: TV_Show_flag, dtype: int64
Dataframe shape:  (1405, 6)


In [8]:
shows = netflix_df[netflix_df['TV_Show_flag'] == 'TV Show']
movies = netflix_df[netflix_df['TV_Show_flag'] == 'Movie']

In [9]:
search= tmdb.Search()

## Start with Movies

In [10]:
movies.head()

Unnamed: 0,Title,Date,Show Name,Season,Episode Name,TV_Show_flag
3,Trevor Noah: Son of Patricia,2018-11-23,Trevor Noah,Son of Patricia,,Movie
19,Captain Underpants: The First Epic Movie,2018-06-07,Captain Underpants,The First Epic Movie,,Movie
66,Saving Capitalism,2017-12-07,Saving Capitalism,,,Movie
67,Betting on Zero,2017-12-07,Betting on Zero,,,Movie
81,Banking on Bitcoin,2017-11-04,Banking on Bitcoin,,,Movie


In [11]:
row1 = movies.iloc[0]
row1

Title           Trevor Noah: Son of Patricia
Date                     2018-11-23 00:00:00
Show Name                        Trevor Noah
Season                       Son of Patricia
Episode Name                            None
TV_Show_flag                           Movie
Name: 3, dtype: object

In [12]:
row2 = movies.iloc[6]
row2

Title                   The Founder
Date            2017-08-16 00:00:00
Show Name               The Founder
Season                         None
Episode Name                   None
TV_Show_flag                  Movie
Name: 148, dtype: object

In [13]:
search_results = search.movie(query=row1['Title'])
n_results = len(search_results['results'])
print(n_results)
temp_id = search_results['results'][0]['id']
full_movie_results = tmdb.Movies(temp_id)

1


In [14]:
full_movie_results.info()

{'adult': False,
 'backdrop_path': '/hDd2RWYR0mwGeF5oms5Ulr9zrhh.jpg',
 'belongs_to_collection': None,
 'budget': 0,
 'genres': [{'id': 35, 'name': 'Comedy'}, {'id': 10770, 'name': 'TV Movie'}],
 'homepage': 'https://www.netflix.com/title/80239932',
 'id': 558341,
 'imdb_id': 'tt9170648',
 'original_language': 'en',
 'original_title': 'Trevor Noah: Son of Patricia',
 'overview': 'Trevor Noah gets out from behind the "Daily Show" desk and takes the stage for a stand-up special that touches on racism, immigration, camping and more.',
 'popularity': 3.24,
 'poster_path': '/dmhDeV3RYq4jMwgIMmo0W05uH8L.jpg',
 'production_companies': [],
 'production_countries': [{'iso_3166_1': 'US',
   'name': 'United States of America'}],
 'release_date': '2018-11-20',
 'revenue': 0,
 'runtime': 63,
 'spoken_languages': [{'iso_639_1': 'en', 'name': 'English'},
  {'iso_639_1': 'xh', 'name': ''}],
 'status': 'Released',
 'tagline': '',
 'title': 'Trevor Noah: Son of Patricia',
 'video': False,
 'vote_average

In [15]:
normal_movie_fields = ['budget', 'genres', 'homepage', 'imdb_id', 'overview', 'popularity'\
                       , 'release_date', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [16]:
set(normal_movie_fields).difference(set(full_movie_results.info().keys()))

set()

In [17]:
dir(full_movie_results)

['BASE_PATH',
 'URLS',
 '_DELETE',
 '_GET',
 '_POST',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_get_complete_url',
 '_get_credit_id_path',
 '_get_guest_session_id_path',
 '_get_id_path',
 '_get_params',
 '_get_path',
 '_get_series_id_season_number_episode_number_path',
 '_get_series_id_season_number_path',
 '_request',
 '_set_attrs_to_values',
 'account_states',
 'adult',
 'alternative_titles',
 'backdrop_path',
 'base_uri',
 'belongs_to_collection',
 'budget',
 'changes',
 'credits',
 'external_ids',
 'genres',
 'headers',
 'homepage',
 'id',
 'images',
 'imdb_id',
 'info',
 'keywords',
 'latest',
 'lists',
 'now_playing',
 'original_language',
 'original_titl

In [18]:
def get_movie_API_results(movie_title): 
        
    # Select requested fields from response 
    normal_movie_fields = ['budget', 'homepage', 'imdb_id', 'overview', 'popularity'\
                           , 'release_date', 'revenue', 'runtime', 'vote_average', 'vote_count']
    
    # Find the Movie in TMDB 
    search_results = search.movie(query=movie_title)
    n_results = len(search_results['results'])
#     print("N Results: ", n_results)
    if n_results == 0:
        movie_results = {key:np.nan for key in normal_movie_fields}
        movie_results['Number of Search Results'] = n_results
        movie_results['title_query'] = movie_title 
        return(movie_results)
    
    temp_id = search_results['results'][0]['id']
    full_movie_results = tmdb.Movies(temp_id)

    assert(set(normal_movie_fields).difference(set(full_movie_results.info().keys()))== set())\
    , 'Movie result schema is missing a field'
    movie_results = {attr:getattr(full_movie_results, attr) for attr in normal_movie_fields}
    # TODO Fix genre parsing 
    
    
    # Append number of search results (incase there are multiple and we choose the wrong one)
    movie_results['Number of Search Results'] = n_results
    movie_results['title_query'] = movie_title
    
    
    time.sleep(0.6)
    return(movie_results)

In [19]:
r1 = get_movie_API_results(row1['Title'])
r1 

{'budget': 0,
 'homepage': 'https://www.netflix.com/title/80239932',
 'imdb_id': 'tt9170648',
 'overview': 'Trevor Noah gets out from behind the "Daily Show" desk and takes the stage for a stand-up special that touches on racism, immigration, camping and more.',
 'popularity': 3.24,
 'release_date': '2018-11-20',
 'revenue': 0,
 'runtime': 63,
 'vote_average': 7.2,
 'vote_count': 36,
 'Number of Search Results': 1,
 'title_query': 'Trevor Noah: Son of Patricia'}

In [20]:
r2 = get_movie_API_results(row2['Title'])
r2

{'budget': 25000000,
 'homepage': 'http://thefounderfilm.com/',
 'imdb_id': 'tt4276820',
 'overview': 'The true story of how Ray Kroc, a salesman from Illinois, met Mac and Dick McDonald, who were running a burger operation in 1950s Southern California. Kroc was impressed by the brothers’ speedy system of making the food and saw franchise potential. He maneuvered himself into a position to be able to pull the company from the brothers and create a billion-dollar empire.',
 'popularity': 12.725,
 'release_date': '2016-09-13',
 'revenue': 23964782,
 'runtime': 115,
 'vote_average': 7.0,
 'vote_count': 2017,
 'Number of Search Results': 7,
 'title_query': 'The Founder'}

In [21]:
demo_df = movies['Title'].iloc[:20].apply(get_movie_API_results)

In [22]:
demo_df.head()

3     {'budget': 0, 'homepage': 'https://www.netflix...
19    {'budget': 38000000, 'homepage': 'http://www.f...
66    {'budget': 0, 'homepage': 'http://www.netflix....
67    {'budget': 0, 'homepage': None, 'imdb_id': 'tt...
81    {'budget': 100000, 'homepage': 'https://invisi...
Name: Title, dtype: object

In [23]:
demo_df.to_list()

[{'budget': 0,
  'homepage': 'https://www.netflix.com/title/80239932',
  'imdb_id': 'tt9170648',
  'overview': 'Trevor Noah gets out from behind the "Daily Show" desk and takes the stage for a stand-up special that touches on racism, immigration, camping and more.',
  'popularity': 3.24,
  'release_date': '2018-11-20',
  'revenue': 0,
  'runtime': 63,
  'vote_average': 7.2,
  'vote_count': 36,
  'Number of Search Results': 1,
  'title_query': 'Trevor Noah: Son of Patricia'},
 {'budget': 38000000,
  'homepage': 'http://www.foxmovies.com/movies/captain-underpants-the-first-epic-movie',
  'imdb_id': 'tt2091256',
  'overview': 'Two mischievous kids hypnotize their mean elementary school principal and turn him into their comic book creation, the kind-hearted and elastic-banded Captain Underpants.',
  'popularity': 9.996,
  'release_date': '2017-06-01',
  'revenue': 125289450,
  'runtime': 89,
  'vote_average': 6.0,
  'vote_count': 565,
  'Number of Search Results': 1,
  'title_query': 'Capt

In [24]:
pd.DataFrame.from_dict(demo_df.to_list(), orient='columns')

Unnamed: 0,Number of Search Results,budget,homepage,imdb_id,overview,popularity,release_date,revenue,runtime,title_query,vote_average,vote_count
0,1,0.0,https://www.netflix.com/title/80239932,tt9170648,"Trevor Noah gets out from behind the ""Daily Sh...",3.24,2018-11-20,0.0,63.0,Trevor Noah: Son of Patricia,7.2,36.0
1,1,38000000.0,http://www.foxmovies.com/movies/captain-underp...,tt2091256,Two mischievous kids hypnotize their mean elem...,9.996,2017-06-01,125289450.0,89.0,Captain Underpants: The First Epic Movie,6.0,565.0
2,1,0.0,http://www.netflix.com/savingcapitalism,tt6185286,Former Secretary of Labor Robert Reich meets w...,1.736,2017-11-21,0.0,73.0,Saving Capitalism,7.1,17.0
3,1,0.0,,tt3762912,Controversial hedge fund titan Bill Ackman is ...,2.313,2017-03-17,0.0,99.0,Betting on Zero,7.4,45.0
4,1,100000.0,https://invisiblemoneydocumentary.wordpress.com/,tt5033790,Not since the invention of the Internet has th...,3.305,2016-12-30,0.0,90.0,Banking on Bitcoin,6.5,54.0
5,1,0.0,,tt6714534,From his days of testifying at the Watergate h...,2.771,2017-04-23,0.0,100.0,Get Me Roger Stone,7.1,56.0
6,7,25000000.0,http://thefounderfilm.com/,tt4276820,"The true story of how Ray Kroc, a salesman fro...",12.725,2016-09-13,23964782.0,115.0,The Founder,7.0,2017.0
7,1,0.0,,tt2545118,Notorious killer whale Tilikum is responsible ...,6.614,2013-06-07,2063312.0,83.0,Blackfish,8.0,660.0
8,1,0.0,https://www.tf1.fr/tf1/elections/videos/emmanu...,tt6866918,Deputy General Secretary at the Elysée to cand...,1.43,2017-05-08,0.0,90.0,Emmanuel Macron: Behind the Rise,6.7,21.0
9,1,0.0,https://www.netflix.com/title/80134781,tt6900644,"Comic Hasan Minhaj of ""The Daily Show"" shares ...",3.019,2017-05-23,0.0,72.0,Hasan Minhaj: Homecoming King,7.9,58.0


In [25]:
with open('../Data/all_movies_results_df.pkl', 'rb') as hnd:
    all_movies_results_df = pickle.load( hnd)

In [26]:
all_movies_results_df.head()

Unnamed: 0,Number of Search Results,budget,homepage,imdb_id,overview,popularity,release_date,revenue,runtime,title_query,vote_average,vote_count
0,1,0.0,https://www.netflix.com/title/80239932,tt9170648,"Trevor Noah gets out from behind the ""Daily Sh...",3.145,2018-11-20,0.0,63.0,Trevor Noah: Son of Patricia,7.1,35.0
1,1,38000000.0,http://www.foxmovies.com/movies/captain-underp...,tt2091256,Two mischievous kids hypnotize their mean elem...,11.473,2017-06-01,125289450.0,89.0,Captain Underpants: The First Epic Movie,6.0,565.0
2,1,0.0,http://www.netflix.com/savingcapitalism,tt6185286,Former Secretary of Labor Robert Reich meets w...,1.285,2017-11-21,0.0,73.0,Saving Capitalism,7.1,17.0
3,1,0.0,,tt3762912,Controversial hedge fund titan Bill Ackman is ...,2.099,2017-03-17,0.0,99.0,Betting on Zero,7.4,45.0
4,1,100000.0,https://invisiblemoneydocumentary.wordpress.com/,tt5033790,Not since the invention of the Internet has th...,2.766,2016-12-30,0.0,90.0,Banking on Bitcoin,6.5,54.0


In [27]:
all_movies_results_df.describe()

Unnamed: 0,Number of Search Results,budget,popularity,revenue,runtime,vote_average,vote_count
count,48.0,45.0,45.0,45.0,45.0,45.0,45.0
mean,3.729167,33602220.0,12.186178,154248200.0,99.977778,6.802222,2826.733333
std,5.804399,57572900.0,12.541833,314480600.0,30.735299,1.348482,4509.87172
min,0.0,0.0,0.6,0.0,25.0,0.0,0.0
25%,1.0,0.0,3.356,0.0,87.0,6.3,56.0
50%,1.0,8000000.0,10.409,3142154.0,99.0,7.0,565.0
75%,3.0,39000000.0,17.455,161025600.0,120.0,7.6,4015.0
max,20.0,220000000.0,74.008,1519558000.0,165.0,8.4,18827.0


In [28]:
all_movies_results_df.isna().sum()

Number of Search Results     0
budget                       3
homepage                    27
imdb_id                      4
overview                     3
popularity                   3
release_date                 3
revenue                      3
runtime                      3
title_query                  0
vote_average                 3
vote_count                   3
dtype: int64

In [29]:
# Missing Rate
all_movies_results_df.isna().sum()/all_movies_results_df.shape[0]

Number of Search Results    0.000000
budget                      0.062500
homepage                    0.562500
imdb_id                     0.083333
overview                    0.062500
popularity                  0.062500
release_date                0.062500
revenue                     0.062500
runtime                     0.062500
title_query                 0.000000
vote_average                0.062500
vote_count                  0.062500
dtype: float64

In [30]:
missing_movies = all_movies_results_df[all_movies_results_df['budget'].isna()]

In [31]:
missing_movies['title_query']

12    BoJack Horseman Christmas Special: Sabrina's C...
31                       Samurai Champloo: Unholy Union
45                     House of Cards: Season 1 (Recap)
Name: title_query, dtype: object

In [32]:
os.path.isfile('../Data/all_movies_results_df.pkl')

True

In [33]:
movie_df_raw = eda1.generate_movie_df(netflix_df=netflix_df)

Existing pickle exists
Number of movies:  48
Number of missing movies:  3
12    BoJack Horseman Christmas Special: Sabrina's C...
31                       Samurai Champloo: Unholy Union
45                     House of Cards: Season 1 (Recap)
Name: title_query, dtype: object


In [34]:
movie_df_raw = eda1.generate_movie_df(netflix_df=netflix_df)

Existing pickle exists
Number of movies:  48
Number of missing movies:  3
12    BoJack Horseman Christmas Special: Sabrina's C...
31                       Samurai Champloo: Unholy Union
45                     House of Cards: Season 1 (Recap)
Name: title_query, dtype: object
