### Movie Analysis project

## Scope:
In my evaluation of determining which movies do best at the box office, I wanted to focus on three main areas:
1. What length movies and genres / genre combinations tend to produce the most revenue / highest average rating?
2. How has domestic movie spend trended over time, and when is the best time during the year to release a movie?
3. Which writers should be targeted for hire to maximize chances of a high grossing movie?



## Process:
1. (Runtime & Genre Analysis) - Follow API instructions provided by themoviedb.org to register and generate API key and access code.  Use access code to authenticate API requests.  API results are limited to 20 results per page and 500 pages per request.  As a result, I had to split my request into smaller requests that could be returned by the API in entirety.  For this process, I decided to split each year into quarters.  For each year in the range 2000-2020, I make 4 api requests to pull in all movie data and append to a dataframe.  After all API requests are made, additional movie details are needed to pull in information on runtime and genres.  For each movie ID in my existing dataframe, I make another API call to pull in movie details associated with that ID.  Once all additional detail is pulled in, I clean data, handle missing values, replace placeholders where necessary, and evaluate the relationship between runtime and revenue.  After runtime, I move on to genres, looking at highest grossing movies on a total and median basis.  I perform the same thing for total and median average vote as well to compare differences between rating and revenue. 

2. (Domestic Spend Trends) - Use BeautifulSoup to scrape domestic box office spend data from boxofficemojo.com.  Once data is loaded, create scatter plots of annual and monthly revenue over time to show trends.  After this, I group the data by month to show the distribution of gross domestic spend in each month for all years.  Given the difference in variance between different months, I also decided to create a boxplot of gross domestic spend for all months.

3. (Writers) - load provided datasets from IMDB.  Multiple writers can be included for each movie.  In these instances, I created a row for each individual writer.  Once I expanded writers onto new rows, I joined writer names from a provided imdb dataset and revenue from the API call to themoviedb.org performed for question 1.  Once all data was loaded, I grouped by writer name to calculate median aggregate statistics. Sort revenue column descending and produce bar charts.

## Question 1: What length movies and genres / genre combinations tend to produce the most revenue / highest average ratings?

### Movie Runtime Analysis
Outline:
1. Import necessary libraries
2. Create function to build progress bar to keep track of running calculations
3. Pull in API / Access Keys and request from TMDB API a list of movie IDs from 2000 through 2020
4. Create function to check how many pages of requests are returned by the API
5. Create function to loop through all returned pages to extract movie IDs and other information
6. Create a dataframe using returned movie IDs
7. Clean data (handle missing data, duplicates, data that does not make sense in the context of this analysis)
8. Request movie details via API for every movie ID we have from our first set of API requests
9. Split dataset into different buckets based on movie runtime and analyze differences in revenue / ratings
10. Create Visualizations

In [1]:
##install seaborn
import sys
!{sys.executable} -m pip install seaborn




[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: C:\Users\Elvis\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [5]:
#import necessary libraries
import requests
import pandas as pd
import numpy as np
import seaborn as sns
import json
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

Given length of calculations when calling API, I wanted a way to show status of calculations.  I followed the step-by-step instructions laid out by Bartosz Mikulski here: https://www.mikulskibartosz.name/how-to-display-a-progress-bar-in-jupyter-notebook/

In [6]:
#start by imporing libraries
import time, sys
from IPython.display import clear_output

def update_progress(progress):
    """
    Function to build progress bar.
    Function returns percent of calculations remaining along with visual of progress completed so far.
    For use in loops.
    """
    bar_length = 20
    if isinstance(progress, int):
        progress = float(progress)
    if not isinstance(progress, float):
        progress = 0
    if progress < 0:
        progress = 0
    if progress >= 1:
        progress = 1
        
    block = int(round(bar_length * progress))
    clear_output(wait = True)
    text = "Progress: [{0}] {1:.1f}%".format( "#" * block + "-" * (bar_length - block), progress * 100)
    print(text) 

In [7]:
def get_keys(path):
    """
    Takes a filepath as input and returns the keys of the dictionary within the file
    """
    with open(path) as f:
        return json.load(f)

API instructions found here: 'https://developers.themoviedb.org/3/getting-started/introduction'
 * Register for API key
 * Enter API key on website to generate access token
 * Create necessary headers using access token

In [16]:
#Use get_keys() function to pull in specific API keys and access token 
keys = get_keys('/Users/Elvis/Documents/tmdbapi.json')
api_key = keys["api_key"]
access_token = keys['access_token']

In [17]:
#Use generated API Key and Access Token in headers to authorize access and API requests
headers = {'Authorization': 'Bearer {}'.format(access_token)
          ,'Content-Type': 'application/json;charset=utf-8'}

In [18]:
def get_num_pages(url, headers, start_date, end_date):
    """
    Takes as input an API url, headers containing authentication information, a start date, and end date.
    Returns the number of pages of results returned by the API call as an int. 
    """
    params = {'release_date.gte': start_date,
              'release_date.lte': end_date}
    returned_movies = requests.get(url=url, headers=headers, params=params).json()
    return returned_movies['total_pages']

Data returned by API call is limited to 20 results per page, and a maximum of 500 pages.  Because the number of results is limited, it is not possible to pull movie data from 2000 through 2020 in one request.  As a result, each year is split into quarters.  Request quarterly data via API and append to consolidated dataframe

In [19]:
def get_movies_data(start_date, end_date, url, headers):
    """
    Takes a start date, end date, API url, and headers with authentication information.
    Uses get_num_pages function to check the number of pages returned by the API.
    Loops through all pages, requesting data from API, concatenating results to a dataframe.
    Returns dataframe of movie information between start and end date.
    """
    df = pd.DataFrame()
    num_pages = get_num_pages(url, headers, start_date, end_date)
    for i in range(1, num_pages+1):
        parameters = {'release_date.gte': start_date,
                      'release_data.lte': end_date,
                      'page': i}
        request = requests.get(url, headers=headers, params=parameters).json()
        df = pd.concat([df, pd.DataFrame(request['results'])], sort=False)
        
    return df