# Retrive box office revenue and profit data

Shenyue Jia

**Task lists**

- [ ] Obtain the following data from tmdb
    - Revenue
    - Budget
    - Certification (P, PG, etc.)
- [ ] Perform Explanatory Data Analysis

In [8]:
# data wrangling
import numpy as np
import pandas as pd
import os, json, math, time
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

## Get movie info of a pre-defined subset

- We will use movie retrieved previously from [IMDB](https://www.imdb.com)
- Data can be found from the repository of this project ([link](https://github.com/jiashenyue/project3-imdb-data/blob/main/Data/title_ratings.csv.gz))

In [3]:
# read the rating file to obtain the tconst of each movie for analysis
df_rating = pd.read_csv('Data/title_ratings.csv.gz')
df_rating.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1953
1,tt0000002,5.8,263
2,tt0000003,6.5,1787
3,tt0000004,5.6,179
4,tt0000005,6.2,2589


In [5]:
# how many movies we will analysis
print(f'There are {df_rating.shape[0]} movies in the df_rating dataframe.')

There are 1282624 movies in the df_rating dataframe.


## Extract data from TMDB

- [TMDB](https://www.themoviedb.org/) is a database with more information of movies based on the skeleton information available from [IMDB](https://www.imdb.com/). Information available from TMDB includes
    - Box office revenue generated by a movie
    - Profit data of a movie
- In this section, we will use data from TMDB to further analyze the pattern of best-selling movies

### Set up TMDB API and relevant python packages

- We will use `tmdbsimple` package to make our life easier to extract data from TMDB API without manually constructing the URLs for our API calls.
    - [GitHub repository of `tmdbsimple`](https://github.com/celiao/tmdbsimple)
    - [PyPi link of `tmdbsimple`](https://pypi.org/project/tmdbsimple/)

In [6]:
# Install tmdbsimple (only need to run once)
# !pip install tmdbsimple



In [9]:
with open('/Users/Shenyue/.secret/tmdb.api.json', 'r') as f:
    login = json.load(f)
## Display the keys of the loaded dict
login.keys()

dict_keys(['API Key'])

In [10]:
tmdb.API_KEY =  login['API Key']

### Make some test queries to understand the TMDB data

In [11]:
## make a movie object using the .Movies function from tmdb
movie = tmdb.Movies(603)

In [12]:
## movie objects have a .info dictionary 
info = movie.info()
info

{'adult': False,
 'backdrop_path': '/waCRuAW5ocONRehP556vPexVXA9.jpg',
 'belongs_to_collection': {'id': 2344,
  'name': 'The Matrix Collection',
  'poster_path': '/bV9qTVHTVf0gkW0j7p7M0ILD4pG.jpg',
  'backdrop_path': '/bRm2DEgUiYciDw3myHuYFInD7la.jpg'},
 'budget': 63000000,
 'genres': [{'id': 28, 'name': 'Action'},
  {'id': 878, 'name': 'Science Fiction'}],
 'homepage': 'http://www.warnerbros.com/matrix',
 'id': 603,
 'imdb_id': 'tt0133093',
 'original_language': 'en',
 'original_title': 'The Matrix',
 'overview': 'Set in the 22nd century, The Matrix tells the story of a computer hacker who joins a group of underground insurgents fighting the vast and powerful computers who now rule the earth.',
 'popularity': 72.162,
 'poster_path': '/f89U3ADr1oiB1s9GkdPOEpXUk5H.jpg',
 'production_companies': [{'id': 79,
   'logo_path': '/tpFpsqbleCzEE2p5EgvUq6ozfCA.png',
   'name': 'Village Roadshow Pictures',
   'origin_country': 'US'},
  {'id': 372,
   'logo_path': None,
   'name': 'Groucho II Film

In [13]:
# print revenue
info['revenue']

463517383

In [14]:
# print budget
info['budget']

63000000

In [15]:
# print imdb id
info['imdb_id']

'tt0133093'

- The `imdb_id` field in tmdb data is the `tconst` field in the data we retrieved from IMDB
- We will see how to obtain the certification info from data

In [16]:
# example from package README
# source = https://github.com/celiao/tmdbsimple
releases = movie.releases()
for c in releases['countries']:
    if c['iso_3166_1'] == 'US':
        print(c['certification'])

R
R


- Instead of printing the certification info, we will add the certification info to the `info` dictionary

In [17]:
# Get the movie object for the current id
movie = tmdb.Movies('tt1361336')
# save the .info .releases dictionaries
info = movie.info()
releases = movie.releases()
# Loop through countries in releases
for c in releases['countries']:
    # if the country abbreviation==US
    if c['iso_3166_1' ] =='US':
        ## save a "certification" key in the info dict with the certification
       info['certification'] = c['certification']

### Prepare a function to obtain movie info

- Wrap the above tests into a function to return an info dictionary of a given `movie_id`

In [21]:
def get_movie_with_rating(movie_id):
    """Adapted from source = https://github.com/celiao/tmdbsimple"""
    # Get the movie object for the current id
    movie = tmdb.Movies(movie_id)
    
    # Save the .info .releases dictionaries
    info = movie.info()
    releases = movie.releases()
    
    # Loop through countries in releases
    for c in releases['countries']:
        # if the country abbreviation == US
        if c['iso_3166_1'] == 'US':
            ## Save a "certification" key in info with the certification
            info['certification'] = c['certification']
    
    return info

In [25]:
# test the above function to see if certification is added to info
test = get_movie_with_rating("tt0848228") #put your function name here
test

{'adult': False,
 'backdrop_path': '/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg',
 'belongs_to_collection': {'id': 86311,
  'name': 'The Avengers Collection',
  'poster_path': '/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg',
  'backdrop_path': '/zuW6fOiusv4X9nnW3paHGfXcSll.jpg'},
 'budget': 220000000,
 'genres': [{'id': 878, 'name': 'Science Fiction'},
  {'id': 28, 'name': 'Action'},
  {'id': 12, 'name': 'Adventure'}],
 'homepage': 'https://www.marvel.com/movies/the-avengers',
 'id': 24428,
 'imdb_id': 'tt0848228',
 'original_language': 'en',
 'original_title': 'The Avengers',
 'overview': 'When an unexpected enemy emerges and threatens global safety and security, Nick Fury, director of the international peacekeeping agency known as S.H.I.E.L.D., finds himself in need of a team to pull the world back from the brink of disaster. Spanning the globe, a daring recruitment effort begins!',
 'popularity': 185.145,
 'poster_path': '/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg',
 'production_companies': [{'id': 420,
   'logo_path

## Obtain required data for all movies in `df_rating`

- We will use the function above to obtain the following information for all movies in `df_rating`
    - `revenue`
    - `budget`
    - `certification`

In [27]:
imdb_ids = df_rating['tconst']
type(imdb_ids)

pandas.core.series.Series

In [None]:
## testing our function by looping through a list of ids
results = []
for movie_id in imdb_ids:
    
    try:
        movie_info = get_movie_with_rating(movie_id)
        results.append(movie_info)
        
    except: 
        pass
    
pd.DataFrame(results)