# Movies Exploratory Analysis
The purpose of this notebook is to explore the data ingested from the various TMDB API endpoints. We will examine the characteristics of their outputs (i.e. data types, number of missing values, etc) and plan the initial schema for the database.

In [1]:
import os
from pathlib import Path
import json

import pandas as pd
from dotenv import load_dotenv
import requests

# set the working directory to project root
load_dotenv()
PROJECT_ROOT = os.getenv("PROJECT_ROOT")
os.chdir(PROJECT_ROOT)

# set up the session
session = requests.Session()
api_key = os.getenv("TMDB_API_KEY")

## The Discover Movies Endpoint (/discover/movie)
Here we will explore the discover movies endpoint, which is how I got the key details of the movies from 2000 to 2024. Originally, this notebook was supposed to be dedicated to the analysis of this endpoint but I realized I was missing a lot of the details I wanted, like cast, budget and box office.

In [2]:
# set data_path
MOVIE_DATA_PATH = "data/movies.csv"

# read data into pandas. it seems the overview column has some non-ASCII characters,
# which threw errors when trying to using the default c engine
df_movies = pd.read_csv(MOVIE_DATA_PATH, engine='python')

In [3]:
# print summary statistics
df_movies.info()
df_movies.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57574 entries, 0 to 57573
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   adult              57574 non-null  object 
 1   backdrop_path      53453 non-null  object 
 2   genre_ids          57573 non-null  object 
 3   id                 57573 non-null  object 
 4   original_language  57573 non-null  object 
 5   original_title     57572 non-null  object 
 6   overview           57008 non-null  object 
 7   popularity         57567 non-null  float64
 8   poster_path        57377 non-null  object 
 9   release_date       57561 non-null  object 
 10  title              57560 non-null  object 
 11  video              57561 non-null  object 
 12  vote_average       57561 non-null  float64
 13  vote_count         57561 non-null  float64
dtypes: float64(3), object(11)
memory usage: 6.1+ MB


Unnamed: 0,popularity,vote_average,vote_count
count,57567.0,57561.0,57561.0
mean,1.140957,6.041737,319.652699
std,3.48019,1.073187,1391.86866
min,0.0,1.2,10.0
25%,0.318,5.4,16.0
50%,0.5651,6.13,32.0
75%,1.07975,6.8,104.0
max,361.0938,10.0,37610.0


In [4]:
# get a sample of the top 5 rows
print(df_movies.head())

   adult                     backdrop_path                genre_ids     id  \
0  False  /7isarjYDEKZ5t1CgcvbuqEUby8P.jpg                     [27]   9532   
1  False  /Ar7QuJ7sJEiC0oP3I8fKBKIQD9u.jpg             [28, 18, 12]     98   
2  False  /mZj8EUr6F1x2PWZjKPxaeYd5WRw.jpg  [12, 16, 35, 10751, 14]  11688   
3  False  /24DZfupDlhXeTchmcOkoGRhP5Vg.jpg             [12, 28, 53]    955   
4  False   /zvmsyAMr3cVDdIu7UvDLSmRXlF.jpg          [35, 18, 10749]  22705   

  original_language            original_title  \
0                en         Final Destination   
1                en                 Gladiator   
2                en  The Emperor's New Groove   
3                en    Mission: Impossible II   
4                it             Tra(sgre)dire   

                                            overview  popularity  \
0  After a teenager has a terrifying vision of hi...     25.9950   
1  After the death of Emperor Marcus Aurelius, hi...     17.7690   
2  When self-centered Emperor Ku

In [5]:
# seems like there's few missing values, let's see what columns have the
# highest percentage missing
print("Percent missing from each column:")
df_movies.isnull().sum()/len(df_movies)

Percent missing from each column:


adult                0.000000
backdrop_path        0.071577
genre_ids            0.000017
id                   0.000017
original_language    0.000017
original_title       0.000035
overview             0.009831
popularity           0.000122
poster_path          0.003422
release_date         0.000226
title                0.000243
video                0.000226
vote_average         0.000226
vote_count           0.000226
dtype: float64

In [6]:
# most of them have largely optional fields, except the ones that are missing ids
# let's check those out
df_missing_ids = df_movies[df_movies["id"].isna()]
print(df_missing_ids)

         adult backdrop_path genre_ids    id original_language original_title  \
11645   Extra:          None      None  None              None           None   

      overview  popularity poster_path release_date title video  vote_average  \
11645     None         NaN        None         None  None   NaN           NaN   

       vote_count  
11645         NaN  


In [7]:
# it's a strange looking row missing almost all information, it's probably safe to drop

### Endpoint Summary
- We can drop the (one) entry that's missing an id
- For the initial phase of this project, we can probably ignore the backdrop_path and poster_path
- All attributes except popularity, vote_average, and vote_count are strings
- The three attributes previously mentioned are numbers

This endpoint gives us a good start, but it's missing some of the details I wanted to explore like cast and box office performance. Thankfully, now that we have the movie IDs, it seems we can get this information from the /movie/details endpoint.

## The Movie Details Endpoint (/movie/{movie_id})
Here we will explore the movie details endpoint, which contains information such as a movies budget, revenue and cast.

In [8]:
# for this endpoint, we only need the movie parameter as the endpoint.
# let's start with one of my favorite movies, "Mad Max: Fury Road"
df_movies[df_movies["original_title"].str.contains("fury road", case=False, na=False)]

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
28567,False,/gqrnQA6Xppdl8vIb2eJc58VC1tW.jpg,"[28, 12, 878]",76341,en,Mad Max: Fury Road,An apocalyptic story set in the furthest reach...,15.1418,/hA2ple9q4qnwxp3hKVNhroipsir.jpg,2015-05-13,Mad Max: Fury Road,False,7.623,23177.0
36841,False,,[99],704725,en,Going Mad: The Battle of Fury Road,For 20 years director George Miller fought to ...,0.5158,/mr57dIi6Xi1LQvlyXd4vIsvMVAm.jpg,2017-08-01,Going Mad: The Battle of Fury Road,False,6.9,13.0


In [12]:
# let's set up the params
params = {
    "api_key": api_key,
}

# get the details
response = session.get("https://api.themoviedb.org/3/movie/76341", params=params)
response.raise_for_status()
data = response.json()

# print vertically for easier reading
for key, value in data.items():
    print(f"{key}: {value}")

adult: False
backdrop_path: /gqrnQA6Xppdl8vIb2eJc58VC1tW.jpg
belongs_to_collection: {'id': 8945, 'name': 'Mad Max Collection', 'poster_path': '/tRxkZboyyXnFgCthoViWBwISZ0r.jpg', 'backdrop_path': '/fhv3dWOuzeW9eXOSlr8MCHwo24t.jpg'}
budget: 150000000
genres: [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 878, 'name': 'Science Fiction'}]
homepage: https://www.warnerbros.com/movies/mad-max-fury-road
id: 76341
imdb_id: tt1392190
origin_country: ['AU', 'US']
original_language: en
original_title: Mad Max: Fury Road
overview: An apocalyptic story set in the furthest reaches of our planet, in a stark desert landscape where humanity is broken, and most everyone is crazed fighting for the necessities of life. Within this world exist two rebels on the run who just might be able to restore order.
popularity: 13.9005
poster_path: /hA2ple9q4qnwxp3hKVNhroipsir.jpg
production_companies: [{'id': 174, 'logo_path': '/zhD3hhtKB5qyv7ZeL4uLpNxgMVU.png', 'name': 'Warner Bros. Pictures'

### Movie Details Summary
Seems like this endpoint has many of the fields I want for my analysis, such as budget and revenue. However, because we need to extract this information one at a time for each film, the data gathering script will have to cache in case the connection is lost or we hit some rate limit. For the initial project, we also don't need to load all the fields into the database.