## 
TMDB Data Extraction Notebook
Author: Noah Jamal Nabila
Date: December, 2025
Purpose: Fetch movie data from TMDB API

This notebook handles:
1. API authentication
2. Data fetching for specified movie IDs
3. Response validation
4. Raw data storage


In [1]:
import pandas as pd
import requests
import json
import time
from datetime import datetime
import sys
sys.path.append('..')
from config import API_KEY, BASE_URL, MOVIE_IDS

print(f"Extraction started: {datetime.now()}")

Extraction started: 2025-12-08 16:31:23.601571


#### Fetch detailed movie information from TMDB API
    

In [2]:
def fetch_movie_details(movie_id, api_key):
   
    url = f"{BASE_URL}/movie/{movie_id}"
    params = {
        'api_key': api_key,
        'append_to_response': 'credits'
    }
    
    try:
        response = requests.get(url, params=params, timeout=10)
        
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 404:
            print(f"Movie ID {movie_id} not found")
            return None
        else:
            print(f"Error {response.status_code} for movie {movie_id}")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"Request failed for movie {movie_id}: {e}")
        return None
    
    finally:
        time.sleep(0.25)  # Rate limiting: 4 requests/second

### Fetching all movies
#### NB: Movie_id 0 removed becaused it's invalid

In [3]:
# Fetch data for all movie IDs
movies_data = []
failed_ids = []

for movie_id in MOVIE_IDS:
    print(f"Fetching movie ID: {movie_id}...", end=" ")
    data = fetch_movie_details(movie_id, API_KEY)
    
    if data:
        movies_data.append(data)
        print("✓")
    else:
        failed_ids.append(movie_id)
        print("✗")

print(f"\nSuccessfully fetched: {len(movies_data)} movies")
print(f"Failed IDs: {failed_ids}")

Fetching movie ID: 299534... ✓
Fetching movie ID: 19995... ✓
Fetching movie ID: 140607... ✓
Fetching movie ID: 299536... ✓
Fetching movie ID: 597... ✓
Fetching movie ID: 135397... ✓
Fetching movie ID: 420818... ✓
Fetching movie ID: 24428... ✓
Fetching movie ID: 168259... ✓
Fetching movie ID: 99861... ✓
Fetching movie ID: 284054... ✓
Fetching movie ID: 12445... ✓
Fetching movie ID: 181808... ✓
Fetching movie ID: 330457... ✓
Fetching movie ID: 351286... ✓
Fetching movie ID: 109445... ✓
Fetching movie ID: 321612... ✓
Fetching movie ID: 260513... ✓

Successfully fetched: 18 movies
Failed IDs: []


#### Saving raw data into a readable format

In [4]:
# Save raw JSON for reproducibility
import os
os.makedirs('../data/raw', exist_ok=True)

with open('../data/raw/movies_raw.json', 'w') as f:
    json.dump(movies_data, f, indent=2)

print("Raw data saved to: data/raw/movies_raw.json")
print(f"Total records: {len(movies_data)}")

Raw data saved to: data/raw/movies_raw.json
Total records: 18


### Initialization of DataFrame creation

In [6]:
# Convert to DataFrame for initial inspection
df_raw = pd.DataFrame(movies_data)
print(f"Raw DataFrame shape: {df_raw.shape}")
print(f"\nColumns: {df_raw.columns.tolist()}")
print(f"\nFirst row sample:")
df_raw.head(5)

Raw DataFrame shape: (18, 27)

Columns: ['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'origin_country', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count', 'credits']

First row sample:


Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,original_language,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,credits
0,False,/9wXPKruA6bWYk2co5ix6fH59Qr8.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",356000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 878, ...",https://www.marvel.com/movies/avengers-endgame,299534,tt4154796,[US],en,...,2799439100,181,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Avenge the fallen.,Avengers: Endgame,False,8.237,26974,"{'cast': [{'adult': False, 'gender': 2, 'id': ..."
1,False,/7JNzw1tSZZEgsBw6lu0VfO2X2Ef.jpg,"{'id': 87096, 'name': 'Avatar Collection', 'po...",237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",https://www.avatar.com/movies/avatar,19995,tt0499549,[US],en,...,2923706026,162,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Enter the world of Pandora.,Avatar,False,7.594,32881,"{'cast': [{'adult': False, 'gender': 2, 'id': ..."
2,False,/8BTsTfln4jlQrLXUBquXJ0ASQy9.jpg,"{'id': 10, 'name': 'Star Wars Collection', 'po...",245000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.starwars.com/films/star-wars-episod...,140607,tt2488496,[US],en,...,2068223624,136,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Every generation has a story.,Star Wars: The Force Awakens,False,7.3,20104,"{'cast': [{'adult': False, 'gender': 2, 'id': ..."
3,False,/mDfJG3LC3Dqb67AZ52x3Z0jU0uB.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",https://www.marvel.com/movies/avengers-infinit...,299536,tt4154756,[US],en,...,2052415039,149,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Destiny arrives all the same.,Avengers: Infinity War,False,8.235,31187,"{'cast': [{'adult': False, 'gender': 2, 'id': ..."
4,False,/xnHVX37XZEp33hhCbYlQFq7ux1J.jpg,,200000000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",https://www.paramountmovies.com/movies/titanic,597,tt0120338,[US],en,...,2264162353,194,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Nothing on earth could come between them.,Titanic,False,7.903,26519,"{'cast': [{'adult': False, 'gender': 2, 'id': ..."
