# TMDB Movie Data Analysis

## Project Overview
This project challenges us to build a movie data analysis pipeline using Python and Pandas. We will fetch movie-related data from the TMDB API, clean and transform the dataset, and implement key performance indicators (KPIs) to identify the best and worst movies based on financial and popularity metrics.

## Objectives
1. **API Data Extraction**: Fetch movie data from the TMDB API.
2. **Data Cleaning & Transformation**: Process and structure the data for analysis.
3. **Exploratory Data Analysis (EDA)**: Perform an initial exploration to understand trends.
4. **Advanced Filtering & Ranking**: Identify the best and worst movies based on financial and popularity metrics.
5. **Franchise & Director Analysis**: Assess how franchises and directors perform over time.
6. **Visualization & Insights**: Present key findings using visualizations.

## Setup and Imports
First, we import the necessary libraries and set up our environment.

In [None]:
import os
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

API_KEY = os.getenv('TMDB_API_KEY')
BASE_URL = "https://api.themoviedb.org/3"

if not API_KEY:
    print("WARNING: TMDB_API_KEY not found in environment variables.")
else:
    print("API Key loaded successfully.")

## Step 1: Fetch Movie Data from API
We need to fetch data for a specific list of movies provided in the assignment. We will define functions to fetch details for each movie ID, including credits (cast and crew).

In [None]:
def fetch_movie_details(movie_id):
    """
    Fetches details for a specific movie ID, including credits.
    """
    if not API_KEY:
        raise ValueError("TMDB_API_KEY not found.")
        
    url = f"{BASE_URL}/movie/{movie_id}?api_key={API_KEY}&language=en-US&append_to_response=credits"
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching movie {movie_id}: {e}")
        return None

def fetch_specific_movies(movie_ids):
    """
    Fetches data for a list of movie IDs.
    """
    movies = []
    for i, movie_id in enumerate(movie_ids):
        # print(f"Fetching movie {i+1}/{len(movie_ids)}: ID {movie_id}") # Uncomment for progress
        data = fetch_movie_details(movie_id)
        if data:
            movies.append(data)
    return movies

In [None]:
# List of IDs from assignment
movie_ids = [0, 299534, 19995, 140607, 299536, 597, 135397, 420818, 24428, 168259, 99861, 284054, 12445, 181808, 330457, 351286, 109445, 321612, 260513]

print("Fetching specific movies...")
raw_movies_data = fetch_specific_movies(movie_ids)
print(f"Successfully fetched {len(raw_movies_data)} movies.")

## Step 2: Data Cleaning and Preprocessing
Now that we have the raw data, we need to clean it. This involves:
1.  Dropping irrelevant columns.
2.  Extracting data from JSON-like columns (genres, production companies, etc.).
3.  **Inspecting extracted columns** for anomalies.
4.  Handling missing or incorrect data (e.g., 0 budget).
5.  **Handling vote_count = 0**.
6.  Converting data types.
7.  Calculating new metrics like ROI and Profit.

In [None]:
def process_data(data_list):
    """
    Cleans and transforms the movie data list into a DataFrame.
    """
    df = pd.DataFrame(data_list)
    
    # 1. Drop irrelevant columns
    drop_cols = ['adult', 'imdb_id', 'original_title', 'video', 'homepage']
    df = df.drop(columns=[c for c in drop_cols if c in df.columns], errors='ignore')
    
    # 2. Extract and clean key data points (JSON-like columns)
    def extract_names(x):
        if isinstance(x, list):
            return "|".join([i['name'] for i in x if 'name' in i])
        return ""

    json_cols = ['genres', 'belongs_to_collection', 'production_countries', 'production_companies', 'spoken_languages']
    for col in json_cols:
        if col in df.columns:
            # Handle belongs_to_collection which is a dict, not list
            if col == 'belongs_to_collection':
                 df[col] = df[col].apply(lambda x: x['name'] if isinstance(x, dict) and 'name' in x else "")
            else:
                df[col] = df[col].apply(extract_names)
                
    # Inspect extracted columns using value_counts() to identify anomalies
    print("\n--- Inspecting Extracted Columns ---")
    for col in json_cols:
        if col in df.columns:
            print(f"\nTop 5 values for {col}:")
            print(df[col].value_counts().head(5))

    # 3. Convert column datatypes
    numeric_cols = ['budget', 'id', 'popularity', 'revenue', 'vote_average', 'vote_count', 'runtime']
    for col in numeric_cols:
        df[col] = pd.to_numeric(df.get(col, 0), errors='coerce')
        
    df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
    df['release_year'] = df['release_date'].dt.year
    
    # 4. Replace unrealistic values
    # Replace 0 with NaN for budget, revenue, runtime
    for col in ['budget', 'revenue', 'runtime']:
        df[col] = df[col].replace(0, pd.NA)
        
    # Convert budget and revenue to million USD
    df['budget_musd'] = df['budget'] / 1000000
    df['revenue_musd'] = df['revenue'] / 1000000
    
    # Handle vote_count = 0
    if 'vote_count' in df.columns and 'vote_average' in df.columns:
        df.loc[df['vote_count'] == 0, 'vote_average'] = 0
    
    # Replace placeholders in overview/tagline
    for col in ['overview', 'tagline']:
        if col in df.columns:
            df[col] = df[col].replace(['No Data', ''], pd.NA)

    # 5. Remove duplicates and drop rows with unknown 'id' or 'title'
    df = df.drop_duplicates(subset='id')
    df = df.dropna(subset=['id', 'title'])
    
    # 6. Keep only rows where at least 10 columns have non-NaN values
    df = df.dropna(thresh=10)
    
    # 7. Filter to include only 'Released' movies
    if 'status' in df.columns:
        df = df[df['status'] == 'Released']
        df = df.drop(columns=['status'])
        
    # 8. Reorder columns and extract credits
    target_cols = [
        'id', 'title', 'tagline', 'release_date', 'release_year', 'genres', 'belongs_to_collection', 
        'original_language', 'budget_musd', 'revenue_musd', 'production_companies', 
        'production_countries', 'vote_count', 'vote_average', 'popularity', 'runtime', 
        'overview', 'spoken_languages', 'poster_path'
    ]
    
    if 'credits' in df.columns:
        def get_director(x):
            if isinstance(x, dict) and 'crew' in x:
                for crew in x['crew']:
                    if crew.get('job') == 'Director':
                        return crew.get('name')
            return ""
            
        def get_cast(x):
            if isinstance(x, dict) and 'cast' in x:
                return "|".join([c['name'] for c in x['cast'][:5]]) # Top 5 cast
            return ""
            
        df['director'] = df['credits'].apply(get_director)
        df['cast'] = df['credits'].apply(get_cast)
        df['cast_size'] = df['credits'].apply(lambda x: len(x.get('cast', [])) if isinstance(x, dict) else 0)
        df['crew_size'] = df['credits'].apply(lambda x: len(x.get('crew', [])) if isinstance(x, dict) else 0)
        
        target_cols.extend(['cast', 'cast_size', 'director', 'crew_size'])
    
    # Select only existing columns from target list
    final_cols = [c for c in target_cols if c in df.columns]
    df = df[final_cols]
    
    # 9. Reset index
    df = df.reset_index(drop=True)
    
    # Calculate ROI and Profit
    df['budget_musd'] = df['budget_musd'].fillna(0)
    df['revenue_musd'] = df['revenue_musd'].fillna(0)
    
    df['roi'] = df.apply(lambda row: row['revenue_musd'] / row['budget_musd'] if row['budget_musd'] > 0 else 0, axis=1)
    df['profit'] = df['revenue_musd'] - df['budget_musd']
    
    return df

In [None]:
print("Processing data...")
df_clean = process_data(raw_movies_data)
display(df_clean.head())

## Step 3: KPI Implementation & Analysis
We will now analyze the data to identify the best and worst performing movies, analyze franchises, and check specific queries.

In [None]:
def analyze_movies(df):
    """Performs comprehensive analysis on the movie dataset."""
    
    # --- 1. Identify Best/Worst Performing Movies ---
    print("\n=== 1. Best/Worst Performing Movies ===")
    
    def rank_movies(df, metric, ascending=False, top_n=5, filter_col=None, filter_val=None):
        data = df.copy()
        if filter_col:
            data = data[data[filter_col] >= filter_val]
        
        ranked = data.sort_values(metric, ascending=ascending).head(top_n)
        return ranked[['title', metric]]

    # Highest Revenue
    print("\n--- Highest Revenue ---")
    display(rank_movies(df, 'revenue_musd'))
    
    # Highest Budget
    print("\n--- Highest Budget ---")
    display(rank_movies(df, 'budget_musd'))
    
    # Highest Profit
    print("\n--- Highest Profit ---")
    display(rank_movies(df, 'profit'))
    
    # Lowest Profit
    print("\n--- Lowest Profit ---")
    display(rank_movies(df, 'profit', ascending=True))
    
    # Highest ROI (Budget >= 10M)
    print("\n--- Highest ROI (Budget >= 10M) ---")
    display(rank_movies(df, 'roi', filter_col='budget_musd', filter_val=10))
    
    # Lowest ROI (Budget >= 10M)
    print("\n--- Lowest ROI (Budget >= 10M) ---")
    display(rank_movies(df, 'roi', ascending=True, filter_col='budget_musd', filter_val=10))
    
    # Most Voted
    print("\n--- Most Voted Movies ---")
    display(rank_movies(df, 'vote_count'))
    
    # Highest Rated (Votes >= 10)
    print("\n--- Highest Rated (Votes >= 10) ---")
    display(rank_movies(df, 'vote_average', filter_col='vote_count', filter_val=10))
    
    # Lowest Rated (Votes >= 10)
    print("\n--- Lowest Rated (Votes >= 10) ---")
    display(rank_movies(df, 'vote_average', ascending=True, filter_col='vote_count', filter_val=10))
    
    # Most Popular
    print("\n--- Most Popular Movies ---")
    display(rank_movies(df, 'popularity'))
    
    
    # --- 2. Advanced Movie Filtering ---
    print("\n=== 2. Advanced Movie Filtering ===")
    
    # Search 1: Best-rated Sci-Fi Action movies starring Bruce Willis
    mask_scifi = df['genres'].str.contains('Science Fiction', na=False)
    mask_action = df['genres'].str.contains('Action', na=False)
    mask_bruce = df['cast'].str.contains('Bruce Willis', na=False)
    
    bruce_movies = df[mask_scifi & mask_action & mask_bruce].sort_values('vote_average', ascending=False)
    print("\n--- Sci-Fi Action movies starring Bruce Willis ---")
    display(bruce_movies[['title', 'vote_average', 'release_date']])
    
    # Search 2: Movies starring Uma Thurman, directed by Quentin Tarantino (sorted by runtime)
    mask_uma = df['cast'].str.contains('Uma Thurman', na=False)
    mask_qt = df['director'].str.contains('Quentin Tarantino', na=False)
    
    uma_qt_movies = df[mask_uma & mask_qt].sort_values('runtime')
    print("\n--- Uma Thurman & Quentin Tarantino Movies (by Runtime) ---")
    display(uma_qt_movies[['title', 'runtime', 'release_date']])
    
    
    # --- 3. Franchise vs Standalone ---
    print("\n=== 3. Franchise vs Standalone Analysis ===")
    
    df['is_franchise'] = df['belongs_to_collection'].apply(lambda x: True if x else False)
    
    franchise_stats = df.groupby('is_franchise').agg({
        'revenue_musd': 'mean',
        'roi': 'median',
        'budget_musd': 'mean',
        'popularity': 'mean',
        'vote_average': 'mean'
    }).rename(index={True: 'Franchise', False: 'Standalone'})
    
    print("\n--- Franchise vs Standalone Stats ---")
    display(franchise_stats)
    
    
    # --- 4. Most Successful Franchises & Directors ---
    print("\n=== 4. Most Successful Franchises & Directors ===")
    
    # Franchises
    franchise_df = df[df['is_franchise']].groupby('belongs_to_collection').agg({
        'title': 'count',
        'budget_musd': ['sum', 'mean'],
        'revenue_musd': ['sum', 'mean'],
        'vote_average': 'mean'
    })
    franchise_df.columns = ['movie_count', 'total_budget', 'mean_budget', 'total_revenue', 'mean_revenue', 'mean_rating']
    print("\n--- Top 5 Franchises by Total Revenue ---")
    display(franchise_df.sort_values('total_revenue', ascending=False).head(5))
    
    # Directors
    director_df = df.groupby('director').agg({
        'title': 'count',
        'revenue_musd': 'sum',
        'vote_average': 'mean'
    })
    director_df.columns = ['movie_count', 'total_revenue', 'mean_rating']
    # Filter out empty director if any
    if "" in director_df.index:
        director_df = director_df.drop("")
        
    print("\n--- Top 5 Directors by Total Revenue ---")
    display(director_df.sort_values('total_revenue', ascending=False).head(5))

    return franchise_stats

In [None]:
franchise_stats = analyze_movies(df_clean)

## Step 4: Data Visualization
Finally, we visualize the data to better understand the relationships between different variables.

In [None]:
def plot_data(df, franchise_stats):
    """Generates plots for analysis."""
    sns.set_theme(style="whitegrid")
    
    # 1. Revenue vs Budget
    plt.figure(figsize=(10, 6))
    sns.scatterplot(data=df, x='budget_musd', y='revenue_musd', hue='is_franchise', alpha=0.7)
    plt.title('Revenue vs Budget')
    plt.xlabel('Budget (MUSD)')
    plt.ylabel('Revenue (MUSD)')
    plt.show()
    
    # 2. ROI Distribution by Genre (Top 5 genres)
    # Explode genres first
    df_genres = df.assign(genre=df['genres'].str.split('|')).explode('genre')
    top_genres = df_genres['genre'].value_counts().head(5).index
    df_top_genres = df_genres[df_genres['genre'].isin(top_genres)]
    
    plt.figure(figsize=(12, 6))
    sns.boxplot(data=df_top_genres, x='genre', y='roi')
    plt.title('ROI Distribution by Top 5 Genres')
    plt.ylim(-1, 10) # Limit y-axis to see distribution better
    plt.show()
    
    # 3. Popularity vs Rating
    plt.figure(figsize=(10, 6))
    sns.scatterplot(data=df, x='vote_average', y='popularity', alpha=0.6)
    plt.title('Popularity vs Rating')
    plt.xlabel('Vote Average')
    plt.ylabel('Popularity')
    plt.show()
    
    # 4. Franchise vs Standalone Comparison (Bar Chart)
    # Reset index to plot
    franchise_plot = franchise_stats.reset_index()
    # Melt for seaborn
    franchise_melt = franchise_plot.melt(id_vars='is_franchise', value_vars=['revenue_musd', 'budget_musd'], var_name='Metric', value_name='Value (MUSD)')
    
    plt.figure(figsize=(10, 6))
    sns.barplot(data=franchise_melt, x='Metric', y='Value (MUSD)', hue='is_franchise')
    plt.title('Franchise vs Standalone: Revenue & Budget')
    plt.show()

    # 5. Yearly Trends in Box Office Performance
    yearly_stats = df.groupby('release_year')['revenue_musd'].sum().reset_index()
    plt.figure(figsize=(12, 6))
    sns.lineplot(data=yearly_stats, x='release_year', y='revenue_musd', marker='o')
    plt.title('Yearly Trends in Box Office Revenue')
    plt.xlabel('Year')
    plt.ylabel('Total Revenue (MUSD)')
    plt.show()

In [None]:
plot_data(df_clean, franchise_stats)

## Conclusion
In this analysis, we fetched movie data from the TMDB API, cleaned it, and performed various analyses to understand what makes a movie successful. We looked at financial metrics like Revenue and ROI, as well as popularity and ratings. We also compared franchises vs. standalone movies and identified top directors.