# TMDB Movie Data Analysis

This notebook focuses on analyzing the TMDB movie dataset to extract additional insights about movies, actors, and directors. We'll explore:

1. Movie metadata and popularity metrics
2. Actor and director networks and collaborations
3. Detailed genre analysis with budget and revenue information
4. Visualization of industry trends over time

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import ast
from collections import Counter, defaultdict
import re
from tqdm.notebook import tqdm
import networkx as nx
from itertools import combinations

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

## 1. Data Loading and Initial Exploration

In [None]:
# Install required packages if not already installed
!pip install networkx plotly

# Load the TMDB credits dataset
tmdb_credits = pd.read_csv('tmdb_5000_credits.csv')
print(f"Loaded TMDB credits dataset with {len(tmdb_credits)} movies")
print(f"Columns: {tmdb_credits.columns.tolist()}")

In [None]:
# Check if there's a corresponding TMDB movies dataset (common in Kaggle)
try:
    tmdb_movies = pd.read_csv('tmdb_5000_movies.csv')
    print(f"Loaded TMDB movies dataset with {len(tmdb_movies)} movies")
    print(f"Columns: {tmdb_movies.columns.tolist()}")
    has_movies_dataset = True
except FileNotFoundError:
    print("TMDB movies dataset not found. Working with credits dataset only.")
    has_movies_dataset = False
    
    # Try to download the dataset
    print("Attempting to download the TMDB movies dataset...")
    try:
        !pip install kaggle
        !kaggle datasets download -d tmdb/tmdb-movie-metadata
        !unzip tmdb-movie-metadata.zip
        tmdb_movies = pd.read_csv('tmdb_5000_movies.csv')
        print(f"Successfully downloaded and loaded TMDB movies dataset with {len(tmdb_movies)} movies")
        has_movies_dataset = True
    except:
        print("Could not download the dataset. Continuing with credits dataset only.")

In [None]:
# Parse the JSON data in the cast and crew columns
def parse_json_column(json_str):
    try:
        return json.loads(json_str.replace('\'\'', '\''))
    except:
        return []

# Extract cast information
tmdb_credits['cast_parsed'] = tmdb_credits['cast'].apply(parse_json_column)

# Extract crew information
tmdb_credits['crew_parsed'] = tmdb_credits['crew'].apply(parse_json_column)

# Extract director information from crew
def get_director(crew_list):
    for crew_member in crew_list:
        if isinstance(crew_member, dict) and crew_member.get('job') == 'Director':
            return crew_member.get('name')
    return None

tmdb_credits['director'] = tmdb_credits['crew_parsed'].apply(get_director)

# Extract top cast members
def get_top_cast(cast_list, n=5):
    return [cast_member.get('name') for cast_member in cast_list[:n] 
            if isinstance(cast_member, dict) and 'name' in cast_member]

tmdb_credits['top_cast'] = tmdb_credits['cast_parsed'].apply(lambda x: get_top_cast(x))

# Display the processed data
tmdb_credits[['title', 'director', 'top_cast']].head()

## 2. Actor and Director Analysis

In [None]:
# Analyze actor popularity and gender distribution
actor_data = []
for _, row in tmdb_credits.iterrows():
    for cast_member in row['cast_parsed']:
        if isinstance(cast_member, dict) and 'gender' in cast_member and 'name' in cast_member:
            actor_data.append({
                'name': cast_member.get('name'),
                'gender': 'Female' if cast_member.get('gender') == 1 else 'Male' if cast_member.get('gender') == 2 else 'Unknown',
                'popularity': cast_member.get('popularity', 0),
                'character': cast_member.get('character', ''),
                'movie_id': row['movie_id'],
                'movie_title': row['title']
            })

actor_df = pd.DataFrame(actor_data)

# Analyze gender distribution
gender_counts = actor_df['gender'].value_counts()
plt.figure(figsize=(10, 6))
gender_counts.plot(kind='bar', color=['skyblue', 'pink', 'gray'])
plt.title('Gender Distribution of Actors in TMDB Dataset')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Analyze top actors by popularity
top_actors_by_popularity = actor_df.sort_values('popularity', ascending=False).drop_duplicates('name').head(20)
plt.figure(figsize=(12, 8))
sns.barplot(x='popularity', y='name', data=top_actors_by_popularity, hue='gender', palette={'Male': 'skyblue', 'Female': 'pink', 'Unknown': 'gray'})
plt.title('Top 20 Actors by Popularity')
plt.xlabel('Popularity Score')
plt.ylabel('Actor Name')
plt.tight_layout()
plt.show()

In [None]:
# Analyze actor-director collaborations
collaborations = []
for _, row in tmdb_credits.iterrows():
    director = row['director']
    if pd.notna(director):
        for actor in row['top_cast']:
            collaborations.append({
                'director': director,
                'actor': actor,
                'movie': row['title']
            })

collab_df = pd.DataFrame(collaborations)

# Count collaborations
collab_counts = collab_df.groupby(['director', 'actor']).size().reset_index(name='collaboration_count')
top_collabs = collab_counts.sort_values('collaboration_count', ascending=False).head(20)

# Visualize top collaborations
plt.figure(figsize=(12, 10))
sns.barplot(x='collaboration_count', y=top_collabs.apply(lambda x: f"{x['director']} & {x['actor']}", axis=1), data=top_collabs)
plt.title('Top 20 Director-Actor Collaborations')
plt.xlabel('Number of Collaborations')
plt.ylabel('Director-Actor Pair')
plt.tight_layout()
plt.show()

In [None]:
# Create a network graph of actor collaborations
# Focus on actors who have worked together in the same movie
G = nx.Graph()

# Add edges for actors who worked together
for _, row in tmdb_credits.iterrows():
    actors = row['top_cast']
    # Add all pairs of actors as edges
    for actor1, actor2 in combinations(actors, 2):
        if G.has_edge(actor1, actor2):
            G[actor1][actor2]['weight'] += 1
        else:
            G.add_edge(actor1, actor2, weight=1)

# Get the largest connected component
largest_cc = max(nx.connected_components(G), key=len)
G_sub = G.subgraph(largest_cc).copy()

# Limit to a manageable size for visualization
if len(G_sub) > 100:
    # Keep only nodes with high degree
    degrees = dict(G_sub.degree())
    nodes_to_keep = sorted(degrees, key=degrees.get, reverse=True)[:100]
    G_sub = G_sub.subgraph(nodes_to_keep).copy()

# Calculate node sizes based on degree
node_size = [50 + 10 * G_sub.degree(node) for node in G_sub.nodes()]

# Calculate edge widths based on weight
edge_width = [0.5 + 0.5 * G_sub[u][v]['weight'] for u, v in G_sub.edges()]

# Plot the network
plt.figure(figsize=(16, 16))
pos = nx.spring_layout(G_sub, seed=42)
nx.draw_networkx_nodes(G_sub, pos, node_size=node_size, node_color='skyblue', alpha=0.8)
nx.draw_networkx_edges(G_sub, pos, width=edge_width, alpha=0.3, edge_color='gray')
nx.draw_networkx_labels(G_sub, pos, font_size=8, font_family='sans-serif')
plt.title('Actor Collaboration Network')
plt.axis('off')
plt.tight_layout()
plt.show()

## 3. Genre Analysis with Budget and Revenue

In [None]:
# If we have the movies dataset, analyze genres with budget and revenue
if has_movies_dataset:
    # Parse the genres JSON data
    def extract_genres(genres_json):
        genres = parse_json_column(genres_json)
        return [genre.get('name') for genre in genres if isinstance(genre, dict) and 'name' in genre]
    
    tmdb_movies['genres_list'] = tmdb_movies['genres'].apply(extract_genres)
    
    # Explode the genres list to analyze each genre separately
    genres_exploded = tmdb_movies.explode('genres_list')
    
    # Calculate average budget and revenue by genre
    genre_metrics = genres_exploded.groupby('genres_list').agg({
        'budget': 'mean',
        'revenue': 'mean',
        'vote_average': 'mean',
        'popularity': 'mean',
        'id': 'count'
    }).reset_index()
    
    genre_metrics.columns = ['Genre', 'Avg Budget', 'Avg Revenue', 'Avg Rating', 'Avg Popularity', 'Movie Count']
    
    # Calculate ROI (Return on Investment)
    genre_metrics['ROI'] = (genre_metrics['Avg Revenue'] - genre_metrics['Avg Budget']) / genre_metrics['Avg Budget']
    
    # Sort by movie count
    genre_metrics_sorted = genre_metrics.sort_values('Movie Count', ascending=False)
    
    # Visualize budget vs. revenue by genre
    plt.figure(figsize=(14, 10))
    sns.scatterplot(x='Avg Budget', y='Avg Revenue', size='Movie Count', hue='Avg Rating',
                    sizes=(100, 1000), alpha=0.7, palette='viridis', data=genre_metrics_sorted)
    
    # Add genre labels
    for _, row in genre_metrics_sorted.iterrows():
        plt.annotate(row['Genre'], 
                     (row['Avg Budget'], row['Avg Revenue']),
                     xytext=(5, 5), 
                     textcoords='offset points',
                     fontsize=10)
    
    plt.title('Average Budget vs. Revenue by Genre')
    plt.xlabel('Average Budget ($)')
    plt.ylabel('Average Revenue ($)')
    plt.xscale('log')
    plt.yscale('log')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Visualize ROI by genre
    plt.figure(figsize=(12, 8))
    sns.barplot(x='ROI', y='Genre', data=genre_metrics_sorted.head(15))
    plt.title('Return on Investment (ROI) by Genre')
    plt.xlabel('ROI (Revenue - Budget) / Budget')
    plt.ylabel('Genre')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("Skipping genre analysis with budget and revenue as the movies dataset is not available.")

## 4. Temporal Analysis of Movie Industry

In [None]:
# If we have the movies dataset, analyze trends over time
if has_movies_dataset:
    # Extract release year from release_date
    tmdb_movies['release_year'] = pd.to_datetime(tmdb_movies['release_date'], errors='coerce').dt.year
    
    # Group by year and calculate metrics
    yearly_metrics = tmdb_movies.groupby('release_year').agg({
        'budget': 'mean',
        'revenue': 'mean',
        'vote_average': 'mean',
        'popularity': 'mean',
        'id': 'count'
    }).reset_index()
    
    yearly_metrics.columns = ['Year', 'Avg Budget', 'Avg Revenue', 'Avg Rating', 'Avg Popularity', 'Movie Count']
    
    # Filter out years with too few movies and invalid years
    yearly_metrics = yearly_metrics[(yearly_metrics['Movie Count'] >= 10) & 
                                    (yearly_metrics['Year'] >= 1960) & 
                                    (yearly_metrics['Year'] <= 2017)]
    
    # Plot trends over time
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Budget trend
    axes[0, 0].plot(yearly_metrics['Year'], yearly_metrics['Avg Budget'], marker='o', linewidth=2)
    axes[0, 0].set_title('Average Movie Budget Over Time')
    axes[0, 0].set_xlabel('Year')
    axes[0, 0].set_ylabel('Average Budget ($)')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Revenue trend
    axes[0, 1].plot(yearly_metrics['Year'], yearly_metrics['Avg Revenue'], marker='o', linewidth=2, color='green')
    axes[0, 1].set_title('Average Movie Revenue Over Time')
    axes[0, 1].set_xlabel('Year')
    axes[0, 1].set_ylabel('Average Revenue ($)')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Rating trend
    axes[1, 0].plot(yearly_metrics['Year'], yearly_metrics['Avg Rating'], marker='o', linewidth=2, color='orange')
    axes[1, 0].set_title('Average Movie Rating Over Time')
    axes[1, 0].set_xlabel('Year')
    axes[1, 0].set_ylabel('Average Rating')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Movie count trend
    axes[1, 1].plot(yearly_metrics['Year'], yearly_metrics['Movie Count'], marker='o', linewidth=2, color='purple')
    axes[1, 1].set_title('Number of Movies Released Each Year')
    axes[1, 1].set_xlabel('Year')
    axes[1, 1].set_ylabel('Movie Count')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("Skipping temporal analysis as the movies dataset is not available.")

## 5. Conclusion and Key Findings

In this notebook, we've conducted an in-depth analysis of the TMDB movie dataset, focusing on actors, directors, genres, and industry trends. Here are the key findings:

1. **Actor and Director Networks**: We identified the most collaborative actor-director pairs and visualized the network of actor collaborations, revealing the interconnected nature of the film industry.

2. **Gender Distribution**: Our analysis showed significant gender disparities among actors in the dataset, with male actors being more prevalent.

3. **Genre Economics**: We analyzed the financial aspects of different genres, identifying which genres tend to have higher budgets, revenues, and return on investment.

4. **Industry Trends**: Our temporal analysis revealed how movie budgets, revenues, ratings, and production volume have evolved over time.

These insights provide a comprehensive view of the movie industry from multiple perspectives and can inform further analysis and modeling efforts.