# TMDB Movie Data Analysis using Spark and APIs

## Step 1: API Setup and Data Extraction

This section focuses on setting up the TMDB API connection and extracting movie data. We'll use PySpark for distributed data processing and the TMDB API to fetch detailed movie information including cast, crew, and financial data.

### Import Necessary Modules

Start by importing the required libraries for:
- **PySpark**: Distributed data processing and analysis
- **API calls**: HTTP requests to TMDB API
- **Data manipulation**: NumPy for numerical operations
- **Environment management**: Secure API key handling
- **Visualization**: Matplotlib for data plotting

In [83]:
# Initialize Spark Session for distributed data processing
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Create Spark session with adaptive query execution enabled
spark = SparkSession.builder \
        .appName("TMDB Movie Data Analysis") \
        .config("spark.sql.adaptive.enabled", "true") \
        .getOrCreate()

In [84]:
# Import additional libraries for API calls and data analysis
import os
from dotenv import load_dotenv
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time  # For handling API rate limits

### Environment Configuration

Load environment variables from .env file to securely access the TMDB API key without hardcoding sensitive information.

In [85]:
# Load environment variables from .env file
load_dotenv()

True

### API Key Configuration

Retrieve the TMDB API key from environment variables. This approach ensures:
- Security: API keys are not exposed in code
- Flexibility: Easy to change keys without code modification
- Best practices: Following secure development standards

In [86]:
# Retrieve TMDB API key from environment variables
API_KEY = os.getenv('TMDB_API_KEY')

# Validate API key is loaded
if not API_KEY:
    raise ValueError("TMDB_API_KEY not found in environment variables. Please check your .env file.")

### Movie Data Extraction with Error Handling

This section implements robust movie data fetching with comprehensive error handling:

**Features:**
- **Timeout handling**: Prevents hanging requests
- **HTTP error handling**: Manages API response errors
- **Connection error handling**: Handles network issues
- **Rate limiting**: Respects API usage limits
- **Data validation**: Ensures safe data access
- **Progress tracking**: Shows real-time processing status

**Data collected for each movie:**
- Basic info: Title, release date, runtime, budget, revenue
- Ratings: Vote average and count
- Cast: Top 5 actors and total cast size
- Crew: Director and total crew size
- Genres and production details

In [87]:
# List of popular movie IDs from TMDB (including one invalid ID for testing error handling)
movie_ids = [
    0, 299534, 19995, 140607, 299536, 597, 135397,
    420818, 24428, 168259, 99861, 284054, 12445,
    181808, 330457, 351286, 109445, 321612, 260513
]

# TMDB API configuration
base_url = "https://api.themoviedb.org/3/movie"
params = {"api_key": API_KEY, "language": "en-US"}

# Initialize lists to store results
movies_data = []
failed_requests = []  # Track failed requests for analysis

print("Starting movie data extraction...")

for i, movie_id in enumerate(movie_ids, 1):
    try:
        print(f"Processing movie {i}/{len(movie_ids)} (ID: {movie_id})...", end=" ")
        
        # Fetch movie details with timeout and error handling
        movie_url = f"{base_url}/{movie_id}"
        movie_response = requests.get(movie_url, params=params, timeout=10)
        movie_response.raise_for_status()  # Raises HTTPError for bad responses
        
        data = movie_response.json()
        
        # Fetch credits with nested error handling
        try:
            credits_url = f"{base_url}/{movie_id}/credits"
            credits_response = requests.get(credits_url, params=params, timeout=10)
            credits_response.raise_for_status()
            
            credits = credits_response.json()
            
            # Extract main cast (top 5 names) with safe access
            cast_list = [member.get("name", "Unknown") for member in credits.get("cast", [])[:5]]
            cast_names = "|".join(cast_list) if cast_list else "No cast data"
            cast_size = len(credits.get("cast", []))
            
            # Extract directors from crew with safe access
            crew = credits.get("crew", [])
            directors = [m.get("name") for m in crew if m.get("job") == "Director" and m.get("name")]
            director_name = directors[0] if directors else "Unknown Director"
            crew_size = len(crew)
            
            # Add credit info to movie data
            data["cast"] = cast_names
            data["cast_size"] = cast_size
            data["director"] = director_name
            data["crew_size"] = crew_size
            
        except requests.exceptions.RequestException as e:
            print(f"Credits fetch failed: {str(e)[:50]}...")
            # Set default values if credits fetch fails
            data["cast"] = "Credits unavailable"
            data["cast_size"] = 0
            data["director"] = "Unknown Director"
            data["crew_size"] = 0
        
        except Exception as e:
            print(f"Credits processing error: {str(e)[:50]}...")
            data["cast"] = "Processing error"
            data["cast_size"] = 0
            data["director"] = "Unknown Director"
            data["crew_size"] = 0
        
        movies_data.append(data)
        print(f"Fetched movie: {data.get('title', 'Unknown Title')} (ID: {movie_id})")
        
        # Add small delay to respect API rate limits
        time.sleep(0.1)
        
    except requests.exceptions.HTTPError as e:
        error_msg = f"HTTP Error {e.response.status_code}: {e.response.reason}"
        print(f"Failed to fetch movie with ID: {movie_id}")
        failed_requests.append({"movie_id": movie_id, "error": error_msg})
        
    except requests.exceptions.Timeout:
        error_msg = "Request timeout (>10s)"
        print(f"Timeout error for movie ID: {movie_id}")
        failed_requests.append({"movie_id": movie_id, "error": error_msg})
        
    except requests.exceptions.ConnectionError:
        error_msg = "Connection error - check internet connection"
        print(f"Connection error for movie ID: {movie_id}")
        failed_requests.append({"movie_id": movie_id, "error": error_msg})
        
    except requests.exceptions.RequestException as e:
        error_msg = f"Request error: {str(e)[:50]}..."
        print(f"Request error for movie ID: {movie_id}")
        failed_requests.append({"movie_id": movie_id, "error": error_msg})
        
    except Exception as e:
        error_msg = f"Unexpected error: {str(e)[:50]}..."
        print(f"Unexpected error for movie ID: {movie_id}")
        failed_requests.append({"movie_id": movie_id, "error": error_msg})

# Summary of data extraction
print(f"\n=== Data Extraction Summary ===")
print(f"Total movies requested: {len(movie_ids)}")
print(f"Successfully fetched: {len(movies_data)}")
print(f"Failed requests: {len(failed_requests)}")

if failed_requests:
    print("\nFailed requests details:")
    for failure in failed_requests:
        print(f"  - Movie ID {failure['movie_id']}: {failure['error']}")

Starting movie data extraction...
Processing movie 1/19 (ID: 0)... Failed to fetch movie with ID: 0
Processing movie 2/19 (ID: 299534)... Fetched movie: Avengers: Endgame (ID: 299534)
Processing movie 3/19 (ID: 19995)... Fetched movie: Avatar (ID: 19995)
Processing movie 4/19 (ID: 140607)... Fetched movie: Star Wars: The Force Awakens (ID: 140607)
Processing movie 5/19 (ID: 299536)... Fetched movie: Avengers: Infinity War (ID: 299536)
Processing movie 6/19 (ID: 597)... Fetched movie: Titanic (ID: 597)
Processing movie 7/19 (ID: 135397)... Fetched movie: Jurassic World (ID: 135397)
Processing movie 8/19 (ID: 420818)... Fetched movie: The Lion King (ID: 420818)
Processing movie 9/19 (ID: 24428)... Fetched movie: The Avengers (ID: 24428)
Processing movie 10/19 (ID: 168259)... Fetched movie: Furious 7 (ID: 168259)
Processing movie 11/19 (ID: 99861)... Fetched movie: Avengers: Age of Ultron (ID: 99861)
Processing movie 12/19 (ID: 284054)... Fetched movie: Black Panther (ID: 284054)
Processi

## Convert movies_data to Saprk DataFrame

In [88]:
# Create Spark DataFrame from movies_data
df = spark.createDataFrame(movies_data)

# Display basic info
print(f"Created DataFrame with {df.count()} movies")
print(f"DataFrame columns: {len(df.columns)}")

# Show first few rows
df.select('title', 'release_date', 'vote_average', 'budget', 'revenue').show(3, truncate=False)


Created DataFrame with 18 movies
DataFrame columns: 30
+----------------------------+------------+------------+---------+----------+
|title                       |release_date|vote_average|budget   |revenue   |
+----------------------------+------------+------------+---------+----------+
|Avengers: Endgame           |2019-04-24  |8.238       |356000000|2799439100|
|Avatar                      |2009-12-15  |7.594       |237000000|2923706026|
|Star Wars: The Force Awakens|2015-12-15  |7.3         |245000000|2068223624|
+----------------------------+------------+------------+---------+----------+
only showing top 3 rows


## Step 2: Data Cleaning and Preprocessing

### Data Preparation & Cleaning

1. **Drop** irrelevant columns: ['adult', 'imdb_id', 'original_title', 'video', 'homepage']
2. **Evaluate** JSON-like columns (['belongs_to_collection', 'genres', 'production_countries', 'production_companies', 'spoken_languages'])
3. **Extract** and clean key data points:
   - Collection name (belongs_to_collection)
   - Genre names (genres → separate multiple genres with "|")
   - Spoken languages (spoken_languages → separate with "|")
   - Production countries (production_countries → separate with "|")
   - Production companies (production_companies → separate with "|")
4. **Inspect** extracted columns using value_counts() to identify anomalies

### Handling Missing & Incorrect Data

5. **Convert** column datatypes:
   - 'budget', 'id', 'popularity' → Numeric (set invalid values to NaN)
   - 'release_date' → Convert to datetime
6. **Replace unrealistic values**:
   - Budget/Revenue/Runtime = 0 → Replace with NaN
   - Convert 'budget' and 'revenue' to million USD
   - Movies with vote_count = 0 → Analyze their vote_average and adjust accordingly
   - 'overview' and 'tagline' → Replace known placeholders (e.g., 'No Data') with NaN
7. **Remove duplicates** and drop rows with unknown 'id' or 'title'
8. **Keep** only rows where at least **10 columns have non-NaN values**
9. **Filter** to include only 'Released' movies, then drop 'status'

### Reorder & Finalize DataFrame

10. **Reorder columns**: ['id', 'title', 'tagline', 'release_date', 'genres', 'belongs_to_collection', 'original_language', 'budget_musd', 'revenue_musd', 'production_companies', 'production_countries', 'vote_count', 'vote_average', 'popularity', 'runtime', 'overview', 'spoken_languages', 'poster_path', 'cast', 'cast_size', 'director', 'crew_size']
11. **Reset index**


### 2.1 Drop irrelevant columns


In [89]:
cols_to_drop = ['adult', 'imdb_id', 'original_title', 'video', 'homepage']
df1 = df.drop(*cols_to_drop)

print(f"Columns after dropping: {len(df.columns)}")
df1.columns

Columns after dropping: 30


['backdrop_path',
 'belongs_to_collection',
 'budget',
 'cast',
 'cast_size',
 'crew_size',
 'director',
 'genres',
 'id',
 'origin_country',
 'original_language',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'vote_average',
 'vote_count']

### 2.2  Evaluate JSON-like columns

In [90]:
# Fix the JSON cleaning with simpler regex patterns
df2 = df1.withColumn("collection_name", 
    when(col("belongs_to_collection").isNotNull(), 
         col("belongs_to_collection")["name"])
    .otherwise(None))

df2 = df2.withColumn("genres", 
    regexp_replace(
        regexp_replace(to_json(col("genres")), "\\[|\\]|\\{|\\}", ""),
        '"name":', ""
    ))

df2 = df2.withColumn("production_countries",
    regexp_replace(
        regexp_replace(to_json(col("production_countries")), "\\[|\\]|\\{|\\}", ""),
        '"name":', ""
    ))

df2 = df2.withColumn("production_companies",
    regexp_replace(
        regexp_replace(to_json(col("production_companies")), "\\[|\\]|\\{|\\}", ""),
        '"name":', ""
    ))

df2 = df2.withColumn("spoken_languages",
    regexp_replace(
        regexp_replace(to_json(col("spoken_languages")), "\\[|\\]|\\{|\\}", ""),
        '"english_name":', ""
    ))

df2 = df2.drop("belongs_to_collection")

# Inspect extracted columns using value_counts
for col_name in ["collection_name", "genres", "spoken_languages", "production_countries"]:
    print(f"\nColumn: {col_name}")
    df2.groupBy(col_name).count().orderBy(desc("count")).show()



Column: collection_name
+--------------------+-----+
|     collection_name|count|
+--------------------+-----+
|The Avengers Coll...|    4|
|Star Wars Collection|    2|
|                NULL|    2|
|Jurassic Park Col...|    2|
|   Frozen Collection|    2|
|   Avatar Collection|    1|
|The Lion King (Re...|    1|
|The Fast and the ...|    1|
|Harry Potter Coll...|    1|
|Black Panther Col...|    1|
|The Incredibles C...|    1|
+--------------------+-----+


Column: genres
+--------------------+-----+
|              genres|count|
+--------------------+-----+
|"Adventure","id":...|    3|
|"Action","id":"28...|    2|
|"Action","id":"28...|    2|
|"Action","id":"28...|    1|
|"Adventure","id":...|    1|
|"Drama","id":"18"...|    1|
|"Science Fiction"...|    1|
|"Adventure","id":...|    1|
|"Action","id":"28...|    1|
|"Adventure","id":...|    1|
|"Family","id":"10...|    1|
|"Family","id":"10...|    1|
|"Action","id":"28...|    1|
|"Animation","id":...|    1|
+--------------------+-----+



In [91]:
# Inspect extracted columns using value_counts
for col_name in ["collection_name", "genres", "spoken_languages", "production_countries"]:
    print(f"\nColumn: {col_name}")
    df2.groupBy(col_name).count().orderBy(desc("count")).show()



Column: collection_name
+--------------------+-----+
|     collection_name|count|
+--------------------+-----+
|The Avengers Coll...|    4|
|Star Wars Collection|    2|
|                NULL|    2|
|Jurassic Park Col...|    2|
|   Frozen Collection|    2|
|   Avatar Collection|    1|
|The Lion King (Re...|    1|
|The Fast and the ...|    1|
|Harry Potter Coll...|    1|
|Black Panther Col...|    1|
|The Incredibles C...|    1|
+--------------------+-----+


Column: genres
+--------------------+-----+
|              genres|count|
+--------------------+-----+
|"Adventure","id":...|    3|
|"Action","id":"28...|    2|
|"Action","id":"28...|    2|
|"Action","id":"28...|    1|
|"Adventure","id":...|    1|
|"Drama","id":"18"...|    1|
|"Science Fiction"...|    1|
|"Adventure","id":...|    1|
|"Action","id":"28...|    1|
|"Adventure","id":...|    1|
|"Family","id":"10...|    1|
|"Family","id":"10...|    1|
|"Action","id":"28...|    1|
|"Animation","id":...|    1|
+--------------------+-----+



### 2.3 Extract and clean key data points

In [92]:
# Collection name
df2 = df1.withColumn("collection_name", 
    when(col("belongs_to_collection").isNotNull(), 
         col("belongs_to_collection")["name"])
    .otherwise(None))

# Genre names (separate with "|")
df2 = df2.withColumn("genres", 
    expr("transform(genres, x -> x.name)")).withColumn("genres", 
    concat_ws("|", col("genres")))

# Spoken languages (separate with "|")
df2 = df2.withColumn("spoken_languages",
    expr("transform(spoken_languages, x -> x.english_name)")).withColumn("spoken_languages",
    concat_ws("|", col("spoken_languages")))

# Production countries (separate with "|")
df2 = df2.withColumn("production_countries",
    expr("transform(production_countries, x -> x.name)")).withColumn("production_countries",
    concat_ws("|", col("production_countries")))

# Production companies (separate with "|")
df2 = df2.withColumn("production_companies",
    expr("transform(production_companies, x -> x.name)")).withColumn("production_companies",
    concat_ws("|", col("production_companies")))

df2 = df2.drop("belongs_to_collection")


### 2.4 Inspect Extracted columns

In [93]:
# Inspect extracted columns using value_counts to identify anomalies
for col_name in ["collection_name", "genres", "spoken_languages", "production_countries", "production_companies"]:
    print(f"\nColumn: {col_name}")
    df2.groupBy(col_name).count().orderBy(desc("count")).show(10)



Column: collection_name
+--------------------+-----+
|     collection_name|count|
+--------------------+-----+
|The Avengers Coll...|    4|
|Star Wars Collection|    2|
|                NULL|    2|
|Jurassic Park Col...|    2|
|   Frozen Collection|    2|
|   Avatar Collection|    1|
|The Lion King (Re...|    1|
|The Fast and the ...|    1|
|Harry Potter Coll...|    1|
|Black Panther Col...|    1|
+--------------------+-----+
only showing top 10 rows

Column: genres
+--------------------+-----+
|              genres|count|
+--------------------+-----+
|Adventure|Action|...|    3|
|Action|Adventure|...|    2|
|Action|Adventure|...|    2|
|Adventure|Science...|    1|
|Action|Adventure|...|    1|
|       Drama|Romance|    1|
|Adventure|Drama|F...|    1|
|Science Fiction|A...|    1|
|Action|Crime|Thri...|    1|
|   Adventure|Fantasy|    1|
+--------------------+-----+
only showing top 10 rows

Column: spoken_languages
+--------------------+-----+
|    spoken_languages|count|
+------------

### 2.5 Convert Column Datatypes

In [94]:
# Convert to numeric (invalid values become null)
df3 = df2.withColumn("budget", col("budget").cast(DoubleType()))
df3 = df3.withColumn("id", col("id").cast(IntegerType()))  
df3 = df3.withColumn("popularity", col("popularity").cast(DoubleType()))

# Convert to datetime
df3 = df3.withColumn("release_date", to_date(col("release_date")))
df3.dtypes


[('backdrop_path', 'string'),
 ('budget', 'double'),
 ('cast', 'string'),
 ('cast_size', 'bigint'),
 ('crew_size', 'bigint'),
 ('director', 'string'),
 ('genres', 'string'),
 ('id', 'int'),
 ('origin_country', 'array<string>'),
 ('original_language', 'string'),
 ('overview', 'string'),
 ('popularity', 'double'),
 ('poster_path', 'string'),
 ('production_companies', 'string'),
 ('production_countries', 'string'),
 ('release_date', 'date'),
 ('revenue', 'bigint'),
 ('runtime', 'bigint'),
 ('spoken_languages', 'string'),
 ('status', 'string'),
 ('tagline', 'string'),
 ('title', 'string'),
 ('vote_average', 'double'),
 ('vote_count', 'bigint'),
 ('collection_name', 'string')]

### 2.6  Replace unrealistic values

In [95]:
# Replace 0 values with null for budget, revenue, runtime
df4 = df3.withColumn("budget", when(col("budget") == 0, lit(None)).otherwise(col("budget"))) \
         .withColumn("revenue", when(col("revenue") == 0, lit(None)).otherwise(col("revenue"))) \
         .withColumn("runtime", when(col("runtime") == 0, lit(None)).otherwise(col("runtime")))

# Convert budget and revenue to million USD
df4 = df4.withColumn("budget_musd", col("budget") / 1000000) \
         .withColumn("revenue_musd", col("revenue") / 1000000)

# Handle vote_count = 0 by setting vote_average to null
df4 = df4.withColumn("vote_average", when(col("vote_count") == 0, lit(None)).otherwise(col("vote_average")))

# Replace placeholder text with null in overview and tagline
df4 = df4.withColumn("overview", when(col("overview").isin("No Data", ""), lit(None)).otherwise(col("overview"))) \
         .withColumn("tagline", when(col("tagline").isin("No Data", ""), lit(None)).otherwise(col("tagline")))


### Verify 2.6

In [96]:
# Check for zero values before replacement
df4.select("budget", "revenue", "runtime", "vote_count").describe().show()

# Check for placeholder text
df4.select("overview", "tagline").filter(col("overview").isin("No Data", "") | col("tagline").isin("No Data", "")).show()

# Verify new million USD columns
df4.select("budget", "budget_musd", "revenue", "revenue_musd").show()


+-------+--------------------+--------------------+------------------+------------------+
|summary|              budget|             revenue|           runtime|        vote_count|
+-------+--------------------+--------------------+------------------+------------------+
|  count|                  18|                  18|                18|                18|
|   mean| 2.137777777777778E8|1.6918318277222223E9|138.05555555555554|20424.833333333332|
| stddev|6.1959717042185664E7| 5.210622853292365E8| 23.84871983793882|7742.1590486952755|
|    min|              1.25E8|          1243225667|               102|             10072|
|    max|              3.56E8|          2923706026|               194|             34230|
+-------+--------------------+--------------------+------------------+------------------+

+--------+-------+
|overview|tagline|
+--------+-------+
+--------+-------+

+------+-----------+----------+------------+
|budget|budget_musd|   revenue|revenue_musd|
+------+-----------+--

count null values

In [97]:
# Count null values in each column
df4.select([spark_sum(when(isnull(c), 1).otherwise(0)).alias(c) for c in df4.columns]).show()


+-------------+------+----+---------+---------+--------+------+---+--------------+-----------------+--------+----------+-----------+--------------------+--------------------+------------+-------+-------+----------------+------+-------+-----+------------+----------+---------------+-----------+------------+
|backdrop_path|budget|cast|cast_size|crew_size|director|genres| id|origin_country|original_language|overview|popularity|poster_path|production_companies|production_countries|release_date|revenue|runtime|spoken_languages|status|tagline|title|vote_average|vote_count|collection_name|budget_musd|revenue_musd|
+-------------+------+----+---------+---------+--------+------+---+--------------+-----------------+--------+----------+-----------+--------------------+--------------------+------------+-------+-------+----------------+------+-------+-----+------------+----------+---------------+-----------+------------+
|            0|     0|   0|        0|        0|       0|     0|  0|            

### 2.7 Remove duplicates

In [98]:
# Filter rows with non-null id and title, then remove duplicates by id
df5 = df4.filter(col("id").isNotNull() & col("title").isNotNull()).dropDuplicates(["id"])


**Verify 2.7**

In [99]:
print(f"Before filtering: {df4.count()}")
print(f"After filtering: {df5.count()}")


Before filtering: 18
After filtering: 18


**2.8** **Keep** only rows where at least **10 columns have non-NaN values**.

In [100]:
# Keep rows with at least 10 non-null values
from functools import reduce
non_null_count = reduce(lambda x, y: x + y, [when(col(c).isNotNull(), 1).otherwise(0) for c in df5.columns])
df6 = df5.withColumn("non_null_count", non_null_count).filter(col("non_null_count") >= 10).drop("non_null_count")


We filtered out rows that have fewer than 10 non-null columns. Since no rows were removed (18 → 18), it means all your movies already have at least 10 columns with valid data, which indicates good data quality from the API.

This step is a data quality check - in larger datasets, you might have incomplete records that should be excluded from analysis.

In [101]:
print(f"Before filtering: {df5.count()}")
print(f"After filtering: {df6.count()}")


Before filtering: 18
After filtering: 18


**2.9**. **Filter** to include only 'Released' movies, then drop 'status'.


In [102]:
# Filter for Released movies and drop status column
df7 = df6.filter(col("status") == "Released").drop("status")


In [103]:
print(f"Before filtering: {df6.count()}")
print(f"After filtering: {df7.count()}")
print("Status column dropped:", "status" not in df7.columns)


Before filtering: 18
After filtering: 18
Status column dropped: True


In [104]:
df7.show(1)

[Stage 375:>                                                        (0 + 8) / 8]

+--------------------+------+--------------------+---------+---------+-------------+-------------+---+--------------+-----------------+--------------------+----------+--------------------+--------------------+--------------------+------------+----------+-------+--------------------+--------------------+-------+------------+----------+---------------+-----------+------------+
|       backdrop_path|budget|                cast|cast_size|crew_size|     director|       genres| id|origin_country|original_language|            overview|popularity|         poster_path|production_companies|production_countries|release_date|   revenue|runtime|    spoken_languages|             tagline|  title|vote_average|vote_count|collection_name|budget_musd|revenue_musd|
+--------------------+------+--------------------+---------+---------+-------------+-------------+---+--------------+-----------------+--------------------+----------+--------------------+--------------------+--------------------+------------+-

                                                                                

In [105]:
# Reorder columns as specified
final_df = df7.select(
    'id', 'title', 'tagline', 'release_date', 'genres', 'collection_name',
    'original_language', 'budget_musd', 'revenue_musd', 'production_companies',
    'production_countries', 'vote_count', 'vote_average', 'popularity', 'runtime',
    'overview', 'spoken_languages', 'poster_path', 'cast', 'cast_size', 'director', 'crew_size'
)


In [106]:
final_df.show(1)

+---+-------+--------------------+------------+-------------+---------------+-----------------+-----------+------------+--------------------+--------------------+----------+------------+----------+-------+--------------------+--------------------+--------------------+--------------------+---------+-------------+---------+
| id|  title|             tagline|release_date|       genres|collection_name|original_language|budget_musd|revenue_musd|production_companies|production_countries|vote_count|vote_average|popularity|runtime|            overview|    spoken_languages|         poster_path|                cast|cast_size|     director|crew_size|
+---+-------+--------------------+------------+-------------+---------------+-----------------+-----------+------------+--------------------+--------------------+----------+------------+----------+-------+--------------------+--------------------+--------------------+--------------------+---------+-------------+---------+
|597|Titanic|Nothing on eart