# TMDB Movie Data Analysis

This notebook documents the process of fetching, cleaning, and analyzing movie data from the TMDB API. We utilize helper functions defined in the `src` directory to maintain a clean and modular codebase.

## Objectives
1. **Data Extraction**: Fetch top-rated and popular movies from TMDB.
2. **Data Cleaning**: Process the raw JSON data into a structured DataFrame.
3. **Exploratory Data Analysis (EDA)**: Analyze trends and distributions.
4. **Ranking**: Identify top-performing movies.

In [None]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add src directory to path to import modules
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

from fetch_data import fetch_movies, save_raw_data
from process_data import process_data, save_processed_data
from analysis import analyze_movies, plot_data

## 1. Data Extraction

We fetch data from two endpoints: `movie/top_rated` and `movie/popular`. We fetch 5 pages from each to get a decent dataset size.

In [None]:
print("Fetching Top Rated Movies...")
top_rated = fetch_movies("movie/top_rated", pages=5)

print("Fetching Popular Movies...")
popular = fetch_movies("movie/popular", pages=5)

# Combine datasets (handling duplicates by ID)
all_movies_dict = {m['id']: m for m in top_rated + popular}
all_movies = list(all_movies_dict.values())

print(f"Total unique movies fetched: {len(all_movies)}")

# Save raw data
save_raw_data(all_movies, filename="../data/raw/movies.json")

## 2. Data Cleaning and Transformation

We load the raw JSON data and convert it into a Pandas DataFrame. We then perform the following cleaning steps:
- Select relevant columns (`id`, `title`, `release_date`, `vote_average`, `vote_count`, `popularity`, `original_language`, `overview`, `budget`, `revenue`).
- Drop rows with missing `title` or `release_date`.
- Convert data types (datetime, numeric).
- Extract `release_year` from `release_date`.
- Calculate `ROI`.

In [None]:
# Load raw data into DataFrame
df_raw = pd.DataFrame(all_movies)

# Process data
df_clean = process_data(df_raw)

# Display first few rows
display(df_clean.head())

# Save processed data
save_processed_data(df_clean, filename="../data/processed/movies_cleaned.csv")

## 3. Analysis and Visualization

We perform basic statistical analysis and generate plots to understand the data distribution.

In [None]:
# Analyze movies (prints stats and returns top lists)
top_movies, popular_movies, revenue_movies, roi_movies = analyze_movies(df_clean)

# Plot data
plot_data(df_clean)

### Top 10 Movies by Vote Average (Weighted)
We filter for movies with a significant number of votes to avoid skewed high ratings from very few votes.

In [None]:
display(top_movies[['title', 'vote_average', 'vote_count', 'release_year']])

### Top 10 Popular Movies
Movies with the highest popularity score.

In [None]:
display(popular_movies[['title', 'popularity', 'release_year']])

### Top 10 Movies by Revenue
Movies with the highest revenue.

In [None]:
display(revenue_movies[['title', 'revenue', 'release_year']])

### Top 10 Movies by ROI
Movies with the highest Return on Investment (Revenue / Budget).

In [None]:
display(roi_movies[['title', 'roi', 'budget', 'revenue', 'release_year']])