# Academy Awards Analysis 🎬
## Investigating Trends in Oscar-Winning Movies
### Author: Judd Jacobs

This project analyzes historical **Academy Award-winning films** using data from **Wikipedia**, **The Movie Database (TMDb)**, and **The Open Movie Database (OMDb)**.

## **Key Analysis Areas**
- **Best Picture trends by genre** (from Wikipedia Scrape & TMDb API) 🏆
- **Box office revenue & IMDb ratings** (OMDb API) 🎭
- **Long-term trends in Oscar-winning films** 📈

## **Step 1:** Import necessary Python Libraries 💽

In [1]:
import pandas as pd
import numpy as np
import requests
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os
from urllib.parse import quote
import re
import json

# # WordCloud and NLTK libraries are currently a strech goal for future development
# from wordcloud import WordCloud
# import nltk
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize

# # Ensure necessary NLTK components are downloaded
# nltk.download("stopwords")
# nltk.download("punkt")

## **Step 2:** Data Acquisition 🗂

In [2]:
# 2.1: Fetch Wikipedia page and parse it with BeautifulSoup
wiki_url = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"
response = requests.get(wiki_url)
soup = BeautifulSoup(response.text, "html.parser")

# 2.2: Locate the table and rows
table = soup.find("table", {"class": "wikitable"})
rows = table.find_all("tr")

# 2.3: Extract column headers
headers = [th.get_text(strip=True) for th in rows[0].find_all("th")]

# 2.4: Extract table data and append 'Status' based on style
data = []
for row in rows[1:]:
    cells = row.find_all(["td", "th"])
    style = row.get("style", "")
    status = "Winner" if "background:#EEDD82" in style.replace(" ", "") else "Nominee"
    row_data = [cell.get_text(strip=True) for cell in cells]
    if row_data:
        row_data.append(status)
        data.append(row_data)

# 2.5: Append 'Status' to column headers
headers.append("Status")

# 2.6: Create initial DataFrame
best_picture_wikipedia = pd.DataFrame(data, columns=headers)

# 2.7: Normalize column names
best_picture_wikipedia.columns = (
    best_picture_wikipedia.columns
    .str.lower()
    .str.strip()
    .str.replace(" ", "_")
    .str.replace(r"[^\w_]", "", regex=True)
)

# 2.8: Initial conversion of 'year' to numeric
best_picture_wikipedia['year'] = pd.to_numeric(best_picture_wikipedia['year'], errors='coerce')

# 2.9: Further clean numeric columns
for col in ['awards', 'nominations']:
    if col in best_picture_wikipedia.columns:
        best_picture_wikipedia[col] = (
            best_picture_wikipedia[col]
            .str.replace(",", "")
            .str.extract("(\\d+)")
            .astype(float)
            .astype("Int64")
        )

# 2.10: Convert year to integer (nullable)
best_picture_wikipedia['year'] = best_picture_wikipedia['year'].astype("Int64")

# 2.11: Validate data integrity
expected_columns = ['film', 'year', 'awards', 'nominations', 'status']
assert all(col in best_picture_wikipedia.columns for col in expected_columns), "Missing expected columns"
assert best_picture_wikipedia['film'].notna().all(), "Null values found in 'film'"
assert best_picture_wikipedia['year'].notna().any(), "No valid 'year' entries found"
assert set(best_picture_wikipedia['status'].unique()) <= {'Winner', 'Nominee'}, "Unexpected status values"
assert pd.api.types.is_integer_dtype(best_picture_wikipedia['year']), "'year' column is not integer"

# 2.12: Preview final cleaned DataFrame
print(best_picture_wikipedia.head())
print(best_picture_wikipedia.dtypes)


             film  year  awards  nominations   status
0           Anora  2024       5            6   Winner
1   The Brutalist  2024       3           10  Nominee
2    Emilia Pérez  2024       2           13  Nominee
3          Wicked  2024       2           10  Nominee
4  Dune: Part Two  2024       2            5  Nominee
film           object
year            Int64
awards          Int64
nominations     Int64
status         object
dtype: object


### **Scrape** Wikipedia
Extract **Best Picture winners**, nominees, and relevant metadata availble on Wikipedia here: https://en.wikipedia.org/wiki/List_of_Academy_Award–winning_films
- **`pandas.read_html()`** to extract the table structure.
- **`BeautifulSoup`** to identify "winning" rows based on Wikipedia background color.

In [None]:
# Wikipedia URL for Best Picture winners
wiki_url = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"

# Use pandas to extract the table
tables = pd.read_html(wiki_url)

# Select the correct table, adjusting the index as needed
best_picture_wikipedia = tables[0]

# Print the first few rows to ensure the correct table was selected
print(best_picture_wikipedia.head())
print(best_picture_wikipedia.dtypes)

In [None]:
# Now find the Wikipedia table with BeautifulSoup
response_wikipedia = requests.get(wiki_url)
soup_wikipedia = BeautifulSoup(response_wikipedia.text, "html.parser")
wikipedia_table = soup_wikipedia.find_all("table", {"class": "wikitable"})[0]

# Extract all rows
rows = wikipedia_table.find_all("tr")

# List to store "Winner" status
winning_status = []

# Loop through rows and check for background color "#EEDD82" skipping the header row
for row in rows[1:]:
    style = row.get("style", "")
    
    # Check if the row has the background color for winners and remove spaces for consistency
    if "background:#EEDD82" in style.replace(" ", ""):
        winning_status.append("Winner")
    else:
        winning_status.append("Nominee")

# Ensure the list length matches the DataFrame
if len(winning_status) == len(best_picture_wikipedia):
    best_picture_wikipedia["Status"] = winning_status
else:
    print("List length does not match DataFrame length")

# Display updated DataFrame
print(best_picture_wikipedia.head())
print(best_picture_wikipedia.dtypes)

### Fetch Genres from TMDb API 🎭
Use **The Movie Database (TMDb) API** to retrieve **movie genres** for Best Picture winners listed in Wikipedia Dataset.

⚠️ Note: the cell below may take up to two minutes, or longer, to load. ⚠️

In [None]:
# Load environment variables from .env file
load_dotenv()

# Access the TMDB API keys stored in the .env file and define them here
tmdb_api_key = os.getenv('TMDB_API_KEY')
tmdb_api_read_access_token = os.getenv('TMBD_API_READ_ACCESS_TOKEN')

tmdb_api_base_url = "https://api.themoviedb.org/3"

# Function to get genre mappings (ID -> Name)
def get_genre_mapping() -> dict:
    url = f"{tmdb_api_base_url}/genre/movie/list?language=en-US"
    headers = {"accept": "application/json", "Authorization": f"Bearer {tmdb_api_read_access_token}"}
    
    response = requests.get(url, headers=headers)
    data = response.json()
    
    if "genres" in data:
        return {genre["id"]: genre["name"] for genre in data["genres"]}
    return {}

# Function to query TMDB API and get genre names for movies
def get_movie_genres(film_titles) -> dict:
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {tmdb_api_read_access_token}"
        }
    
    # Fetch genre ID-to-name mapping
    genre_mapping = get_genre_mapping()

    # Store results
    movie_genres = {}

    for title in film_titles:
        # Encode spaces and special characters for use in URL
        encoded_title = quote(title)
        
        url = f"{tmdb_api_base_url}/search/movie?query={encoded_title}&include_adult=false&language=en-US&page=1"
        response = requests.get(url, headers=headers)
        data = response.json()
        
        if "results" in data and data["results"]:
            # Ensure exact match
            exact_match = next((movie for movie in data["results"] if movie["title"] == title), None)
            
            if exact_match:
                genre_ids = exact_match["genre_ids"]
                genre_names = [genre_mapping.get(gid, "Unknown Genre") for gid in genre_ids]
                movie_genres[title] = genre_names
            else:
                movie_genres[title] = ["No exact match found"]
        else:
            movie_genres[title] = ["No results found"]
    
    return movie_genres

# Create a List of the movie titles from extracted Wikipedia data
movie_titles = best_picture_wikipedia["Film"].tolist()

# Get genre names for each movie
tmdb_genre_results = get_movie_genres(movie_titles)

# Convert to DataFrame for display
tmdb_genre_results = pd.DataFrame(list(tmdb_genre_results.items()), columns=["Title", "Genres"])
print(tmdb_genre_results.head())
print(tmdb_genre_results.dtypes)

### Combine Data from Wikipedia and TMDb 🎞️

In [None]:
# Confirm column names for both DataFrames
print("best_picture_wikipedia columns:", best_picture_wikipedia.columns.tolist())
print("tmdb_genre_results columns:", tmdb_genre_results.columns.tolist())

In [None]:
# Merge best_picture_wikipedia and tmdb_genre_results DataFrames on "Film"/"Title"
best_picture_merged = best_picture_wikipedia.merge(
    tmdb_genre_results,
    left_on="Film",
    right_on="Title",
    how="left"
)

# Drop the now redundant "Title" column
best_picture_merged.drop("Title", axis=1, inplace=True)

# Split the genres into separate columns
best_picture_merged = best_picture_merged.explode("Genres")
best_picture_merged = pd.concat([best_picture_merged, best_picture_merged["Genres"].str.get_dummies()], axis=1)

# Print the updated DataFrame
print(best_picture_merged.head())
print(best_picture_merged.columns)

### Fetch Metadata from OMDb API 💸
Use **OMDb API** to fetch **box office revenue** and other metadata for Best Picture winners and combine with merged Wikipedia and TMDb data.

In [None]:
# Pull in OMDb API Key
omdb_api_key = os.getenv("OMDB_API_KEY")

# Local cache file
cache_file = "omdb_cache.json"

# Load cache if it exists
if os.path.exists(cache_file):
    with open(cache_file, "r") as f:
        omdb_cache = json.load(f)
else:
    omdb_cache = {}

# Function to clean movie titles
def clean_title(title: str) -> str:
    return re.sub(r'\s*\(.*?\)', '', title).strip()

# Function to fetch metadata (if not cached)
def get_full_movie_metadata(movie_title, api_key) -> dict:
    cleaned_title = clean_title(movie_title)

    # Use cleaned title as cache key
    if cleaned_title in omdb_cache:
        return omdb_cache[cleaned_title]

    encoded_title = quote(cleaned_title)
    omdb_url = f"http://www.omdbapi.com/?t={encoded_title}&apikey={api_key}"

    try:
        response = requests.get(omdb_url, timeout=10)
        response.raise_for_status()
        data = response.json()

        if "Error" in data:
            print(f"OMDb API Error for '{movie_title}' → '{cleaned_title}': {data['Error']}")
            data = {}

    except requests.exceptions.Timeout:
        print(f"Timeout error for '{movie_title}' → '{cleaned_title}'. Skipping...")
        data = {}
    except requests.exceptions.RequestException as e:
        print(f"API request failed for '{movie_title}' → '{cleaned_title}': {e}")
        data = {}
    except ValueError:
        print(f"Invalid JSON response for '{movie_title}' → '{cleaned_title}'.")
        data = {}

    # Store in cache regardless of success/failure
    omdb_cache[cleaned_title] = data
    return data

# Populate DataFrame using cached API responses
def populate_metadata(movie_df, title_column, metadata_func, metadata_fields, api_key) -> pd.DataFrame:
    for field in metadata_fields:
        if field not in movie_df.columns:
            movie_df[field] = ""

    for index, row in movie_df.iterrows():
        title = row[title_column]
        metadata = metadata_func(title, api_key)
        for field in metadata_fields:
            value = metadata.get(field, "N/A")
            if isinstance(value, (dict, list)):
                value = str(value)
            movie_df.at[index, field] = value

    return movie_df

# Fields to add to the DataFrame
metadata_fields = [
    "Title", "Year", "Rated", "Released", "Runtime", "Genre", "Director", "Writer", "Actors", "Plot",
    "Language", "Country", "Awards", "Poster", "Ratings", "Metascore", "imdbRating", "imdbVotes",
    "imdbID", "Type", "DVD", "BoxOffice", "Production", "Website", "Response"
]

# Run metadata enrichment
best_picture_all = populate_metadata(
    best_picture_merged.copy(),
    title_column="Film",
    metadata_func=get_full_movie_metadata,
    metadata_fields=metadata_fields,
    api_key=omdb_api_key
)

# Save updated cache
with open(cache_file, "w") as f:
    json.dump(omdb_cache, f, indent=2)

# Preview
print(best_picture_all.head())


## **Step 3:** Data Cleaning & Storage 🛠
Merged data will have some cleaning applied and then the cleaned dataset will be stored in a local **SQLite database**.

In [None]:
def clean_money(val) -> float:
    """Function to clean money values stored as strings"""
    if pd.isna(val) or val == "N/A":
        return np.nan
    try:
        return float(str(val).replace("$", "").replace(",", ""))
    except:
        return np.nan

def clean_votes(val) -> float:
    """Function to clean vote counts stored as strings"""
    if pd.isna(val) or val == "N/A":
        return np.nan
    try:
        return float(str(val).replace(",", ""))
    except:
        return np.nan

def clean_runtime(runtime_str) -> int:
    """Function to clean runtime values stored as strings"""
    if pd.isna(runtime_str) or runtime_str == "N/A":
        return np.nan
    try:
        return int(str(runtime_str).split()[0])
    except (ValueError, IndexError):
        return np.nan

# Make a copy of the merged raw dataset to avoid mutating the original
best_picture_all_clean = best_picture_all.copy()

# Filter rows based on columns: 'Response', 'Type'
best_picture_all_clean = best_picture_all_clean[(best_picture_all_clean['Response'].str.contains("True", regex=False, na=False, case=False))
                                                 & (best_picture_all['Type'].str.contains("movie", regex=False, na=False, case=False))]

# Drop unnecessary or empty columns
best_picture_all_clean.drop(columns=["Website", "DVD", "Production"], inplace=True, errors='ignore')

# Execute the cleaning functions
best_picture_all_clean["BoxOffice"] = best_picture_all_clean["BoxOffice"].apply(clean_money)

# Apply to imdbVotes only (since it's a string with commas)
best_picture_all_clean["imdbVotes"] = best_picture_all_clean["imdbVotes"].apply(clean_votes)

# Apply to Runtime column
best_picture_all_clean["Runtime"] = best_picture_all_clean["Runtime"].apply(clean_runtime)

# Conversion for numeric columns stored as text
numeric_columns = ["imdbRating", "Metascore"]
best_picture_all_clean[numeric_columns] = best_picture_all_clean[numeric_columns].apply(pd.to_numeric, errors="coerce")

# Rename the column
best_picture_all_clean.rename(columns={"Runtime": "Runtime (mins)"}, inplace=True)

# Convert "N/A" to NaN across object columns
best_picture_all_clean.replace("N/A", np.nan, inplace=True)

# Convert Year to integer
best_picture_all_clean["Year"] = pd.to_numeric(best_picture_all_clean["Year"], errors="coerce")

# Preview cleaned DataFrame
best_picture_all_clean.info()
best_picture_all_clean.head()


### Save cleaned data to local SQLite database

In [None]:
# Deduplicated movie metadata for movies table
movies = best_picture_all_clean.drop_duplicates(subset=["Film", "Year"]).copy()

# 2. Normalize genre columns
genre_columns = [
    "Action", "Adventure", "Animation", "Comedy", "Crime", "Documentary",
    "Drama", "Family", "Fantasy", "History", "Horror", "Music", "Mystery",
    "Romance", "Science Fiction", "TV Movie", "Thriller", "War", "Western"
]

genre = best_picture_all_clean[["Film", "Year"] + genre_columns].melt(
    id_vars=["Film", "Year"],
    value_vars=genre_columns,
    var_name="Genre",
    value_name="IsPresent"
)

# Keep only rows where genre is present
movie_genres = genre[genre["IsPresent"] == 1].drop(columns="IsPresent")

# 3. Store both in SQLite
with sqlite3.connect("academy_awards.db") as conn:
    movies.to_sql("movies", conn, if_exists="replace", index=False)
    movie_genres.to_sql("movie_genres", conn, if_exists="replace", index=False)

print("Data successfully split and stored in SQLite!")


## START HERE


## Exploratory Data Analysis 📊
We will explore trends in **Best Picture winners** by genre and other relevant statistics.

In [None]:
# Connect to the database
with sqlite3.connect("academy_awards.db") as conn:
    cursor = conn.cursor()
    
    # Query to list all table names
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    
    # Fetch and print table names
    tables = cursor.fetchall()
    print("Tables in database:")
    for table in tables:
        print(f"- {table[0]}")


In [None]:
# Connect and read the data
with sqlite3.connect("academy_awards.db") as conn:
    movies = pd.read_sql("SELECT * FROM movies", conn)
    movie_genres = pd.read_sql("SELECT * FROM movie_genres", conn)

print(f"Movies loaded: {len(movies)}")
print(f"Genre rows loaded: {len(movie_genres)}")


In [None]:
# Box Office Revenue vs Number of Nominations
box_office_vs_nomination = movies.dropna(subset=["BoxOffice", "Nominations"])

plt.figure(figsize=(10, 6))
plt.scatter(box_office_vs_nomination["Nominations"], box_office_vs_nomination["BoxOffice"], alpha=0.6)
plt.title("Box Office Revenue vs Number of Nominations")
plt.xlabel("Number of Nominations")
plt.ylabel("Box Office Revenue (USD)")
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
# Box Office Revenue vs IMDb Rating (Per Movie)
box_office_vs_imdb_rating = movies.dropna(subset=["imdbRating", "BoxOffice"])

plt.figure(figsize=(10, 6))
plt.scatter(box_office_vs_imdb_rating["imdbRating"], ["BoxOffice"], alpha=0.6)
plt.title("Box Office Revenue vs IMDb Rating (Per Movie)")
plt.xlabel("IMDb Rating")
plt.ylabel("Box Office Revenue (USD)")
plt.grid(True)
plt.tight_layout()
plt.show()


## 🛠 SQL Queries for Data Exploration
Now that our data is stored in SQLite, we will perform **SQL queries** to explore trends:
- **Most Awarded Films**
- **Box Office Performance (Stretch Goal)**
- **Decade-wise Genre Trends**

In [None]:
# Connect to SQLite database
conn = sqlite3.connect("academy_awards.db")

# Query: Top 10 Most Awarded Films
query_awards = """
SELECT Film, `Awards Won`
FROM academy_award_winners
ORDER BY `Awards Won` DESC
LIMIT 10;
"""

# Execute query
top_awarded_films = pd.read_sql(query_awards, conn)
print("Top 10 Most Awarded Films:")
print(top_awarded_films)

## 📊 Visualization: Top 10 Most Awarded Films
We will create a **bar chart** to visualize the **most awarded films** in Academy Award history.

In [None]:
# Plot Bar Chart for Most Awarded Films
plt.figure(figsize=(12, 6))
sns.barplot(y=top_awarded_films["Film"], x=top_awarded_films["Awards Won"], palette="Blues_r")
plt.xlabel("Awards Won")
plt.ylabel("Film")
plt.title("Top 10 Most Awarded Films in Academy Award History")
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

## 💰 SQL Query: Average Box Office Revenue by Genre (Stretch Goal)
If box office revenue data is available, we will analyze which **genres** tend to perform better financially.

In [None]:
# Query: Average Box Office Revenue by Genre
query_revenue = """
SELECT Genre, AVG(CAST(REPLACE(`Box Office Revenue`, '$', '') AS FLOAT)) AS Avg_Revenue
FROM academy_award_winners
WHERE `Box Office Revenue` IS NOT NULL AND `Box Office Revenue` != 'N/A'
GROUP BY Genre
ORDER BY Avg_Revenue DESC;
"""

# Execute query
box_office_by_genre = pd.read_sql(query_revenue, conn)
print("Average Box Office Revenue by Genre:")
print(box_office_by_genre)

## 📊 Visualization: Box Office Revenue vs. Awards Won
We will create a **scatter plot** to visualize the relationship between **box office revenue and the number of awards won**.

In [None]:
# Convert Box Office Revenue to numeric
awards_df["Box Office Revenue"] = (
    awards_df["Box Office Revenue"]
    .str.replace("$", "")
    .str.replace(",", "")
    .astype(float)
)

# Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x=awards_df["Box Office Revenue"], y=awards_df["Awards Won"], hue=awards_df["Genre"], palette="coolwarm", alpha=0.8)
plt.xlabel("Box Office Revenue (in millions)")
plt.ylabel("Awards Won")
plt.title("Box Office Revenue vs. Awards Won")
plt.show()

## 📅 Visualization: Timeline of Best Picture Wins
We will visualize how **Academy Award wins have changed over the decades** using a **line chart**.

In [None]:
# Aggregate awards by decade
awards_df["Decade"] = (awards_df["Year"] // 10) * 10
awards_by_decade = awards_df.groupby("Decade")["Awards Won"].sum().reset_index()

# Line Plot for Award Wins Over Time
plt.figure(figsize=(12, 6))
sns.lineplot(x=awards_by_decade["Decade"], y=awards_by_decade["Awards Won"], marker="o", linestyle="-", color="b")
plt.xlabel("Decade")
plt.ylabel("Total Awards Won")
plt.title("Best Picture Wins Over the Decades")
plt.grid()
plt.show()

## Updated Visualizations

In [None]:
# Remove duplicate rows to avoid double-counting movies
movies_deduped = best_picture_all_clean.drop_duplicates(subset=["Film", "Year"])
print(f"Original rows: {len(best_picture_all_clean)}, Deduplicated: {len(movies_deduped)}")


In [None]:
import matplotlib.pyplot as plt

scatter_data = movies_deduped.dropna(subset=["imdbRating", "BoxOffice"])

plt.figure(figsize=(10, 6))
plt.scatter(scatter_data["imdbRating"], scatter_data["BoxOffice"], alpha=0.6)
plt.title("Box Office Revenue vs IMDb Rating (Per Movie)")
plt.xlabel("IMDb Rating")
plt.ylabel("Box Office Revenue (USD)")
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
# List of genre columns
genre_columns = [
    "Action", "Adventure", "Animation", "Comedy", "Crime", "Documentary",
    "Drama", "Family", "Fantasy", "History", "Horror", "Music", "Mystery",
    "Romance", "Science Fiction", "TV Movie", "Thriller", "War", "Western"
]

# Group by year and sum genre counts
genre_by_year = best_picture_all_clean.groupby("Year")[genre_columns].sum().sort_index()

# Plot
plt.figure(figsize=(14, 7))
genre_by_year.plot(kind="area", stacked=True, alpha=0.85, figsize=(14, 7))
plt.title("Genre Distribution of Best Picture Films Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Films (Genre-Normalized)")
plt.legend(loc="upper left", bbox_to_anchor=(1.0, 1.0))
plt.tight_layout()
plt.show()


In [None]:
# Group by winner status
winner_group = movies_deduped.groupby("Status")[["imdbRating", "BoxOffice"]].mean().dropna()

# Plot
winner_group.plot(kind="bar", figsize=(10, 6))
plt.title("Average IMDb Rating and Box Office: Winner vs Nominee (Per Movie)")
plt.ylabel("Average Value")
plt.xticks(rotation=0)
plt.grid(axis="y")
plt.tight_layout()
plt.show()


In [None]:
# Remove duplicate rows to avoid double-counting movies
movies_deduped = best_picture_all_clean.drop_duplicates(subset=["Film", "Year"])
print(f"Original rows: {len(best_picture_all_clean)}, Deduplicated: {len(movies_deduped)}")


In [None]:
import matplotlib.pyplot as plt

scatter_data = movies_deduped.dropna(subset=["imdbRating", "BoxOffice"])

plt.figure(figsize=(10, 6))
plt.scatter(scatter_data["imdbRating"], scatter_data["BoxOffice"], alpha=0.6)
plt.title("Box Office Revenue vs IMDb Rating (Per Movie)")
plt.xlabel("IMDb Rating")
plt.ylabel("Box Office Revenue (USD)")
plt.grid(True)
plt.tight_layout()
plt.savefig("box_office_vs_rating.png")
plt.show()


In [None]:
genre_columns = [
    "Action", "Adventure", "Animation", "Comedy", "Crime", "Documentary",
    "Drama", "Family", "Fantasy", "History", "Horror", "Music", "Mystery",
    "Romance", "Science Fiction", "TV Movie", "Thriller", "War", "Western"
]

genre_by_year = best_picture_all_clean.groupby("Year")[genre_columns].sum().sort_index()

plt.figure(figsize=(14, 7))
genre_by_year.plot(kind="area", stacked=True, alpha=0.85, figsize=(14, 7))
plt.title("Genre Distribution of Best Picture Films Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Films (Genre-Normalized)")
plt.legend(loc="upper left", bbox_to_anchor=(1.0, 1.0))
plt.tight_layout()
plt.savefig("genre_distribution_over_time.png")
plt.show()


In [None]:
winner_group = movies_deduped.groupby("Status")[["imdbRating", "BoxOffice"]].mean().dropna()

winner_group.plot(kind="bar", figsize=(10, 6))
plt.title("Average IMDb Rating and Box Office: Winner vs Nominee (Per Movie)")
plt.ylabel("Average Value")
plt.xticks(rotation=0)
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("winner_vs_nominee.png")
plt.show()


In [None]:
import seaborn as sns

# Unpivot genres for boxplot
genre_ratings = best_picture_all_clean.melt(
    id_vars=["imdbRating"], value_vars=genre_columns,
    var_name="Genre", value_name="IsPresent"
)

# Filter only rows where the genre is present
genre_ratings = genre_ratings[genre_ratings["IsPresent"] == 1]

plt.figure(figsize=(16, 6))
sns.boxplot(data=genre_ratings, x="Genre", y="imdbRating")
plt.title("IMDb Ratings by Genre (Genre-Normalized)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("genre_vs_rating_boxplot.png")
plt.show()


In [None]:
winners = movies_deduped[movies_deduped["Status"] == "Winner"].dropna(subset=["Year", "imdbRating"])

plt.figure(figsize=(12, 6))
plt.plot(winners["Year"], winners["imdbRating"], marker="o", linestyle="-")
plt.title("IMDb Rating of Best Picture Winners Over Time")
plt.xlabel("Year")
plt.ylabel("IMDb Rating")
plt.grid(True)
plt.tight_layout()
plt.savefig("winner_timeline.png")
plt.show()


In [None]:
# Genre Distribution of Best Picture Films Over Time
genre_pivot = movie_genres.assign(count=1).pivot_table(
    index="Year", columns="Genre", values="count", aggfunc="sum", fill_value=0
).sort_index()

plt.figure(figsize=(14, 7))
genre_pivot.plot(kind="area", stacked=True, alpha=0.85, figsize=(14, 7))
plt.title("Genre Distribution of Best Picture Films Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Films (Genre-Normalized)")
plt.legend(loc="upper left", bbox_to_anchor=(1.0, 1.0))
plt.tight_layout()
plt.show()


In [None]:
# Average IMDb Rating and Box Office: Winner vs Nominee (Per Movie)
winner_group = movies.groupby("Status")[["imdbRating", "BoxOffice"]].mean().dropna()

winner_group.plot(kind="bar", figsize=(10, 6))
plt.title("Average IMDb Rating and Box Office: Winner vs Nominee (Per Movie)")
plt.ylabel("Average Value")
plt.xticks(rotation=0)
plt.grid(axis="y")
plt.tight_layout()
plt.show()


In [None]:
# IMDb Ratings by Genre (Genre-Normalized)
genre_ratings = movie_genres.merge(movies[["Film", "Year", "imdbRating"]], on=["Film", "Year"])
genre_ratings = genre_ratings.dropna(subset=["imdbRating"])

plt.figure(figsize=(16, 6))
sns.boxplot(data=genre_ratings, x="Genre", y="imdbRating")
plt.title("IMDb Ratings by Genre (Genre-Normalized)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
# IMDb Rating of Best Picture Winners Over Time
winners = movies[movies["Status"] == "Winner"].dropna(subset=["Year", "imdbRating"])

plt.figure(figsize=(12, 6))
plt.plot(winners["Year"], winners["imdbRating"], marker="o", linestyle="-")
plt.title("IMDb Rating of Best Picture Winners Over Time")
plt.xlabel("Year")
plt.ylabel("IMDb Rating")
plt.grid(True)
plt.tight_layout()
plt.show()


## Stretch Goal: Word Cloud (Wikipedia Movie Summaries) ☁️
Placeholder for future development. I plan to create word cloud for the plot summaries. If Wikipedia summaries are accessible, also generate a **word cloud** from commonly used words in movie descriptions.

In [None]:
# # Sample Wikipedia summary text (replace with actual summaries if available)
# sample_text = "This is a sample summary of a Best Picture-winning film. It tells the story of love, ambition, and success."

# # Tokenize & remove stopwords
# tokens = word_tokenize(sample_text.lower())
# filtered_words = [word for word in tokens if word.isalnum() and word not in stopwords.words("english")]

# # Generate Word Cloud
# wordcloud = WordCloud(width=800, height=400, background_color="white").generate(" ".join(filtered_words))

# # Display Word Cloud
# plt.figure(figsize=(10, 5))
# plt.imshow(wordcloud, interpolation="bilinear")
# plt.axis("off")
# plt.title("Word Cloud of Wikipedia Movie Summaries")
# plt.show()

In [None]:
# oscars = pd.read_csv("data/oscars.csv", sep='\t', on_bad_lines='skip')
# oscars = oscars.dropna()
# oscars = oscars.drop_duplicates()
# oscars = oscars.reset_index(drop=True)

In [None]:
# # Step 4: Store Data in SQLite Database
# conn = sqlite3.connect("academy_awards.db")
# awards_df.to_sql("awards", conn, if_exists="replace", index=False)
# speech_df.to_sql("speeches", conn, if_exists="replace", index=False)

In [None]:
# # Step 5: SQL Queries & Analysis
# ## Query genres of Best Picture winners over decades
# query = """
# SELECT genre, COUNT(*) AS num_wins, strftime('%Y', award_year) AS decade
# FROM awards
# WHERE category = 'Best Picture'
# GROUP BY genre, decade
# ORDER BY decade ASC;
# """
# genre_trends_df = pd.read_sql(query, conn)

# ## Query word frequency in acceptance speeches
# query = """
# SELECT cleaned_speech FROM speeches;
# """
# speech_texts = pd.read_sql(query, conn)

### Overview of the Analysis (examples)
- In this analysis, we explored the relationship between the race of law enforcement officers and the race of the drivers they stop. Our goal was to see if there’s any indication of bias in traffic stops based on the racial identity of the officers. To do this, we used a chi-squared test for independence, which helps us understand whether there’s a meaningful connection between these two groups.

### Results of the Chi-Squared Test
- **Chi-Squared Statistic:** We calculated a chi-squared statistic of 122.92. This high number shows that there’s a significant difference between the actual number of stops for different racial groups and what we would expect to see if there were no connection between the officer's race and the driver's race. In other words, this suggests that the patterns we observe in the data are unlikely to be just a coincidence.

- **P-Value:** The p-value we found was about 8.20e-17, which is extremely low. This tells us that the result is statistically significant since it’s much lower than the usual thresholds (like 0.05 or 0.01). A low p-value means we have strong evidence against the idea that there’s no connection between the officer's race and the driver's race.

### Interpretation of Findings
- The results show a strong connection between the race of the officer and the race of the driver being stopped. This means that a driver's chances of being stopped may change depending on the officer's race, suggesting there might be some bias in how traffic stops are carried out.

### Implications
- These findings are important for understanding how race plays a role in law enforcement. They suggest that different racial groups might be treated differently by officers during traffic stops. It's crucial to address these biases to ensure fairness and equality in policing.

### Conclusion
- The strong evidence from the chi-squared statistic and p-value emphasizes the importance of further examining law enforcement practices. Police leaders and community advocacy groups should take these findings into account when reviewing policies and training programs designed to reduce racial bias in policing.