# 🎬 Academy Awards Analysis
## Investigating Trends in Oscar-Winning Movies
### Author: Judd Jacobs

This project analyzes historical **Academy Award-winning films** using data from **Wikipedia** (scraped via `pandas.read_html()`) and the **Kaggle Oscar Awards dataset**. 

#### **Key Analysis Areas:**
- 🏆 **Best Picture trends by genre** (from TMDb API)
- 🎭 **Box office revenue & IMDb ratings** (from Kaggle dataset)
- 📈 **Long-term trends in Oscar-winning films**
- ☁️ **Stretch Goal**: **Box office revenue from OMDb API** *(if available)*

In [95]:
# Import necessary libraries
import pandas as pd
import numpy as np
import requests
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from dotenv import load_dotenv
import os
from urllib.parse import quote

# Set plotting style
sns.set_style("whitegrid")

# Ensure necessary NLTK components are downloaded
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juddjacobs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/juddjacobs/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 🗂 Data Collection: Scraping Wikipedia
We will extract **Best Picture winners** and relevant metadata using:
- **`pandas.read_html()`** to extract the table structure.
- **`BeautifulSoup`** to identify "winning" rows based on background color.

In [87]:
# Wikipedia URL for Best Picture winners
wiki_url = "https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films"

# Use pandas to extract the table
tables = pd.read_html(wiki_url)

# Select the correct table, adjusting the index, as needed - which is currently the first table at index 0 (as of 20250317)
best_picture_wikipedia = tables[0]

# Print the first few rows to ensure the correct table was selected
best_picture_wikipedia.head()

Unnamed: 0,Film,Year,Awards,Nominations
0,Anora,2024,5,6
1,The Brutalist,2024,3,10
2,Emilia Pérez,2024,2,13
3,Wicked,2024,2,10
4,Dune: Part Two,2024,2,5


In [88]:
# Find the Wikipedia table with BeautifulSoup
response_wikipedia = requests.get(wiki_url)
soup_wikipedia = BeautifulSoup(response_wikipedia.text, "html.parser")
wikipedia_table = soup_wikipedia.find_all("table", {"class": "wikitable"})[0]

# Extract all rows
rows = wikipedia_table.find_all("tr")

# List to store "Winner" status
winning_status = []

# Loop through rows and check for background color "#EEDD82" skipping the header row
for row in rows[1:]:
    style = row.get("style", "")
    
    # Check if the row has the background color for winners and remove spaces for consistency
    if "background:#EEDD82" in style.replace(" ", ""):
        winning_status.append("Winner")
    else:
        winning_status.append("Nominee")

# Ensure the list length matches the DataFrame
if len(winning_status) == len(best_picture_wikipedia):
    best_picture_wikipedia["Status"] = winning_status
else:
    print("List length does not match DataFrame length")

# Normalize "Status" column and filter only winners
best_picture_winners = best_picture_wikipedia[best_picture_wikipedia["Status"] == "Winner"]

# Display updated DataFrame
best_picture_winners.head(100)

Unnamed: 0,Film,Year,Awards,Nominations,Status
0,Anora,2024,5,6,Winner
14,Oppenheimer,2023,7,13,Winner
27,Everything Everywhere All at Once,2022,7,11,Winner
40,CODA,2021,3,3,Winner
55,Nomadland,2020/21,3,6,Winner
...,...,...,...,...,...
1219,Rebecca,1940,2,11,Winner
1323,Tom Jones,1963,4,10,Winner
1355,West Side Story,1961,10,11,Winner
1367,Wings,1927/28,2,2,Winner


In [89]:
# Connect to (or create) database
conn = sqlite3.connect("academy_awards.db")

# Save to SQLite table
best_picture_winners.to_sql("best_picture_winners", conn, if_exists="replace", index=False)
print("Data saved to SQLite database!")

Data saved to SQLite database!


The following cell is a quick check to ensure the data in the previous cell were saved to the SQLite DB.

In [90]:
# Connect to the database
conn = sqlite3.connect("academy_awards.db")

# Create a cursor object
cursor = conn.cursor()

# Execute the query and fetch all rows
cursor.execute("SELECT * FROM best_picture_winners")
rows = cursor.fetchall()

# Print the results
for row in rows:
    print(row)

# Close the connection
conn.close()

('Anora', '2024', '5', '6', 'Winner')
('Oppenheimer', '2023', '7', '13', 'Winner')
('Everything Everywhere All at Once', '2022', '7', '11', 'Winner')
('CODA', '2021', '3', '3', 'Winner')
('Nomadland', '2020/21', '3', '6', 'Winner')
('Parasite', '2019', '4', '6', 'Winner')
('Green Book', '2018', '3', '5', 'Winner')
('The Shape of Water', '2017', '4', '13', 'Winner')
('Moonlight', '2016', '3', '8', 'Winner')
('Spotlight', '2015', '2', '6', 'Winner')
('Birdman or (The Unexpected Virtue of Ignorance)', '2014', '4', '9', 'Winner')
('12 Years a Slave', '2013', '3', '9', 'Winner')
('Argo', '2012', '3', '7', 'Winner')
('The Artist', '2011', '5', '10', 'Winner')
("The King's Speech", '2010', '4', '12', 'Winner')
('The Hurt Locker', '2009', '6', '9', 'Winner')
('Slumdog Millionaire', '2008', '8', '10', 'Winner')
('No Country for Old Men', '2007', '4', '8', 'Winner')
('The Departed', '2006', '4', '5', 'Winner')
('Crash', '2005', '3', '6', 'Winner')
('Million Dollar Baby', '2004', '4', '7', 'Winne

In [None]:
# Create a List of the "Films"
movie_titles = best_picture_winners["Film"].tolist()
movie_titles

## 🎭 Fetching Genres from TMDb API
We will use **The Movie Database (TMDb) API** to retrieve **movie genres** for Best Picture winners.

In [None]:
# Load environment variables from .env file
load_dotenv()

# Access the TMDB API keys stored in the .env file and define them here
tmdb_api_key = os.getenv('TMDB_API_KEY')
tmdb_api_read_access_token = os.getenv('TMBD_API_READ_ACCESS_TOKEN')

tmdb_api_base_url = "https://api.themoviedb.org/3"

# Function to get genre mappings (ID -> Name)
def get_genre_mapping():
    url = f"{tmdb_api_base_url}/genre/movie/list?language=en-US"
    headers = {"accept": "application/json", "Authorization": f"Bearer {tmdb_api_read_access_token}"}
    
    response = requests.get(url, headers=headers)
    data = response.json()
    
    if "genres" in data:
        return {genre["id"]: genre["name"] for genre in data["genres"]}
    return {}

# Function to query TMDB API and get genre names for movies
def get_movie_genres(film_titles):
    headers = {
        "accept": "application/json",
        "Authorization": f"Bearer {tmdb_api_read_access_token}"}
    genre_mapping = get_genre_mapping()  # Fetch genre ID-to-name mapping
    
    movie_genres = {}  # Store results

    for title in film_titles:
        encoded_title = quote(title)  # Encode spaces and special characters
        
        url = f"{tmdb_api_base_url}/search/movie?query={encoded_title}&include_adult=false&language=en-US&page=1"
        response = requests.get(url, headers=headers)
        data = response.json()
        
        if "results" in data and data["results"]:
            # Ensure exact match
            exact_match = next((movie for movie in data["results"] if movie["title"] == title), None)
            
            if exact_match:
                genre_ids = exact_match["genre_ids"]
                genre_names = [genre_mapping.get(gid, "Unknown Genre") for gid in genre_ids]
                movie_genres[title] = genre_names
            else:
                movie_genres[title] = ["No exact match found"]
        else:
            movie_genres[title] = ["No results found"]
    
    return movie_genres

# Get genre names for each movie
genre_results = get_movie_genres(movie_titles)

# Convert to DataFrame for better visualization
df = pd.DataFrame(list(genre_results.items()), columns=["Title", "Genres"])
print(df)

                                Title                                 Genres
0                               Anora               [Drama, Comedy, Romance]
1                         Oppenheimer                       [Drama, History]
2   Everything Everywhere All at Once   [Action, Adventure, Science Fiction]
3                                CODA                [Drama, Music, Romance]
4                           Nomadland                                [Drama]
..                                ...                                    ...
92                            Rebecca    [Romance, Drama, Mystery, Thriller]
93                          Tom Jones  [Comedy, Adventure, History, Romance]
94                    West Side Story                [Crime, Drama, Romance]
95                              Wings          [Drama, Action, War, Romance]
96         You Can't Take It with You                      [Comedy, Romance]

[97 rows x 2 columns]


In [None]:
# Function to store movie data in SQLite
def store_movie_data(movie_data) -> None:
    # Create/connect to the database
    conn = sqlite3.connect("academy_awards.db")
    cursor = conn.cursor()

    # Create table if it doesn't exist
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS movies (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT,
            release_date TEXT,
            overview TEXT,
            vote_average REAL,
            tmdb_id INTEGER UNIQUE
        )
    ''')

    # Extract movie details from API response
    if movie_data and movie_data.get("results"):
        for movie in movie_data["results"]:
            tmdb_id = movie.get("id")
            title = movie.get("title", "Unknown")
            release_date = movie.get("release_date", "N/A")
            overview = movie.get("overview", "No description available.")
            vote_average = movie.get("vote_average", 0.0)

            # Insert or ignore if the movie already exists (prevents duplicate entries)
            cursor.execute('''
                INSERT OR IGNORE INTO movies (tmdb_id, title, release_date, overview, vote_average)
                VALUES (?, ?, ?, ?, ?)
            ''', (tmdb_id, title, release_date, overview, vote_average))

    conn.commit()
    conn.close()

# List of movies to fetch
movies_list = film_titles

# Loop through each movie, fetch data, and store in the database
for movie in movies_list:
    data = fetch_movie_data(movie)
    if data:
        store_movie_data(data)

print("Movie data successfully stored in SQLite database!")

Movie data successfully stored in SQLite database!


In [19]:
# Connect to the database
conn = sqlite3.connect("academy_awards.db")

# Create a cursor object
cursor = conn.cursor()

# Execute the query and fetch all rows
cursor.execute("SELECT * FROM movies")
rows = cursor.fetchall()

# Print the results
for row in rows:
    print(row)

# Close the connection
conn.close()

(1, 'Titanic', '1953-04-11', 'Unhappily married, Julia Sturges decides to go to America with her two children on the Titanic. Her husband, Richard also arranges passage on the luxury liner so as to have custody of their two children. All this fades to insignificance once the ship hits an iceberg.', 6.6, 16535)
(2, 'Titanic', '1943-11-10', "In 1912, the Titanic embarks on its inevitable collision course with history. In the wake of the over-spending required to build the largest luxury ship in the world, White Star Line executive Sir Bruce Ismay schemes to reverse the direction of his company's plummeting stock value. Onboard the Titanic, brave German 1st Officer Petersen struggles to convince his self-important British superiors not to overexert the ship's engines.", 6.1, 11021)
(3, 'Titanic', '1997-11-18', "101-year-old Rose DeWitt Bukater tells the story of her life aboard the Titanic, 84 years later. A young Rose boards the ship with her mother and fiancé. Meanwhile, Jack Dawson and

In [None]:
#this was my original code - the cell above is what I'm playing around with and seemed to work.
def get_movie_genre(movie_title, tmdb_api_key) -> str:
    # Fetches movie genres from TMDb API given a movie title.
    search_url = f"https://api.themoviedb.org/3/search/movie?tmdb_api_key={tmdb_api_key}&query={movie_title}"
    search_response = requests.get(search_url).json()
    
    if search_response["results"]:
        movie_id = search_response["results"][0]["id"]
        movie_url = f"https://api.themoviedb.org/3/movie/{movie_id}?tmdb_api_key={tmdb_api_key}"
        movie_response = requests.get(movie_url).json()
        
        # Extract genres
        genre_list = [genre["name"] for genre in movie_response.get("genres", [])]
        return ", ".join(genre_list)
    else:
        return "Unknown"

# Apply function to each movie in the dataset
awards_df["Genre"] = awards_df["Film"].apply(lambda title: get_movie_genre(title, tmdb_api_key))

# Display updated DataFrame with Genres
awards_df.head()


## ☁️ Stretch Goal: Fetching Box Office Revenue from OMDb API
If available, we will use **OMDb API (IMDb)** to fetch **box office revenue** for Best Picture winners.

In [None]:
# # OMDb API Key (replace with your actual key)
# omdb_api_key = "your_omdb_api_key"

# def get_box_office(movie_title, api_key):
#     """Fetches box office revenue from OMDb API given a movie title."""
#     omdb_url = f"http://www.omdbapi.com/?t={movie_title}&apikey={api_key}"
#     omdb_response = requests.get(omdb_url).json()
    
#     return omdb_response.get("BoxOffice", "N/A")

# # Apply function to each movie in the dataset
# awards_df["Box Office Revenue"] = awards_df["Film"].apply(lambda title: get_box_office(title, omdb_api_key))

# # Display updated DataFrame
# awards_df.head()

## 📥 Data Collection: Kaggle Dataset
Load additional Oscar award data from a structured Kaggle dataset.

In [None]:
# Load Kaggle dataset
kaggle_df = pd.read_csv("data/full_data.csv", sep="\t")

# Display dataset structure
kaggle_df.head()

## 🛠 Data Cleaning & Storage
We will clean and merge Wikipedia & Kaggle data, then store it in an **SQLite database**.

In [None]:
# Clean Wikipedia data
movies_df["Year"] = movies_df["Year"].str.extract(r"(\d{4})").astype(float)

# Merge Wikipedia and Kaggle data
merged_df = pd.merge(movies_df, kaggle_df, left_on="Best Picture Winner", right_on="Film", how="left")

# Save to SQLite database
conn = sqlite3.connect("academy_awards.db")
merged_df.to_sql("best_picture_winners", conn, if_exists="replace", index=False)

print("Data successfully stored in SQLite database!")

## 📊 Exploratory Data Analysis
We will explore trends in **Best Picture winners** by genre and other relevant statistics.

In [None]:
# Check unique genres in Kaggle dataset
print(kaggle_df["Genre"].unique())

# Count of Best Picture winners by genre
genre_counts = kaggle_df["Genre"].value_counts()

# Plot the genre distribution
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.index, y=genre_counts.values, palette="Blues_r")
plt.xticks(rotation=45)
plt.xlabel("Genre")
plt.ylabel("Number of Wins")
plt.title("Best Picture Wins by Genre")
plt.show()

## 💰 Box Office & IMDb Ratings
We will analyze **box office revenue** and IMDb ratings of Best Picture winners.

In [None]:
# Scatter plot: Box Office Revenue vs IMDb Ratings
plt.figure(figsize=(10, 6))
sns.scatterplot(x=kaggle_df["BoxOffice"], y=kaggle_df["IMDb Rating"], hue=kaggle_df["Year"], palette="coolwarm")
plt.xlabel("Box Office Revenue (in millions)")
plt.ylabel("IMDb Rating")
plt.title("Box Office Revenue vs IMDb Ratings for Best Picture Winners")
plt.show()

## ☁️ Stretch Goal: Word Cloud (Wikipedia Movie Summaries)
If Wikipedia summaries are accessible, generate a **word cloud** from commonly used words in movie descriptions.

In [None]:
# Sample Wikipedia summary text (replace with actual summaries if available)
sample_text = "This is a sample summary of a Best Picture-winning film. It tells the story of love, ambition, and success."

# Tokenize & remove stopwords
tokens = word_tokenize(sample_text.lower())
filtered_words = [word for word in tokens if word.isalnum() and word not in stopwords.words("english")]

# Generate Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(" ".join(filtered_words))

# Display Word Cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud of Wikipedia Movie Summaries")
plt.show()

In [None]:
oscars = pd.read_csv("data/oscars.csv", sep='\t', on_bad_lines='skip')
oscars = oscars.dropna()
oscars = oscars.drop_duplicates()
oscars = oscars.reset_index(drop=True)

# Step 3: Data Cleaning

In [None]:
# Clean movie metadata
def clean_movie_data(df):
    """
    Handle missing values and standardize column names in the movie dataset.
    """
    df.dropna(subset=["title", "release_year"], inplace=True)
    df.fillna({"box_office": 0, "runtime": df["runtime"].median()}, inplace=True)
    return df

# Clean speech transcripts (stretch goal)
# def preprocess_speech_text(text):
#     """
#     Tokenize and clean Oscar acceptance speech text for word frequency analysis.
#     """
#     nltk.download("stopwords")
#     nltk.download("punkt")
#     tokens = word_tokenize(text.lower())  # Convert to lowercase and tokenize
#     filtered_words = [word for word in tokens if word.isalnum() and word not in stopwords.words("english")]
#     return " ".join(filtered_words)

# speech_df["cleaned_speech"] = speech_df["speech_text"].apply(preprocess_speech_text)



In [None]:
# Step 4: Store Data in SQLite Database
conn = sqlite3.connect("academy_awards.db")
awards_df.to_sql("awards", conn, if_exists="replace", index=False)
speech_df.to_sql("speeches", conn, if_exists="replace", index=False)



In [None]:
# Step 5: SQL Queries & Analysis
## Query genres of Best Picture winners over decades
query = """
SELECT genre, COUNT(*) AS num_wins, strftime('%Y', award_year) AS decade
FROM awards
WHERE category = 'Best Picture'
GROUP BY genre, decade
ORDER BY decade ASC;
"""
genre_trends_df = pd.read_sql(query, conn)

## Query word frequency in acceptance speeches
query = """
SELECT cleaned_speech FROM speeches;
"""
speech_texts = pd.read_sql(query, conn)



In [None]:
# Step 6: Data Visualization
## Bar Chart - Best Picture Wins by Genre
plt.figure(figsize=(12,6))
sns.barplot(x="genre", y="num_wins", hue="decade", data=genre_trends_df)
plt.xticks(rotation=45)
plt.title("Best Picture Wins by Genre Over Decades")
plt.show()

## Scatter Plot - Box Office vs IMDb Ratings
plt.figure(figsize=(10,5))
sns.scatterplot(x="box_office", y="imdb_rating", hue="decade", data=awards_df)
plt.title("Box Office Revenue vs IMDb Ratings for Oscar Winners")
plt.show()

## Word Cloud - Common Words in Acceptance Speeches
all_text = " ".join(speech_texts["cleaned_speech"])
wordcloud = WordCloud(width=800, height=400, background_color="white").generate(all_text)
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Most Common Words in Oscar Acceptance Speeches")
plt.show()



In [None]:
# Step 7: Conclusion & Interpretation
"""
- The bar chart shows which genres have dominated the Best Picture category over time.
- The scatter plot identifies any correlation between box office revenue and audience reception (IMDb ratings).
- The word cloud highlights common themes in Oscar speeches, reflecting industry trends and sentiments. (stretch goal)
"""



### Overview of the Analysis (examples)
- In this analysis, we explored the relationship between the race of law enforcement officers and the race of the drivers they stop. Our goal was to see if there’s any indication of bias in traffic stops based on the racial identity of the officers. To do this, we used a chi-squared test for independence, which helps us understand whether there’s a meaningful connection between these two groups.

### Results of the Chi-Squared Test
- **Chi-Squared Statistic:** We calculated a chi-squared statistic of 122.92. This high number shows that there’s a significant difference between the actual number of stops for different racial groups and what we would expect to see if there were no connection between the officer's race and the driver's race. In other words, this suggests that the patterns we observe in the data are unlikely to be just a coincidence.

- **P-Value:** The p-value we found was about 8.20e-17, which is extremely low. This tells us that the result is statistically significant since it’s much lower than the usual thresholds (like 0.05 or 0.01). A low p-value means we have strong evidence against the idea that there’s no connection between the officer's race and the driver's race.

### Interpretation of Findings
- The results show a strong connection between the race of the officer and the race of the driver being stopped. This means that a driver's chances of being stopped may change depending on the officer's race, suggesting there might be some bias in how traffic stops are carried out.

### Implications
- These findings are important for understanding how race plays a role in law enforcement. They suggest that different racial groups might be treated differently by officers during traffic stops. It's crucial to address these biases to ensure fairness and equality in policing.

### Conclusion
- The strong evidence from the chi-squared statistic and p-value emphasizes the importance of further examining law enforcement practices. Police leaders and community advocacy groups should take these findings into account when reviewing policies and training programs designed to reduce racial bias in policing.