
# The Golden Age Myth: A Data-Driven Investigation
**Author:** Prateek Chandra

## Project Overview
Is "old music/movies/TV" really better, or is it just **survivorship bias**? 
This project investigates the common belief that entertainment quality has declined over time. Using IMDb datasets, we analyze rating distributions, voting patterns, and content volume across decades to separate nostalgia from statistical reality.

**Key Questions:**
1. Do older movies actually have higher average ratings?
2. How does **Survivorship Bias** skew our perception of the past?
3. What happens when we account for **vote volume** (popularity)?
4. How has the **Content Explosion** (OTT era) affected these metrics?


In [1]:

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from pathlib import Path

# Set default plotting template
pio.templates.default = "plotly_white"

# Load Data
DATA_DIR = Path("../data/processed")
movies = pd.read_csv(DATA_DIR / "movies_features.csv")
tv = pd.read_csv(DATA_DIR / "tv_features.csv")

# Standardize Columns
for df in [movies, tv]:
    df["decade"] = pd.to_numeric(df["decade"], errors="coerce")
    df["averageRating"] = pd.to_numeric(df["averageRating"], errors="coerce")
    df["numVotes"] = pd.to_numeric(df["numVotes"], errors="coerce")
    df.dropna(subset=["decade", "averageRating"], inplace=True)
    df["decade"] = df["decade"].astype(int)

# Filter for relevant decades (e.g., 1920 onwards for consistency)
movies = movies[movies["decade"] >= 1920]
tv = tv[tv["decade"] >= 1920]

print("Data Loaded Successfully.")
print(f"Movies: {movies.shape[0]} titles")
print(f"TV Shows: {tv.shape[0]} titles")


Data Loaded Successfully.
Movies: 391929 titles
TV Shows: 133528 titles



## 1. The Illusion of Superiority: Rating Distributions
At first glance, older decades seem to have consistent quality. If we look at the raw distribution of ratings for Movies, earlier decades often show higher medians and more compact interquartile ranges.


In [2]:

# Filter for decades with enough data for a fair trend
MIN_TITLES = 50
decade_counts = movies["decade"].value_counts()
valid_decades = decade_counts[decade_counts >= MIN_TITLES].index
movies_plot = movies[movies["decade"].isin(valid_decades)]

fig = px.box(movies_plot, 
             x="decade", 
             y="averageRating", 
             title="Movie Rating Distribution by Decade (Raw Data)",
             labels={"decade": "Decade", "averageRating": "IMDb Rating"},
             color="decade")
fig.update_layout(showlegend=False)
fig.show()



> **Observation:** Notice how the median rating (the line inside the box) stays relatively high for the 1940s-1970s. However, the sheer *volume* of outliers (dots) increases drastically in recent decades.



## 2. The Filter of Time: Survivorship Bias
The "Golden Age" effect is heavily influenced by **Survivorship Bias**. We only remember (and rate) the masterpieces of the past. Mediocre films from the 1940s are largely forgotten and may not even appear in the dataset or have very few votes.

Let's test this by filtering for **Popularity** (Number of Votes). If the "Golden Age" holds true, older movies should still dominate even when we include only "well-known" titles.


In [3]:

# Function to visualize different vote thresholds
def plot_survivorship_bias(data, min_votes):
    filtered = data[data["numVotes"] >= min_votes]
    
    fig = px.box(filtered, 
                 x="decade", 
                 y="averageRating", 
                 title=f"Rating Distribution: Titles with > {min_votes} Votes",
                 labels={"decade": "Decade", "averageRating": "Rating"},
                 color="decade")
    fig.update_layout(showlegend=False)
    fig.show()

# High Threshold: Only "Significant" Titles
plot_survivorship_bias(movies, min_votes=1000)



> **Insight:** When we restrict the analysis to titles with at least 1,000 votes (filtering out obscure/forgotten titles), the "superiority" of the past diminishes. Modern decades show competitive median ratings, suggesting that **bad old movies just disappear**, while **bad new movies are still visible**.



## 3. Correcting for Niche Appeal: Vote-Weighted Ratings
Older, niche films often have inflated ratings because they are rated by small groups of dedicated fans. To counter this, we calculate a **Vote-Weighted Mean**. This gives more importance to titles that have withstood the test of public scrutiny (high vote counts).


In [4]:

def weighted_mean(group):
    # Weight average rating by log(numVotes) to dampen the effect of massive blockbusters slightly
    # while still rewarding popularity
    return np.average(group["averageRating"], weights=np.log1p(group["numVotes"]))

comparison = movies.groupby("decade").apply(
    lambda x: pd.Series({
        "Raw Mean": x["averageRating"].mean(),
        "Vote-Weighted Mean": weighted_mean(x)
    })
).reset_index()

fig = go.Figure()
fig.add_trace(go.Scatter(x=comparison["decade"], y=comparison["Raw Mean"],
                    mode='lines+markers', name='Raw Mean Rating'))
fig.add_trace(go.Scatter(x=comparison["decade"], y=comparison["Vote-Weighted Mean"],
                    mode='lines+markers', name='Vote-Weighted Mean'))

fig.update_layout(title="Raw vs. Vote-Weighted Average Ratings by Decade",
                   xaxis_title="Decade", yaxis_title="Average Rating")
fig.show()



> **Insight:** The Vote-Weighted mean (orange) is often lower than the raw mean (blue) in earlier decades, indicating that "boutique" high-rated films with few votes were pulling the average up. In recent years, the gap narrows or reverses.



## 4. The Content Explosion (OTT Era)
We cannot overlook the massive increase in production volume. The barrier to entry for filmmaking has lowered, and streaming services (Netflix, Prime, etc.) have flooded the market.

More content naturally leads to **higher variance**—more masterpieces, but also significantly more "noise" (mediocre or bad content).


In [5]:

# Content Volume
volume_by_decade = movies.groupby("decade").size().reset_index(name="count")

fig = px.bar(volume_by_decade, x="decade", y="count", 
             title="The Content Explosion: Movie Releases per Decade",
             labels={"decade": "Decade", "count": "Number of Releases"})
fig.show()

# Rating Variance
variance_by_decade = movies.groupby("decade")["averageRating"].var().reset_index(name="variance")

fig = px.line(variance_by_decade, x="decade", y="variance", markers=True,
              title="Rating Variance by Decade",
              labels={"decade": "Decade", "variance": "Rating Variance"})
fig.show()



> **Conclusion:** The explosion in releases (especially post-2000) correlates perfectly with increased rating variance. The "Golden Age" wasn't necessarily producing *better* content on average; it was just producing *less* content, which the passage of time has curated for us.



# Final Verdict
The "Golden Age Myth" is largely a statistical artifact caused by:
1. **Survivorship Bias:** We forget the bad stuff from the past.
2. **Content Volume:** Modern eras are flooded with content, diluting the average.
3. **Niche Inflation:** Old, obscure titles often have high ratings from small, devoted audiences.

When we control for these factors (voting thresholds, weighted averages), modern decades hold their own against the classics.
