# Project 5: A Quantitative Look at Representation in Pre-Code Hollywood

This final project uses our cleaned dataset to move from technical demonstration to critical analysis. We will investigate gender representation across different creative roles and genres during the Pre-Code era (1929-1934).

**Analytical Questions:**
1.  **Overall Roles:** What is the gender distribution for key roles like `actor`/`actress`, `director`, and `writer`?
2.  **Genre Breakdown:** How does the proportion of roles for actresses change across major film genres?
3.  **Role Prominence:** Using the presence of a credited character name as a proxy for a significant role, what is the gender breakdown for these more prominent parts?

**Methodology:**
* We will use our `hollywood_df.pkl` dataset, which contains cleaned information on movies, principals, and their roles.
* We will use `pandas` for data aggregation to calculate counts and proportions.
* We will use `plotly.subplots` to create a dashboard of bar charts that presents our findings in a clear, comparative format.

In [1]:
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# --- 1. Load the Main Hollywood DataFrame ---
HOLLYWOOD_DF_PATH = "../data/processed/hollywood_df.pkl"
hollywood_df = pd.read_pickle(HOLLYWOOD_DF_PATH)
# We also need the crew data for writers
CREW_DF_PATH = "../data/raw_imdb/title.crew.tsv"
crew_df = pd.read_csv(CREW_DF_PATH, sep='\t', na_values='\\N')
print("Data loaded successfully.")

# --- 2. Analysis: Representation Across Genres ---
# Filter for actors and actresses only
acting_roles_df = hollywood_df[hollywood_df['category'].isin(['actor', 'actress'])].copy()

# For simplicity, let's get the primary genre
acting_roles_df['primary_genre'] = acting_roles_df['genres'].fillna('Unknown').str.split(',').str[0]

# Group by genre and role category to get counts
genre_gender_counts = acting_roles_df.groupby(['primary_genre', 'category']).size().unstack(fill_value=0)

# Calculate the percentage of roles that went to actresses in each genre
genre_gender_counts['total_roles'] = genre_gender_counts['actor'] + genre_gender_counts['actress']
genre_gender_counts['actress_pct'] = (genre_gender_counts['actress'] / genre_gender_counts['total_roles']) * 100

# Filter for major genres for a cleaner plot
major_genres = genre_gender_counts[genre_gender_counts['total_roles'] > 100].sort_values(by='actress_pct', ascending=False)
print("Analysis of roles across genres complete.")

# --- 3. Analysis: Prominence of Roles (Named Characters) ---
# Create a boolean column for roles with a credited character name
acting_roles_df['has_character_name'] = acting_roles_df['characters'].notna() & (acting_roles_df['characters'] != '[]')

# Group by category and the new boolean column
prominence_counts = acting_roles_df.groupby(['category', 'has_character_name']).size().unstack(fill_value=0)
prominence_counts.columns = ['No Character Name', 'Has Character Name']
print("Analysis of role prominence complete.")


# --- 4. Visualization: Create a Dashboard ---
fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=("Percentage of Acting Roles Filled by Women Across Genres", "Prominence of Roles by Gender"),
    vertical_spacing=0.15
)

# Subplot 1: Genre Breakdown Bar Chart
fig.add_trace(go.Bar(
    x=major_genres.index,
    y=major_genres['actress_pct'],
    name='Actress %',
    marker_color='#33FF57'
), row=1, col=1)

# Subplot 2: Prominence Breakdown Bar Chart
fig.add_trace(go.Bar(
    name='Has Character Name',
    x=prominence_counts.index,
    y=prominence_counts['Has Character Name'],
    marker_color='#3357FF'
), row=2, col=1)
fig.add_trace(go.Bar(
    name='No Character Name',
    x=prominence_counts.index,
    y=prominence_counts['No Character Name'],
    marker_color='#FF5733'
), row=2, col=1)

# --- 5. Style the Dashboard ---
fig.update_layout(
    title_text="A Quantitative Look at Gender Representation in Pre-Code Hollywood",
    template='plotly_dark',
    height=900,
    showlegend=True,
    legend_title_text='Role Type',
    barmode='stack' # Stack the bars in the second plot
)
fig.update_yaxes(title_text="Actress Percentage (%)", row=1, col=1, range=[0, 100])
fig.update_yaxes(title_text="Number of Roles", row=2, col=1)

fig.show()

Data loaded successfully.
Analysis of roles across genres complete.
Analysis of role prominence complete.
