# Project 4: Interactive Genre & Keyword Sunburst Chart

In this project, we'll create a new type of interactive visualization to explore the thematic composition of film genres. Using a sunburst chart, we can see the hierarchical relationship between a parent genre (like 'Horror') and its component keywords (like 'monster', 'supernatural', etc.).

**Methodology:**
1.  **Load Data:** We will use our final `hollywood_galaxy_df.pkl` file, which contains the movie titles, primary genres, and keywords.
2.  **Prepare Hierarchical Data:** The key step is to transform our data into a hierarchical format that Plotly's sunburst chart can understand. We will count the occurrences of each `(primary_genre, keyword)` pair.
3.  **Visualize:** We will use `plotly.express.sunburst` to generate the interactive chart, where the size of each slice represents its frequency.

In [2]:
import pandas as pd
import plotly.express as px

# --- 1. Load the Final, Processed Data ---
FINAL_DF_PATH = "../data/processed/hollywood_galaxy_df.pkl"
final_df = pd.read_pickle(FINAL_DF_PATH)

print("Final galaxy data loaded successfully.")

# --- 2. Prepare Data for the Sunburst Chart ---

# --- FIX: Create the 'primary_genre' column right after loading ---
# This ensures the column exists for all subsequent operations.
final_df['primary_genre'] = final_df['genres'].fillna('Unknown').str.split(',').str[0]

# Step 2a: Explode the keywords string into one keyword per row
final_df['keywords'] = final_df['keywords'].fillna('').astype(str)
keyword_df = final_df[final_df['keywords'] != ''].copy()
keyword_df['keyword_list'] = keyword_df['keywords'].str.split()
exploded_df = keyword_df.explode('keyword_list')

# Step 2b: Count the occurrences of each (genre, keyword) pair
hierarchy_counts = exploded_df.groupby(['primary_genre', 'keyword_list']).size().reset_index(name='count')

# Step 2c: Filter for more significant keywords to keep the chart clean
significant_keywords_df = hierarchy_counts[hierarchy_counts['count'] >= 5]

print(f"Prepared {len(significant_keywords_df)} significant genre-keyword relationships for the chart.")

# --- 3. Create the Interactive Sunburst Chart ---
fig = px.sunburst(
    significant_keywords_df,
    path=['primary_genre', 'keyword_list'],
    values='count',
    title="An Interactive Sunburst of Pre-Code Hollywood Genres and Their Keywords",
    color='primary_genre',
)

# --- 4. Style the Visualization ---
fig.update_layout(
    margin=dict(t=50, l=25, r=25, b=25),
    title={'y':0.95, 'x':0.5, 'xanchor': 'center', 'yanchor': 'top'}
)
fig.update_traces(
    textinfo='label+percent parent'
)

# --- 5. Show the Final Plot ---
fig.show()

Final galaxy data loaded successfully.
Prepared 571 significant genre-keyword relationships for the chart.
