# Project 2: The Social Network of Pre-Code Hollywood

In this project, we'll shift our focus from movie similarity to the relationships between the people who made them. Our goal is to create an interactive social network graph of the actors and actresses of the Pre-Code era to visualize who worked together most frequently.

**Objective:**
- Identify pairs of actors who co-starred in films.
- Count the number of collaborations for each pair.
- Visualize this network using `pyvis`, where node size can represent an actor's total film count and edge thickness represents the strength of their collaboration.

**Methodology:**
1.  **Load Data:** We will start with our cleaned `hollywood_df.pkl` file.
2.  **Filter for Actors:** We will create a DataFrame containing only actors and actresses.
3.  **Generate Co-star Pairs:** For each film, we will create a list of every possible pair of actors who appeared in it.
4.  **Aggregate Pairs:** We will count the occurrences of each pair across all films to find the most frequent collaborators.

In [1]:
import pandas as pd
import os
from itertools import combinations
from collections import Counter

# --- 1. Load the Hollywood DataFrame ---
HOLLYWOOD_DF_PATH = "../data/processed/hollywood_df.pkl"
hollywood_df = pd.read_pickle(HOLLYWOOD_DF_PATH)

print("Hollywood DataFrame loaded successfully.")

# --- 2. Filter for only actors/actresses ---
actors_df = hollywood_df[hollywood_df['category'].isin(['actor', 'actress'])].copy()
print(f"Filtered down to {len(actors_df):,} actor/actress roles.")

# --- 3. Generate Co-starring Pairs for Each Movie ---

# Group by movie (tconst) and list all actors in that movie
actor_lists_by_movie = actors_df.groupby('tconst')['primaryName'].apply(list)

print(f"Found {len(actor_lists_by_movie)} movies with actor lists.")

# Create a list to hold all pairs
all_pairs = []

# Iterate through each movie's actor list
for actors in actor_lists_by_movie:
    # Use itertools.combinations to get all unique pairs of 2
    # We sort the pair so that ('Actor A', 'Actor B') is the same as ('Actor B', 'Actor A')
    pairs = combinations(sorted(actors), 2)
    all_pairs.extend(pairs)

print(f"Generated {len(all_pairs):,} total co-starring pairs.")

# --- 4. Count the Pairs ---
# Use collections.Counter for a highly efficient way to count the pairs
pair_counts = Counter(all_pairs)

# Convert the counter to a DataFrame for easier manipulation
pair_counts_df = pd.DataFrame(pair_counts.items(), columns=['pair', 'count'])
pair_counts_df.sort_values(by='count', ascending=False, inplace=True)

print("\n--- Top 15 Most Frequent Collaborators in Pre-Code Hollywood ---")
display(pair_counts_df.head(15))

Hollywood DataFrame loaded successfully.
Filtered down to 41,492 actor/actress roles.
Found 4466 movies with actor lists.
Generated 180,261 total co-starring pairs.

--- Top 15 Most Frequent Collaborators in Pre-Code Hollywood ---


Unnamed: 0,pair,count
2534,"(Ken Maynard, Tarzan)",28
799,"(Bob Steele, Perry Murdock)",18
14047,"(Bert Wheeler, Robert Woolsey)",15
3556,"(Bud Osborne, Cliff Lyons)",15
8029,"(Oliver Hardy, Stan Laurel)",14
30309,"(Jack Rockwell, Ken Maynard)",14
9761,"(Frank Rice, Ken Maynard)",13
54066,"(Bob Steele, George 'Gabby' Hayes)",12
14042,"(Bert Wheeler, Dorothy Lee)",12
74387,"(Earl Dwire, George 'Gabby' Hayes)",12


## Part 2: Visualizing the Actor Network

Now that we have the collaboration data, we can build our interactive graph. To ensure the visualization is clear and meaningful, we will apply a filter to only show pairs who have co-starred in a significant number of films.

**Methodology:**
1.  **Calculate Node Sizes:** We will first calculate the total number of Pre-Code films for each actor. This will be used to determine the size of each actor's node in the graph, making more prolific actors larger.
2.  **Filter for Strong Connections:** We will filter our `pair_counts_df` to only include pairs who have collaborated on **6 or more films**. This threshold removes noise and focuses the graph on the strongest relationships.
3.  **Construct the Graph:** We will iterate through our filtered list of pairs, adding each actor as a node and creating a weighted edge between them to represent the strength of their collaboration.

In [3]:
from pyvis.network import Network

# --- 1. Calculate Node Sizes (Total Films Per Actor) ---
actor_film_counts = actors_df['primaryName'].value_counts()

# --- 2. Filter for Strong Collaborations ---
threshold = 6
strong_pairs_df = pair_counts_df[pair_counts_df['count'] >= threshold]

print(f"Found {len(strong_pairs_df)} pairs of frequent collaborators (>= {threshold} films).")


# --- 3. Build the Interactive Graph ---
net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", notebook=True, cdn_resources='in_line')
net.force_atlas_2based(gravity=-50, spring_length=250)

# Iterate through our filtered DataFrame of strong pairs
for index, row in strong_pairs_df.iterrows():
    actor1, actor2 = row['pair']
    
    # --- FIX: Convert NumPy int64 to standard Python int ---
    count = int(row['count'])
    actor1_size = int(actor_film_counts.get(actor1, 10))
    actor2_size = int(actor_film_counts.get(actor2, 10))
    
    # Add the nodes and the edge using the converted int types
    net.add_node(actor1, label=actor1, size=actor1_size, title=f"{actor1}: {actor1_size} films")
    net.add_node(actor2, label=actor2, size=actor2_size, title=f"{actor2}: {actor2_size} films")
    net.add_edge(actor1, actor2, value=count, title=f"Co-starred in {count} films")

net.show_buttons(filter_=['physics'])

# Save and display the graph

Found 204 pairs of frequent collaborators (>= 6 films).
