# Project 2: The Social Network of Pre-Code Hollywood

In this project, we'll shift our focus from movie similarity to the relationships between the people who made them. Our goal is to create an interactive social network graph of the actors and actresses of the Pre-Code era to visualize who worked together most frequently.

**Objective:**
- Identify pairs of actors who co-starred in films.
- Count the number of collaborations for each pair.
- Visualize this network using `pyvis`, where node size can represent an actor's total film count and edge thickness represents the strength of their collaboration.

**Methodology:**
1.  **Load Data:** We will start with our cleaned `hollywood_df.pkl` file.
2.  **Filter for Actors:** We will create a DataFrame containing only actors and actresses.
3.  **Generate Co-star Pairs:** For each film, we will create a list of every possible pair of actors who appeared in it.
4.  **Aggregate Pairs:** We will count the occurrences of each pair across all films to find the most frequent collaborators.

In [1]:
import pandas as pd
import os
from itertools import combinations
from collections import Counter

# --- 1. Load the Hollywood DataFrame ---
HOLLYWOOD_DF_PATH = "../data/processed/hollywood_df.pkl"
hollywood_df = pd.read_pickle(HOLLYWOOD_DF_PATH)

print("Hollywood DataFrame loaded successfully.")

# --- 2. Filter for only actors/actresses ---
actors_df = hollywood_df[hollywood_df['category'].isin(['actor', 'actress'])].copy()
print(f"Filtered down to {len(actors_df):,} actor/actress roles.")

# --- 3. Generate Co-starring Pairs for Each Movie ---

# Group by movie (tconst) and list all actors in that movie
actor_lists_by_movie = actors_df.groupby('tconst')['primaryName'].apply(list)

print(f"Found {len(actor_lists_by_movie)} movies with actor lists.")

# Create a list to hold all pairs
all_pairs = []

# Iterate through each movie's actor list
for actors in actor_lists_by_movie:
    # Use itertools.combinations to get all unique pairs of 2
    # We sort the pair so that ('Actor A', 'Actor B') is the same as ('Actor B', 'Actor A')
    pairs = combinations(sorted(actors), 2)
    all_pairs.extend(pairs)

print(f"Generated {len(all_pairs):,} total co-starring pairs.")

# --- 4. Count the Pairs ---
# Use collections.Counter for a highly efficient way to count the pairs
pair_counts = Counter(all_pairs)

# Convert the counter to a DataFrame for easier manipulation
pair_counts_df = pd.DataFrame(pair_counts.items(), columns=['pair', 'count'])
pair_counts_df.sort_values(by='count', ascending=False, inplace=True)

print("\n--- Top 15 Most Frequent Collaborators in Pre-Code Hollywood ---")
display(pair_counts_df.head(15))

Hollywood DataFrame loaded successfully.
Filtered down to 41,492 actor/actress roles.
Found 4466 movies with actor lists.
Generated 180,261 total co-starring pairs.

--- Top 15 Most Frequent Collaborators in Pre-Code Hollywood ---


Unnamed: 0,pair,count
2534,"(Ken Maynard, Tarzan)",28
799,"(Bob Steele, Perry Murdock)",18
14047,"(Bert Wheeler, Robert Woolsey)",15
3556,"(Bud Osborne, Cliff Lyons)",15
8029,"(Oliver Hardy, Stan Laurel)",14
30309,"(Jack Rockwell, Ken Maynard)",14
9761,"(Frank Rice, Ken Maynard)",13
54066,"(Bob Steele, George 'Gabby' Hayes)",12
14042,"(Bert Wheeler, Dorothy Lee)",12
74387,"(Earl Dwire, George 'Gabby' Hayes)",12


## Part 2: Visualizing the Actor Network

Now that we have the collaboration data, we can build our interactive graph. To ensure the visualization is clear and meaningful, we will apply a filter to only show pairs who have co-starred in a significant number of films.

**Methodology:**
1.  **Calculate Node Sizes:** We will first calculate the total number of Pre-Code films for each actor. This will be used to determine the size of each actor's node in the graph, making more prolific actors larger.
2.  **Filter for Strong Connections:** We will filter our `pair_counts_df` to only include pairs who have collaborated on **6 or more films**. This threshold removes noise and focuses the graph on the strongest relationships.
3.  **Construct the Graph:** We will iterate through our filtered list of pairs, adding each actor as a node and creating a weighted edge between them to represent the strength of their collaboration.

In [5]:
import pandas as pd
import networkx as nx
import community as community_louvain
from pyvis.network import Network
from collections import defaultdict

# --- 1. Data Prep (from before) ---
actor_film_counts = actors_df['primaryName'].value_counts()
threshold = 6
strong_pairs_df = pair_counts_df[pair_counts_df['count'] >= threshold]
print(f"Found {len(strong_pairs_df)} pairs of frequent collaborators (>= {threshold} films).")

# --- 2. New: Prepare Genre Data for Edges ---
print("Preparing genre context for actor collaborations...")
# Create a dataframe with one row per movie, containing a list of its actors and its genres string
movie_to_actors_genres = actors_df.groupby('tconst').agg({
    'primaryName': list,
    'genres': 'first'  # 'first' gets the single genre string for that movie
})

# Create a dictionary to map pairs to a list of genres
edge_genres = defaultdict(list)

# Iterate through each movie
for tconst, row in movie_to_actors_genres.iterrows():
    actors = row['primaryName']
    genres = row['genres']
    if pd.notna(genres):
        # Generate all pairs of actors for this movie
        for pair in combinations(sorted(actors), 2):
            # For each pair, append the movie's genres (split into a list)
            edge_genres[pair].extend(genres.split(','))

# Create a summarized, unique list of genres for each pair
edge_labels = {
    pair: ", ".join(sorted(list(set(genres))))
    for pair, genres in edge_genres.items()
}
print("Genre context prepared.")

# --- 3. Build NetworkX Graph and Detect Communities (from before) ---
G = nx.Graph()
for index, row in strong_pairs_df.iterrows():
    G.add_edge(row['pair'][0], row['pair'][1], weight=int(row['count']))
partition = community_louvain.best_partition(G)
num_communities = len(set(partition.values()))
colors = ["#FF5733", "#33FF57", "#3357FF", "#FF33A1", "#A133FF", "#33FFA1", "#FFC300", 
          "#C70039", "#900C3F", "#581845", "#DAF7A6", "#FFC0CB", "#00FFFF", "#F0E68C"]
color_map = {i: colors[i % len(colors)] for i in range(num_communities)}

# --- 4. Build the Final Interactive Graph with Enriched Tooltips ---
net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", notebook=True, cdn_resources='in_line')
net.force_atlas_2based(gravity=-60, spring_length=250)

# Add nodes with community-based colors
for node, community_id in partition.items():
    actor_size = int(actor_film_counts.get(node, 10))
    net.add_node(node, label=node, size=actor_size, color=color_map[community_id],
                 title=f"{node}: {actor_size} films (Community {community_id})")

# Add edges with enriched tooltips
for index, row in strong_pairs_df.iterrows():
    actor1, actor2 = row['pair']
    count = int(row['count'])
    
    # Get the sorted pair to match the key in our edge_labels dictionary
    sorted_pair = tuple(sorted((actor1, actor2)))
    genres_label = edge_labels.get(sorted_pair, "N/A")
    
    # Create the detailed tooltip
    edge_title = f"Co-starred in {count} films\nGenres: {genres_label}"
    
    net.add_edge(actor1, actor2, value=count, title=edge_title)
    
net.show_buttons(filter_=['physics'])
net.show("pre_code_actor_network_final.html")
display(net)

Found 204 pairs of frequent collaborators (>= 6 films).
Preparing genre context for actor collaborations...
Genre context prepared.
pre_code_actor_network_final.html


<class 'pyvis.network.Network'> |N|=204 |E|=204