# Gephi Nodes Generator (Filtered by Connectivity)

This notebook creates a nodes.csv file for Gephi import using the top N most connected actors (by number of movies), with edges representing movies that connect actors.

## Overview
1. Configure the number of top actors to include
2. Load actor and movie data from CSV files
3. Filter to top N actors by number of movies (most connected)
4. Create actor nodes with 'P-' prefix for IDs
5. Create edges between actors who appeared in the same movie
6. Export nodes_N.csv and edges_N.csv files for Gephi

In [31]:
# Import Required Libraries
import pandas as pd
import json
import os

In [32]:
# Configuration: Number of top actors to include and minimum Recognizability
TOP_N_ACTORS = 1000
MIN_RECOGNIZABILITY = 8

print(f"üéØ Configured to generate network for top {TOP_N_ACTORS} actors (min Recognizability: {MIN_RECOGNIZABILITY})")

üéØ Configured to generate network for top 1000 actors (min Recognizability: 8)


## Configuration

Set the number of top actors to include in the network (based on number of movies they appear in) and minimum Recognizability threshold.

In [33]:
# Load CSV Data Files
# Load actor data from CSV
actors_df = pd.read_csv('../data/actor_details.csv')

# Load movie data from CSV
movies_df = pd.read_csv('../data/movie_details.csv')

print(f"Loaded {len(actors_df)} actor records")
print(f"Loaded {len(movies_df)} movies")

# Display sample data
print("\nSample actor data:")
print(actors_df.head())
print("\nSample movie data:")
print(movies_df.head())

Loaded 22922 actor records
Loaded 3200 movies

Sample actor data:
   person_id           name  popularity  Recognizability
0          2    Mark Hamill      2.0847                8
1          3  Harrison Ford      3.2373               10
2          4  Carrie Fisher      1.1380                9
3          5  Peter Cushing      0.9095                7
4      12248  Alec Guinness      0.5721                9

Sample movie data:
   movie_id                                              title  \
0        11                                          Star Wars   
1        12                                       Finding Nemo   
2        13                                       Forrest Gump   
3        14                                    American Beauty   
4        22  Pirates of the Caribbean: The Curse of the Bla...   

                                      original_title original_language  \
0                                          Star Wars                en   
1                          

In [34]:
# Filter to Top N Actors by Number of Movies (Most Connected) with Minimum Recognizability
# Count the number of movies each actor appears in
movie_counts = actors_df.groupby('person_id').size().reset_index(name='movie_count')

# Merge with actor details
actor_movie_counts = actors_df[['person_id', 'name', 'Recognizability']].drop_duplicates('person_id').merge(
    movie_counts, on='person_id'
)

# Filter by minimum Recognizability first
actor_movie_counts = actor_movie_counts[actor_movie_counts['Recognizability'] >= MIN_RECOGNIZABILITY]

print(f"After Recognizability filter (>= {MIN_RECOGNIZABILITY}): {len(actor_movie_counts)} actors remaining")

# Sort by number of movies (descending) and select top N
actor_movie_counts = actor_movie_counts.sort_values('movie_count', ascending=False)
top_actors_df = actor_movie_counts.head(TOP_N_ACTORS)

print(f"Selected top {len(top_actors_df)} actors by number of movies")
print(f"Movie count range: {top_actors_df['movie_count'].max()} to {top_actors_df['movie_count'].min()} movies")
print(f"Recognizability range: {top_actors_df['Recognizability'].max():.2f} to {top_actors_df['Recognizability'].min():.2f}")
print("\nTop 10 most connected actors:")
print(top_actors_df[['name', 'movie_count', 'Recognizability']].head(10))

# Create actor nodes with P- prefix
actor_nodes = []

for _, actor in top_actors_df.iterrows():
    node = {
        'id': f"P-{actor['person_id']}",
        'name': actor['name'],
        'type': 'Actor',
        'Recognizability': actor['Recognizability'],
        'movie_count': actor['movie_count']
    }
    actor_nodes.append(node)

actor_df = pd.DataFrame(actor_nodes)
print(f"\nCreated {len(actor_df)} actor nodes")
print("Sample actor nodes:")
print(actor_df.head())

After Recognizability filter (>= 8): 1718 actors remaining
Selected top 1000 actors by number of movies
Movie count range: 1 to 1 movies
Recognizability range: 10.00 to 8.00

Top 10 most connected actors:
                      name  movie_count  Recognizability
0              Mark Hamill            1                8
16427        Colin Farrell            1                9
16417          Ethan Hawke            1                9
16406   Edward James Olmos            1                8
16405           Sean Young            1                8
16404         Rutger Hauer            1                8
16393       Joe Pantoliano            1                8
16392     Carrie-Anne Moss            1                8
16391           Guy Pearce            1                8
16389  Christina Applegate            1                8

Created 1000 actor nodes
Sample actor nodes:
        id                name   type  Recognizability  movie_count
0      P-2         Mark Hamill  Actor                8

In [35]:
# Export Actor Nodes
# Export actor nodes to CSV for Gephi
output_file = f'../data/nodes_{TOP_N_ACTORS}_r{MIN_RECOGNIZABILITY}.csv'
actor_df.to_csv(output_file, index=False)

print(f"Total nodes: {len(actor_df)}")
print(f"Actor nodes: {len(actor_df)}")

print(f"\nNodes exported to: {output_file}")

# Display first few rows
print("\nFirst 10 nodes:")
print(actor_df.head(10))

Total nodes: 1000
Actor nodes: 1000

Nodes exported to: ../data/nodes_1000_r8.csv

First 10 nodes:
        id                 name   type  Recognizability  movie_count
0      P-2          Mark Hamill  Actor                8            1
1  P-72466        Colin Farrell  Actor                9            1
2    P-569          Ethan Hawke  Actor                9            1
3    P-587   Edward James Olmos  Actor                8            1
4    P-586           Sean Young  Actor                8            1
5    P-585         Rutger Hauer  Actor                8            1
6    P-532       Joe Pantoliano  Actor                8            1
7    P-530     Carrie-Anne Moss  Actor                8            1
8    P-529           Guy Pearce  Actor                8            1
9  P-18979  Christina Applegate  Actor                8            1


In [36]:
# Verify the output file
# Check if file was created and display info
output_file = f'../data/nodes_{TOP_N_ACTORS}_r{MIN_RECOGNIZABILITY}.csv'
if os.path.exists(output_file):
    file_size = os.path.getsize(output_file)
    print(f"‚úÖ File created successfully!")
    print(f"üìÅ File location: {output_file}")
    print(f"üìä File size: {file_size:,} bytes")
    
    # Read back and verify structure
    verify_df = pd.read_csv(output_file)
    print(f"üîç Verified {len(verify_df)} nodes in output file")
    print("\nColumn structure:")
    print(verify_df.dtypes)
else:
    print("‚ùå Error: File was not created")

‚úÖ File created successfully!
üìÅ File location: ../data/nodes_1000_r8.csv
üìä File size: 31,956 bytes
üîç Verified 1000 nodes in output file

Column structure:
id                 object
name               object
type               object
Recognizability     int64
movie_count         int64
dtype: object


## Edge Creation for Gephi

Now we'll create edges between actors who appeared in the same movie. Each edge will represent a movie connection between two actors.

In [37]:
# Load Movie Cast Mapping Data
# Read the movie_cast_mapping.csv file
mapping_df = pd.read_csv('../data/movie_cast_mapping.csv')

print(f"Loaded {len(mapping_df)} movie-actor relationships")
print("Sample mapping data:")
print(mapping_df.head(10))

Loaded 53686 movie-actor relationships
Sample mapping data:
   movie_id  person_id
0        11          2
1        11          3
2        11          4
3        11          5
4        11      12248
5        11          6
6        11        130
7        11      24343
8        11      24342
9        11      33032


In [38]:
# Create Edges Between Actors Who Appeared in Same Movies
# Get the set of actor IDs from our filtered top actors
available_actor_ids = set(top_actors_df['person_id'])
available_movie_ids = set(movies_df['movie_id'])

print(f"Available actor IDs (top {TOP_N_ACTORS}): {len(available_actor_ids)}")
print(f"Available movie IDs: {len(available_movie_ids)}")

# Group the mapping data by movie to find actors who worked together
from itertools import combinations

edges = []
movie_groups = mapping_df.groupby('movie_id')

for movie_id, group in movie_groups:
    # Only process movies that exist in our dataset
    if movie_id not in available_movie_ids:
        continue
    
    # Get all actors in this movie who are in our filtered actor dataset
    actors_in_movie = [pid for pid in group['person_id'] if pid in available_actor_ids]
    
    # Skip if less than 2 actors from our filtered set are in this movie
    if len(actors_in_movie) < 2:
        continue
    
    # Create edges between all pairs of actors in this movie
    for actor1, actor2 in combinations(actors_in_movie, 2):
        # Get movie details
        movie_info = movies_df[movies_df['movie_id'] == movie_id].iloc[0]
        
        edge = {
            'Source': f"P-{actor1}",
            'Target': f"P-{actor2}",
            'Type': 'Undirected',
            'Weight': 1,
            'movie_id': movie_id,
            'movie_title': movie_info['title'],
            'release_date': movie_info['release_date']
        }
        edges.append(edge)

edges_df = pd.DataFrame(edges)
print(f"\nCreated {len(edges_df)} edges between actors")
print(f"These edges represent {len(edges_df['movie_id'].unique())} movies")
print("\nSample edges:")
print(edges_df.head(10))

Available actor IDs (top 1000): 1000
Available movie IDs: 3200

Created 17312 edges between actors
These edges represent 2274 movies

Sample edges:
  Source   Target        Type  Weight  movie_id     movie_title release_date
0    P-2      P-3  Undirected       1        11       Star Wars   1977-05-25
1   P-62    P-287  Undirected       1        63  Twelve Monkeys   1995-12-29
2   P-62    P-290  Undirected       1        63  Twelve Monkeys   1995-12-29
3  P-287    P-290  Undirected       1        63  Twelve Monkeys   1995-12-29
4  P-325    P-326  Undirected       1        65          8 Mile   2002-11-08
5  P-325    P-328  Undirected       1        65          8 Mile   2002-11-08
6  P-325  P-53650  Undirected       1        65          8 Mile   2002-11-08
7  P-326    P-328  Undirected       1        65          8 Mile   2002-11-08
8  P-326  P-53650  Undirected       1        65          8 Mile   2002-11-08
9  P-328  P-53650  Undirected       1        65          8 Mile   2002-11-08

Crea

In [39]:
# Export Edges to CSV
# Save edges to CSV file for Gephi import
edges_output_file = f'../data/edges_{TOP_N_ACTORS}_r{MIN_RECOGNIZABILITY}.csv'
edges_df.to_csv(edges_output_file, index=False)

print(f"‚úÖ Edges exported to: {edges_output_file}")

# Verify the edges file
if os.path.exists(edges_output_file):
    edges_file_size = os.path.getsize(edges_output_file)
    print(f"üìÅ File location: {edges_output_file}")
    print(f"üìä File size: {edges_file_size:,} bytes")
    
    # Read back and verify structure
    verify_edges_df = pd.read_csv(edges_output_file)
    print(f"üîç Verified {len(verify_edges_df)} edges in output file")
    print("\nEdge types distribution:")
    print(verify_edges_df['Type'].value_counts())
    print("\nColumn structure:")
    print(verify_edges_df.dtypes)
else:
    print("‚ùå Error: Edges file was not created")

‚úÖ Edges exported to: ../data/edges_1000_r8.csv
üìÅ File location: ../data/edges_1000_r8.csv
üìä File size: 1,084,718 bytes
üîç Verified 17312 edges in output file

Edge types distribution:
Type
Undirected    17312
Name: count, dtype: int64

Column structure:
Source          object
Target          object
Type            object
Weight           int64
movie_id         int64
movie_title     object
release_date    object
dtype: object


In [40]:
# Final Summary
print("üéØ Gephi Import Files Summary")
print("=" * 40)
print(f"üìÑ Nodes file: ../data/nodes_{TOP_N_ACTORS}_r{MIN_RECOGNIZABILITY}.csv")
print(f"   - Total nodes: {len(actor_df):,}")
print(f"   - Top {TOP_N_ACTORS} most connected actors (min Recognizability: {MIN_RECOGNIZABILITY})")
print()
print(f"üîó Edges file: ../data/edges_{TOP_N_ACTORS}_r{MIN_RECOGNIZABILITY}.csv")
print(f"   - Total edges: {len(edges_df):,}")
print(f"   - Relationships: Actor ‚Üî Actor (via movies)")
print(f"   - Movies represented: {len(edges_df['movie_id'].unique()):,}")
print()
print("‚úÖ Both files are ready for Gephi import!")
print("\nNext steps:")
print("1. Open Gephi")
print(f"2. Import nodes_{TOP_N_ACTORS}_r{MIN_RECOGNIZABILITY}.csv as nodes table")
print(f"3. Import edges_{TOP_N_ACTORS}_r{MIN_RECOGNIZABILITY}.csv as edges table")
print("4. Explore the actor collaboration network!")
print("\nNote: Each edge represents a movie where two actors worked together.")
print(f"\nüí° To change the number of actors or Recognizability threshold, modify TOP_N_ACTORS and MIN_RECOGNIZABILITY in the configuration cell and re-run.")

üéØ Gephi Import Files Summary
üìÑ Nodes file: ../data/nodes_1000_r8.csv
   - Total nodes: 1,000
   - Top 1000 most connected actors (min Recognizability: 8)

üîó Edges file: ../data/edges_1000_r8.csv
   - Total edges: 17,312
   - Relationships: Actor ‚Üî Actor (via movies)
   - Movies represented: 2,274

‚úÖ Both files are ready for Gephi import!

Next steps:
1. Open Gephi
2. Import nodes_1000_r8.csv as nodes table
3. Import edges_1000_r8.csv as edges table
4. Explore the actor collaboration network!

Note: Each edge represents a movie where two actors worked together.

üí° To change the number of actors or Recognizability threshold, modify TOP_N_ACTORS and MIN_RECOGNIZABILITY in the configuration cell and re-run.
