# Gephi Nodes Generator

This notebook creates a nodes.csv file for Gephi import using actor data, with edges representing movies that connect actors.

## Overview
1. Load actor and movie data from CSV files
2. Create actor nodes with 'P-' prefix for IDs
3. Create edges between actors who appeared in the same movie
4. Export nodes.csv and edges.csv files for Gephi

In [1]:
# Import Required Libraries
import pandas as pd
import json
import os

In [2]:
# Load CSV Data Files
# Load actor data from CSV
actors_df = pd.read_csv('../data/actor_details.csv')

# Load movie data from CSV
movies_df = pd.read_csv('../data/movie_details.csv')

print(f"Loaded {len(actors_df)} actor records")
print(f"Loaded {len(movies_df)} movies")

# Display sample data
print("\nSample actor data:")
print(actors_df.head())
print("\nSample movie data:")
print(movies_df.head())

Loaded 22966 actor records
Loaded 3200 movies

Sample actor data:
   person_id           name  popularity known_for_department  \
0          2    Mark Hamill      2.0847               Acting   
1          3  Harrison Ford      3.2373               Acting   
2          4  Carrie Fisher      1.1380               Acting   
3          5  Peter Cushing      0.9095               Acting   
4      12248  Alec Guinness      0.5721               Acting   

              character  order  adult  movie_id  
0        Luke Skywalker      0  False        11  
1              Han Solo      1  False        11  
2  Princess Leia Organa      2  False        11  
3     Grand Moff Tarkin      3  False        11  
4  Obi-Wan "Ben" Kenobi      4  False        11  

Sample movie data:
   movie_id                                              title  \
0        11                                          Star Wars   
1        12                                       Finding Nemo   
2        13                    

In [3]:
# Process Actor Data
# Create actor nodes with P- prefix (each actor appears only once in the CSV)
actor_nodes = []

for _, actor in actors_df.iterrows():
    node = {
        'id': f"P-{actor['person_id']}",
        'name': actor['name'],
        'type': 'Actor',
        'popularity': actor['popularity'],
        'department': actor['known_for_department']
    }
    actor_nodes.append(node)

actor_df = pd.DataFrame(actor_nodes)
print(f"Created {len(actor_df)} actor nodes")
print("Sample actor nodes:")
print(actor_df.head())

Created 22966 actor nodes
Sample actor nodes:
        id           name   type  popularity department
0      P-2    Mark Hamill  Actor      2.0847     Acting
1      P-3  Harrison Ford  Actor      3.2373     Acting
2      P-4  Carrie Fisher  Actor      1.1380     Acting
3      P-5  Peter Cushing  Actor      0.9095     Acting
4  P-12248  Alec Guinness  Actor      0.5721     Acting


In [4]:
# Export Actor Nodes
# Export actor nodes to CSV for Gephi
output_file = '../data/nodes.csv'
actor_df.to_csv(output_file, index=False)

print(f"Total nodes: {len(actor_df)}")
print(f"Actor nodes: {len(actor_df)}")

print(f"\nNodes exported to: {output_file}")

# Display first few rows
print("\nFirst 10 nodes:")
print(actor_df.head(10))

Total nodes: 22966
Actor nodes: 22966

Nodes exported to: ../data/nodes.csv

First 10 nodes:
        id             name   type  popularity department
0      P-2      Mark Hamill  Actor      2.0847     Acting
1      P-3    Harrison Ford  Actor      3.2373     Acting
2      P-4    Carrie Fisher  Actor      1.1380     Acting
3      P-5    Peter Cushing  Actor      0.9095     Acting
4  P-12248    Alec Guinness  Actor      0.5721     Acting
5      P-6  Anthony Daniels  Actor      0.4214     Acting
6    P-130      Kenny Baker  Actor      0.3543     Acting
7  P-24343     Peter Mayhew  Actor      0.3098     Acting
8  P-24342     David Prowse  Actor      0.2570     Acting
9  P-33032       Phil Brown  Actor      0.1588     Acting


In [5]:
# Verify the output file
# Check if file was created and display info
if os.path.exists('../data/nodes.csv'):
    file_size = os.path.getsize('../data/nodes.csv')
    print(f"‚úÖ File created successfully!")
    print(f"üìÅ File location: ../data/nodes.csv")
    print(f"üìä File size: {file_size:,} bytes")
    
    # Read back and verify structure
    verify_df = pd.read_csv('../data/nodes.csv')
    print(f"üîç Verified {len(verify_df)} nodes in output file")
    print("\nColumn structure:")
    print(verify_df.dtypes)
else:
    print("‚ùå Error: File was not created")

‚úÖ File created successfully!
üìÅ File location: ../data/nodes.csv
üìä File size: 988,757 bytes
üîç Verified 22966 nodes in output file

Column structure:
id             object
name           object
type           object
popularity    float64
department     object
dtype: object


## Edge Creation for Gephi

Now we'll create edges between actors who appeared in the same movie. Each edge will represent a movie connection between two actors.

In [6]:
# Load Movie Cast Mapping Data
# Read the movie_cast_mapping.csv file
mapping_df = pd.read_csv('../data/movie_cast_mapping.csv')

print(f"Loaded {len(mapping_df)} movie-actor relationships")
print("Sample mapping data:")
print(mapping_df.head(10))

Loaded 53686 movie-actor relationships
Sample mapping data:
   movie_id  person_id
0        11          2
1        11          3
2        11          4
3        11          5
4        11      12248
5        11          6
6        11        130
7        11      24343
8        11      24342
9        11      33032


In [7]:
# Create Edges Between Actors Who Appeared in Same Movies
# Get the set of actor IDs from our loaded actor data
available_actor_ids = set(actors_df['person_id'])
available_movie_ids = set(movies_df['movie_id'])

print(f"Available actor IDs: {len(available_actor_ids)}")
print(f"Available movie IDs: {len(available_movie_ids)}")

# Group the mapping data by movie to find actors who worked together
from itertools import combinations

edges = []
movie_groups = mapping_df.groupby('movie_id')

for movie_id, group in movie_groups:
    # Only process movies that exist in our dataset
    if movie_id not in available_movie_ids:
        continue
    
    # Get all actors in this movie who are in our actor dataset
    actors_in_movie = [pid for pid in group['person_id'] if pid in available_actor_ids]
    
    # Create edges between all pairs of actors in this movie
    for actor1, actor2 in combinations(actors_in_movie, 2):
        # Get movie details
        movie_info = movies_df[movies_df['movie_id'] == movie_id].iloc[0]
        
        edge = {
            'Source': f"P-{actor1}",
            'Target': f"P-{actor2}",
            'Type': 'Undirected',
            'Weight': 1,
            'movie_id': movie_id,
            'movie_title': movie_info['title'],
            'release_date': movie_info['release_date']
        }
        edges.append(edge)

edges_df = pd.DataFrame(edges)
print(f"Created {len(edges_df)} edges between actors")
print(f"These edges represent {len(edges_df['movie_id'].unique())} movies")
print("\nSample edges:")
print(edges_df.head(10))

Available actor IDs: 22966
Available movie IDs: 3200
Created 434119 edges between actors
These edges represent 3198 movies

Sample edges:
  Source    Target        Type  Weight  movie_id movie_title release_date
0    P-2       P-3  Undirected       1        11   Star Wars   1977-05-25
1    P-2       P-4  Undirected       1        11   Star Wars   1977-05-25
2    P-2       P-5  Undirected       1        11   Star Wars   1977-05-25
3    P-2   P-12248  Undirected       1        11   Star Wars   1977-05-25
4    P-2       P-6  Undirected       1        11   Star Wars   1977-05-25
5    P-2     P-130  Undirected       1        11   Star Wars   1977-05-25
6    P-2   P-24343  Undirected       1        11   Star Wars   1977-05-25
7    P-2   P-24342  Undirected       1        11   Star Wars   1977-05-25
8    P-2   P-33032  Undirected       1        11   Star Wars   1977-05-25
9    P-2  P-131625  Undirected       1        11   Star Wars   1977-05-25
Created 434119 edges between actors
These edges 

In [8]:
# Export Edges to CSV
# Save edges to CSV file for Gephi import
edges_output_file = '../data/edges.csv'
edges_df.to_csv(edges_output_file, index=False)

print(f"‚úÖ Edges exported to: {edges_output_file}")

# Verify the edges file
if os.path.exists(edges_output_file):
    edges_file_size = os.path.getsize(edges_output_file)
    print(f"üìÅ File location: {edges_output_file}")
    print(f"üìä File size: {edges_file_size:,} bytes")
    
    # Read back and verify structure
    verify_edges_df = pd.read_csv(edges_output_file)
    print(f"üîç Verified {len(verify_edges_df)} edges in output file")
    print("\nEdge types distribution:")
    print(verify_edges_df['Type'].value_counts())
    print("\nColumn structure:")
    print(verify_edges_df.dtypes)
else:
    print("‚ùå Error: Edges file was not created")

‚úÖ Edges exported to: ../data/edges.csv
üìÅ File location: ../data/edges.csv
üìä File size: 27,361,654 bytes
üîç Verified 434119 edges in output file

Edge types distribution:
Type
Undirected    434119
Name: count, dtype: int64

Column structure:
Source          object
Target          object
Type            object
Weight           int64
movie_id         int64
movie_title     object
release_date    object
dtype: object
Source          object
Target          object
Type            object
Weight           int64
movie_id         int64
movie_title     object
release_date    object
dtype: object


In [9]:
# Final Summary
print("üéØ Gephi Import Files Summary")
print("=" * 40)
print(f"üìÑ Nodes file: ../data/nodes.csv")
print(f"   - Total nodes: {len(actor_df):,}")
print(f"   - All nodes are actors")
print()
print(f"üîó Edges file: ../data/edges.csv") 
print(f"   - Total edges: {len(edges_df):,}")
print(f"   - Relationships: Actor ‚Üî Actor (via movies)")
print(f"   - Movies represented: {len(edges_df['movie_id'].unique()):,}")
print()
print("‚úÖ Both files are ready for Gephi import!")
print("\nNext steps:")
print("1. Open Gephi")
print("2. Import nodes.csv as nodes table")
print("3. Import edges.csv as edges table")
print("4. Explore the actor collaboration network!")
print("\nNote: Each edge represents a movie where two actors worked together.")

üéØ Gephi Import Files Summary
üìÑ Nodes file: ../data/nodes.csv
   - Total nodes: 22,966
   - All nodes are actors

üîó Edges file: ../data/edges.csv
   - Total edges: 434,119
   - Relationships: Actor ‚Üî Actor (via movies)
   - Movies represented: 3,198

‚úÖ Both files are ready for Gephi import!

Next steps:
1. Open Gephi
2. Import nodes.csv as nodes table
3. Import edges.csv as edges table
4. Explore the actor collaboration network!

Note: Each edge represents a movie where two actors worked together.
