# Gephi Nodes Generator

This notebook creates a nodes.csv file for Gephi import by combining actor and movie data from JSON files.

## Overview
1. Load actor and movie JSON data
2. Process actors with 'P-' prefix for IDs
3. Process movies with 'M-' prefix for IDs 
4. Combine into single nodes.csv file for Gephi

In [11]:
# Import Required Libraries
import pandas as pd
import json
import os

In [12]:
# Load CSV Data Files
# Load actor data from CSV
actors_df = pd.read_csv('../data/actor_details.csv')

# Load movie data from CSV
movies_df = pd.read_csv('../data/movie_details.csv')

print(f"Loaded {len(actors_df)} actor records")
print(f"Loaded {len(movies_df)} movies")

# Display sample data
print("\nSample actor data:")
print(actors_df.head())
print("\nSample movie data:")
print(movies_df.head())

Loaded 22966 actor records
Loaded 3200 movies

Sample actor data:
   person_id           name  popularity known_for_department  \
0          2    Mark Hamill      2.0847               Acting   
1          3  Harrison Ford      3.2373               Acting   
2          4  Carrie Fisher      1.1380               Acting   
3          5  Peter Cushing      0.9095               Acting   
4      12248  Alec Guinness      0.5721               Acting   

              character  order  adult  movie_id  
0        Luke Skywalker      0  False        11  
1              Han Solo      1  False        11  
2  Princess Leia Organa      2  False        11  
3     Grand Moff Tarkin      3  False        11  
4  Obi-Wan "Ben" Kenobi      4  False        11  

Sample movie data:
   movie_id                                              title  \
0        11                                          Star Wars   
1        12                                       Finding Nemo   
2        13                    

In [13]:
# Process Actor Data
# Create actor nodes with P- prefix (each actor appears only once in the CSV)
actor_nodes = []

for _, actor in actors_df.iterrows():
    node = {
        'id': f"P-{actor['person_id']}",
        'name': actor['name'],
        'type': 'Actor',
        'popularity': actor['popularity'],
        'department': actor['known_for_department']
    }
    actor_nodes.append(node)

actor_df = pd.DataFrame(actor_nodes)
print(f"Created {len(actor_df)} actor nodes")
print("Sample actor nodes:")
print(actor_df.head())

Created 22966 actor nodes
Sample actor nodes:
        id           name   type  popularity department
0      P-2    Mark Hamill  Actor      2.0847     Acting
1      P-3  Harrison Ford  Actor      3.2373     Acting
2      P-4  Carrie Fisher  Actor      1.1380     Acting
3      P-5  Peter Cushing  Actor      0.9095     Acting
4  P-12248  Alec Guinness  Actor      0.5721     Acting


In [14]:
# Process Movie Data
# Create movie nodes with M- prefix
movie_nodes = []

for _, movie in movies_df.iterrows():
    node = {
        'id': f"M-{movie['movie_id']}",
        'name': movie['original_title'],
        'type': 'Movie',
        'title': movie['title'],
        'release_date': movie['release_date'],
        'vote_average': movie['vote_average'],
        'vote_count': movie['vote_count']
    }
    movie_nodes.append(node)

movie_df = pd.DataFrame(movie_nodes)
print(f"Created {len(movie_df)} movie nodes")
print("Sample movie nodes:")
print(movie_df.head())

Created 3200 movie nodes
Sample movie nodes:
     id                                               name   type  \
0  M-11                                          Star Wars  Movie   
1  M-12                                       Finding Nemo  Movie   
2  M-13                                       Forrest Gump  Movie   
3  M-14                                    American Beauty  Movie   
4  M-22  Pirates of the Caribbean: The Curse of the Bla...  Movie   

                                               title release_date  \
0                                          Star Wars   1977-05-25   
1                                       Finding Nemo   2003-05-30   
2                                       Forrest Gump   1994-06-23   
3                                    American Beauty   1999-09-15   
4  Pirates of the Caribbean: The Curse of the Bla...   2003-07-09   

   vote_average  vote_count  
0         8.200       21443  
1         7.800       19883  
2         8.465       28583  
3    

In [15]:
# Combine and Export Nodes
# Combine actor and movie nodes
all_nodes = pd.concat([actor_df, movie_df], ignore_index=True)

print(f"Total nodes: {len(all_nodes)}")
print(f"Actor nodes: {len(actor_df)}")
print(f"Movie nodes: {len(movie_df)}")

# Display summary
print("\nNode type distribution:")
print(all_nodes['type'].value_counts())

# Export to CSV for Gephi
output_file = '../data/nodes.csv'
all_nodes.to_csv(output_file, index=False)
print(f"\nNodes exported to: {output_file}")

# Display first few rows
print("\nFirst 10 nodes:")
print(all_nodes.head(10))

Total nodes: 26166
Actor nodes: 22966
Movie nodes: 3200

Node type distribution:
type
Actor    22966
Movie     3200
Name: count, dtype: int64

Nodes exported to: ../data/nodes.csv

First 10 nodes:
        id             name   type  popularity department title release_date  \
0      P-2      Mark Hamill  Actor      2.0847     Acting   NaN          NaN   
1      P-3    Harrison Ford  Actor      3.2373     Acting   NaN          NaN   
2      P-4    Carrie Fisher  Actor      1.1380     Acting   NaN          NaN   
3      P-5    Peter Cushing  Actor      0.9095     Acting   NaN          NaN   
4  P-12248    Alec Guinness  Actor      0.5721     Acting   NaN          NaN   
5      P-6  Anthony Daniels  Actor      0.4214     Acting   NaN          NaN   
6    P-130      Kenny Baker  Actor      0.3543     Acting   NaN          NaN   
7  P-24343     Peter Mayhew  Actor      0.3098     Acting   NaN          NaN   
8  P-24342     David Prowse  Actor      0.2570     Acting   NaN          NaN   
9  

In [16]:
# Verify the output file
# Check if file was created and display info
if os.path.exists('../data/nodes.csv'):
    file_size = os.path.getsize('../data/nodes.csv')
    print(f"‚úÖ File created successfully!")
    print(f"üìÅ File location: ../data/nodes.csv")
    print(f"üìä File size: {file_size:,} bytes")
    
    # Read back and verify structure
    verify_df = pd.read_csv('../data/nodes.csv')
    print(f"üîç Verified {len(verify_df)} nodes in output file")
    print("\nColumn structure:")
    print(verify_df.dtypes)
else:
    print("‚ùå Error: File was not created")

‚úÖ File created successfully!
üìÅ File location: ../data/nodes.csv
üìä File size: 1,312,375 bytes
üîç Verified 26166 nodes in output file

Column structure:
id               object
name             object
type             object
popularity      float64
department       object
title            object
release_date     object
vote_average    float64
vote_count      float64
dtype: object


## Edge Creation for Gephi

Now we'll create edges from the movie-cast mapping data to show relationships between movies and actors.

In [17]:
# Load Movie Cast Mapping Data
# Read the movie_cast_mapping.csv file
mapping_df = pd.read_csv('../data/movie_cast_mapping.csv')

print(f"Loaded {len(mapping_df)} movie-actor relationships")
print("Sample mapping data:")
print(mapping_df.head(10))

Loaded 53686 movie-actor relationships
Sample mapping data:
   movie_id  person_id
0        11          2
1        11          3
2        11          4
3        11          5
4        11      12248
5        11          6
6        11        130
7        11      24343
8        11      24342
9        11      33032


In [18]:
# Create Edges with Prefixed IDs
# Get the set of actor IDs from our loaded actor data
available_actor_ids = set(actors_df['person_id'])
available_movie_ids = set(movies_df['movie_id'])

print(f"Available actor IDs: {len(available_actor_ids)}")
print(f"Available movie IDs: {len(available_movie_ids)}")

# Process the mapping data to create edges for Gephi
edges = []
skipped_count = 0

for _, row in mapping_df.iterrows():
    # Only create edge if both the person_id and movie_id exist in our datasets
    if row['person_id'] in available_actor_ids and row['movie_id'] in available_movie_ids:
        edge = {
            'Source': f"M-{row['movie_id']}",
            'Target': f"P-{row['person_id']}",
            'Type': 'Undirected',
            'Weight': 1
        }
        edges.append(edge)
    else:
        skipped_count += 1

edges_df = pd.DataFrame(edges)
print(f"Created {len(edges_df)} edges (skipped {skipped_count} edges for missing actors/movies)")
print("Sample edges:")
print(edges_df.head(10))

Available actor IDs: 22966
Available movie IDs: 3200
Created 53686 edges (skipped 0 edges for missing actors/movies)
Sample edges:
  Source   Target        Type  Weight
0   M-11      P-2  Undirected       1
1   M-11      P-3  Undirected       1
2   M-11      P-4  Undirected       1
3   M-11      P-5  Undirected       1
4   M-11  P-12248  Undirected       1
5   M-11      P-6  Undirected       1
6   M-11    P-130  Undirected       1
7   M-11  P-24343  Undirected       1
8   M-11  P-24342  Undirected       1
9   M-11  P-33032  Undirected       1


In [19]:
# Export Edges to CSV
# Save edges to CSV file for Gephi import
edges_output_file = '../data/edges.csv'
edges_df.to_csv(edges_output_file, index=False)

print(f"‚úÖ Edges exported to: {edges_output_file}")

# Verify the edges file
if os.path.exists(edges_output_file):
    edges_file_size = os.path.getsize(edges_output_file)
    print(f"üìÅ File location: {edges_output_file}")
    print(f"üìä File size: {edges_file_size:,} bytes")
    
    # Read back and verify structure
    verify_edges_df = pd.read_csv(edges_output_file)
    print(f"üîç Verified {len(verify_edges_df)} edges in output file")
    print("\nEdge types distribution:")
    print(verify_edges_df['Type'].value_counts())
    print("\nColumn structure:")
    print(verify_edges_df.dtypes)
else:
    print("‚ùå Error: Edges file was not created")

‚úÖ Edges exported to: ../data/edges.csv
üìÅ File location: ../data/edges.csv
üìä File size: 1,561,284 bytes
üîç Verified 53686 edges in output file

Edge types distribution:
Type
Undirected    53686
Name: count, dtype: int64

Column structure:
Source    object
Target    object
Type      object
Weight     int64
dtype: object


In [20]:
# Final Summary
print("üéØ Gephi Import Files Summary")
print("=" * 40)
print(f"üìÑ Nodes file: ../data/nodes.csv")
print(f"   - Total nodes: {len(all_nodes):,}")
print(f"   - Actors: {len(actor_df):,}")
print(f"   - Movies: {len(movie_df):,}")
print()
print(f"üîó Edges file: ../data/edges.csv") 
print(f"   - Total edges: {len(edges_df):,}")
print(f"   - Relationships: Movie ‚Üî Actor")
print()
print("‚úÖ Both files are ready for Gephi import!")
print("\nNext steps:")
print("1. Open Gephi")
print("2. Import nodes.csv as nodes table")
print("3. Import edges.csv as edges table")
print("4. Enjoy exploring your movie-actor network!")

üéØ Gephi Import Files Summary
üìÑ Nodes file: ../data/nodes.csv
   - Total nodes: 26,166
   - Actors: 22,966
   - Movies: 3,200

üîó Edges file: ../data/edges.csv
   - Total edges: 53,686
   - Relationships: Movie ‚Üî Actor

‚úÖ Both files are ready for Gephi import!

Next steps:
1. Open Gephi
2. Import nodes.csv as nodes table
3. Import edges.csv as edges table
4. Enjoy exploring your movie-actor network!
