# Creating Networks from JSON Data

This notebook contains an example that reads data from a file of movies `../data/imdb_movies_1985to2022.json` and constructs a graph of actors. This dataset contains a sample of movies released betwen 2000-2022, their titles, genres, release years, ratings, and top-billed actors.

Using this dataset, we build a graph and perform some rudimentary graph analysis, extracting centrality metrics from it.

In [67]:
%matplotlib inline

In [68]:
import json
import random

import numpy as np
import pandas as pd
import networkx as nx


## Exercise 1: Build Graph of Actors, Finding Most Prolific Actor

The dataset contains a list of movies. We want to convert that list into a network of actors, where nodes represent the actor, and edges between them represent the movies in which the two actors have co-starred.

From there, we want to rank the actors by the number of neighboring actors to whom they are connected, and print the top 10.

In [69]:
g = nx.Graph() # Build the graph

In [70]:
with open("imdb_movies_1985to2022.json", "r") as in_file:
    for line in in_file:
        
        # Load the movie from this line
        this_movie = json.loads(line)
            
        # Create a node for every actor
        for actor_id,actor_name in this_movie['actors']:
            g.add_node(actor_id)
            
        # Iterate through the list of actors, generating all pairs
        #. Starting with the first actor in the list, generate pairs with all subsequent actors
        #. then continue to second actor in the list and repeat
        i = 0 # Counter in the list
        for left_actor_id,left_actor_name in this_movie['actors']:
            for right_actor_id,right_actor_name in this_movie['actors'][i+1:]:
                if(g.has_edge(left_actor_id, right_actor_id)):
                    g[left_actor_id][right_actor_id]['weight'] += 1
                else:
                    g.add_edge(left_actor_id, right_actor_id, weight = 1)
                # Get the current weight, if it exists
                
                # Add an edge for these actors
                
                # Print edges
                
            i += 1 # increment the counter
            
    degree_list = list(g.degree())
    degree_list.sort(key = lambda x: x[1], reverse = True)
    for x in range(20):
        print(degree_list[x])

('nm0000616', 646)
('nm0000514', 346)
('nm0001744', 308)
('nm0261724', 303)
('nm0001803', 277)
('nm0000115', 259)
('nm0442207', 256)
('nm0004193', 243)
('nm0000246', 241)
('nm0000448', 237)
('nm0222881', 232)
('nm0001595', 225)
('nm0000799', 225)
('nm0000168', 225)
('nm2278431', 222)
('nm0001002', 219)
('nm0001367', 218)
('nm0000353', 214)
('nm0000151', 209)
('nm0159008', 209)


In [71]:
print("Nodes:", len(g.nodes))

Nodes: 270156


In [72]:
# If you want to explore this graph in Gephi or some other
#. graph analysis tool, NetworkX makes it easy to export data.
#. Here, we use the GraphML format, which Gephi can read 
#. natively, to keep node attributes like Actor Name
nx.write_graphml(g, "actors.graphml")

In [73]:
top_k = 10 # how many of the most central nodes to print