[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jeljov/NAP2025/blob/main/SNA_Tutorial_Part1.ipynb)

## Introduction to Social Network Analysis: Network Creation and Visualization

We'll start by loading the required libraries. The only new one is [networkX](https://networkx.org/documentation/stable/index.html), the main python library for network analysis.

In [None]:
from pathlib import Path
from collections import defaultdict

import pandas as pd

import matplotlib.pyplot as plt

import networkx as nx

import pickle

### Load and explore the data

We will use a data set about actors and movies they played in as well as about the actors' nominations for American Academy Awards (Oscars).
The data set is available from Kaggle as the ["Oscar nominations and filmographies since 1972"](https://www.kaggle.com/datasets/milanjanosov/oscar-nominations-and-filmographies-since-1972) data set. More information about this dataset and one apporach to its use in network analysis is available in [this DataViz artcle](https://nightingaledvs.com/50-years-of-oscars-acting-success-and-collaboration/).

A side note: you might be interested in exploring the [Cinemagoer](https://cinemagoer.github.io/) python package for  retrieving data about movies and people from the IMDb movie database.

In [None]:
from google.colab import files

data_files = files.upload()

In [None]:
filmographies = pd.read_csv('filmographies.csv')
nominations = pd.read_csv('nomination_stats.csv')

In [None]:
# Offline version

# filmographies = pd.read_csv(Path.cwd() / 'data' / 'filmographies.csv')
# nominations = pd.read_csv(Path.cwd() / 'data' / 'nomination_stats.csv')

We'll start by exploring the data

In [None]:
filmographies.head()

In [None]:
filmographies.info()

In [None]:
filmographies.kind.value_counts()

In [None]:
filmographies.position.value_counts()

In [None]:
nominations.head()

In [None]:
nominations.info()

In [None]:
nominations.outcome.value_counts()

### Create a network of actors who acted in the same movie

We will create an undirected graph, as the relationship of acting in the same movie is mutual

In [None]:
G = nx.Graph()

#### Add nodes to the network

Add a node for each actor / actress

In [None]:
actors_df = filmographies.loc[filmographies.position.isin(['actor', 'actress']) & (filmographies.kind=="movie")]
actors_df.shape

Check if we need name_id as unique identifiers or we can rely on actor / actress names to uniquely identify them

In [None]:
actors_df.name.nunique() == actors_df.name_id.nunique()

Create a node for each actor / actress using their names as node identifiers and labels.

There are different ways for adding nodes to a graph, but the most often used one is via the `add_nodes_from()` method that receives a list of values representing node names or a list of tuples holding not only node names but also attributes

In [None]:
actors = actors_df.name.unique().tolist()
print(f"Number of actors: {len(actors)}")

In [None]:
G.add_nodes_from(actors)
print(G)

To access graph nodes, use `G.nodes()` which returns an object of type `NodeView`.
It is a dict-like and set-like object that provides a *view* into the nodes and their attributes within a graph. It does not store a separate copy of the node data but rather reflects the current state of the graph.

In [None]:
for node in list(G.nodes())[:5]:
    print(node)

Next, we add attributes to nodes. In particular, we will add three attributes:
* number of movies an actor / actress played in,
* number of Oscar nominations,
* number of Oscars won

In [None]:
num_films = actors_df.groupby('name').size()
num_films

In [None]:
num_films.describe()

In [None]:
for node in G.nodes():
    G.nodes[node]['num_films'] = num_films.get(node, 0)

In [None]:
nominees_df = nominations.loc[nominations.outcome == 'nominee',]
num_nominations = nominees_df.groupby('name').size()
num_nominations

In [None]:
num_nominations.describe()

In [None]:
# num_nominations[num_nominations == 21]

In [None]:
for node in G.nodes():
    G.nodes[node]['num_nominations'] = num_nominations.get(node, 0)

In [None]:
winners_df = nominations.loc[nominations.outcome == 'winner',]
num_awards = winners_df.groupby('name').size()
for node in G.nodes():
    G.nodes[node]['num_awards'] = num_awards.get(node, 0)

In [None]:
# setting data to True to show attribute values
for node in list(G.nodes(data=True))[:10]:
    print(node)

#### Adding edges to the graph

Add edges between actors who acted in the same movie

In [None]:
edges_dict = dict()

for title_id, group in actors_df.groupby('title_id'):
    title_actors = group['name'].tolist()
    for i in range(len(title_actors)):
        for j in range(i+1, len(title_actors)):
            if (title_actors[i], title_actors[j]) in edges_dict.keys():
                edges_dict[(title_actors[i], title_actors[j])] += 1
            elif (title_actors[j], title_actors[i]) in edges_dict.keys():
                edges_dict[(title_actors[j], title_actors[i])] += 1
            else:
                edges_dict[(title_actors[i], title_actors[j])] = 1

In [None]:
for actor_pair, freq in edges_dict.items():
    source, target = actor_pair
    G.add_edge(source, target, weight=freq)

print("Edge count", G.number_of_edges())

We can access edges via `G.edges()` and if we want to access edge attributes (e.g., weight), we set the `data` argument to True:

In [None]:
for edge in list(G.edges(data=True))[:10]:
    print(edge)

A graph is internally represented as an adjacency matrix or as an edge list.

The **adjacency matrix** is a matrix of NxN dimension, where N is the number of nodes. It is a sparse matrix, with values different than zero indicating an edge between two nodes. In case of undirected graphs, such as the current one, it is a symmetric matrix.

In networkX, we can access this internal, matrix, representation as follows:

In [None]:
adj_matrix = nx.to_pandas_adjacency(G)
adj_matrix.iloc[:5,:5]

In case of large graphs, such as the current one, a more efficient graph representation is **edge list**, which is a list of source - target node pairs, with associated weight (or other edge attributes) if available. In networkX, it can be accessed via the `adjacency_matrix()` function (which is quite unusal and confusing):

In [None]:
adj_matrix = nx.adjacency_matrix(G)
print(adj_matrix)

#### Visualising a graph

Since the actors graph is quite large for initial visualisation, we'll focus on its subset (subgraph): actors who won multiple oscars, connected if they co-starred in the same movie

In [None]:
multi_oscars_filter = [node for node, attr in G.nodes(data=True) if attr.get('num_awards') > 1]
G_multi_oscar_winners = G.subgraph(multi_oscars_filter)
print(G_multi_oscar_winners)

A key thing to decide on when creating a graph visualisation is how to lay out the graph nodes, that is, what layout algorithm and the corresponding networkX's layout function to use.

NetworkX includes a variety of built-in layout algorithms. The following are those that are most often used:
* `spring_layout` (Fruchterman-Reingold algorithm): The default layout, which uses a force-directed approach to position nodes, simulating attractive forces between connected nodes and repulsive forces between all other nodes.
* `circular_layout`: Arranges nodes in a circle. This is useful for emphasizing cyclic structures (when present).
* `random_layout`: Places nodes randomly within a unit square.
* `kamada_kawai_layout`: Uses a path-length cost-function to position nodes, often resulting in aesthetically pleasing layouts that reflect the graph's structure well.

First, we create just a basic visualisation

In [None]:
plt.figure(figsize=(12,12))

pos = nx.spring_layout(G_multi_oscar_winners,
                       seed=1,
                       k=0.75) # controls how close the nodes are; increasing it moves nodes apart

nx.draw_networkx_nodes(G_multi_oscar_winners, pos)
nx.draw_networkx_edges(G_multi_oscar_winners, pos)
nx.draw_networkx_labels(G_multi_oscar_winners, pos)

plt.title("Actors and actresses with multiple oscars")

plt.axis('off')
plt.show()

We can make the visualisation more informative by:
* scaling nodes so that their size reflects one of nodes' attributes, for example, number of movies an actor played in
* setting the node colour to represent another node feature - for example, number of awards an actor received
* setting edge width to correspond to the strength of connection between two nodes

In [None]:
# making the node size a function of the number of films the actor / actress acted in
node_sizes = [30 + 9*G_multi_oscar_winners.nodes[n]['num_films'] for n in G_multi_oscar_winners.nodes()]

# getting the number of awards each actor / actress won, to serve for node color selection from a color map,
# so that the color of a node reflects the number of awards won
node_colors = [G_multi_oscar_winners.nodes[n]['num_awards'] for n in G_multi_oscar_winners.nodes()]

# extract edge weights to be used for determining the edge width
edge_width = [w['weight'] for _, _, w in G_multi_oscar_winners.edges(data=True)]

In [None]:
plt.figure(figsize=(12,12))

pos = nx.spring_layout(G_multi_oscar_winners, seed=18, k=0.75)
# pos = nx.kamada_kawai_layout(G_multi_oscar_winners, weight='weight')

nx.draw_networkx_nodes(G_multi_oscar_winners, pos,
                       node_size=node_sizes,
                       node_color=node_colors, cmap='coolwarm', alpha=0.4)
nx.draw_networkx_edges(G_multi_oscar_winners, pos,
                       width=edge_width, alpha=0.55)
nx.draw_networkx_labels(G_multi_oscar_winners, pos,
                        font_size=9,
                        horizontalalignment='right')

plt.title("Actors and actresses with multiple oscars")

plt.axis('off')
plt.show()

#### Save graph

There are several common formats for storing graphs:
* GraphML: An XML-based format that is often used for storing and sharing edge and node attribute data.
* GML (Graph Modeling Language): A simpler, text-based format supported by many tools.
* GEXF (Graph Exchange XML Format): Another XML-based format, often used with visualization tools like Gephi.
* Edge List: A simple format where each line holds a pair of source and target nodes representing an edge

The use of these formats is recommended if one needs to share the graph with other, non-Python software.
However, if a graph is be used in python only, then serialising it in a usual way - via pickle - is recommended.

In [None]:
nx.write_edgelist(G, 'actors.adjlist')
files.download('actors.adjlist')

nx.write_edgelist(G_multi_oscar_winners, 'multi_oscar_winners.adjlist')
files.download('multi_oscar_winners.adjlist')

In [None]:
# Offline version

# nx.write_edgelist(G, Path.cwd() / 'graphs' / 'actors.adjlist')
# nx.write_edgelist(G_multi_oscar_winners, Path.cwd() / 'graphs' / 'multi_oscar_winners.adjlist')

In [None]:
with open('actors.pkl', 'wb') as fobj:
    pickle.dump(G, fobj)
files.download('actors.pkl')

with open('multi_oscar_winners.pkl', 'wb') as fobj:
    pickle.dump(G_multi_oscar_winners, fobj)
files.download('multi_oscar_winners.pkl')

In [None]:
# Offline version

# with open(Path.cwd() / 'graphs' / 'actors.pkl', 'wb') as fobj:
#     pickle.dump(G, fobj)

# with open(Path.cwd() / 'graphs' / 'multi_oscar_winners.pkl', 'wb') as fobj:
#     pickle.dump(G_multi_oscar_winners, fobj)