---
title: "Generative<br>Network Analysis"
subtitle: "[DAY FOUR]{.kn-pink} [GESIS Fall Seminar in Computational Social Science]{.kn-blue}"
author:
  - name: John McLevey
    affiliations:
      - name: University of Waterloo
  - name: Johannes B. Gruber
    affiliations:
      - name: VU Amsterdam
output-dir: "../docs/"
format:
  revealjs:
    theme: [default, custom.scss]
    width: 1600
    height: 900
    embed-resources: true
    execute:
      echo: true
      warning: false
      cache: true
      freeze: true
    slide-number: false
    chalkboard: false
    preview-links: auto
    smaller: true
    fig-align: left
    fig-format: svg
    lightbox: true
    scrollable: true
    code-overflow: scroll
    code-fold: false
    code-line-numbers: true
    code-copy: hover
    reference-location: document
    tbl-cap-location: margin
    logo: media/logo_gesis.png
    footer: "[CC BY-SA 4.0]{.nord-footer}"
    email-obfuscation: javascript
highlight-style: "nord"
bibliography: references.bib
---



<!-- {{< include _day-4-origins.qmd >}} -->
<!-- {{< include _day-4-sna-topics.qmd >}} -->
<!-- {{< include _day-4-learning-objectives.qmd >}} -->
<!-- {{< include _day-4-community-detection.qmd >}} -->
<!-- {{< include _day-4-modularity-maximization-is-bad.qmd >}} -->
<!-- {{< include _day-4-force-directed-visualization.qmd >}} -->
<!-- {{< include _day-4-generative-modelling-networks.qmd >}} -->
<!-- Tutorials -->
<!-- {{< include _day-4-tutorial-plan.qmd >}} -->
<!-- {{< include _day-4-tutorial-political-blogs.qmd >}} -->

## Iterative Modelling

:::: {.columns}
::: {.column width="65%"}
The Enron Email Network

<br><br>

We'll use `graph-tool` to

- iteratively fit, improve, and compare Nested Stochastic Blockmodels by analyzing a directed email communication network between Enron employees
- conduct rigerous posterior inference about the network from a generative modelling perspective

[This will be a bit of a modelling marathon, so if all of this is new to you,<br>*focus on high-level logic*.]{.nord-light}
:::

::: {.column width="35%"}
:::
::::

::: {.notes}
In this tutorial, you'll learn how to use `graph-tool` to iteratively fit, improve, and compare Nested Stochastic Blockmodels (NSBMs) by analyzing a directed email communication network between Enron employees. You also learn how to conduct rigerous posterior inference about the network from a generative modelling perspective.

The models we'll develop here start simple and gradually increase in complexity, incorporating additional information about the Enron network, or applying refinements to better estimate it's structure. We will visualize, assess, and compare these models, and finally, analyze the posterior distribution of block partitions to quantify uncertainty, compute marginal probabilities for node block assignments, and determine if there are other plausible explanations for the structure of our observed network.
:::

## Setup

With the `gt` Conda environment activated,

```python
import math
import pickle
import random
from pprint import pprint

import graph_tool.all as gt
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import homogeneity_score

import icsspy
from icsspy.networks import (
    get_consensus_partition_from_posterior,
    plot_line_comparison,
)
from icsspy.paths import enron


icsspy.set_style()
print(f'Using graph-tool version {gt.__version__}')
```

> Using graph-tool version 2.77 (commit , )

##

## Load the Enron Data

We can load the Enron email data (Crick 2022)^[Tyler Crick. 2022. "The Enron email dataset, cleaned and validated." *Computational Social Science Lab Data*.] from the `icsspy` course package. The network itself has already been prepared and can be loaded directly into `graph-tool`.

In [None]:
enron_email_network = str(enron / 'enron_graph.gt')
g = gt.load_graph(enron_email_network)
print(g)

Like the political blogs network, this network has internal property maps containing data about node and edge attributes, as well as the graph itself. We can list the available property maps:

In [None]:
g.list_properties()

- `vertex_lookup` is an internal dictionary that maps each email address to a unique integer ID.
- `label` is a string variable containing the email address.
- `position` is a string variable containing information about the job position associated with the email account.
- `edge weight` are counts of the number of emails that vertex `i` sent vertex `j` (since this is a directed network)



<!-- {{< include _day-4-tutorial-youtube.qmd >}} -->




# References

##