# Network Formation with Hamiltonian Markov Chain Monte Carlo





## Introduction

This project analyzes the formation and structure of trust networks in historical financial markets, using data from the New York Stock Exchange (NYSE) spanning 1883 to 1930. We explore how social connections, ethnicity, and latent characteristics influenced the sponsorship of new NYSE members and the subsequent approval process by existing members.

### Importing Libraries

In [1]:
import tensorflow as tf
import pandas as pd
import tensorflow_probability as tfp

### Constants

## Data

Based on the provided data structure and the model described in the document, we can formulate an expressive representation of the limiting likelihood function. Let's define the key components:

1. Node attributes:
$x_i$ = ($ethnicity_i$, $everCommittee_i$, $everSponsor_i$)
2. Edge formation: Let $L_{ij,t}$ be an indicator for whether node i chose node j as a sponsor in transaction t.
3. Payoff structure:
$U_{ij,t}$ = $U^*(x_i, x_j; s_{i,t}, s_{jt}, t_{ij,t}) + σε_{ij,t}$
    Where:
    - $s_{it}$, $s_{jt}$ are node-specific network statistics (e.g., degree centrality)
    - $t_{ij,t}$ are edge-specific network statistics (e.g., common ethnicity)

4. Aggregate state variables:
    - $M^*(x, s|R)$: Reference distribution
    - $H^*(x; s)$: Inclusive value function
    
5. Limiting link frequency probability:
$μ_0(L_{ij,t} = 1, s_{it}, s_{jt} | x_i, x_j) = \frac{s_{it} s_{ij,t} *exp(U^*(x_i,x_j;s_{it}, s_{jt}) + U^*(x_j,x_i;s_{jt},s_{it}))}{(1 + H^*(x_i,s_{it}))*(1 + H^*(x_j,s_{jt}))} \times M^*(x_i,s_{it}|R) * M^*(x_j,s_{jt}|R)$

6. Committee voting: Let $v_{kt}$ be the vote (white ball or black ball) of committee member k in transaction t. $$P(v_{kt} = white | x_i, x_{j1}, x_{j2}, x_k) = Φ(αx_i + β_1x_{j1} + β_2x_{j2} + γx_k + δ_1d(Z_i, Z_k) + δ_2d(Z_{j1}, Z_k) + δ_3d(Z_{j2}, Z_k))$$
Where:
    - Φ is the standard normal CDF
    - $Z_i$, $Z_{j1}$, $Z_{j2}$, $Z_k$ are latent positions in a low-dimensional Euclidean space
    d(·,·) is a distance function in this space.


In [2]:
# Read the data files
node_data = pd.read_csv('data/nyse_node_sp1.csv', header=None, 
                        names=['name', 'ever_committee', 'node_id', 'ethnicity', 'ever_sponsor'])
edge_data = pd.read_csv('data/nyse_edge_buy_sp_sp1.csv', header=None,
                        names=['buyer_id', 'sponsor1_id', 'sponsor2_id', 'f1', 'f2', 'f3', 'f4', 'blackballs', 'whiteballs', 'year'])
committee_data = pd.read_csv('data/nyse_edge_buy_com1.csv', header=None,
                             names=['buyer_id', 'committee_id', 'f1', 'f2', 'f3', 'f4', 'blackballs', 'whiteballs', 'year'])



In [3]:
def process_data(node_data, edge_data, committee_data):
    node_attrs = node_data.set_index('node_id').to_dict('index')
    
    # Initialize network statistics
    network_stats = {node_id: {'degree': 0, 'sponsor_count': 0} for node_id in node_attrs}
    
    transactions = []
    for _, row in edge_data.iterrows():
        buyer_id = row['buyer_id']
        sponsor1_id = row['sponsor1_id']
        sponsor2_id = row['sponsor2_id']
        year = row['year']
        
        # Update network statistics
        network_stats[buyer_id]['degree'] += 2
        network_stats[sponsor1_id]['degree'] += 1
        network_stats[sponsor2_id]['degree'] += 1
        network_stats[sponsor1_id]['sponsor_count'] += 1
        network_stats[sponsor2_id]['sponsor_count'] += 1
        
        committee_members = committee_data[(committee_data['buyer_id'] == buyer_id) & 
                                           (committee_data['year'] == year)]['committee_id'].tolist()
        
        transactions.append({
            'buyer_id': buyer_id,
            'sponsor1_id': sponsor1_id,
            'sponsor2_id': sponsor2_id,
            'committee_members': committee_members,
            'year': year,
            'whiteballs': row['whiteballs'],
            'blackballs': row['blackballs']
        })
    
    return node_attrs, transactions, network_stats

node_attrs, transactions, network_stats = process_data(node_data, edge_data, committee_data)

## Model

In [1]:
# Define utility function
@tf.function
def U_star(xi, xj, si, sj, tij, theta):
    return tf.tensordot(theta, tf.concat([xi, xj, si, sj, tij], axis=0), axes=1)

# Define reference distribution
@tf.function
def M_star(x, s, R, eta):
    return tf.exp(tf.tensordot(eta, tf.concat([x, s, R], axis=0), axes=1))

# Inlcusive value function
@tf.function
def H_star(x, s, gamma):
    return tf.exp(tf.tensordot(gamma, tf.concat([x, s], axis=0), axes=1))

# Define limiting link probability
@tf.function
def mu_star_0(Lijt, sit, sjt, xi, xj, R, theta, eta, gamma):
    numerator = sit * sjt * tf.exp(U_star(xi, xj, sit, sjt, tf.constant([]), theta) + 
                                   U_star(xj, xi, sjt, sit, tf.constant([]), theta))
    denominator = (1 + H_star(xi, sit, gamma)) * (1 + H_star(xj, sjt, gamma))
    prob = (numerator / denominator) * M_star(xi, sit, R, eta) * M_star(xj, sjt, R, eta)
    
    # Use Lijt to determine whether we want P(link = 1) or P(link = 0)
    return tf.where(Lijt == 1, prob, 1 - prob) 

# Define committee voting probability
@tf.function
def P_vote(v, xi, xj1, xj2, xk, Z, alpha, beta1, beta2, gamma, delta):
    latent_term = delta[0]*tf.norm(Z[xi] - Z[xk]) + \
                  delta[1]*tf.norm(Z[xj1] - Z[xk]) + \
                  delta[2]*tf.norm(Z[xj2] - Z[xk])
    logits = tf.tensordot(alpha, xi, axes=1) + tf.tensordot(beta1, xj1, axes=1) + \
             tf.tensordot(beta2, xj2, axes=1) + tf.tensordot(gamma, xk, axes=1) - latent_term
    prob = tfp.distributions.Normal(loc=0., scale=1.).cdf(logits)
    
    # Use v to determine whether we want P(vote = 1) or P(vote = 0)
    return tf.where(v == 1, prob, 1 - prob)


In [ ]:
# Log limiting likelihood function
@tf.function
def log_likelihood(params, node_attrs, transactions, Z):
    theta, eta, gamma, alpha, beta1, beta2, gamma_vote, delta, mu_Z, Sigma_Z = params
    
    ll = tf.constant(0., dtype=tf.float32)
    
    for t in transactions:
        buyer_id = t['buyer_id']
        sponsor1_id = t['sponsor1_id']
        sponsor2_id = t['sponsor2_id']
        
        buyer_stats = tf.constant([network_stats[buyer_id]['degree'], network_stats[buyer_id]['sponsor_count']], dtype=tf.float32)
        sponsor1_stats = tf.constant([network_stats[sponsor1_id]['degree'], network_stats[sponsor1_id]['sponsor_count']], dtype=tf.float32)
        sponsor2_stats = tf.constant([network_stats[sponsor2_id]['degree'], network_stats[sponsor2_id]['sponsor_count']], dtype=tf.float32)
        
        ##TODO: what is R?
        
        # Sponsor choice likelihood
        ll += tf.math.log(mu_star_0(tf.constant(1.),  buyer_stats, sponsor1_stats, 
                                    node_attrs[buyer_id], node_attrs[sponsor1_id], 
                                    tf.constant([]), theta, eta, gamma))
        ll += tf.math.log(mu_star_0(tf.constant(1.),  buyer_stats, sponsor2_stats,
                                    node_attrs[buyer_id], node_attrs[sponsor2_id], 
                                    tf.constant([]), theta, eta, gamma))
        
        # Add log probability for not choosing other potential sponsors
        for potential_sponsor_id in node_attrs:
            if potential_sponsor_id not in [sponsor1_id, sponsor2_id]:
                ll += tf.math.log(1 - mu_star_0(tf.constant(0.),  buyer_stats, sponsor1_stats, 
                                                node_attrs[buyer_id], node_attrs[potential_sponsor_id], 
                                                tf.constant([]), theta, eta, gamma))
        
        # Committee voting likelihood
        for committee_id in t['committee_members']:
            v = tf.constant(1. if t['whiteballs'] > t['blackballs'] else 0., dtype=tf.float32)
            ll += tf.math.log(P_vote(v, node_attrs[buyer_id], node_attrs[sponsor1_id], 
                                     node_attrs[sponsor2_id], node_attrs[committee_id], 
                                     Z, alpha, beta1, beta2, gamma_vote, delta))
    
    # Add prior for latent positions
    for node_id in Z:
        ll += tfp.distributions.MultivariateNormalDiag(loc=mu_Z, scale_diag=tf.linalg.diag_part(Sigma_Z)).log_prob(Z[node_id])
    
    return ll

## Inference

In [ ]:
# Set up HMC
num_results = 1000
num_burnin_steps = 1000

# Initialize parameters
initial_state = tf.concat([
    tf.zeros(10),  # theta
    tf.zeros(5),   # eta
    tf.zeros(5),   # gamma
    tf.zeros(5),   # alpha
    tf.zeros(5),   # beta1
    tf.zeros(5),   # beta2
    tf.zeros(5),   # gamma_vote
    tf.zeros(3),   # delta
    tf.zeros(2),   # mu_Z
    tf.ones(2),    # diag(Sigma_Z)
    tf.random.normal([len(node_attrs) * 2])  # Z (flattened)
], axis=0)

# Define the HMC transition kernel
step_size = tf.Variable(0.01)
adaptive_hmc = tfp.mcmc.SimpleStepSizeAdaptation(
    tfp.mcmc.HamiltonianMonteCarlo(
        target_log_prob_fn=log_likelihood,
        num_leapfrog_steps=10,
        step_size=step_size),
    num_adaptation_steps=int(num_burnin_steps * 0.8))