In [None]:
%run clone_git_on_colab.py

In [None]:
from course_settings import set_tf_nthreads
set_tf_nthreads(4)

# Deep sets and graph networks

The ML models we have looked at so far make the assumption that we have a fixed-dimensional vector of input features. In reality that might not always be the case. Some examples:

* Sequences (text, audio, video)
* Point clouds (e.g. points in 3D space)
* Lists of objects (e.g. particles in a collision)
* Graphs with different numbers of nodes and different numbers of connections for each node

For sequences one approach are recurrent neural networks (RNNs) that utilize a state that gets updated as it iteratively processes input. However, these still need a defined ordering of the inputs and they have certain disadvantages (most prominently difficulty to model "long-range" correlations between inputs and difficulty to parallelize since they are sequential in nature).

Another approach are models that apply **permutation invariant** transformations on the inputs. Both deep sets and graph networks make use of this. The nowadays (2023) also very popular [**transformers**](https://arxiv.org/abs/1706.03762) can be viewed as graph networks where all nodes are connected to each other.

## Deep sets

The simplest approach for a permutation invariant transformation is a **per-point transformation** ($\phi$) followed by a **permutation invariant aggregation**, typically taking the sum/mean or min/max whose output can then be transformed ($\rho$) by any means, e.g. another MLP.

![](figures/deep_set_transformation.png)

See [arXiv:1703.06114](https://arxiv.org/abs/1703.06114) for a detailed discussion.


### Application to jets in Higgs dataset

Remember the missing values in the dataset for the [HiggsChallenge](HiggsChallenge.ipynb)? Those occurred since we had a non-fixed length list of jets in each event (0, 1 or 2). Maybe we can embed the jets into a fixed length vector using a permutation invariant transformation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, GlobalAveragePooling1D, Masking
from tensorflow.keras.callbacks import History

In [None]:
df = pd.read_csv('data/atlas-higgs-challenge-2014-v2.csv.gz')
n_sig_tot = df["Weight"][df.Label == "s"].sum()
n_bkg_tot = df["Weight"][df.Label == "b"].sum()
# comment this out if you want to run on the full dataset
df = df.sample(frac=0.1)

First, we separate the jet features and other features:

In [None]:
jet_cols = sum([[f"PRI_{obj}_{field}" for field in ["pt", "eta", "phi"]] for obj in ["jet_leading", "jet_subleading"]], [])
jet_cols

We also exclude variables that are derived from the jets:

In [None]:
excluded_cols = ['DER_deltaeta_jet_jet', 'DER_mass_jet_jet', 'DER_prodeta_jet_jet', 'DER_lep_eta_centrality']

In [None]:
other_cols = [col for col in df.columns if (col.startswith("PRI") or col.startswith("DER")) and col not in jet_cols and not col in excluded_cols]
other_cols

We will make the jet features a 3-D array of shape `(nevents, max_njets, n_jet_features)`

In [None]:
X_jet = df[jet_cols].to_numpy().reshape(-1, 2, 3)
X_jet

The rest of the features just stays a 2-D array as usual:

In [None]:
X_other = df[other_cols].to_numpy()
X_other

Still we need to replace missing values by 0 which can occur for the quantity `DER_mass_MMC`

In [None]:
X_other[X_other == -999] = 0

In [None]:
y = (df.Label == "s").to_numpy()
weight = df['Weight'].to_numpy()

In [None]:
(
    X_jet_train, X_jet_test,
    X_other_train, X_other_test,
    y_train, y_test,
    weight_train, weight_test,
) = train_test_split(X_jet, X_other, y, weight)

Now, let's scale the features. For the jets we have to be a bit careful only to consider non-missing values in the scaling. Also the scikit-learn scalers can only deal with 2D arrays - so let's define a custom scaler:

In [None]:
class JetScaler:
    def __init__(self, mask_value=-999):
        self.mask_value = mask_value
        self.scaler = RobustScaler()
    
    def fill_nan(self, X):
        "replace missing values by nan"
        X[(X == self.mask_value).all(axis=-1)] = np.nan
        
    def fit(self, X):
        X = np.array(X) # copy
        self.fill_nan(X)
        X = X.reshape(-1, X.shape[-1]) # make 2D
        self.scaler.fit(X)
        
    def transform(self, X):
        orig_shape = X.shape
        X = np.array(X).reshape(-1, X.shape[-1])
        self.fill_nan(X)
        X = self.scaler.transform(X)
        X = np.nan_to_num(X, 0) # replace missing values by 0
        return X.reshape(*orig_shape) # turn back into 3D

In [None]:
jet_scaler = JetScaler()
jet_scaler.fit(X_jet_train)

In [None]:
X_jet_train_scaled = jet_scaler.transform(X_jet_train)

In [None]:
other_scaler = RobustScaler()
other_scaler.fit(X_other_train)

In [None]:
X_other_train_scaled = other_scaler.transform(X_other_train)

Also we again balance the weights to have the same sum of weights for signal and background and average weight 1

In [None]:
class_weight_signal = 1 / weight_train[y_train==1].sum()
class_weight_background = 1 / weight_train[y_train==0].sum()

In [None]:
def transform_weight(weight, y):
    weight = np.array(weight)
    weight[y==0] *= class_weight_background
    weight[y==1] *= class_weight_signal
    return weight / weight.mean()

In [None]:
weight_train_scaled = transform_weight(weight_train, y_train)

Now the model - we use the functional API of keras

**Note:** When applying the keras `Dense` layer to 3D arrays it is applied independently on each element along the second dimension This is precisely what we want for our per-point transformation $\phi$.

In [None]:
def make_model():
    input_jets = Input(shape=(2, 3), name="jets")
    jets = input_jets
    input_other = Input(shape=(X_other_train.shape[1],), name="other")

    # embed the jets using 3 hidden layers (shared per-jet)
    jets = Dense(100, activation="relu")(jets)
    jets = Dense(100, activation="relu")(jets)
    jets = Dense(100, activation="relu")(jets)
    
    # take the mean/average as a permutation invariant operation
    # note: since we still process a sequence of fixed length 2 this could in priniciple receive contributions
    # from non-existing jets if the NN encodes the 0s into a non-zero vector.
    # We could use a Masking Layer, but that has problems (produces NaN) when the sequence is completely empty
    # so we would need something custom which we don't do here (seems to still work reasonably well)
    jets = tf.keras.layers.GlobalAveragePooling1D()(jets)
    
    # 3 hidden layers for the other features
    other = input_other
    other = Dense(100, activation="relu")(other)
    other = Dense(100, activation="relu")(other)
    other = Dense(100, activation="relu")(other)
    
    # concatenate embedded jets and other features and add final hidden layer + output
    out = tf.keras.layers.concatenate([jets, other])
    out = Dense(100, activation="relu")(out)
    out = Dense(1, activation="sigmoid")(out)

    return tf.keras.Model(inputs=[input_jets, input_other], outputs=[out])

model = make_model()

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True)

In [None]:
model.compile(loss="binary_crossentropy", optimizer="Adam")

In [None]:
history = History()

In [None]:
model.fit(
    {"jets": X_jet_train_scaled, "other": X_other_train_scaled},
    y_train,
    sample_weight=weight_train_scaled,
    epochs=10,
    batch_size=64,
    shuffle=True,
    validation_split=0.2,
    callbacks=[history],
)

In [None]:
pd.DataFrame(history.history).plot()

In [None]:
X_jet_test_scaled = jet_scaler.transform(X_jet_test)
X_other_test_scaled = other_scaler.transform(X_other_test)
weight_test_scaled = transform_weight(weight_test, y_test)

In [None]:
y_pred_train = model.predict({"jets": X_jet_train_scaled, "other": X_other_train_scaled}, verbose=True)[:, 0]
y_pred_test = model.predict({"jets": X_jet_test_scaled, "other": X_other_test_scaled}, verbose=True)[:, 0]

In [None]:
from sklearn.metrics import roc_curve

In [None]:
from mltools import ams

In [None]:
def ams_scan(y, y_prob, weights, label):
    fpr, tpr, thr = roc_curve(y, y_prob, sample_weight=weights)
    ams_vals = ams(tpr * n_sig_tot, fpr * n_bkg_tot)
    print("{}: Maximum AMS {:.3f} for pcut {:.3f}".format(label, ams_vals.max(), thr[np.argmax(ams_vals)]))
    return thr, ams_vals

In [None]:
plt.plot(*ams_scan(y_train, y_pred_train, weight_train, "Train"), label="Train")
plt.plot(*ams_scan(y_test, y_pred_test, weight_test, "Test"), label="Test")
plt.xlim(0.8, 1.)
plt.legend()

## Application to top-tagging dataset

Sets are a nice representation for objects in particles physics. Let's apply this to the jet constituents of the dataset from the [CNNTopTagging](CNNTopTagging.ipynb) notebook.

We have prepared a subset of this dataset in original form containing the 4-momenta (Energy, px, py, pz) of up to 200 jet constituents:

In [None]:
top_tagging_path = "data/top_tagging_with_adjacency.npz"

In [None]:
if not os.path.exists(top_tagging_path):
    import requests
    url = "https://cloud.physik.lmu.de/index.php/s/AtESAET6JK6DiWZ/download"
    res = requests.get(url)
    with open(top_tagging_path, "wb") as f:
        f.write(res.content)

In [None]:
npz_file = np.load(top_tagging_path)

In [None]:
X = npz_file["jet_4mom"]
y = npz_file["y"]

Here, the missing values are filled with `0`:

In [None]:
X

We can reuse the `JetScaler` we defined for the Higgs Dataset:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = JetScaler(mask_value=0)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Here we can use a simple Sequential stack of layers since we only use the jet constituents as inputs:

In [None]:
model = tf.keras.Sequential([
    Masking(input_shape=X_train.shape[1:]),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    GlobalAveragePooling1D(),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    Dense(100, activation="relu"),
    Dense(1, activation="sigmoid"),
])

Here we were able to use a [Masking](https://keras.io/guides/understanding_masking_and_padding/) layer since the sequence is never completely empty.

Again, the first layers operate independently on each constituent:

In [None]:
model.summary()

In [None]:
model.compile(loss="binary_crossentropy", optimizer="Adam")

In [None]:
history = History()

In [None]:
history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=10,
    batch_size=32,
    shuffle=True,
    callbacks=[history],
)

In [None]:
pd.DataFrame(history.history).plot()

In [None]:
scores = model.predict(X_test)

In [None]:
fpr, tpr, thr = roc_curve(y_test, scores)

In [None]:
def plot_top_tagging_performance(fpr, tpr):
    plt.plot(tpr, 1. / fpr)
    plt.ylabel("QCD jet rejection")
    plt.xlabel("Top quark jet efficiency")
    plt.yscale("log")

    print("Top quark jet selection efficiency at 10^3 QCD jet rejection: ", np.max(tpr[fpr < 0.001]))
    print("QCD jet rejection at 30% Top quark jet efficiency: ", 1. / np.min(fpr[tpr > 0.3]))
    
plot_top_tagging_performance(fpr, tpr)

## Graph convolutions/Graph neural networks

Similar to convolutional networks where we update the state of each pixel by aggregating over neigboring pixels we can perform a *graph convolution* by aggregating over neighboring nodes in a graph:

![cnn vs gcn](figures/cnn_vs_gcn.jpg)

(figure from https://zhuanlan.zhihu.com/p/51990489)

In the "Deep sets" language such a graph convolution corresponds to a *permutation equivariant* tranformation of the set of nodes, since it also does not depend on the ordering if the aggregation is done in a permutation invariant way (e.g. sum/mean/min/max).

A rather simple implementation is given by the update rule introduced in [arXiv:1609.02907](https://arxiv.org/abs/1609.02907)

$ H^{(l+1)} = \sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}) $

where $A$ is the *adjacency matrix*, $D$ the *degree matrix*,  $H^{(l)}$ the hidden state of layer $l$ and $W^{(l)}$ the weight matrix of the layer $l$. The tilde above $A$ and $D$ indicates that self-loops were added (all nodes are neighbors of themselves).

An equivalent formulation is

$ h_i^{(l+1)} = \sigma\left(\sum\limits_{j\in\mathcal{N}(i)}\frac{1}{c_{ij}}h^{(l)}_j W^{(l)}\right) $

where $ \mathcal{N(i)} $ is the set of neighbors of node $i$ and $c_{ij} = \sqrt{N_i}\sqrt{N_j}$ with $N_i$ being the number of neigbors of node $i$

In [None]:
def normalize_adjacency(adj):
    """
    calculate outer product of sqrt(degree vector) and multiply with adjaceny matrix
    
    this corresponds to the D^{1/2}AD^{1/2} normalization suggested in Kipf & Welling (arXiv:1609.02907)
    """
    deg_diag = tf.reduce_sum(adj, axis=2)
    deg12_diag = tf.where(deg_diag > 0, deg_diag ** -0.5, 0)
    return (
        tf.matmul(
            tf.expand_dims(deg12_diag, axis=2),
            tf.expand_dims(deg12_diag, axis=1),
        )
        * adj
    )

In [None]:
class GraphConv(tf.keras.layers.Layer):
    """
    Simple graph convolution. Should be equivalent to Kipf & Welling (arXiv:1609.02907)
    """

    def __init__(self, units, activation="relu"):
        super().__init__()
        self.dense = tf.keras.layers.Dense(units)
        self.activation = tf.keras.activations.get(activation)

    def call(self, inputs):
        feat, adjacency = inputs
        return self.activation(tf.matmul(normalize_adjacency(adjacency), self.dense(feat)))

One question is now - what is the graph in our dataset? Since The CNN architecture worked well it would make sense to define the graph by taking a certain number of nearest neighbors in the $\eta-\phi$ plane that was previously also used to define the image pixels.
We prepared adjacency matrices for 7 nearest neigbors:

In [None]:
npz_file = np.load(top_tagging_path)

In [None]:
X = npz_file["jet_4mom"]
y = npz_file["y"]
A = npz_file["adj"]

In [None]:
def ptetaphi(X):
    px = X[..., 1]
    py = X[..., 2]
    pz = X[..., 3]
    pt = np.hypot(px, py)
    eta = np.arcsinh(pz / pt)
    phi = np.arcsin(py / pt)
    return np.stack([pt, eta, phi], axis=1)

In [None]:
def plot_graph(x, a):
    plt.figure(figsize=(12, 8))
    nconst = (~(a == 0).all(axis=-1)).sum()
    x = x[:nconst]
    x = ptetaphi(x)
    plt.scatter(x[:, 1], x[:, 2], s=100)
    for i in range(nconst):
        for j in range(nconst):
            if a[i, j] or a[j, i]:
                plt.plot([x[i, 1], x[j, 1]], [x[i, 2], x[j, 2]], color="C0")

Let's plot a few random graphs:

In [None]:
i = np.random.randint(0, len(X))
plot_graph(X[i], A[i])

In [None]:
X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(X, y, A)
scaler = JetScaler(mask_value=0)
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
def get_model(units=100, num_nodes=200, num_features=4):
    adjacency_input = Input(shape=(num_nodes, num_nodes), name='adjacency')
    feature_input = Input(shape=(num_nodes, num_features), name='features')

    # constituent-level transformations
    p = feature_input
    for i in range(3):
        p = Dense(units, activation="relu")(p)

    for i in range(3):
        p = GraphConv(units, activation="relu")([p, adjacency_input])

    x = GlobalAveragePooling1D()(p)

    # event-level transformations
    for i in range(3):
        x = Dense(units, activation="relu")(x)

    output = Dense(1, activation="sigmoid")(x)

    return tf.keras.models.Model(
        inputs=[adjacency_input, feature_input],
        outputs=[output]
    )
model = get_model()

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True)

In [None]:
model.compile(loss="binary_crossentropy", optimizer="Adam")

In [None]:
history = History()

In [None]:
model.fit(
    {"features": X_train, "adjacency": A_train},
    y_train,
    validation_split=0.2,
    epochs=10,
    batch_size=32,
    shuffle=True,
    callbacks=[history]
)

In [None]:
pd.DataFrame(history.history).plot()

In [None]:
scores = model.predict({"features": X_test, "adjacency": A_test})

In [None]:
fpr, tpr, thr = roc_curve(y_test, scores)

In [None]:
plot_top_tagging_performance(fpr, tpr)

Some Notes:

- We made it quite hard here for the neural network by putting in really the raw 4-momentum information
- Possible improvements:
  - Go to the $\eta-\phi$ plane
  - Transform coordinates to be relative to the jet center
  - Use graph operations that depend on the distance between points instead of absolute position (e.g. [EdgeConv](https://arxiv.org/abs/1801.07829))
  - just train longer and/or on more data (we only used 10k samples)

# Further possibilities

We only touched the surface of what is possible with graph neural networks. In general, you can have arbitrary update rules that update in each step features of Nodes (V), Edges (e) and global aggregated features (u). Everyone of these 3 categories can receive input from any of the others:

![graph network general update rule](figures/graph-network.png)

(figure from [arXiv:1806.01261](https://arxiv.org/abs/1806.01261))

More info/tutorials:

http://tkipf.github.io/graph-convolutional-networks/  
https://docs.dgl.ai/tutorials/models/1_gnn/1_gcn.html  
https://docs.dgl.ai/generated/dgl.nn.pytorch.conv.GraphConv.html#

For more advanced applications with graph neural networks have a look at specialized libraries:

[Spektral (tensorflow)](https://graphneural.network/)  
[DGL (mainly pytorch, but also tensorflow)](https://dgl.ai)  
[PyTorch Geometric](https://pytorch-geometric.readthedocs.io)

<div class="alert alert-warning">
If you actually want to implement graph networks, better consult these instead of manually building them. The examples in this tutorial are meant for educational purposes!
</div>