In [1]:
import sys

import numpy as np

sys.path.append("..")
from src.model.model import HyMMSBM
from src.data.data_io import PREPROCESSED_DATASETS, load_data

# Training the *Hy-MMSBM* model

In this notebook we show how to train the *Hy-MMSBM* model on a given dataset. 

While here we show the inner workings of our implementation, a direct command line interface for training the model 
is available through `main_inference.py`, which allows to abstract from the underlying Python code. 

To run such code directly, we refer to the explanations in `README.md`.

## Data loading

In [2]:
# If you did not download the data yet, do so by running the following command
# !python ../download_data.py

Load a real hypergraph. The internal representation used by our code is an instance of `Hypergraph` (see `src.data.representation`). 

We suggest loading the data via the `load_data` function, which works with three types of inputs:
- the name of one of the preprocessed real hypergraphs we used in our experimental analyses;
- the paths to two *.txt* files containing the hyperedge list and weight list respectively;
- the path to a [pickle](https://docs.python.org/3/library/pickle.html) file containing an instance of `Hypergraph`.

We suggest the first option for the real datasets we make available, the second for custom datasets.

For example, the following three commands load the same hypergraph:

```python
load_data(real_dataset="justice")

load_data(
    hye_file="../data/examples/justice_dataset/hyperedges.txt",
    weight_file="../data/examples/justice_dataset/weights.txt",
)

load_data(
    pickle_file="../data/examples/justice_dataset/justice.pkl",
)
```

A complete list of the available real datasets is provided here:

In [3]:
PREPROCESSED_DATASETS

['arxiv',
 'amazon_5core',
 'curated_gene_disease_associations',
 'enron-email',
 'high_school',
 'hospital',
 'house-bills',
 'house-committees',
 'justice',
 'primary_school',
 'senate-bills',
 'senate-committees',
 'trivago-clicks_2core',
 'trivago-clicks_5core',
 'trivago-clicks_10core',
 'walmart-trips_2core',
 'walmart-trips_3core',
 'walmart-trips_4core',
 'workspace1']

Let's load the justice dataset.

In [4]:
justice_hyg = load_data(real_dataset="justice")

## Model training

Training the model simply requires specifying the number $K$ of communities and whether the model needs to be assortative.

In [5]:
%%time

model = HyMMSBM(
    K=2,
    assortative=False,
)
model.fit(justice_hyg)

CPU times: user 15.1 ms, sys: 2.32 ms, total: 17.4 ms
Wall time: 15.8 ms


After inference, the parameters can be retrieved as attributes of the model.

In [6]:
model.u[:5]

array([[9.13714317e-02, 4.15569806e+00],
       [2.33500300e-01, 1.96606102e-01],
       [3.85680385e-02, 5.45674009e+00],
       [1.93659181e-04, 8.90207793e-02],
       [1.90403422e-01, 7.08358947e-02]])

In [7]:
model.w

array([[32.16840873,  0.25994342],
       [ 0.25994342,  8.72325656]])

#### Additional training options

Other options can be specified:
- in the model initialization, one can specify:
    - the maximum hyperedge size (which is otherwise inferred once a hypergraph is observed).
    - the priors for $w$ and $u$, as rates of exponential distributions. These can be specified as non-negative numbers (priors equal to 0 correspond to no prior), or as numpy arrays if a non-uniform prior is expected.
- at inference time, one can specify the number of EM steps.

For example:

In [8]:
%%time

model = HyMMSBM(
    K=2,
    assortative=True,
    max_hye_size=15,
    u_prior=1.,
    w_prior=10.,
)
model.fit(
    justice_hyg,
    n_iter=500,
)

CPU times: user 539 ms, sys: 2.89 ms, total: 542 ms
Wall time: 541 ms


Notice that, in `main_training.py` and all our experiments, we repeat the procedure above different times (in the script, specified as the command line argument `--training_rounds`) and only return the model realization with the highest log-likelihood.

As a final option, if either $w$ or $u$ are provided at initialization, these are considered fixed parameters and will not be inferred. For example, one can fix the affinity matrix and only infer the community assigments $u$:

In [9]:
fixed_w = np.eye(2)

model = HyMMSBM(
    K=2,
    w=fixed_w,
    u_prior=0.,
)
model.fit(
    justice_hyg,
    n_iter=500,
)

The matrix stays the same, but the communities assignments have been inferred normally:

In [10]:
model.w is fixed_w

True

In [11]:
model.u[:5]

array([[0.        , 1.34752201],
       [0.        , 0.25226192],
       [0.        , 1.62877116],
       [0.        , 0.02700151],
       [0.        , 0.16165644]])