# Big Graph Generation (MAG240m)

## Overview

In this notebook, we have walked through the complete process of generating a synthetic dataset based on an MAG240m dataset.
We will cover advanced SynGen features as memmory mapping and independent chunk generation.

## Preprare the dataset

In [1]:
data_path = '/raid/ogbn_mag240m/'
preprocessed_path = '/raid/ogbn_mag240m_syngen'

In [2]:
!python -m syngen preprocess --source-path=$data_path --dataset=ogbn_mag240m --destination-path=$preprocessed_path --cpu --use-cache

INFO:__main__:|    Synthetic Graph Generation Tool    |
INFO:syngen.utils.io_utils:writing to file /raid/ogbn_mag240m_syngen/writes_list.parquet parquet
INFO:syngen.utils.io_utils:writing to file /raid/ogbn_mag240m_syngen/affiliated_with_list.parquet parquet
INFO:syngen.utils.io_utils:writing to file /raid/ogbn_mag240m_syngen/cites_list.parquet parquet
INFO:syngen.cli.commands.preprocess:ogbn_mag240m successfully preprocessed into /raid/ogbn_mag240m_syngen


In [3]:
!cat $preprocessed_path/graph_metadata.json

{
    "nodes": [
        {
            "name": "paper",
            "count": 121751666,
            "features": [
                {
                    "name": "feat_0",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_1",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_2",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_3",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
          

## Create Configurations Directory

In [4]:
configs_dir = 'ogbn_mag240m_configs'

In [5]:
!mkdir -p $configs_dir

## Prepare simple SynGen Configuration

In [6]:
!python -m syngen mimic-dataset --output-file=$configs_dir/simple.json --dataset-path $preprocessed_path --tab-gen uniform

INFO:__main__:|    Synthetic Graph Generation Tool    |
INFO:syngen.cli.commands.mimic_dataset:SynGen Configuration saved into ogbn_mag240m_configs/simple.json


In [7]:
!cat $configs_dir/simple.json

{
    "nodes": [
        {
            "name": "paper",
            "count": 121751666,
            "features": [
                {
                    "name": "feat_0",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_1",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_2",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_3",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
          

In [8]:
!python -m syngen synthesize --config-path $configs_dir/simple.json --save-path /raid/ogbn_mag240m_simple --verbose

INFO:__main__:|    Synthetic Graph Generation Tool    |
0it [00:00, ?it/s]
100%|█████████████████████████████████████████| 768/768 [10:29<00:00,  1.22it/s]
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00,  2.97it/s]
0it [00:00, ?it/s]
NODE paper FIT TOOK: 906.03
FIT NODES TOOK: 906.03
DEBUG:root:Initialized logger
DEBUG:root:Using seed: 66
DEBUG:root:Fit results dst_src: None
DEBUG:root:Fit results src_dst: (0.4493778749717661, 0.16335407041150202, 0.1362311795696754, 0.2510368750470564)
EDGE writes STRUCTURAL FIT TOOK: 110.91
DEBUG:root:Initialized logger
DEBUG:root:Using seed: 20
DEBUG:root:Fit results dst_src: None
DEBUG:root:Fit results src_dst: (0.37499999944120643, 0.12499999906867743, 0.12500000055879357, 0.3750000009313226)
EDGE affiliated_with STRUCTURAL FIT TOOK: 6.61
DEBUG:root:Initialized logger
DEBUG:root:Using seed: 99
DEBUG:root:Fit results: (0.4468597402097436, 0.14895324673658117, 0.14895324673658117, 0.25523376631709405)
EDGE cites STRUCTURAL FIT 

In [9]:
!du -ah /raid/ogbn_mag240m_simple

2.9G	/raid/ogbn_mag240m_simple/writes_list.parquet
2.1G	/raid/ogbn_mag240m_simple/paper_tabular_features/year_label.npy
193G	/raid/ogbn_mag240m_simple/paper_tabular_features/paper_feats.npy
195G	/raid/ogbn_mag240m_simple/paper_tabular_features
5.3G	/raid/ogbn_mag240m_simple/cites_list.parquet
251M	/raid/ogbn_mag240m_simple/affiliated_with_list.parquet
200K	/raid/ogbn_mag240m_simple/graph_metadata.json
203G	/raid/ogbn_mag240m_simple


## Prepare SynGen Configuration that stores fitted generators

Generators fitting process takes a significant part of the entire generation, so we can store the fitted generators for the future experiments.

In [10]:
generators_dump_dir = 'ogbn_mag240m_gens'

In [10]:
!python -m syngen mimic-dataset --gen-dump-path=$generators_dump_dir --output-file=$configs_dir/with_gen_dump.json --dataset-path $preprocessed_path --tab-gen uniform

INFO:__main__:|    Synthetic Graph Generation Tool    |
INFO:syngen.cli.commands.mimic_dataset:SynGen Configuration saved into ogbn_mag240m_configs/with_gen_dump.json


In [11]:
!cat $configs_dir/with_gen_dump.json

{
    "nodes": [
        {
            "name": "paper",
            "count": 121751666,
            "features": [
                {
                    "name": "feat_0",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_1",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_2",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
                {
                    "name": "feat_3",
                    "dtype": "float16",
                    "feature_type": "continuous",
                    "feature_file": "paper_feats.npy"
                },
          

In [12]:
!python -m syngen synthesize --config-path $configs_dir/with_gen_dump.json --save-path /raid/ogbn_mag240m_with_gen_dump --verbose

INFO:__main__:|    Synthetic Graph Generation Tool    |
0it [00:00, ?it/s]
100%|█████████████████████████████████████████| 768/768 [10:16<00:00,  1.24it/s]
100%|█████████████████████████████████████████████| 2/2 [00:00<00:00,  2.99it/s]
0it [00:00, ?it/s]
NODE paper FIT TOOK: 705.66
FIT NODES TOOK: 705.66
DEBUG:root:Initialized logger
DEBUG:root:Using seed: 20
DEBUG:root:Fit results dst_src: None
DEBUG:root:Fit results src_dst: (0.4493778749717661, 0.16335407041150202, 0.1362311795696754, 0.2510368750470564)
EDGE writes STRUCTURAL FIT TOOK: 112.65
DEBUG:root:Initialized logger
DEBUG:root:Using seed: 99
DEBUG:root:Fit results dst_src: None
DEBUG:root:Fit results src_dst: (0.37499999944120643, 0.12499999906867743, 0.12500000055879357, 0.3750000009313226)
EDGE affiliated_with STRUCTURAL FIT TOOK: 6.82
DEBUG:root:Initialized logger
DEBUG:root:Using seed: 1
DEBUG:root:Fit results: (0.4468597402097436, 0.14895324673658117, 0.14895324673658117, 0.25523376631709405)
EDGE cites STRUCTURAL FIT T

In [13]:
!du -ah /raid/ogbn_mag240m_with_gen_dump

2.9G	/raid/ogbn_mag240m_with_gen_dump/writes_list.parquet
2.1G	/raid/ogbn_mag240m_with_gen_dump/paper_tabular_features/year_label.npy
193G	/raid/ogbn_mag240m_with_gen_dump/paper_tabular_features/paper_feats.npy
195G	/raid/ogbn_mag240m_with_gen_dump/paper_tabular_features
4.9G	/raid/ogbn_mag240m_with_gen_dump/cites_list.parquet
251M	/raid/ogbn_mag240m_with_gen_dump/affiliated_with_list.parquet
200K	/raid/ogbn_mag240m_with_gen_dump/graph_metadata.json
202G	/raid/ogbn_mag240m_with_gen_dump


## Prepare SynGen Configuration that scales the dataset

In [11]:
import os

In [12]:
scale_config_files = {}
for scale in [1, 2, 4]:
    edges_scale = scale ** 3
    out_file = f'{configs_dir}/scale_nodes_{scale}_edges_{edges_scale}.json'
    scale_config_files[scale] = out_file
    if os.path.exists(out_file):
        continue
    !python -m syngen mimic-dataset --node-scale=$scale --edge-scale=$edges_scale --gen-dump-path=$generators_dump_dir --output-file=$out_file --dataset-path $preprocessed_path --tab-gen uniform

In [15]:
def generate_scale(scale):
    config_file = scale_config_files[scale]
    out_dir = f"/raid/scale_{scale}"
    !python -m syngen synthesize --config-path $config_file --save-path $out_dir --verbose
    !du -ah $out_dir

In [17]:
generate_scale(2)

INFO:__main__:|    Synthetic Graph Generation Tool    |
NODE paper FIT TOOK: 0.02
FIT NODES TOOK: 0.02
EDGE writes STRUCTURAL FIT TOOK: 0.00
EDGE affiliated_with STRUCTURAL FIT TOOK: 0.00
EDGE cites STRUCTURAL FIT TOOK: 0.00
FIT EDGES TOOK: 0.01
FIT TOOK: 0.03
100%|█████████████████████████████████████████| 256/256 [01:59<00:00,  2.15it/s]
INFO:syngen.utils.io_utils:writing to file /raid/scale_2/writes_list.parquet parquet
EDGE writes STRUCT GEN TOOK: 206.77
100%|██████████████████████████████████████| 4096/4096 [00:36<00:00, 111.33it/s]
INFO:syngen.utils.io_utils:writing to file /raid/scale_2/affiliated_with_list.parquet parquet
EDGE affiliated_with STRUCT GEN TOOK: 66.71
100%|█████████████████████████████████████████| 528/528 [03:28<00:00,  2.53it/s]
INFO:syngen.utils.io_utils:writing to file /raid/scale_2/cites_list.parquet parquet
EDGE cites STRUCT GEN TOOK: 394.74
GEN STRUCT TOOK: 668.22
100%|███████████████████████████████████████████| 79/79 [09:52<00:00,  7.50s/it]
GEN TABULAR N

## Memory-mapped files for edge lists

Instead of chunk concatenation after the generation SynGen supports memory mapped files that allow multi-process writing into the single  file. To enable this feature, you need to specify the edge `structure_path` as `.npy` file. 

In [13]:
import json

In [14]:
memmap_scale_config_files = {}

for scale, config_file in scale_config_files.items():
    with open(config_file, 'r') as f:
        cfg = json.load(f)
    
    for edge_info in cfg["edges"]:
        edge_info['structure_path'] = edge_info['structure_path'].split('.')[0] + '.npy'
    
    memmap_cfg_file = config_file[:-5] + "_memmap.json"
    memmap_scale_config_files[scale] = memmap_cfg_file
    
    if os.path.exists(memmap_cfg_file):
        continue
    
    with open(memmap_cfg_file, 'w') as f:
        json.dump(cfg, f, indent=4)
    

In [15]:
def generate_scale_memmap(scale):
    config_file = memmap_scale_config_files[scale]
    out_dir = f"/raid/scale_{scale}_memmap"
    !python -m syngen synthesize --config-path $config_file --save-path $out_dir --verbose
    !du -ah $out_dir

In [16]:
generate_scale_memmap(1)

DGL backend not selected or invalid.  Assuming PyTorch for now.
Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)
INFO:__main__:|    Synthetic Graph Generation Tool    |
NODE paper FIT TOOK: 0.08
FIT NODES TOOK: 0.08
EDGE writes STRUCTURAL FIT TOOK: 0.03
EDGE affiliated_with STRUCTURAL FIT TOOK: 0.04
EDGE cites STRUCTURAL FIT TOOK: 0.02
FIT EDGES TOOK: 0.10
FIT TOOK: 0.17
100%|█████████████████████████████████████████████| 4/4 [00:30<00:00,  7.72s/it]
EDGE writes STRUCT GEN TOOK: 32.58
EDGE affiliated_with STRUCT GEN TOOK: 4.20
100%|███████████████████████████████████████████| 10/10 [00:23<00:00,  2.31s/it]
EDGE cites STRUCT GEN TOOK: 24.36
GEN STRUCT TOOK: 61.14
100%|███████████████████████████████████████████| 40/40 [03:17<00:00,  4.93s/it]
GEN TABULAR NODE FEATURES TOOK: 209.58
GEN TABULAR EDGE FEATURES TOOK: 0.00
GEN ALIGNMENT TAKE: 

In [17]:
generate_scale_memmap(2)

INFO:__main__:|    Synthetic Graph Generation Tool    |
NODE paper FIT TOOK: 0.01
FIT NODES TOOK: 0.01
EDGE writes STRUCTURAL FIT TOOK: 0.00
EDGE affiliated_with STRUCTURAL FIT TOOK: 0.01
EDGE cites STRUCTURAL FIT TOOK: 0.01
FIT EDGES TOOK: 0.02
FIT TOOK: 0.03
100%|█████████████████████████████████████████| 256/256 [01:47<00:00,  2.37it/s]
EDGE writes STRUCT GEN TOOK: 110.77
100%|██████████████████████████████████████| 4096/4096 [00:25<00:00, 159.79it/s]
EDGE affiliated_with STRUCT GEN TOOK: 39.71
100%|█████████████████████████████████████████| 528/528 [01:54<00:00,  4.62it/s]
EDGE cites STRUCT GEN TOOK: 116.67
GEN STRUCT TOOK: 267.15
100%|███████████████████████████████████████████| 78/78 [06:55<00:00,  5.33s/it]
GEN TABULAR NODE FEATURES TOOK: 438.53
GEN TABULAR EDGE FEATURES TOOK: 0.00
GEN ALIGNMENT TAKE: 0.00
24G	/raid/scale_2_memmap/writes_list.npy
2.7G	/raid/scale_2_memmap/affiliated_with_list.npy
4.0G	/raid/scale_2_memmap/paper_tabular_features/year_label.npy
385G	/raid/scale_2_

In [18]:
generate_scale_memmap(4)

INFO:__main__:|    Synthetic Graph Generation Tool    |
NODE paper FIT TOOK: 0.01
FIT NODES TOOK: 0.01
EDGE writes STRUCTURAL FIT TOOK: 0.01
EDGE affiliated_with STRUCTURAL FIT TOOK: 0.01
EDGE cites STRUCTURAL FIT TOOK: 0.01
FIT EDGES TOOK: 0.02
FIT TOOK: 0.03
100%|█████████████████████████████████████| 16384/16384 [27:41<00:00,  9.86it/s]
EDGE writes STRUCT GEN TOOK: 1669.94
100%|███████████████████████████████████████| 4096/4096 [01:30<00:00, 45.09it/s]
EDGE affiliated_with STRUCT GEN TOOK: 107.36
100%|███████████████████████████████████████| 8256/8256 [21:34<00:00,  6.38it/s]
EDGE cites STRUCT GEN TOOK: 1301.40
GEN STRUCT TOOK: 3078.70
100%|█████████████████████████████████████████| 157/157 [16:05<00:00,  6.15s/it]
100%|█████████████████████████████████████████████| 3/3 [00:31<00:00, 10.60s/it]
GEN TABULAR NODE FEATURES TOOK: 1009.78
GEN TABULAR EDGE FEATURES TOOK: 0.00
GEN ALIGNMENT TAKE: 0.00
185G	/raid/scale_4_memmap/writes_list.npy
22G	/raid/scale_4_memmap/affiliated_with_list.n