A synthetic point cloud formed of planted motifs separated by noise points can be generated by running
python -m SPMiner.common.run_syn_ptcloud \
--out datasets/pointcloud/raw/points.npz \
--grid 20 10 10 \
--grid_spacing 2.0 \
--motif_size $MOTIF \
--noise_points_per_cell $NPC\
--enable_random_transform\
--variant_rates 0.3 0.2 0.1 0.05\
Some configuration options are: --enable_random_transform which will randomly rotate and translate the planted motifs --variant_rates will include variants of the plated motif with the same node type composition but different relative positions, specify the fractions for replacement rate of the base motif, --type_vocab defines the node types available and --type_probs allows custom sampling probabilities over node types.
Applies k-NN to point cloud
python -m SPMiner.common.pointcloud_to_deepsnap \
--input datasets/pointcloud/raw/points.npz \
--out datasets/pointcloud/graphs.pkl \
--k 1
Also included in this repo is an implementation of a naïve baseline for the subgraph mining problem which only takes into consideration the node type composition.
python -m naive_cluster.baseline_naive.py --data path/to/points.npz --mode knn --k $MOTIF --sample-frac 1 --n-clusters 4 --runs 10
The naïve baseline can also be run with varying motif size or noise with the following scripts: results_naive_runs_motif.py, results_naive_runs_noise.py. They output a csv file and png.
To train the point cloud encoder:
python -m pt_cloud.matching.train
All config options can be found under pt_cloud/matching/config.py.
This is a modified version of SPMiner which is node-type aware and is able to process point cloud datasets as input.
Running the pipeline consists of training the encoder on synthetic data, then running the decoder on the dataset from which to mine patterns. The config argument args.model_path (subgraph_matching/config.py) specifies the location of the saved checkpoint. Full configuration options can be found in SPMiner/subgraph_mining/config.py
Train the encoder: python3 -m SPMiner.subgraph_matching.train --node_anchored --num_types 4 --type_emb_dim 16. Note that a trained order embedding model checkpoint is provided in SPMiner/ckpt/model.pt. It was trained for over 4 node node types so for mining with more than 4 node types it has to be retrained.
By default, the encoder is trained with on-the-fly generated synthetic data (--dataset=syn-balanced). The dataset argument can be used to change to a real-world dataset (e.g. --dataset=enzymes), or an imbalanced class version of a dataset (e.g. --dataset=syn-imbalanced). It is recommended to train on a balanced dataset.
Full configuration options for training the encoder can be found in SPMiner/subgraph_matching/config.py
To run the decoder on the point cloud dataset:
python -m SPMiner.subgraph_mining.decoder --dataset pointcloud --search_strategy greedy --method_type order --min_pattern_size $MOTIF --max_pattern_size $MOTIF --out_path results/mined.pkl --model_path ckpt/model.pt --out_batch_size 64 --num_types 4 --type_emb_dim 16
Useful configuration options: --min_neighborhood_size, --max_neighborhood_size defines the neighborhood size considered when searching for the pattern. Setting max_neighborhood_size larger than the max connected component size will result in an infinite loop.
Outputs: plots/cluster/{k}-{i}.png/json motif exemplar of size k , results/out-patterns.p all mined motifs (pickle)
python SPMiner/tools/count_patterns_typed.py \
--dataset pointcloud \
--graphs_pkl datasets/pointcloud/raw/graphs.pkl \
--patterns_glob 'plots/cluster/*.json' \
--node_attr node_type \
--count_method freq \
--out_path results/counts.json
To evaluate the performance of the model we define a hit rate such that there is a hit if one of the topk mined motifs matches the GT. The hit rate is evaluate over runs amount of runs so for example if runs = 10 then the score will be out of ten. For each run we compare the node type match and the exact match which checks for connectivity as well as node type match.
To evaluate SPMiner while varying noise level :
python -m SPMiner.tools.spminer_eval_noise --motif_size 6 --noise_multipliers 0 0.25 0.5 1 2 4 --runs 10 --k 2 --ckpt ckpt/model.pt --min_pattern_size 6 --max_pattern_size 6 --num_types 4 --type_emb_dim 16 --use_counts --topk 4 --plot_dir plots
To consider only the top-k most frequent motifs use --use_counts, --topk $K. The k parameter is for applying k-NN on the point cloud. Setting k > 1 here is necessary because for low noise level the graph has may small disconnected components.
python -m SPMiner.tools.spminer_eval_motif --motif_min 4 --motif_max 20 --noise_ratio 1.0 --runs 10 --ckpt ckpt/model.pt --num_types 4 --type_emb_dim 16 --use_counts --topk 4 --plot_dir plots
The --noise_ratio parameter defines the amount of noise as a ratio of noise points to motif points in the point cloud in total. For example noise_ratio = 1 for motif_size = 6 means the point cloud will be generated with noise_points_per_cell = 6.
The visualizations folder regroups comparative plots and analyses of the three methods described above.
The library uses PyTorch and PyTorch Geometric to implement message passing graph neural networks (GNN).
It also uses DeepSNAP, which facilitates easy use of graph algorithms (such as subgraph operation and matching operation) to be performed during training for every iteration, thanks to its synchronization between an internal graph object (such as a NetworkX object) and the Pytorch Geometric Data object.
Detailed library requirements can be found in requirements.txt