Point cloud and subgraph mining

1. Synthetic data generation

Generation

A synthetic point cloud formed of planted motifs separated by noise points can be generated by running

python -m SPMiner.common.run_syn_ptcloud \
  --out datasets/pointcloud/raw/points.npz \
  --grid 20 10 10 \
  --grid_spacing 2.0 \
  --motif_size $MOTIF \
  --noise_points_per_cell $NPC\
  --enable_random_transform\
  --variant_rates 0.3 0.2 0.1 0.05\

Some configuration options are: --enable_random_transform which will randomly rotate and translate the planted motifs --variant_rates will include variants of the plated motif with the same node type composition but different relative positions, specify the fractions for replacement rate of the base motif, --type_vocab defines the node types available and --type_probs allows custom sampling probabilities over node types.

Converting to graph

Applies k-NN to point cloud

python -m SPMiner.common.pointcloud_to_deepsnap \
  --input datasets/pointcloud/raw/points.npz \
  --out datasets/pointcloud/graphs.pkl \
  --k 1

2. Naive node type clustering

Also included in this repo is an implementation of a naïve baseline for the subgraph mining problem which only takes into consideration the node type composition.

  python -m naive_cluster.baseline_naive.py --data path/to/points.npz  --mode knn --k $MOTIF  --sample-frac 1 --n-clusters 4 --runs 10

The naïve baseline can also be run with varying motif size or noise with the following scripts: results_naive_runs_motif.py, results_naive_runs_noise.py. They output a csv file and png.

3. Point cloud subpattern mining

To train the point cloud encoder:

python -m pt_cloud.matching.train

All config options can be found under pt_cloud/matching/config.py.

4. Node-type aware Frequent Subgraph Mining (SPMiner)

This is a modified version of SPMiner which is node-type aware and is able to process point cloud datasets as input.

Running the pipeline consists of training the encoder on synthetic data, then running the decoder on the dataset from which to mine patterns. The config argument args.model_path (subgraph_matching/config.py) specifies the location of the saved checkpoint. Full configuration options can be found in SPMiner/subgraph_mining/config.py

Training the GNN encoder

Train the encoder: python3 -m SPMiner.subgraph_matching.train --node_anchored --num_types 4 --type_emb_dim 16. Note that a trained order embedding model checkpoint is provided in SPMiner/ckpt/model.pt. It was trained for over 4 node node types so for mining with more than 4 node types it has to be retrained.

By default, the encoder is trained with on-the-fly generated synthetic data (--dataset=syn-balanced). The dataset argument can be used to change to a real-world dataset (e.g. --dataset=enzymes), or an imbalanced class version of a dataset (e.g. --dataset=syn-imbalanced). It is recommended to train on a balanced dataset.

Full configuration options for training the encoder can be found in SPMiner/subgraph_matching/config.py

Frequent Subgraph Mining

To run the decoder on the point cloud dataset:

python -m SPMiner.subgraph_mining.decoder   --dataset pointcloud   --search_strategy greedy   --method_type order   --min_pattern_size $MOTIF --max_pattern_size $MOTIF   --out_path results/mined.pkl   --model_path ckpt/model.pt   --out_batch_size 64   --num_types 4 --type_emb_dim 16

Useful configuration options: --min_neighborhood_size, --max_neighborhood_size defines the neighborhood size considered when searching for the pattern. Setting max_neighborhood_size larger than the max connected component size will result in an infinite loop. Outputs: plots/cluster/{k}-{i}.png/json motif exemplar of size k , results/out-patterns.p all mined motifs (pickle)

Analyse results

Count frequencies

python SPMiner/tools/count_patterns_typed.py \
  --dataset pointcloud \
  --graphs_pkl datasets/pointcloud/raw/graphs.pkl \
  --patterns_glob 'plots/cluster/*.json' \
  --node_attr node_type \
  --count_method freq \
  --out_path results/counts.json

To evaluate the performance of the model we define a hit rate such that there is a hit if one of the topk mined motifs matches the GT. The hit rate is evaluate over runs amount of runs so for example if runs = 10 then the score will be out of ten. For each run we compare the node type match and the exact match which checks for connectivity as well as node type match.

Vary noise level

To evaluate SPMiner while varying noise level :

python -m SPMiner.tools.spminer_eval_noise --motif_size 6 --noise_multipliers 0 0.25 0.5 1 2 4 --runs 10  --k 2  --ckpt ckpt/model.pt --min_pattern_size 6 --max_pattern_size 6 --num_types 4 --type_emb_dim 16  --use_counts --topk 4  --plot_dir plots

To consider only the top-k most frequent motifs use --use_counts, --topk $K. The k parameter is for applying k-NN on the point cloud. Setting k > 1 here is necessary because for low noise level the graph has may small disconnected components.

Vary motif size

python -m SPMiner.tools.spminer_eval_motif   --motif_min 4 --motif_max 20   --noise_ratio 1.0   --runs 10   --ckpt ckpt/model.pt   --num_types 4 --type_emb_dim 16   --use_counts --topk 4  --plot_dir plots

The --noise_ratio parameter defines the amount of noise as a ratio of noise points to motif points in the point cloud in total. For example noise_ratio = 1 for motif_size = 6 means the point cloud will be generated with noise_points_per_cell = 6.

Visualizations

The visualizations folder regroups comparative plots and analyses of the three methods described above.

Dependencies

The library uses PyTorch and PyTorch Geometric to implement message passing graph neural networks (GNN).

It also uses DeepSNAP, which facilitates easy use of graph algorithms (such as subgraph operation and matching operation) to be performed during training for every iteration, thanks to its synchronization between an internal graph object (such as a NetworkX object) and the Pytorch Geometric Data object.

Detailed library requirements can be found in requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
SPMiner		SPMiner
naive_cluster		naive_cluster
pt_cloud		pt_cloud
visualizations		visualizations
wandb		wandb
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Point cloud and subgraph mining

1. Synthetic data generation

Generation

Converting to graph

2. Naive node type clustering

3. Point cloud subpattern mining

4. Node-type aware Frequent Subgraph Mining (SPMiner)

Training the GNN encoder

Frequent Subgraph Mining

Analyse results

Count frequencies

Vary noise level

Vary motif size

Visualizations

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Point cloud and subgraph mining

1. Synthetic data generation

Generation

Converting to graph

2. Naive node type clustering

3. Point cloud subpattern mining

4. Node-type aware Frequent Subgraph Mining (SPMiner)

Training the GNN encoder

Frequent Subgraph Mining

Analyse results

Count frequencies

Vary noise level

Vary motif size

Visualizations

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages