This repository contains the code and data to replicate the experimental results in the manuscript Heterogeneous Graph Tree Networks.
ACM, IMDB, and DBLP
RTX 3060 12GB
- Pytorch
- Pytorch Geometric
- DGL
- Scikit-Learn
- cogdl (used for implementation of SimpleHGN)
- Imblearn 0.8.0 (only used for label split)
- Place the ACM raw data (
ACM.mat
from https://github.com/Jhy1993/HAN/tree/master/data/acm) into./data/ACM
folder, if ACM data is not available. - Data preprocessing for all test models.
- Run the command:
python gen_preprocessed_data.py
Note: The above command will generate corrsponding input data for different models. The preprocessing time for all models are fast except MAGNN, which takes about one hour for the ACM and IMDB datasets, and causes out-of-memory (OOM) and out-of-time (OOT) issue for DBLP dataset. So we use the preprocessed DBLP dataset from the original source (https://github.com/cynricfu/MAGNN) and place it inside the folder./MAGNN/data/preprocessed/DBLP_processed
, then runpython gen_preprocessed_data.py
. If you don't want to test MAGNN model and don't want to generate preprocessed data for MAGNN, you can runpython gen_preprocessed_data.py --skip_MAGNN
.
- Run the command:
- Run baseline models and the proposed HetGTCN and HetGTAN models.
- The test results of all models except for SimpleHGN are recorded in the Jupyter files located in the folder
./result
. The file name is formatted as{metric}_{model}_{optional hop#}.ipynb
. For example, the Macro-F1 scores of HetGTCN and HetGTAN with 5 model layers are named asf1-macro_HetGTCN_hop5.ipynb
andf1-macro_HetGTAN_hop5.ipynb
, respectively. You can simply open the a Jupyter file in the results folder and run it directly without any extra work. The model settings are clearly described in each Jupyter notebook. - You can also use the following command to run a test without using Jupyter Notebook.
python train.py
with the optional input arguments:--data
, default =ACM
, help =Name of dataset, can be ACM, IMDB and DBLP
--data_path
, default =data/preprocessed/
, help =folder path of saved preprocessed data
--model
, default =HetGTAN
, help =Heterogeneous GNN model, choices of: HAN, HetGTCN, HetGTAN, HGT, HetGCN, HetGAT, RGCN, HetGTAN_NoSem, HetGTAN_LW, HetGTAN_mean, HetGTCN_mean, HetGTCN_LW
--target_node_type
, default =paper
, help =Target node type: paper for ACM data
--n_hid
, type =int
, default =64
, help =num of hidden features
--num_heads
, type =int
, default =8
, help =num heads for attention layer
--dropout
, type =float
, default =0.8
, help =Initial layer dropout, or dropout for HetGCN
--dropout2
, type =float
, default =0.2
, help =Intermediate layer dropout for HetGTAN or HetGTCN, or attention dropout for HetGAT
-lr
,--learning_rate
, type =float
, default =0.005
, help =Learning rate
-wd
,--weight_decay
, type =float
, default =0.00005
, help =weight decay in Adam optimizer
--patience
, type =int
, default =100
, help =Early stopping patience
--num_iter
, type =int
, default =500
, help =Max epochs to run
--num_test
, type =int
, default =30
, help = 'num of runs to test accuracy`--hop
, type =int
, default =5
, help =hop or #layers of GNN models
--num_bases
, type =int
, default =5
, help =num bases for RGCN model
--filter_pct
, type =float
, default=0.1
, help =remove the top and bottom filer_pct points before obtaining statistics of test accuracy
--log_step
, type =int
, default =1000
, help =training log step
-lw
,--layer_wise
, action =store_true
, default =False
, help =whether to share parameters for different layers
--average
, default =macro
, help =f1 average: can choose either macro or micro
- Examples:
- Use the following command to run a five-layer HetGTAN on the ACM dataset with F1-macro metric 30 times:
python train.py -lw
- Use the following command to run a five-layer HetGTAN on the IMDB dataset with F1-macro metric 30 times:
python train.py --data IMDB --target_node_type movie -lw
- Use the following command to run a five-layer HetGTCN on the DBLP dataset with F1-macro metric 10 times:
python train.py --model HetGTCN --data DBLP --target_node_type author --dropout 0.8 --dropout2 0.6 -wd 1e-5 --num_test 10 -lw
- Use the following command to run a five-layer HetGTAN on the ACM dataset with F1-macro metric 30 times:
- The results for the SimpleHGN model are recorded as Jupyter files located in the folder
./SimpleHGN
. For instance,f1-macro_SimpleHGN_hop2.ipynb
contains the test result of a two-layer SimpleHGN model. You can simply open the Jupyter files and run it to reproduce our results. You can also run the following commandpython train_simpleHGN.py
for the same purpose. - If you want to run command for MAGNN, simply go to the folder
./MAGNN
and runpython main.py
. - If you want to run command for GTN, simply go to the folder
./GTN
and runpython main_sparse.py
. - If you want to run command for DMGI, simple go to the folder
./DMGI
, and runpython main.py
.
- The test results of all models except for SimpleHGN are recorded in the Jupyter files located in the folder