This repository contains the ANCKA clustering framework for general attributed networks, extended from the previously published AHCKA algorithm for attributed hypergraph. If you find it helpful, please cite our work below:
@article{LiGSYSLL24,
author = {Yiran Li and Gongyao Guo and Jieming Shi and Renchi Yang and Shiqi Shen and Qing Li and Jun Luo},
title = {A versatile framework for attributed network clustering via K-nearest neighbor augmentation},
journal = {{VLDB} J.},
volume = {33},
number = {6},
pages = {1913--1943},
year = {2024}
}@article{LiYS23,
author = {Yiran Li and Renchi Yang and Jieming Shi},
title = {Efficient and Effective Attributed Hypergraph Clustering via K-Nearest Neighbor Augmentation},
journal = {Proc. {ACM} Manag. Data},
volume = {1},
number = {2},
pages = {116:1--116:23},
year = {2023}
}The following attributed network datasets used in this work are also made available:
-
Attributed undirected/directed graphs: Cora, Undirected Citeseer, Wiki, Directed Citeseer, Tweibo*, Amazon2M*.
-
Attributed hypergraphs: Cora-CA, Cora-CC, Query, Citeseer, 20News, DBLP, Amazon*, MAG-PM*.
-
Attributed multiplex graphs: ACM, DBLP, IMDB.
*Available at Zenodo upload due to space limit of GitHub repository.
-
Prepare Python environment with pip or conda:
-
Python version 3.9
-
Install required libraries: numpy, scipy, scikit-learn
-
-
(Optional) Extract attributed multiplex datasets:
unzip data/multi.zip -d data/
-
(Optional) To run ANCKA-GPU, install cupy and faiss-gpu=1.7.3.
-
(Optional) For testing large-scale datasets:
-
Install KNN search libraries: faiss-cpu and scann
-
Download datasets and KNN index files from Zenodo upload
-
Extract files:
cd ANCKA unzip ~/Download_path/ANCKA_data.zip -d data/
-
To run ANCKA clustering algorithm please specify the name of dataset by command-line parameter --dataset, and for hypergraph datasets also specify the type of dataset by --data. Besides, also provide the type of attributed network by --network_type. Datasets supported by our implementation include the following:
| Type of dataset | --data | --dataset | --network_type |
|---|---|---|---|
| Co-authorship hypergraph in .pickle files | coauthorship | cora, dblp | HG |
| Co-citation hypergraph in .pickle files | cocitation | cora, citeseer | HG |
| Attributed hypergraph stored in .npz file | npz | query, 20news, amazon, magpm | HG |
| Attributed undirected graph | - | cora,citeseer-UG,wiki,Amazon2M | UG |
| Attributed directed graph | - | citeseer-DG,Tweibo | DG |
| Attributed multiplex graph | - | ACM,IMDB,DBLP-MG | MG |
Other parameters are optional:
| Parameter | Default | Description |
|---|---|---|
| --knn_k | 10 |
|
| --alpha | 0.2 |
|
| --beta | 0.5 |
|
| --tmax | 200 |
|
| --interval | 5 |
|
| --scale | - | Apply settings for large-scale data: approximate KNN with ScaNN or Faiss; simplified initialization ( |
| --gpu | - | Use ANCKA-GPU implementation for clustering |
| --times | 10 | Rerun ANCKA-GPU algorithm to obtain average metrics |
| --caltime | - | Measure the time costs of different segments in ANCKA or ANCKA-GPU algorithm |
| --verbose | - | Produce verbose command-line output |
To reproduce the experiment results of ANCKA and ANCKA-GPU in our paper, please refer to command.sh for corresponding datasets.
Sample output of running bash command.sh:
CPU based
dataset:dblp data:coauthorship network_type:HG
parameter setting: k=10 init_iter=25 beta=0.5
ACC=0.797 F1=0.774 NMI=0.632 ARI=0.632 Time=30.899s
dataset:cora data:none network_type:UG
parameter setting: k=50 init_iter=25 beta=0.5
ACC=0.723 F1=0.686 NMI=0.556 ARI=0.484 Time=1.110s
dataset:citeseer data:none network_type:DG
parameter setting: k=50 init_iter=25 beta=0.4
ACC=0.696 F1=0.651 NMI=0.444 ARI=0.460 Time=1.152s
dataset:acm data:none network_type:MG
parameter setting: k=50 init_iter=25 beta=0.7
ACC=0.928 F1=0.928 NMI=0.739 ARI=0.796 Time=1.602s
GPU based
dataset:dblp data:coauthorship network_type:HG
parameter setting: k=10 init_iter=25 beta=0.5
ACC=0.808 F1=0.787 NMI=0.643 ARI=0.646 Time=0.663s
dataset:cora data:none network_type:UG
parameter setting: k=50 init_iter=25 beta=0.4
ACC=0.683 F1=0.621 NMI=0.533 ARI=0.470 Time=0.377s
dataset:citeseer data:none network_type:DG
parameter setting: k=50 init_iter=25 beta=0.4
ACC=0.694 F1=0.652 NMI=0.441 ARI=0.454 Time=0.839s
dataset:acm data:none network_type:MG
parameter setting: k=50 init_iter=25 beta=0.7
ACC=0.924 F1=0.924 NMI=0.730 ARI=0.786 Time=0.332s