Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks

Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. This repository contains the dataset files, the data pipeline used to create it as well as experiments performed to benchmark node classification and link prediction methods.

The dataset

Loading via PyTorch Geometric

You can load the dataset easily using the torch_geometric.datasets.WikiCS class in PyTorch Geometric. Note that the train_mask, val_mask, stopping_mask fields are matrices with rows corresponding to 20 different dataset splits, please use all of them for and report averaged results for standardised evaluation. (The test_mask has a single version.)

Loading from file

The dataset/data.json file contains the vectorised representation of the dataset. It includes the node features, adjacency lists, labels and dataset splits. Note that we provide 20 different training splits to avoid overfitting to a specific small set of training nodes for semi-supervised node classification. See our paper for more details.

See experiments/linkprediction/load_wiki.py and experiments/node_classification/load_graph_data.py for reference data loaders in PyTorch Geometric and DGL, respectively.

Metadata about the nodes and labels can be found in dataset/metadata.json, with the same ordering of nodes and labels as the vectorised data file. For nodes, this describes what page of Wikipedia it was derived from and what textual content was used for the features. For labels, the corresponding category is named.

Experiments

Our experiments were performed using Python 3.5, CUDA 10.1 and the dependencies noted in requirements.txt.

Node classification

For node classification models, run the following, with #MODEL_NAME# one of svm, mlp, gcn, gat and appnp:

cd experiments
python -m node_classification.#MODEL_NAME#.#MODEL_NAME#_train --dataset=wiki

Add the hyperparameters as follows to replicate our results:

cd experiments
python -m node_classification.svm.svm_train --dataset=wiki --self-loop --kernel rbf --c 8 --test
python -m node_classification.mlp.mlp_train --dataset=wiki --self-loop --n-hidden-layers 1 --n-hidden-units 35 --dropout 0.35 --lr 0.003 --test
python -m node_classification.gcn.gcn_train --dataset=wiki --self-loop --n-hidden-layers 1 --n-hidden-units 33 --dropout 0.25 --lr 0.02 --test
python -m node_classification.gat.gat_train --dataset=wiki --self-loop --n-hidden-layers 1 --n-hidden-units 14 --in-drop 0.5 --attn-drop 0.5 --n-heads 5 --lr 0.007 --test
python -m node_classification.appnp.appnp_train --dataset=wiki --self-loop --n-hidden-units 14 --k 2 --alpha 0.11 --in-drop 0.4 --edge-drop 0.4 --lr 0.02 --test

Link prediction

The SVM and VGAE benchmarks for link prediction are included in this repository:

cd experiments/linkprediction
python train_vgae.py --dataset=wiki --test
python train_svm.py --dataset=wiki --c=10 --test

Citing

If you use our dataset, please cite our paper (Bibtex below).

@article{mernyei2020wiki,
  title={Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks},
  author={Mernyei, P{\'e}ter and Cangea, C{\u{a}}t{\u{a}}lina},
  journal={arXiv preprint arXiv:2007.02901},
  year={2020}
}

Software used

The dataset pipeline includes a modified version of the Wikipedia category sanitizer by Boldi and Monti for extracting and sanitizing category labels.
wikiextractor was used to extract article text data.
mysqldump-to-csv was used for processing hyperlink data.
The GCN, GAT and APPNP implementations were taken from the DGL examples repository.

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
data_processing		data_processing
dataset		dataset
experiments		experiments
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks

The dataset

Loading via PyTorch Geometric

Loading from file

Experiments

Node classification

Link prediction

Citing

Software used

About

Releases

Packages

Languages

License

pmernyei/wiki-cs-dataset

Folders and files

Latest commit

History

Repository files navigation

Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks

The dataset

Loading via PyTorch Geometric

Loading from file

Experiments

Node classification

Link prediction

Citing

Software used

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages