Supervised Learning on Relational Databases with Graph Neural Networks
This is code to reproduce the results in the paper Supervised Learning on Relational Databases with Graph Neural Networks.
docker/whole_project/environment.yml lists all dependencies you need to install to run this code.
You can follow the instructions here to automatically install a conda environment from this file.
You can also build a docker container which contains all dependencies. You'll need docker (or nvidia-docker if you want to use a GPU) installed to do this.
docker/whole_project/Dockerfile builds a container that can run all experiments.
I would love to have a link here where you could just download the prepared datasets. But unfortunately that would violate the Kaggle terms of service.
So you either need to follow the instructions below and build them yourself, or reach out to me by email and I may be able to provide them to you.
Preparing the datasets yourself
/__init__.pyto be the location where you'd like to install the datasets. Default is
Download raw dataset files from Kaggle. You need a Kaggle account to do this. You only need to download the datasets you're interested in.
a) Put the Acquire Valued Shoppers Challenge data in
data_root/raw_data/acquirevaluedshopperschallenge. Extract any compressed files.
b) Put the Home Credit Default Risk data in
data_root/raw_data/homecreditdefaultrisk. Extract any compressed files.
c) Put the KDD Cup 2014 data in
data_root/raw_data/kddcup2014. Extract any compressed files.
Build the docker container specified in
docker/neo4j/Dockerfile. This creates a container with the neo4j graph database installed, which is used to build the datasets.
Start the database server(s) for the datasets you want to build:
docker run -d -e "NEO4J_dbms_active__database=<db_name>.graph.db" --publish=7474:<port_for_browser> --publish=7687:<port_for_db> --mount type=bind,source=<path_to_code>/data/datasets/<db_name>,target=/data rdb-neo4j
<path_to_code>is the location of this repo on your system,
<port_for_browser>is an optional port for using the build-in neo4j data viewer (you can set it as
7474if you don't care), and (
<port_for_db>) are (
10687), or (
python -m data.<db_name>.build_database_from_kaggle_filesfrom the root directory of this repo.
(optional) To view the dataset in the built-in neo4j data viewer, navigate to
<your_machine's_ip_address>:7474in a web browser, run
:server disconnectto log off whatever your web browser thinks is the default neo4j server, and log into the right one by specifying
<port_for_browser>in the web interface.
python -m data.<db_name>.build_dataset_from_databasefrom the root directory of this repo.
python -m data.<db_name>.build_db_infofrom the root directory of this repo.
(optional) to create the tabular and DFS datasets used in the experiments, run
python -m data.<db_name>.build_DFS_featuresfrom the root directory of this repo. Then run
python -m data.<db_name>.build_tabular_datasetsfrom the root directory of this repo.
Add your own datasets
If you have your own relational dataset you'd like to use this system with, you can copy and modify the code in one of the
data/kddcup2014 directories to suit your purposes.
The main thing you have to do is create the
.cypher script to get your data into a neo4j database. Once you've done that, nearly all the dataset building code is reusable.
You'll also have to add your dataset's name in a few places in the codebase, e.g. in the
__init__ method of the
All experiments are started with the scripts in the
For example, to recreate the
PoolMLP row in paper tables 3 and 4, you would run
python -m experiments.GNN.PoolMLP from the root directory of this repo to start training, then run
python -m experiments.evaluate_experiments when training is finished, and finally run
python -m experiments.GNN.print_and_plot_results.
By default, experiments run in tmux windows on your local machine. But you can also change the argument in the
run_script_with_kwargs command at the bottom of each experiment script to run them in a local docker container.
Or you can export the docker image built with
docker/whole_project/Dockerfile to AWS ECR and modify the arguments in
experiments/utils/run_script_with_kwargs to run all experiments on AWS Batch.