This repository is intended purely to demonstrate how to make a graph dataset for PyTorch Geometric from graph vertices and edges stored in CSV files.
The demonstration is done through a node-prediction GNN training/evaluation example with a very small amount of code and data.
Main usage is to read all the "*.py" scripts.
The scripts can also be executed, for example:
# train GNN
python run_training.py
# test a trained GNN model saved after epoch 5
python run_evaluation.py testing_data ./output_saved_trained_models/epoch5.pth
# test multiple saved trained GNN models
find "./output_saved_trained_models/" -type f | sort -V | xargs python ./run_evaluation.py testing_data
# use a single trained model to predict vertex values for a single graph, and save the predictions to a file
python run_inference_for_one_graph.py \
./output_saved_trained_models/epoch15.pth \
./input_graph_CSV_files/data/1A22/1A22_sr1_vertices_in.csv \
./input_graph_CSV_files/data/1A22/1A22_sr1_edges.csv \
./output_vertex_predictions.txt
Example of installing prerequisites with Miniconda:
# download an install latest Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source ~/miniconda3/bin/activate
# install PyTorch using instructions from 'https://pytorch.org/get-started/locally/'
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# install PyTorch Geometric using instructions from 'https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html'
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
# install Pandas, to use for reading CSV files
conda install pandas
Alternatively, Apptainer container can be built from the image supplied among the files:
# build Apptainer container
apptainer build container.sif container.def
# enter the container
apptainer shell container.sif
It should be noted that Apptainer's container may not be able to use host's GPU.
The included input graphs are already prepared for GNN training and application. The graph preparation code is not included, but below are the main recommendations for the graphs to work with the provided training and inference code.
The graphs should have bidirectional connections and self-connections. That is, in any '*_edges.csv' file:
- if there is (i -> j) edge, there should also be (j -> i) edge with the same weight
- there should be (i -> i) edge with an apppropriate weight for every vertex id i
All the vertex and edge feature values should be normalized universally (not on per-graph basis, but based on some global statistics) - for example, converted to z-scores using mean and standard deviation values known beforehand or derived from all the graphs used in training:
z_score = (x - mean) / standard_deviation