Decentralized training of graph convolutional networks

This is the code and report for my final project in Dr. Carother's Parallel Computing course taught in the Spring of 2021 at RPI.

Project overview

Graph Convolutional Networks (GCNs) can be used to solve node level classification problems in graph structured data. In a node level classification problem, we assume a given dataset has an underlying graph structure where each node in the graph contains a data point and some nodes (i.e. a small percentage) contain corresponding labels. The goal is to learn a classifier that performs well on nodes with unknown labels. As the size of graph datasets grow, the need to perform distributed training of GCNs is necessitated. This work applies concepts from decentralized consensus optimization, namely the Decentralized Gradient Descent (DGD) method, to assist in training GCNs for graph structured data. We perform numerical experiments on the AiMOS supercomputer at RPI and include a technical report.

What's in the repo

code: folder containing the code used to produce the technical report associated with this repository
- data: folder containing the Cora dataset and empty folders to store the partitioned data
- gcn: folder containing a custom implementation of the GCN architecture used for solving node classification problems
- graph_partitioner: folder containing the code that partitions a graph dataset among $N$ agents using PyMetis
- distributed_gcn_trainer.py: python file that performs DGD using mpi4py
- make_plots.py: python file that produces the figures in the technical report
requirements.txt: a text file containing the packages required in the conda environment that a user sets up to run these experiments

How to run on AiMOS

AiMOS is the supercomputer located on RPI's campus. As usernames and project names are sensitive material, below is a mock .sh file script that would need to be located in the code folder to run this script with multiple GPUs. We utilize the NPL Cluster on AiMOS.

Note: anthing in <> brackets is considered a user argument and must be specified before running on AiMOS. These are:

<path_to_conda>: points to the directory where miniconda is located
<your_env>: the name of your conda environment that contains the packages in the requirements.txt file
<path_to_code_folder>: absolute path to the code folder
<num_nodes>: number of computing nodes to use (max used in the report is 4)
<num_gpus>: number of GPUs in each computing node (max is 8 for the NPL Cluster)

#!/bin/bash -x

# ----- GATHER SLURM INFORMATION FOR RUNNING ON MULTIPLE COMPUTE NODES ----- #
if [ "x$SLURM_NPROCS" = "x" ]
then
    if [ "x$SLURM_NTASKS_PER_NODE" = "x" ]
    then
        SLURM_NTASKS_PER_NODE=1
    fi
    SLURM_NPROCS=`expr $SLURM_JOB_NUM_NODES \* $SLURM_NTASKS_PER_NODE`
    else
        if [ "x$SLURM_NTASKS_PER_NODE" = "x" ]
        then
            SLURM_NTASKS_PER_NODE=`expr $SLURM_NPROCS / $SLURM_JOB_NUM_NODES`
        fi
fi

# ----- SET UP TEMPORARY ENVIRONMENT TO DO COMPUTATIONS IN ----- #
srun hostname -s | sort -u > /tmp/hosts.$SLURM_JOB_ID
awk "{ print \$0 \" slots=$SLURM_NTASKS_PER_NODE\"; }" /tmp/hosts.$SLURM_JOB_ID > /tmp/tmp.$SLURM_JOB_ID
mv /tmp/tmp.$SLURM_JOB_ID /tmp/hosts.$SLURM_JOB_ID

# ----- LOAD CONDA ENVIRONMENT ----- #
source ~/<path_to_conda>/conda.sh
conda activate <your_env>

# ----- LOAD GCC FOR GRAPH PARTITIONING ----- #
module load gcc

# ----- PERFORM GRAPH PARTITIONING ----- #
# CHECK IF SUBGRAPHS EXIST
SUBGRAPH_FILE=<path_to_code_folder>/code/data/subgraph_list
if [ -f "$SUBGRAPH_FILE" ]
then
  rm -r <path_to_code_folder>/code/data/subgraph_list
  rm -r <path_to_code_folder>/code/data/adj_matrices
  echo deleting old files...
fi

# MAKE NEW DIRECTORIES FOR SUBGRAPH INFORMATION
echo making new files ...
mkdir <path_to_code_folder>/code/data/subgraph_list
mkdir <path_to_code_folder>/code/data/adj_matrices

# PARTITION THE GRAPH WITH PYMETIS
python <path_to_code_folder>/code/create_coarse_graph.py --num_nodes=$SLURM_NPROCS

# ----- PARSE USER INPUTS ----- #
while getopts e:l: flag
do
    case "${flag}" in
        e) epochs=${OPTARG};;
        l) lr=${OPTARG};;
    esac
done

# ----- RUN COMMAND ON MULTIPLE NODES ----- #
mpirun -hostfile /tmp/hosts.$SLURM_JOB_ID -np $SLURM_NPROCS python <path_to_code_folder>/code/distributed_gcn_trainer.py --num_nodes=$SLURM_NPROCS --epochs=$epochs --lr=$lr

rm /tmp/hosts.$SLURM_JOB_ID

Once the .sh file has been included in the code folder, the following command can be ran in AiMOS to train a GCN with DGD:

sbatch -N<num_nodes> --ntasks-per-node=<num_gpus> --gres=gpu:<num_gpus> -t 5:00 -o <path_to_code_folder>/code/output_file_name.out <path_to_code_folder>/code/runScript.sh -e 350 -l 50

Note: we must be in the code directory when we submit the above script.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
.gitignore		.gitignore
ReadMe.md		ReadMe.md
mancino_ball_parallel_project.pdf		mancino_ball_parallel_project.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

.gitignore

.gitignore

ReadMe.md

ReadMe.md

mancino_ball_parallel_project.pdf

mancino_ball_parallel_project.pdf

requirements.txt

requirements.txt

Repository files navigation

Decentralized training of graph convolutional networks

Project overview

What's in the repo

How to run on AiMOS

About

Releases

Packages

Languages

gmancino/decentralized_gcn

Folders and files

Latest commit

History

Repository files navigation

Decentralized training of graph convolutional networks

Project overview

What's in the repo

How to run on AiMOS

About

Topics

Resources

Stars

Watchers

Forks

Languages