This is the code and report for my final project in Dr. Carother's Parallel Computing course taught in the Spring of 2021 at RPI.
Graph Convolutional Networks (GCNs) can be used to solve node level classification problems in graph structured data. In a node level classification problem, we assume a given dataset has an underlying graph structure where each node in the graph contains a data point and some nodes (i.e. a small percentage) contain corresponding labels. The goal is to learn a classifier that performs well on nodes with unknown labels. As the size of graph datasets grow, the need to perform distributed training of GCNs is necessitated. This work applies concepts from decentralized consensus optimization, namely the Decentralized Gradient Descent (DGD) method, to assist in training GCNs for graph structured data. We perform numerical experiments on the AiMOS supercomputer at RPI and include a technical report.
-
code
: folder containing the code used to produce the technical report associated with this repository-
data
: folder containing the Cora dataset and empty folders to store the partitioned data -
gcn
: folder containing a custom implementation of the GCN architecture used for solving node classification problems -
graph_partitioner
: folder containing the code that partitions a graph dataset among$N$ agents using PyMetis -
distributed_gcn_trainer.py
: python file that performs DGD using mpi4py -
make_plots.py
: python file that produces the figures in the technical report
-
-
requirements.txt
: a text file containing the packages required in the conda environment that a user sets up to run these experiments
AiMOS is the supercomputer located on RPI's campus. As usernames and project names are sensitive material, below is a mock .sh
file script that would need to be located in the code
folder to run this script with multiple GPUs. We utilize the NPL Cluster on AiMOS.
Note: anthing in <>
brackets is considered a user argument and must be specified before running on AiMOS. These are:
<path_to_conda>
: points to the directory where miniconda is located<your_env>
: the name of your conda environment that contains the packages in therequirements.txt
file<path_to_code_folder>
: absolute path to thecode
folder<num_nodes>
: number of computing nodes to use (max used in the report is 4)<num_gpus>
: number of GPUs in each computing node (max is 8 for the NPL Cluster)
#!/bin/bash -x
# ----- GATHER SLURM INFORMATION FOR RUNNING ON MULTIPLE COMPUTE NODES ----- #
if [ "x$SLURM_NPROCS" = "x" ]
then
if [ "x$SLURM_NTASKS_PER_NODE" = "x" ]
then
SLURM_NTASKS_PER_NODE=1
fi
SLURM_NPROCS=`expr $SLURM_JOB_NUM_NODES \* $SLURM_NTASKS_PER_NODE`
else
if [ "x$SLURM_NTASKS_PER_NODE" = "x" ]
then
SLURM_NTASKS_PER_NODE=`expr $SLURM_NPROCS / $SLURM_JOB_NUM_NODES`
fi
fi
# ----- SET UP TEMPORARY ENVIRONMENT TO DO COMPUTATIONS IN ----- #
srun hostname -s | sort -u > /tmp/hosts.$SLURM_JOB_ID
awk "{ print \$0 \" slots=$SLURM_NTASKS_PER_NODE\"; }" /tmp/hosts.$SLURM_JOB_ID > /tmp/tmp.$SLURM_JOB_ID
mv /tmp/tmp.$SLURM_JOB_ID /tmp/hosts.$SLURM_JOB_ID
# ----- LOAD CONDA ENVIRONMENT ----- #
source ~/<path_to_conda>/conda.sh
conda activate <your_env>
# ----- LOAD GCC FOR GRAPH PARTITIONING ----- #
module load gcc
# ----- PERFORM GRAPH PARTITIONING ----- #
# CHECK IF SUBGRAPHS EXIST
SUBGRAPH_FILE=<path_to_code_folder>/code/data/subgraph_list
if [ -f "$SUBGRAPH_FILE" ]
then
rm -r <path_to_code_folder>/code/data/subgraph_list
rm -r <path_to_code_folder>/code/data/adj_matrices
echo deleting old files...
fi
# MAKE NEW DIRECTORIES FOR SUBGRAPH INFORMATION
echo making new files ...
mkdir <path_to_code_folder>/code/data/subgraph_list
mkdir <path_to_code_folder>/code/data/adj_matrices
# PARTITION THE GRAPH WITH PYMETIS
python <path_to_code_folder>/code/create_coarse_graph.py --num_nodes=$SLURM_NPROCS
# ----- PARSE USER INPUTS ----- #
while getopts e:l: flag
do
case "${flag}" in
e) epochs=${OPTARG};;
l) lr=${OPTARG};;
esac
done
# ----- RUN COMMAND ON MULTIPLE NODES ----- #
mpirun -hostfile /tmp/hosts.$SLURM_JOB_ID -np $SLURM_NPROCS python <path_to_code_folder>/code/distributed_gcn_trainer.py --num_nodes=$SLURM_NPROCS --epochs=$epochs --lr=$lr
rm /tmp/hosts.$SLURM_JOB_ID
Once the .sh
file has been included in the code
folder, the following command can be ran in AiMOS to train a GCN with DGD:
sbatch -N<num_nodes> --ntasks-per-node=<num_gpus> --gres=gpu:<num_gpus> -t 5:00 -o <path_to_code_folder>/code/output_file_name.out <path_to_code_folder>/code/runScript.sh -e 350 -l 50
Note: we must be in the code
directory when we submit the above script.