RDMA-P2P, the underlying transfer layer for FaaScale, is implemented by adding multiple RDMA interfaces to the RDMC module of Derecho.
Create a shared directory for nodes and a workspace for RDMA-P2P:
# at the place you want to store your node files
mkdir -p nodes
cd nodes
mkdir -p workspaceClone the repository into the workspace:
# change directory to the workspace
cd workspace
git clone git@github.com:lambda-scale/rdma-p2p.git- Experiment automation
Prepare derecho_node.cfg for automation:
# copy node-sample.cfg to ./
cp src/conf/derecho_node-sample.cfg ./
mv derecho_node-sample.cfg derecho_node.cfgExample derecho_node.cfg:
# derecho_node.cfg sample
[Automation]
num_experiments = 1 # This will set how manny runs; the total workers will double for each run
local_rdma_dir = /sample/workspace/RDMA-P2P # set to the workspace you have created in the former step
node_config_file = /sample/workspace/RDMA-P2P/node-sample.cfg # typically set to ${local_rdma_dir}/node.cfg
# seconds to wait before launch next worker. We do need to make sure that
# leader/contact node is started before other worker as least for rdma
worker_comm_establish_wait_time = 8
worker_inter_wait_time = 4
controller_wait_time = 4 #time to wait after controller python application started
client_wait_time = 8 #seconds to wait after sending a requestRequest resources from the SLURM cluster:
salloc --nodes=${node_number} --ntasks-per-node=1 --partition=gpu --gres=gpu:a40:1 --cpus-per-task=16 --mem-per-cpu=8GRun the SLURM automation script:
# change directory to the workspace
cd workspace
bash slurm/deploy/automate_experiment.slurm- Pull nvidia base image
- Build GPUDirect RDMA (GDR) image
- Launch GDR cluster
- Automate experiment
docker pull nvcr.io/nvidia/pytorch:24.04-py3This step will copy 3 files (private and public keys and install_dependencies.sh) into docker, so make sure to have these 3 files.
cd dockerfiles
docker build -t gpu-scaling-gdr:latest .Current image built as a long-running instance. Change dockerfiles/Dockerfile to make it short-live.
docker run -dit --name worker0 --gpus '"device=1"' --network host --privileged --device /dev/infiniband/uverbs1 --device /dev/infiniband/rdma_cm gdr-py:latest /bin/bash
...
docker run -dit --name worker3 --gpus '"device=1"' --network host --privileged --device /dev/infiniband/uverbs1 --device /dev/infiniband/rdma_cm gdr-py:latest /bin/bashAfter launching a GDR cluster, we need to first configure node.cfg with each docker's ID and IP addresse.
# copy node-sample.cfg to ./
cp src/conf/node-sample.cfg ./
mv node-sample.cfg node.cfg
vim node.cfg
...
#example node.cfg configure for 4 GDR instances
#leader node goes first
0,192.168.0.10
1,192.168.0.20
2,192.168.0.30
3,192.168.0.40Now, we are ready to run [automation script] This automation includes the following processes:
- Configure derecho.cfg file on each remote host (both application and RDMA share this configure file)
- Cleanup TCP ports and log files from last experiment
- Collect log files into [experiment-res] directory in local host
# copy node-sample.cfg to ./
cp src/conf/derecho_node-sample.cfg ./
mv derecho_node-sample.cfg derecho_node.cfg
#Here is an example of automate_experiment.sh script example
[Automation]
num_experiments = 1 # This will set how manny runs; the total workers will double for each run
total_workers = 3 # total workers in current cluster
ssh_port = 2222 # port number for sshd on each host
local_rdma_dir = /home/rui/workspace/RDMA-P2P #this is multiple area; make sure get it right
node_config_file = /home/rui/workspace/RDMA-P2P/node.cfg #node.cfg file abs path
worker_comm_establish_wait_time = 5 #seconds to wait after launching our system at each worker
#seconds to wait before launch next worker. We do need to make sure that
# leader/contact node is started before other worker as least for rdma
worker_inter_wait_time = 4
controller_wait_time = 4 #time to wait after controller python application started
client_wait_time = 8 #seconds to wait after sending a request