RDMA-P2P Documentation

RDMA-P2P, the underlying transfer layer for FaaScale, is implemented by adding multiple RDMA interfaces to the RDMC module of Derecho.

Slurm deployment

Step 1: Create Node Directory and Workspace

Create a shared directory for nodes and a workspace for RDMA-P2P:

# at the place you want to store your node files
mkdir -p nodes
cd nodes
mkdir -p workspace

Step 2: Clone RDMA-P2P Repository

Clone the repository into the workspace:

# change directory to the workspace
cd workspace
git clone git@github.com:lambda-scale/rdma-p2p.git

Experiment automation

Step 3: Configure Key Files

Prepare derecho_node.cfg for automation:

# copy node-sample.cfg to ./
cp src/conf/derecho_node-sample.cfg ./
mv derecho_node-sample.cfg derecho_node.cfg

Example derecho_node.cfg:

# derecho_node.cfg sample
[Automation]
num_experiments = 1 # This will set how manny runs; the total workers will double for each run
local_rdma_dir = /sample/workspace/RDMA-P2P # set to the workspace you have created in the former step
node_config_file = /sample/workspace/RDMA-P2P/node-sample.cfg # typically set to ${local_rdma_dir}/node.cfg

# seconds to wait before launch next worker. We do need to make sure that 
# leader/contact node is started before other worker as least for rdma
worker_comm_establish_wait_time = 8
worker_inter_wait_time = 4 
controller_wait_time = 4 #time to wait after controller python application started
client_wait_time = 8 #seconds to wait after sending a request

Step 4: Allocate SLURM Resources

Request resources from the SLURM cluster:

salloc --nodes=${node_number} --ntasks-per-node=1 --partition=gpu --gres=gpu:a40:1 --cpus-per-task=16 --mem-per-cpu=8G

Step 5: Run Experiment Automation

Run the SLURM automation script:

# change directory to the workspace
cd workspace
bash slurm/deploy/automate_experiment.slurm

Docker deployment (deprecated)

Steps

Pull nvidia base image
Build GPUDirect RDMA (GDR) image
Launch GDR cluster
Automate experiment

Step 1: Pull image from nvcr.io/nvidia/pytorch:24.04-py3

docker pull nvcr.io/nvidia/pytorch:24.04-py3

Step 2: Build GPUDirect RDMA (GDR) image

This step will copy 3 files (private and public keys and install_dependencies.sh) into docker, so make sure to have these 3 files.

cd dockerfiles
docker build -t gpu-scaling-gdr:latest .

step 3: Launch GDR cluster (4 docker instances)

Current image built as a long-running instance. Change dockerfiles/Dockerfile to make it short-live.

docker run -dit --name worker0 --gpus '"device=1"' --network host --privileged --device /dev/infiniband/uverbs1 --device /dev/infiniband/rdma_cm gdr-py:latest /bin/bash  
...  
docker run -dit --name worker3 --gpus '"device=1"' --network host --privileged --device /dev/infiniband/uverbs1 --device /dev/infiniband/rdma_cm gdr-py:latest /bin/bash

Step 4: Experiment automation

1. Configure node.cfg

After launching a GDR cluster, we need to first configure node.cfg with each docker's ID and IP addresse.

# copy node-sample.cfg to ./
cp src/conf/node-sample.cfg ./
mv node-sample.cfg node.cfg
vim node.cfg
...

#example node.cfg configure for 4 GDR instances
#leader node goes first
0,192.168.0.10
1,192.168.0.20
2,192.168.0.30
3,192.168.0.40

2. Run automation script

Now, we are ready to run [automation script] This automation includes the following processes:

Configure derecho.cfg file on each remote host (both application and RDMA share this configure file)
Cleanup TCP ports and log files from last experiment
Collect log files into [experiment-res] directory in local host

# copy node-sample.cfg to ./
cp src/conf/derecho_node-sample.cfg ./
mv derecho_node-sample.cfg derecho_node.cfg

#Here is an example of automate_experiment.sh script example
[Automation]
num_experiments = 1 # This will set how manny runs; the total workers will double for each run
total_workers = 3 # total workers in current cluster
ssh_port = 2222 # port number for sshd on each host
local_rdma_dir = /home/rui/workspace/RDMA-P2P #this is multiple area; make sure get it right
node_config_file = /home/rui/workspace/RDMA-P2P/node.cfg #node.cfg file abs path
worker_comm_establish_wait_time = 5 #seconds to wait after launching our system at each worker

#seconds to wait before launch next worker. We do need to make sure that 
# leader/contact node is started before other worker as least for rdma
worker_inter_wait_time = 4 

controller_wait_time = 4 #time to wait after controller python application started
client_wait_time = 8 #seconds to wait after sending a request

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
include/derecho		include/derecho
pyrdmc		pyrdmc
scripts		scripts
slurm		slurm
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RDMA-P2P Documentation

Slurm deployment

Step 1: Create Node Directory and Workspace

Step 2: Clone RDMA-P2P Repository

Step 3: Configure Key Files

Step 4: Allocate SLURM Resources

Step 5: Run Experiment Automation

Docker deployment (deprecated)

Steps

Step 1: Pull image from nvcr.io/nvidia/pytorch:24.04-py3

Step 2: Build GPUDirect RDMA (GDR) image

step 3: Launch GDR cluster (4 docker instances)

Step 4: Experiment automation

1. Configure node.cfg

2. Run automation script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RDMA-P2P Documentation

Slurm deployment

Step 1: Create Node Directory and Workspace

Step 2: Clone RDMA-P2P Repository

Step 3: Configure Key Files

Step 4: Allocate SLURM Resources

Step 5: Run Experiment Automation

Docker deployment (deprecated)

Steps

Step 1: Pull image from nvcr.io/nvidia/pytorch:24.04-py3

Step 2: Build GPUDirect RDMA (GDR) image

step 3: Launch GDR cluster (4 docker instances)

Step 4: Experiment automation

1. Configure node.cfg

2. Run automation script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages