Skip to content

lambda-scale/rdma-p2p

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RDMA-P2P Documentation

RDMA-P2P, the underlying transfer layer for FaaScale, is implemented by adding multiple RDMA interfaces to the RDMC module of Derecho.

Slurm deployment

Step 1: Create Node Directory and Workspace

Create a shared directory for nodes and a workspace for RDMA-P2P:

# at the place you want to store your node files
mkdir -p nodes
cd nodes
mkdir -p workspace

Step 2: Clone RDMA-P2P Repository

Clone the repository into the workspace:

# change directory to the workspace
cd workspace
git clone git@github.com:lambda-scale/rdma-p2p.git
  • Experiment automation

Step 3: Configure Key Files

Prepare derecho_node.cfg for automation:

# copy node-sample.cfg to ./
cp src/conf/derecho_node-sample.cfg ./
mv derecho_node-sample.cfg derecho_node.cfg

Example derecho_node.cfg:

# derecho_node.cfg sample
[Automation]
num_experiments = 1 # This will set how manny runs; the total workers will double for each run
local_rdma_dir = /sample/workspace/RDMA-P2P # set to the workspace you have created in the former step
node_config_file = /sample/workspace/RDMA-P2P/node-sample.cfg # typically set to ${local_rdma_dir}/node.cfg

# seconds to wait before launch next worker. We do need to make sure that 
# leader/contact node is started before other worker as least for rdma
worker_comm_establish_wait_time = 8
worker_inter_wait_time = 4 
controller_wait_time = 4 #time to wait after controller python application started
client_wait_time = 8 #seconds to wait after sending a request

Step 4: Allocate SLURM Resources

Request resources from the SLURM cluster:

salloc --nodes=${node_number} --ntasks-per-node=1 --partition=gpu --gres=gpu:a40:1 --cpus-per-task=16 --mem-per-cpu=8G

Step 5: Run Experiment Automation

Run the SLURM automation script:

# change directory to the workspace
cd workspace
bash slurm/deploy/automate_experiment.slurm

Docker deployment (deprecated)

Steps

  • Pull nvidia base image
  • Build GPUDirect RDMA (GDR) image
  • Launch GDR cluster
  • Automate experiment

Step 1: Pull image from nvcr.io/nvidia/pytorch:24.04-py3

docker pull nvcr.io/nvidia/pytorch:24.04-py3

Step 2: Build GPUDirect RDMA (GDR) image

This step will copy 3 files (private and public keys and install_dependencies.sh) into docker, so make sure to have these 3 files.

cd dockerfiles
docker build -t gpu-scaling-gdr:latest .

step 3: Launch GDR cluster (4 docker instances)

Current image built as a long-running instance. Change dockerfiles/Dockerfile to make it short-live.

docker run -dit --name worker0 --gpus '"device=1"' --network host --privileged --device /dev/infiniband/uverbs1 --device /dev/infiniband/rdma_cm gdr-py:latest /bin/bash  
...  
docker run -dit --name worker3 --gpus '"device=1"' --network host --privileged --device /dev/infiniband/uverbs1 --device /dev/infiniband/rdma_cm gdr-py:latest /bin/bash

Step 4: Experiment automation

1. Configure node.cfg

After launching a GDR cluster, we need to first configure node.cfg with each docker's ID and IP addresse.

# copy node-sample.cfg to ./
cp src/conf/node-sample.cfg ./
mv node-sample.cfg node.cfg
vim node.cfg
...

#example node.cfg configure for 4 GDR instances
#leader node goes first
0,192.168.0.10
1,192.168.0.20
2,192.168.0.30
3,192.168.0.40

2. Run automation script

Now, we are ready to run [automation script] This automation includes the following processes:

  • Configure derecho.cfg file on each remote host (both application and RDMA share this configure file)
  • Cleanup TCP ports and log files from last experiment
  • Collect log files into [experiment-res] directory in local host
# copy node-sample.cfg to ./
cp src/conf/derecho_node-sample.cfg ./
mv derecho_node-sample.cfg derecho_node.cfg

#Here is an example of automate_experiment.sh script example
[Automation]
num_experiments = 1 # This will set how manny runs; the total workers will double for each run
total_workers = 3 # total workers in current cluster
ssh_port = 2222 # port number for sshd on each host
local_rdma_dir = /home/rui/workspace/RDMA-P2P #this is multiple area; make sure get it right
node_config_file = /home/rui/workspace/RDMA-P2P/node.cfg #node.cfg file abs path
worker_comm_establish_wait_time = 5 #seconds to wait after launching our system at each worker

#seconds to wait before launch next worker. We do need to make sure that 
# leader/contact node is started before other worker as least for rdma
worker_inter_wait_time = 4 

controller_wait_time = 4 #time to wait after controller python application started
client_wait_time = 8 #seconds to wait after sending a request

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors