How to run inference of a (very) large model across mulitple GPUs ? #2007

jorgeantonio21 · 2024-04-04T13:52:46Z

It is mentioned on README that candle supports multi GPU inference, using NCCL under the hood. How can this be implemented ? I wonder if there is any available example to look at..

Also, I know PyTorch has things like DDP and FSDP, is candle support for multi GPU inference comparable to these techniques ?

EricLBuehler · 2024-04-04T14:02:12Z

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

candle/candle-examples/examples/llama_multiprocess/model.rs

Lines 293 to 308 in f48c07e

fn load(vb: VarBuilder, cache: &Cache, cfg: &Config, comm: Rc<Comm>) -> Result<Self> {

let qkv_proj = TensorParallelColumnLinear::load_multi(

vb.clone(),

&["q_proj", "k_proj", "v_proj"],

comm.clone(),

)?;

let o_proj = TensorParallelRowLinear::load(vb.pp("o_proj"), comm.clone())?;

Ok(Self {

qkv_proj,

o_proj,

num_attention_heads: cfg.num_attention_heads / comm.world_size(),

num_key_value_heads: cfg.num_key_value_heads / comm.world_size(),

head_dim: cfg.hidden_size / cfg.num_attention_heads,

cache: cache.clone(),

})

}

b0xtch · 2024-04-12T20:03:01Z

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

candle/candle-examples/examples/llama_multiprocess/model.rs

Lines 293 to 308 in f48c07e

fn load(vb: VarBuilder, cache: &Cache, cfg: &Config, comm: Rc<Comm>) -> Result<Self> {

let qkv_proj = TensorParallelColumnLinear::load_multi(

vb.clone(),

&["q_proj", "k_proj", "v_proj"],

comm.clone(),

)?;

let o_proj = TensorParallelRowLinear::load(vb.pp("o_proj"), comm.clone())?;

Ok(Self {

qkv_proj,

o_proj,

num_attention_heads: cfg.num_attention_heads / comm.world_size(),

num_key_value_heads: cfg.num_key_value_heads / comm.world_size(),

head_dim: cfg.hidden_size / cfg.num_attention_heads,

cache: cache.clone(),

})

}

That example is for a single node. How about multiple nodes? Can we just run the example with mpirun -n 2 --hostfile ../../hostfile target/release/llama_multiprocess 2 2000

Update:

I guess I must modify the code to support the world rank for MPI. I think sticking to NCCL as a backend might be better, but then is there support in Cudarc for cross-node communication?

Found this library https://github.com/oddity-ai/async-cuda

b0xtch · 2024-06-28T19:00:26Z

I started a draft here for the splitting a model across multiple GPUs on different nodes. There is a mapping feature as I linked above on mistral.rs repo

Flash Attention not working on CUDA 12.1 #1936

b0xtch mentioned this issue Jun 21, 2024

Cross GPU device mapping feature EricLBuehler/mistral.rs#395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run inference of a (very) large model across mulitple GPUs ? #2007

How to run inference of a (very) large model across mulitple GPUs ? #2007

jorgeantonio21 commented Apr 4, 2024

EricLBuehler commented Apr 4, 2024

b0xtch commented Apr 12, 2024 •

edited

Loading

b0xtch commented Jun 28, 2024

How to run inference of a (very) large model across mulitple GPUs ? #2007

How to run inference of a (very) large model across mulitple GPUs ? #2007

Comments

jorgeantonio21 commented Apr 4, 2024

EricLBuehler commented Apr 4, 2024

b0xtch commented Apr 12, 2024 • edited Loading

b0xtch commented Jun 28, 2024

b0xtch commented Apr 12, 2024 •

edited

Loading