trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043

vonchenplus · 2024-07-29T05:47:41Z

System Info

NVIDIA A100 80G * 2
Libraries
- TensorRT-LLM 0.11.0
Driver Version: 525.105.17
CUDA Version: 12.4

Who can help?

@byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Create model config file
Mixtral-8x7B-v0.1.json
use build command build model:
`export max_batch_size=2048
export tp_size=2
export max_num_tokens=2048

trtllm-build --model_config $model_cfg
--use_fused_mlp
--gpt_attention_plugin float16
--output_dir $engine_dir
--max_batch_size $max_batch_size
--max_input_len 2048
--max_output_len 2048
--reduce_fusion disable
--workers $tp_size
--max_num_tokens $max_num_tokens
--use_paged_context_fmha enable
--multiple_profiles enable`

Expected behavior

Build success

actual behavior

OutOfMemory

trtllm-build output log:
[07/29/2024-03:08:43] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/29/2024-03:08:45] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gemm_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set lookup_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set lora_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set moe_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set context_fmha to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set remove_input_padding to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set reduce_fusion to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set multi_block_mode to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set enable_xqa to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set multiple_profiles to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set paged_state to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set streamingllm to False.
[07/29/2024-03:08:45] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rope_theta = 1000000.0
[07/29/2024-03:08:45] [TRT-LLM] [W] --max_output_len has been deprecated in favor of --max_seq_len
[07/29/2024-03:08:45] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[07/29/2024-03:08:47] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[07/29/2024-03:08:49] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/29/2024-03:08:49] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info
[07/29/2024-03:08:52] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[07/29/2024-03:09:09] [TRT] [W] Unused Input: position_ids
[07/29/2024-03:09:09] [TRT] [W] Detected layernorm nodes in FP16.
[07/29/2024-03:09:09] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, olayernorm layers to run in FP32 precision can help with preserving accuracy.
[07/29/2024-03:09:09] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[07/29/2024-03:10:11] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:11] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:11] [TRT] [W] Requested amount of GPU memory (80815849472 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[07/29/2024-03:10:12] [TRT] [E] Error Code: 2: Requested size was 80815849472 bytes.
[07/29/2024-03:10:12] [TRT] [E] [globWriter.cpp::makeResizableGpuMemory::424] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:12] [TRT-LLM] [E] Engine building failed, please check the error log.

additional notes

I attempted to validate the performance tests described in perf-overview and successfully validated the performance tests for Llama-2-7b-hf and Meta-Llama-3-8B. However, when building Mixtral-8x7B-v0.1, I encountered an out-of-memory error. I am using two A100 80GB GPUs. What type of GPU is required to compile Mixtral-8x7B-v0.1?

The text was updated successfully, but these errors were encountered:

Kefeng-Duan · 2024-07-29T08:37:59Z

Hi, vonchenplus
Please decrease the max_batch_size/max_input_len/max_output_len when building the engine.
A rough estimation is

for Mixtral8x7b with fp16 precision, its weight consumes ~87GB
the fp16 KVcache consumes ~65536*2048(max_bs)*4096(max_in + max_out)*2(fp16)/1024/1024/1024 = 1024GB
the sum of the two is far larger than your GPUs capacity (160GB)

vonchenplus · 2024-07-29T10:22:29Z

I have a few questions I'd like to ask:

When I set the workers parameter to 2 during trtllm-build, is only one GPU being used in the end? Because I see that only one GPU's memory is being utilized.
When you tested Mixtral-8x7B-v0.1 fp16, what graphics card did you use?

Kefeng-Duan · 2024-07-30T00:24:25Z

Hi, @vonchenplus

How do you observe that only one GPU was used? Could you help to double confirm your observation by print the workers here:

TensorRT-LLM/tensorrt_llm/commands/build.py

Line 408 in 93293aa

workers = min(torch.cuda.device_count(), args.workers)

?
I think 2xA100 is OK

vonchenplus · 2024-07-30T01:06:57Z

Hi, @Kefeng-Duan ,

I checked yesterday and confirmed that the workers are set to 2, but it seems that only one card is being used.
However, in my current test, using two cards is indeed not working.

Kefeng-Duan · 2024-07-31T09:42:02Z

Hi, @vonchenplus

are your torch.cuda.device_count() = 1 or 2?
even when you decrease the max_bs/max_seqin/max_seqout to a proper range? what's the failure message?
Thanks

vonchenplus · 2024-08-01T00:42:29Z

Hi @Kefeng-Duan

torch.cuda.device_count() is 2.
I didn't reduce max_bs/max_seqin/max_seqout because I wanted to reproduce the performance mentioned in your report.

kaiyux · 2024-08-01T01:37:35Z

Hi @vonchenplus , in the Mixtral-8x7B-v0.1.json file you shared, tp_size is set to 1, if you want to do TP2 with dummy weights, you'll need to set mapping as

"mapping": {
        "world_size": 2,
        "tp_size": 2,
        "pp_size": 1
    },

vonchenplus · 2024-08-02T02:12:42Z

Hi @kaiyux,

It's already working, thanks you.

vonchenplus added the bug Something isn't working label Jul 29, 2024

nv-guomingz added the waiting for feedback label Jul 29, 2024

vonchenplus closed this as completed Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043

trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043

vonchenplus commented Jul 29, 2024

Kefeng-Duan commented Jul 29, 2024 •

edited

Loading

vonchenplus commented Jul 29, 2024

Kefeng-Duan commented Jul 30, 2024 •

edited

Loading

vonchenplus commented Jul 30, 2024

Kefeng-Duan commented Jul 31, 2024

vonchenplus commented Aug 1, 2024

kaiyux commented Aug 1, 2024

vonchenplus commented Aug 2, 2024

trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043

trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043

Comments

vonchenplus commented Jul 29, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Kefeng-Duan commented Jul 29, 2024 • edited Loading

vonchenplus commented Jul 29, 2024

Kefeng-Duan commented Jul 30, 2024 • edited Loading

vonchenplus commented Jul 30, 2024

Kefeng-Duan commented Jul 31, 2024

vonchenplus commented Aug 1, 2024

kaiyux commented Aug 1, 2024

vonchenplus commented Aug 2, 2024

Kefeng-Duan commented Jul 29, 2024 •

edited

Loading

Kefeng-Duan commented Jul 30, 2024 •

edited

Loading