-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043
Comments
Hi, vonchenplus
|
I have a few questions I'd like to ask:
|
Hi, @vonchenplus
|
Hi, @Kefeng-Duan ,
|
Hi, @vonchenplus
|
Hi @Kefeng-Duan
|
Hi @vonchenplus , in the Mixtral-8x7B-v0.1.json file you shared,
|
Hi @kaiyux, It's already working, thanks you. |
System Info
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Create model config file
Mixtral-8x7B-v0.1.json
use build command build model:
`export max_batch_size=2048
export tp_size=2
export max_num_tokens=2048
trtllm-build --model_config $model_cfg
--use_fused_mlp
--gpt_attention_plugin float16
--output_dir $engine_dir
--max_batch_size $max_batch_size
--max_input_len 2048
--max_output_len 2048
--reduce_fusion disable
--workers $tp_size
--max_num_tokens $max_num_tokens
--use_paged_context_fmha enable
--multiple_profiles enable`
Expected behavior
Build success
actual behavior
OutOfMemory
trtllm-build output log:
[07/29/2024-03:08:43] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/29/2024-03:08:45] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gemm_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set lookup_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set lora_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set moe_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set context_fmha to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set remove_input_padding to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set reduce_fusion to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set multi_block_mode to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set enable_xqa to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set multiple_profiles to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set paged_state to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set streamingllm to False.
[07/29/2024-03:08:45] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rope_theta = 1000000.0
[07/29/2024-03:08:45] [TRT-LLM] [W] --max_output_len has been deprecated in favor of --max_seq_len
[07/29/2024-03:08:45] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[07/29/2024-03:08:47] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[07/29/2024-03:08:49] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/29/2024-03:08:49] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info
[07/29/2024-03:08:52] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[07/29/2024-03:09:09] [TRT] [W] Unused Input: position_ids
[07/29/2024-03:09:09] [TRT] [W] Detected layernorm nodes in FP16.
[07/29/2024-03:09:09] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, olayernorm layers to run in FP32 precision can help with preserving accuracy.
[07/29/2024-03:09:09] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[07/29/2024-03:10:11] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:11] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:11] [TRT] [W] Requested amount of GPU memory (80815849472 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[07/29/2024-03:10:12] [TRT] [E] Error Code: 2: Requested size was 80815849472 bytes.
[07/29/2024-03:10:12] [TRT] [E] [globWriter.cpp::makeResizableGpuMemory::424] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:12] [TRT-LLM] [E] Engine building failed, please check the error log.
additional notes
I attempted to validate the performance tests described in perf-overview and successfully validated the performance tests for Llama-2-7b-hf and Meta-Llama-3-8B. However, when building Mixtral-8x7B-v0.1, I encountered an out-of-memory error. I am using two A100 80GB GPUs. What type of GPU is required to compile Mixtral-8x7B-v0.1?
The text was updated successfully, but these errors were encountered: