Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043

Closed
1 of 4 tasks
vonchenplus opened this issue Jul 29, 2024 · 8 comments
Closed
1 of 4 tasks

trtllm-build Mixtral-8x7B-v0.1 fp16 failed #2043

vonchenplus opened this issue Jul 29, 2024 · 8 comments
Labels
bug Something isn't working waiting for feedback

Comments

@vonchenplus
Copy link

System Info

  • NVIDIA A100 80G * 2
  • Libraries
    • TensorRT-LLM 0.11.0
  • Driver Version: 525.105.17
  • CUDA Version: 12.4

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Create model config file
    Mixtral-8x7B-v0.1.json

  2. use build command build model:
    `export max_batch_size=2048
    export tp_size=2
    export max_num_tokens=2048

    trtllm-build --model_config $model_cfg
    --use_fused_mlp
    --gpt_attention_plugin float16
    --output_dir $engine_dir
    --max_batch_size $max_batch_size
    --max_input_len 2048
    --max_output_len 2048
    --reduce_fusion disable
    --workers $tp_size
    --max_num_tokens $max_num_tokens
    --use_paged_context_fmha enable
    --multiple_profiles enable`

Expected behavior

Build success

actual behavior

OutOfMemory

trtllm-build output log:
[07/29/2024-03:08:43] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/29/2024-03:08:45] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gemm_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set lookup_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set lora_plugin to None.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set moe_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set context_fmha to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set remove_input_padding to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set reduce_fusion to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set multi_block_mode to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set enable_xqa to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set multiple_profiles to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set paged_state to True.
[07/29/2024-03:08:45] [TRT-LLM] [I] Set streamingllm to False.
[07/29/2024-03:08:45] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rope_theta = 1000000.0
[07/29/2024-03:08:45] [TRT-LLM] [W] --max_output_len has been deprecated in favor of --max_seq_len
[07/29/2024-03:08:45] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[07/29/2024-03:08:47] [TRT-LLM] [W] Found pynvml==11.5.3 and cuda driver version 525.105.17. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[07/29/2024-03:08:49] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[07/29/2024-03:08:49] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: info
[07/29/2024-03:08:52] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[07/29/2024-03:09:09] [TRT] [W] Unused Input: position_ids
[07/29/2024-03:09:09] [TRT] [W] Detected layernorm nodes in FP16.
[07/29/2024-03:09:09] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, olayernorm layers to run in FP32 precision can help with preserving accuracy.
[07/29/2024-03:09:09] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[07/29/2024-03:10:11] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:11] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:11] [TRT] [W] Requested amount of GPU memory (80815849472 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[07/29/2024-03:10:12] [TRT] [E] Error Code: 2: Requested size was 80815849472 bytes.
[07/29/2024-03:10:12] [TRT] [E] [globWriter.cpp::makeResizableGpuMemory::424] Error Code 2: OutOfMemory (Requested size was 80815849472 bytes.)
[07/29/2024-03:10:12] [TRT-LLM] [E] Engine building failed, please check the error log.

additional notes

I attempted to validate the performance tests described in perf-overview and successfully validated the performance tests for Llama-2-7b-hf and Meta-Llama-3-8B. However, when building Mixtral-8x7B-v0.1, I encountered an out-of-memory error. I am using two A100 80GB GPUs. What type of GPU is required to compile Mixtral-8x7B-v0.1?

@vonchenplus vonchenplus added the bug Something isn't working label Jul 29, 2024
@Kefeng-Duan
Copy link
Collaborator

Kefeng-Duan commented Jul 29, 2024

Hi, vonchenplus
Please decrease the max_batch_size/max_input_len/max_output_len when building the engine.
A rough estimation is

  1. for Mixtral8x7b with fp16 precision, its weight consumes ~87GB
  2. the fp16 KVcache consumes ~65536*2048(max_bs)*4096(max_in + max_out)*2(fp16)/1024/1024/1024 = 1024GB
    the sum of the two is far larger than your GPUs capacity (160GB)

@vonchenplus
Copy link
Author

I have a few questions I'd like to ask:

  1. When I set the workers parameter to 2 during trtllm-build, is only one GPU being used in the end? Because I see that only one GPU's memory is being utilized.
  2. When you tested Mixtral-8x7B-v0.1 fp16, what graphics card did you use?

@Kefeng-Duan
Copy link
Collaborator

Kefeng-Duan commented Jul 30, 2024

Hi, @vonchenplus

  1. How do you observe that only one GPU was used? Could you help to double confirm your observation by print the workers here:
    workers = min(torch.cuda.device_count(), args.workers)
    ?
  2. I think 2xA100 is OK

@vonchenplus
Copy link
Author

Hi, @Kefeng-Duan ,

  1. I checked yesterday and confirmed that the workers are set to 2, but it seems that only one card is being used.
  2. However, in my current test, using two cards is indeed not working.

@Kefeng-Duan
Copy link
Collaborator

Hi, @vonchenplus

  1. are your torch.cuda.device_count() = 1 or 2?
  2. even when you decrease the max_bs/max_seqin/max_seqout to a proper range? what's the failure message?
    Thanks

@vonchenplus
Copy link
Author

Hi @Kefeng-Duan

  1. torch.cuda.device_count() is 2.
  2. I didn't reduce max_bs/max_seqin/max_seqout because I wanted to reproduce the performance mentioned in your report.

@kaiyux
Copy link
Member

kaiyux commented Aug 1, 2024

Hi @vonchenplus , in the Mixtral-8x7B-v0.1.json file you shared, tp_size is set to 1, if you want to do TP2 with dummy weights, you'll need to set mapping as

"mapping": {
        "world_size": 2,
        "tp_size": 2,
        "pp_size": 1
    },

@vonchenplus
Copy link
Author

Hi @kaiyux,

It's already working, thanks you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for feedback
Projects
None yet
Development

No branches or pull requests

4 participants