bug: TensorRT-LLM error

### Cortex version

0.5.1-rc2

### Describe the Bug

`cortex-beta run openhermes-2.5-7b-tensorrt-llm-linux-ada` fails with logs below.

### Steps to Reproduce

1. `cortex-beta run openhermes-2.5-7b-tensorrt-llm-linux-ada`

### Screenshots / Logs

20240923 19:58:25.229834 UTC 8237 DEBUG [LoadModel] Reset all resources and states before loading new model - tensorrt-llm_engine.cc:380
20240923 19:58:25.229878 UTC 8237 INFO  Reset all resources and states - tensorrt-llm_engine.cc:616
20240923 19:58:25.229884 UTC 8237 DEBUG [LoadModel] n_parallel: 1, batch_size: 16 - tensorrt-llm_engine.cc:388
[TensorRT-LLM][INFO] Set logger level by INFO
20240923 19:58:25.276219 UTC 8237 INFO  Successully loaded the tokenizer - tensorrt-llm_engine.h:105
20240923 19:58:25.276238 UTC 8237 INFO  Loaded tokenizer from /home/ubuntu/cortexcpp-beta/models/openhermes-2.5-7b-tensorrt-llm-linux-ada/tokenizer.model - tensorrt-llm_engine.cc:414
[TensorRT-LLM][INFO] Engine version 0.11.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Parameter layer_types cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found
[TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found
[TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
20240923 19:58:25.765816 UTC 8237 INFO  Loaded config from /home/ubuntu/cortexcpp-beta/models/openhermes-2.5-7b-tensorrt-llm-linux-ada/config.json - tensorrt-llm_engine.cc:421
[TensorRT-LLM][INFO] Engine version 0.11.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Parameter layer_types cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found
[TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found
[TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 32768
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
20240923 19:58:25.890898 UTC 8237 ERROR Failed to load model: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/home/runner/actions-runner/_work/cortex.tensorrt-llm/cortex.tensorrt-llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:211)
1       0x7f9ee02bfdb3 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 147
2       0x7f9e19d56fe4 tensorrt_llm::runtime::BufferManager::initMemoryPool(int) + 148
3       0x7f9e19d58e5f tensorrt_llm::runtime::BufferManager::BufferManager(std::shared_ptr<tensorrt_llm::runtime::CudaStream>, bool) + 431
4       0x7f9e19e38593 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(tensorrt_llm::runtime::RawEngine const&, nvinfer1::ILogger*, float, bool) + 451
5       0x7f9e1a086579 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 937
6       0x7f9e1a0a965b tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 443
7       0x7f9e1a0aee60 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1408
8       0x7f9e1a0aff6a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1978
9       0x7f9e1a0a4cee tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 62
10      0x7f9ee02bdcef tensorrtllm::TensorrtllmEngine::LoadModel(std::shared_ptr<Json::Value>, std::function<void (Json::Value&&, Json::Value&&)>&&) + 3983
11      0x55b453c9b0a6 cortex-beta(+0x2560a6) [0x55b453c9b0a6]
12      0x55b453cabe5c cortex-beta(+0x266e5c) [0x55b453cabe5c]
13      0x55b453cabaef cortex-beta(+0x266aef) [0x55b453cabaef]
14      0x55b453cab858 cortex-beta(+0x266858) [0x55b453cab858]
15      0x55b45428f7d1 cortex-beta(+0x84a7d1) [0x55b45428f7d1]
16      0x55b4541fd9f2 cortex-beta(+0x7b89f2) [0x55b4541fd9f2]
17      0x55b45420b863 cortex-beta(+0x7c6863) [0x55b45420b863]
18      0x55b454209c30 cortex-beta(+0x7c4c30) [0x55b454209c30]
19      0x55b454207b1b cortex-beta(+0x7c2b1b) [0x55b454207b1b]
20      0x55b4541fd28a cortex-beta(+0x7b828a) [0x55b4541fd28a]
21      0x55b4541fcfd9 cortex-beta(+0x7b7fd9) [0x55b4541fcfd9]
22      0x55b4541fc619 cortex-beta(+0x7b7619) [0x55b4541fc619]
23      0x55b4541fbcae cortex-beta(+0x7b6cae) [0x55b4541fbcae]
24      0x55b45420c952 cortex-beta(+0x7c7952) [0x55b45420c952]
25      0x55b45420b024 cortex-beta(+0x7c6024) [0x55b45420b024]
26      0x55b45420911a cortex-beta(+0x7c411a) [0x55b45420911a]
27      0x55b4548cc2ef cortex-beta(+0xe872ef) [0x55b4548cc2ef]
28      0x55b4548be629 cortex-beta(+0xe79629) [0x55b4548be629]
29      0x55b4548bd12d cortex-beta(+0xe7812d) [0x55b4548bd12d]
30      0x55b4548c838c cortex-beta(+0xe8338c) [0x55b4548c838c]
31      0x55b4548c655a cortex-beta(+0xe8155a) [0x55b4548c655a]
32      0x55b4548c51ef cortex-beta(+0xe801ef) [0x55b4548c51ef]
33      0x55b453bbd11a cortex-beta(+0x17811a) [0x55b453bbd11a]
34      0x55b4548b6969 cortex-beta(+0xe71969) [0x55b4548b6969]
35      0x55b4548b6830 cortex-beta(+0xe71830) [0x55b4548b6830]
36      0x55b45489754f cortex-beta(+0xe5254f) [0x55b45489754f]
37      0x55b45489af28 cortex-beta(+0xe55f28) [0x55b45489af28]
38      0x55b45489a9eb cortex-beta(+0xe559eb) [0x55b45489a9eb]
39      0x55b45489baba cortex-beta(+0xe56aba) [0x55b45489baba]
40      0x55b45489ba7d cortex-beta(+0xe56a7d) [0x55b45489ba7d]
41      0x55b45489ba2a cortex-beta(+0xe56a2a) [0x55b45489ba2a]
42      0x55b45489b9fe cortex-beta(+0xe569fe) [0x55b45489b9fe]
43      0x55b45489b9e2 cortex-beta(+0xe569e2) [0x55b45489b9e2]
44      0x7f9ee50f5253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9ee50f5253]
45      0x7f9ee4d7bac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9ee4d7bac3]
46      0x7f9ee4e0d850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f9ee4e0d850] - tensorrt-llm_engine.cc:439
[TensorRT-LLM][INFO] Engine version 0.11.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Parameter layer_types cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found
[TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json:
[TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found
[TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 32768
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3958 MiB
[TensorRT-LLM][ERROR] Error Code: 6: The engine plan file is generated on an incompatible device, expecting compute 5.2 got compute 8.9, please rebuild.
[TensorRT-LLM][ERROR] [engine.cpp::deserializeEngine::1233] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/home/runner/actions-runner/_work/cortex.tensorrt-llm/cortex.tensorrt-llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:258)
1       0x7f9ee02bfdb3 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 147
2       0x7f9e19d56eb9 tensorrt_llm::runtime::BufferManager::memoryPoolTrimTo(int, unsigned long) + 73
3       0x7f9e180ac791 /home/ubuntu/cortexcpp-beta/engines/cortex.tensorrt-llm/libtensorrt_llm.so(+0x73a791) [0x7f9e180ac791]
4       0x7f9e1a086579 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 937
5       0x7f9e1a0a965b tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 443
6       0x7f9e1a0aee60 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1408
7       0x7f9e1a0aff6a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1978
8       0x7f9e1a0a4cee tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 62
9       0x7f9ee02c147a std::_MakeUniq<tensorrt_llm::executor::Executor>::__single_object std::make_unique<tensorrt_llm::executor::Executor, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, tensorrt_llm::executor::ModelType&&, tensorrt_llm::executor::ExecutorConfig&) + 138
10      0x7f9ee02a9f57 /home/ubuntu/cortexcpp-beta/engines/cortex.tensorrt-llm/libengine.so(+0x88f57) [0x7f9ee02a9f57]
11      0x55b453c9b0a6 cortex-beta(+0x2560a6) [0x55b453c9b0a6]
12      0x55b453cabe5c cortex-beta(+0x266e5c) [0x55b453cabe5c]
13      0x55b453cabaef cortex-beta(+0x266aef) [0x55b453cabaef]
14      0x55b453cab858 cortex-beta(+0x266858) [0x55b453cab858]
15      0x55b45428f7d1 cortex-beta(+0x84a7d1) [0x55b45428f7d1]
16      0x55b4541fd9f2 cortex-beta(+0x7b89f2) [0x55b4541fd9f2]
17      0x55b45420b863 cortex-beta(+0x7c6863) [0x55b45420b863]
18      0x55b454209c30 cortex-beta(+0x7c4c30) [0x55b454209c30]
19      0x55b454207b1b cortex-beta(+0x7c2b1b) [0x55b454207b1b]
20      0x55b4541fd28a cortex-beta(+0x7b828a) [0x55b4541fd28a]
21      0x55b4541fcfd9 cortex-beta(+0x7b7fd9) [0x55b4541fcfd9]
22      0x55b4541fc619 cortex-beta(+0x7b7619) [0x55b4541fc619]
23      0x55b4541fbcae cortex-beta(+0x7b6cae) [0x55b4541fbcae]
24      0x55b45420c952 cortex-beta(+0x7c7952) [0x55b45420c952]
25      0x55b45420b024 cortex-beta(+0x7c6024) [0x55b45420b024]
26      0x55b45420911a cortex-beta(+0x7c411a) [0x55b45420911a]
27      0x55b4548cc2ef cortex-beta(+0xe872ef) [0x55b4548cc2ef]
28      0x55b4548be629 cortex-beta(+0xe79629) [0x55b4548be629]
29      0x55b4548bd12d cortex-beta(+0xe7812d) [0x55b4548bd12d]
30      0x55b4548c838c cortex-beta(+0xe8338c) [0x55b4548c838c]
31      0x55b4548c655a cortex-beta(+0xe8155a) [0x55b4548c655a]
32      0x55b4548c51ef cortex-beta(+0xe801ef) [0x55b4548c51ef]
33      0x55b453bbd11a cortex-beta(+0x17811a) [0x55b453bbd11a]
34      0x55b4548b6969 cortex-beta(+0xe71969) [0x55b4548b6969]
35      0x55b4548b6830 cortex-beta(+0xe71830) [0x55b4548b6830]
36      0x55b45489754f cortex-beta(+0xe5254f) [0x55b45489754f]
37      0x55b45489af28 cortex-beta(+0xe55f28) [0x55b45489af28]
38      0x55b45489a9eb cortex-beta(+0xe559eb) [0x55b45489a9eb]
39      0x55b45489baba cortex-beta(+0xe56aba) [0x55b45489baba]
40      0x55b45489ba7d cortex-beta(+0xe56a7d) [0x55b45489ba7d]
41      0x55b45489ba2a cortex-beta(+0xe56a2a) [0x55b45489ba2a]
42      0x55b45489b9fe cortex-beta(+0xe569fe) [0x55b45489b9fe]
43      0x55b45489b9e2 cortex-beta(+0xe569e2) [0x55b45489b9e2]
44      0x7f9ee50f5253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9ee50f5253]
45      0x7f9ee4d7bac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9ee4d7bac3]
46      0x7f9ee4e0d850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f9ee4e0d850]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
HTTP error: Failed to read connection

### What is your OS?

- [ ] MacOS
- [ ] Windows
- [X] Linux

### What engine are you running?

- [ ] cortex.llamacpp (default)
- [X] cortex.tensorrt-llm (Nvidia GPUs)
- [ ] cortex.onnx (NPUs, DirectML)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: TensorRT-LLM error #1315

Cortex version

Describe the Bug

Steps to Reproduce

Screenshots / Logs

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

What is your OS?

What engine are you running?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: TensorRT-LLM error #1315

Description

Cortex version

Describe the Bug

Steps to Reproduce

Screenshots / Logs

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

What is your OS?

What engine are you running?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.