Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: can't swap back to llamacpp after using trtllm #2358

Closed
0xSage opened this issue Mar 14, 2024 · 7 comments
Closed

bug: can't swap back to llamacpp after using trtllm #2358

0xSage opened this issue Mar 14, 2024 · 7 comments
Assignees
Labels
P1: important Important feature / fix type: bug Something isn't working
Milestone

Comments

@0xSage
Copy link
Contributor

0xSage commented Mar 14, 2024

Describe the bug
A clear and concise description of what the bug is.

v. 0.4.8-322
windows amd ryzen with 4070

Steps to reproduce
Steps to reproduce the behavior:

  1. Use Llamacorn
  2. Wait for response to complete
  3. Create a new thread
  4. [get a brief popup: unable to stop inference. the model does not support stopping inference]
  5. Select another model, e.g. openhermes
  6. App is stuck in "starting model ___" indefinitely
  7. Ctrl+r refresh doesn't help
  8. Restarting the app (killing all processes) does not help. App remains in corrupted state
image

Additional things Ive tried:

  • Stopping the model in system monitor doesn't seem to do anything
  • Starting a new thread (when inference has completed) shows the some popup (unable to stop inference)
  • Starting a new thread seems to keep the cache from the previous thread. It's all just one context right now?

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your issue.

Environment details

  • Operating System: [Specify your OS. e.g., MacOS Sonoma 14.2.1, Windows 11, Ubuntu 22, etc]
  • Jan Version: [e.g., 0.4.3]
  • Processor: [e.g., Apple M1, Intel Core i7, AMD Ryzen 5, etc]
  • RAM: [e.g., 8GB, 16GB]
  • Any additional relevant hardware specifics: [e.g., Graphics card, SSD/HDD]

Logs
If the cause of the error is not clear, kindly provide your usage logs:

  • tail -n 50 ~/jan/logs/app.log if you are using the UI
  • tail -n 50 ~/jan/logs/server.log if you are using the local api server
    Making sure to redact any private information.

Additional context
Add any other context or information that could be helpful in diagnosing the problem.

@0xSage 0xSage added P1: important Important feature / fix type: bug Something isn't working labels Mar 14, 2024
@dan-homebrew
Copy link
Contributor

dan-homebrew commented Mar 14, 2024

I was able to reproduce this, but I don't think should block the 0.4.9 release. We can handle this next sprint

[1710424192] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1585][llama_server_context::update_slots] slot 0 released (1638 tokens in cache)

2024-03-14T13:49:52.993Z [NITRO]::Debug: 20240314 13:49:52.991000 UTC 26240 INFO Wait for task to be released:6 - llamaCPP.cc:405

20240314 13:49:52.991000 UTC 43012 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535

[1710424192] [D:\a\nitro\nitro\controllers\llamaCPP.h: 882][llama_server_context::launch_slot_with_data] slot 0 is processing [task id: 6]

2024-03-14T13:49:53.000Z [NITRO]::Debug: [1710424192] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1722][llama_server_context::update_slots] slot 0 : kv cache rm - [0, end)

2024-03-14T13:50:10.996Z [NITRO]::Debug: [1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 475][llama_client_slot::print_timings]

[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 480][llama_client_slot::print_timings] print_timings: prompt eval time = 14552.11 ms / 1682 tokens ( 8.65 ms per token, 115.58 tokens per second)

[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 485][llama_client_slot::print_timings] print_timings: eval time = 3452.00 ms / 94 runs ( 36.72 ms per token, 27.23 tokens per second)

[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 487][llama_client_slot::print_timings] print_timings: total time = 18004.10 ms

[1710424210] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1585][llama_server_context::update_slots] slot 0 released (1777 tokens in cache)

2024-03-14T13:50:23.932Z [NITRO]::Debug: Request to kill Nitro

2024-03-14T13:50:23.935Z [NITRO]::Debug: 20240314 13:50:10.993000 UTC 43012 INFO reached result stop - llamaCPP.cc:365

20240314 13:50:10.993000 UTC 43012 INFO End of result - llamaCPP.cc:338

20240314 13:50:11.068000 UTC 26240 INFO Task completed, release it - llamaCPP.cc:408

20240314 13:50:23.934000 UTC 2088 INFO Program is exitting, goodbye! - processManager.cc:8

20240314 13:50:23.934000 UTC 2088 INFO changed to false - llamaCPP.cc:680

[1710424223] [D:\a\nitro\nitro\controllers\llamaCPP.h: 1585][llama_server_context::update_slots] slot 0 released (1777 tokens in cache)

2024-03-14T13:50:24.953Z [TENSORRT_LLM_NITRO]::Debug:Request to kill engine

2024-03-14T13:50:27.489Z [NITRO]::Debug: Nitro process is terminated

2024-03-14T13:50:27.490Z [TENSORRT_LLM_NITRO]::Debug:Engine process is terminated

2024-03-14T13:50:27.490Z [TENSORRT_LLM_NITRO]::Debug:Spawning engine subprocess...

2024-03-14T13:50:27.490Z [TENSORRT_LLM_NITRO]::Debug:Spawn nitro at path: C:\Users\dan\jan\extensions\@janhq\tensorrt-llm-extension\dist\bin\nitro.exe, and args: 1,127.0.0.1,3928

2024-03-14T13:50:27.495Z [NITRO]::Debug: Nitro exited with code: 3221226505

2024-03-14T13:50:27.682Z [TENSORRT_LLM_NITRO]::Debug:�[0m

�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[0m

�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m

�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m

�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m

�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[0m

�[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m��[94m �[0m

�[94m �[0m

�[94m �[94m �[0m�[1;32m �[0m

�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[1;32m-�[0m

�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;

2024-03-14T13:50:27.808Z [TENSORRT_LLM_NITRO]::Debug:Engine is ready

2024-03-14T13:50:27.809Z [TENSORRT_LLM_NITRO]::Debug:Loading model with params {"engine_path":"C:\\Users\\dan\\jan\\models\\llamacorn-1.1b-chat-fp16","ctx_len":2048}

2024-03-14T13:50:28.731Z [TENSORRT_LLM_NITRO]::Debug:32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[0m

�[1;32m_�[1;32m_�[1;32m_�[1;32m/�[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m\�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m/�[1;32m_�[1;32m �[1;32m �[1;32m|�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m_�[1;32m_�[1;32m �[1;32m\�[1;32m_�[1;32m_�[1;32m �[1;32m �[1;32m|�[1;32m �[1;32m/�[1;32m �[1;32m/�[0m

�[1;32m_�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m_�[1;32m\�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m �[1;32m �[1;32m �[1;32m|�[1;32m/�[1;32m �[1;32m/�[1;32m �[0m

�[1;32m_�[1;32m/�[1;32m �[1;32m_�[1;32m,�[1;32m �[1;32m_�[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m_�[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m|�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m �[1;32m �[1;32m/�[1;32m|�[1;32m �[1;32m �[1;32m/�[1;32m �[1;32m �[0m

�[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m|�[1;32m_�[1;32m|�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m|�[1;32m_�[1;32m|�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m\�[1;32m_�[1;32m_�[1;32m_�[1;32m_�[1;32m/�[1;32m �[1;32m/�[1;32m_�[1;32m/�[1;32m �[1;32m|�[1;32m_�[1;32m/�[1;32m �[1;32m �[1;32m �[0m

�[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[1;32m �[0m

�[0m20240314 13:50:27.681000 UTC 10304 INFO Nitro version: undefined - main.cc:57

20240314 13:50:27.681000 UTC 10304 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:59

20240314 13:50:27.681000 UTC 10304 INFO Please load your model - main.cc:60

20240314 13:50:27.681000 UTC 10304 INFO Number of thread is:1 - main.cc:68

[TensorRT-LLM][INFO] Set logger level by INFO

2024-03-14T13:50:28.746Z [TENSORRT_LLM_NITRO]::Debug:20240314 13:50:28.743000 UTC 12044 INFO Successully loaded the tokenizer - tensorrtllm.h:53

20240314 13:50:28.743000 UTC 12044 INFO Loaded tokenizer - tensorrtllm.cc:354

[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.

[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null

[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.

[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:

[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found

[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null

[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.

[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null

[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.

[TensorRT-LLM][INFO] Initializing MPI with thread mode 1

2024-03-14T13:50:28.753Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] MPI size: 1, rank: 0

2024-03-14T13:50:31.055Z [TENSORRT_LLM_NITRO]::Debug:20240314 13:50:28.753000 UTC 12044 INFO Engine Path : C:\Users\dan\jan\models\llamacorn-1.1b-chat-fp16\rank0.engine - tensorrtllm.cc:361

[TensorRT-LLM][INFO] Loaded engine size: 2100 MiB

2024-03-14T13:50:31.063Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.

2024-03-14T13:50:31.537Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20760, GPU 3251 (MiB)

2024-03-14T13:50:31.550Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +6, GPU +8, now: CPU 20766, GPU 3259 (MiB)

2024-03-14T13:50:31.567Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2098, now: CPU 0, GPU 2098 (MiB)

2024-03-14T13:50:31.574Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 20813, GPU 3353 (MiB)

2024-03-14T13:50:31.576Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +6, GPU +8, now: CPU 20819, GPU 3361 (MiB)

2024-03-14T13:50:31.688Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)

2024-03-14T13:50:31.694Z [TENSORRT_LLM_NITRO]::Debug:[TensorRT-LLM][INFO] Allocate 255328256 bytes for k/v cache.

[TensorRT-LLM][INFO] Using 201984 tokens in paged KV cache.

2024-03-14T13:50:31.822Z [TENSORRT_LLM_NITRO]::Debug:Load model success with response {}

2024-03-14T13:50:31.916Z [TENSORRT_LLM_NITRO]::Debug:20240314 13:50:31.856000 UTC 12044 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:535

[TensorRT-LLM][ERROR] 3: [executionContext.cpp::nvinfer1::rt::ExecutionContext::setInputShape::2309] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::nvinfer1::rt::ExecutionContext::setInputShape::2309, condition: satisfyProfile Runtime dimension does not satisfy any optimization profile.)

job aborted:

[ranks] message

[0] application aborted

aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 0

@Van-QA Van-QA added this to the v0.4.9 milestone Mar 14, 2024
@0xSage
Copy link
Contributor Author

0xSage commented Mar 15, 2024

Still experiencing this on 324.

Also, interesting: when I switch from trt to gguf, there is a brief "starting [the old trt model]" loading state (for like 1 second), before getting stuck in the "starting [new gguf model]" loading state.

If it's a UI glitch, we also need to fix that

@louis-jan
Copy link
Contributor

Please help attach the app.log @0xSage 🙏

@louis-jan
Copy link
Contributor

Root cause: the nitro file is missing due to a failed app update.

@Van-QA
Copy link
Contributor

Van-QA commented Mar 18, 2024

the main issue is resolved and working fine in Jan v0.4.8-326 ✅, as for the Nitro issue, we will resolve it in this follow up ticket: janhq/cortex.tensorrt-llm#27

@Van-QA Van-QA closed this as completed Mar 18, 2024
@Van-QA
Copy link
Contributor

Van-QA commented Mar 18, 2024

Still experiencing this on 324.

Also, interesting: when I switch from trt to gguf, there is a brief "starting [the old trt model]" loading state (for like 1 second), before getting stuck in the "starting [new gguf model]" loading state.

If it's a UI glitch, we also need to fix that

@louis-jan Related to the UI glitchs, after observation, we need to correct the behavior of the status message when switching between models:

  • Expected ✅: stopping > starting > generating
  • Current ❌ : generating > stopping > starting > generating

@Van-QA
Copy link
Contributor

Van-QA commented Mar 18, 2024

Tested and looking good as of
Jan v0.4.8-328 ✅

@Van-QA Van-QA closed this as completed Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1: important Important feature / fix type: bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants