Lorax Hanging in production #149

karlbernard2 · 2023-12-22T21:23:22Z

System Info

ghcr.io/predibase/lorax:latest
Running within Kubernetes on H100

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

When putting the instance in production, while it receive simultaneous request for different adapters, it will just hang there.
/generate and /health will stop answering
but /info and /docs will continue to be available.

There's no error getting displayed in the logs

Not sure what's the best way to diagnose what the issue could be, but looks to me like it's having some issues fetching multiple adapters in parallel and processing request queued at the same time?

Expected behavior

Should handle live requests for multiple adapters

tgaddair · 2023-12-22T22:45:28Z

Hey @karlbernard2, thanks for reporting. It sounds like there's a deadlock that's occurring here that may be triggered under very specific conditions (requests coming it at just the wrong time). Can you share any additional details about your setup (args to lorax-launcher for example) that can help with reproducing the error?

One thing that stands out from the logs you provided is that the adapter NextDayAI/xtraspicy1.0_13b_r32_800 was loaded, successfully processed a request, then offloaded, then loaded back, but never successfully processed any other requests. It's curious that it was offloaded at all, as it looks like only two adapters were loaded, while by default we will allow up to 128 to be loaded before doing any offloading. So whatever is causing the deadlock may be related to that behavior.

I'll try and take a closer look, but if there's anything you can provide to help me repro that would be helpful.

tgaddair · 2023-12-22T23:06:15Z

The fact that the /health endpoint is unresponsive but the /info endpoint works would suggest that there's an issue with the Python server, rather than the router. It's possible that the Python server is stuck on some operation.

Something you could try:

Make sure your container is running in privileged mode by adding SYS_PTRACE to the security context of the container as shown here.
SSH into the pod with kubectl exec -it <pod_name> -- /bin/bash
Install py-spy so you can get a backtrace from the Python server: pip install py-spy
Find the Python server process: ps aux | grep python
Run py-spy on the Python server to obtain the backtrace: sudo py-spy dump -p <pid>

If you're able to run that on one of the hung pods, that would be very helpful for debugging the error.

karlbernard2 · 2023-12-23T02:47:38Z

Thanks for the detailed instructions,, I'll try to do that.

Here's how I launched teh container
containers:
- name: lorax-container
image: ghcr.io/predibase/lorax:latest
ports:
- containerPort: 8001
env:
- name: HUGGING_FACE_HUB_TOKEN
value: hf_secret
- name: PORT
value: "8001"
- name: ROPE_SCALING
value: "dynamic"
- name: ROPE_FACTOR
value: "2.0"
args:
- "--max-input-length=7900"
- "--max-total-tokens=8192"
- "--max-batch-prefill-tokens=8192"
- "--model-id=NextDayAI/extraspicy"

karlbernard2 · 2023-12-23T03:08:32Z

@tgaddair My first attempt to replicate didn;t have the same issue (althouh earlier today I got it all the time, so will try more.)

However, since you talked about offloading that shouln't happen, you might find these logs strange:

2023-12-23T03:04:06.827779Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=1b65c393d478941d6b16797446fc1519}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_720_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="3.414291187s" validation_time="3.11ms" queue_time="46.871µs" inference_time="3.41113458s" time_per_token="18.950747ms" seed="Some(18257989878521111275)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:07.058713Z INFO lorax_router::loader: router/src/loader.rs:241: adapter __base_model__ offloaded 2023-12-23T03:04:07.058731Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter __base_model__ status to Downloaded 2023-12-23T03:04:07.095727Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_760_adapter loaded 2023-12-23T03:04:07.095745Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_760_adapter status to Ready 2023-12-23T03:04:07.588268Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=f94f4f52e417b624f40329c484ee954f}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_760_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="541.9253ms" validation_time="2.590788ms" queue_time="64.482779ms" inference_time="474.851989ms" time_per_token="23.742599ms" seed="Some(14941048975292732004)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:07.625223Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=676504f7982ff96d058121f59d961dd4}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_800_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 300, return_full_text: None, stop: ["\nEva:", "\nShizuka:", "\n###"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="1.211921872s" validation_time="778.706µs" queue_time="13.944244ms" inference_time="1.197199151s" time_per_token="21.003493ms" seed="Some(5853084379632762489)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:09.218011Z INFO lorax_router::loader: router/src/loader.rs:241: adapter NextDayAI/xtraspicy1.0_13b_r32_800_adapter offloaded 2023-12-23T03:04:09.218033Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_800_adapter status to Downloaded 2023-12-23T03:04:09.218716Z INFO lorax_router::loader: router/src/loader.rs:241: adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter offloaded 2023-12-23T03:04:09.218739Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter status to Downloaded 2023-12-23T03:04:09.239236Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_400_adapter loaded 2023-12-23T03:04:09.239269Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_400_adapter status to Ready 2023-12-23T03:04:09.258709Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter loaded 2023-12-23T03:04:09.258724Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter status to Ready 2023-12-23T03:04:09.701608Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=a9611795aa4b0336970442689421c739}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_400_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, stop: ["\nYour roommate Amber :", "\nWinston:", "\n###"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="489.291069ms" validation_time="4.884384ms" queue_time="22.199583ms" inference_time="462.207337ms" time_per_token="24.326701ms" seed="Some(16751434335624526013)"}: lorax_router::server: router/src/server.rs:298: Success

Screenshot might be easier to read:

We are only dealing with 4 adapters

karlbernard2 · 2023-12-23T03:42:26Z

Here's the trace:

tgaddair · 2023-12-23T03:52:38Z

Thanks for the back trace @karlbernard2, this is very helpful!

Definitely looks like the hanging is occurring the SGMV kernel.

In the short term, you can try disabling SGMV with an environment variable: DISABLE_SGMV=1. That's not a great longterm solution since SGMV is very fast when you have lots of adapters, but it should at least unblock you while I try and repro the issue, and the performance hit shouldn't be very noticeable with fewer than 10 adapters.

I'll see if I can repro this behavior with the adapters you're using here.

tgaddair · 2023-12-23T21:45:21Z

Hey @karlbernard2, update on this: I tried running some stress tests today with a variety of request patterns to try and replicate your setup, but was unable to trigger the hanging behavior.

Can you share a few more details about your environment:

What GPU are you running on?
What Nvidia device driver version are you using (from nvidia-smi)?
Is this running on prem or in the cloud? If cloud, which one?

Thanks.

karlbernard2 · 2023-12-23T21:47:52Z

We’re running H100 on NebiusAI Kubernetes. I’ll have to get back to you on Tuesday with info on drivers.

tgaddair · 2024-01-04T05:36:01Z

Hey @karlbernard2, I managed to track down the root cause of the deadlock, and has been fixed in #156.

tgaddair self-assigned this Dec 22, 2023

tgaddair added the bug Something isn't working label Dec 22, 2023

tgaddair mentioned this issue Jan 4, 2024

Fixed deadlock in sgmv_shrink kernel caused by imbalanced segments #156

Merged

tgaddair closed this as completed in #156 Jan 4, 2024

tgaddair mentioned this issue Jan 4, 2024

Fixed deadlock in sgmv_shrink kernel caused by skewed segments punica-ai/punica#35

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lorax Hanging in production #149

Lorax Hanging in production #149

karlbernard2 commented Dec 22, 2023

tgaddair commented Dec 22, 2023

tgaddair commented Dec 22, 2023

karlbernard2 commented Dec 23, 2023

karlbernard2 commented Dec 23, 2023 •

edited

karlbernard2 commented Dec 23, 2023

tgaddair commented Dec 23, 2023

tgaddair commented Dec 23, 2023

karlbernard2 commented Dec 23, 2023

tgaddair commented Jan 4, 2024

Lorax Hanging in production #149

Lorax Hanging in production #149

Comments

karlbernard2 commented Dec 22, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

tgaddair commented Dec 22, 2023

tgaddair commented Dec 22, 2023

karlbernard2 commented Dec 23, 2023

karlbernard2 commented Dec 23, 2023 • edited

karlbernard2 commented Dec 23, 2023

tgaddair commented Dec 23, 2023

tgaddair commented Dec 23, 2023

karlbernard2 commented Dec 23, 2023

tgaddair commented Jan 4, 2024

karlbernard2 commented Dec 23, 2023 •

edited