Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lorax Hanging in production #149

Closed
1 of 4 tasks
karlbernard2 opened this issue Dec 22, 2023 · 9 comments · Fixed by #156
Closed
1 of 4 tasks

Lorax Hanging in production #149

karlbernard2 opened this issue Dec 22, 2023 · 9 comments · Fixed by #156
Assignees
Labels
bug Something isn't working

Comments

@karlbernard2
Copy link

System Info

ghcr.io/predibase/lorax:latest
Running within Kubernetes on H100

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

When putting the instance in production, while it receive simultaneous request for different adapters, it will just hang there.
/generate and /health will stop answering
but /info and /docs will continue to be available.

There's no error getting displayed in the logs
CleanShot 2023-12-22 at 16 19 41

Not sure what's the best way to diagnose what the issue could be, but looks to me like it's having some issues fetching multiple adapters in parallel and processing request queued at the same time?

Expected behavior

Should handle live requests for multiple adapters

@tgaddair
Copy link
Contributor

Hey @karlbernard2, thanks for reporting. It sounds like there's a deadlock that's occurring here that may be triggered under very specific conditions (requests coming it at just the wrong time). Can you share any additional details about your setup (args to lorax-launcher for example) that can help with reproducing the error?

One thing that stands out from the logs you provided is that the adapter NextDayAI/xtraspicy1.0_13b_r32_800 was loaded, successfully processed a request, then offloaded, then loaded back, but never successfully processed any other requests. It's curious that it was offloaded at all, as it looks like only two adapters were loaded, while by default we will allow up to 128 to be loaded before doing any offloading. So whatever is causing the deadlock may be related to that behavior.

I'll try and take a closer look, but if there's anything you can provide to help me repro that would be helpful.

@tgaddair tgaddair self-assigned this Dec 22, 2023
@tgaddair tgaddair added the bug Something isn't working label Dec 22, 2023
@tgaddair
Copy link
Contributor

The fact that the /health endpoint is unresponsive but the /info endpoint works would suggest that there's an issue with the Python server, rather than the router. It's possible that the Python server is stuck on some operation.

Something you could try:

  • Make sure your container is running in privileged mode by adding SYS_PTRACE to the security context of the container as shown here.
  • SSH into the pod with kubectl exec -it <pod_name> -- /bin/bash
  • Install py-spy so you can get a backtrace from the Python server: pip install py-spy
  • Find the Python server process: ps aux | grep python
  • Run py-spy on the Python server to obtain the backtrace: sudo py-spy dump -p <pid>

If you're able to run that on one of the hung pods, that would be very helpful for debugging the error.

@karlbernard2
Copy link
Author

Thanks for the detailed instructions,, I'll try to do that.

Here's how I launched teh container
containers:
- name: lorax-container
image: ghcr.io/predibase/lorax:latest
ports:
- containerPort: 8001
env:
- name: HUGGING_FACE_HUB_TOKEN
value: hf_secret
- name: PORT
value: "8001"
- name: ROPE_SCALING
value: "dynamic"
- name: ROPE_FACTOR
value: "2.0"
args:
- "--max-input-length=7900"
- "--max-total-tokens=8192"
- "--max-batch-prefill-tokens=8192"
- "--model-id=NextDayAI/extraspicy"

@karlbernard2
Copy link
Author

karlbernard2 commented Dec 23, 2023

@tgaddair My first attempt to replicate didn;t have the same issue (althouh earlier today I got it all the time, so will try more.)

However, since you talked about offloading that shouln't happen, you might find these logs strange:

2023-12-23T03:04:06.827779Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=1b65c393d478941d6b16797446fc1519}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_720_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="3.414291187s" validation_time="3.11ms" queue_time="46.871µs" inference_time="3.41113458s" time_per_token="18.950747ms" seed="Some(18257989878521111275)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:07.058713Z INFO lorax_router::loader: router/src/loader.rs:241: adapter __base_model__ offloaded 2023-12-23T03:04:07.058731Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter __base_model__ status to Downloaded 2023-12-23T03:04:07.095727Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_760_adapter loaded 2023-12-23T03:04:07.095745Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_760_adapter status to Ready 2023-12-23T03:04:07.588268Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=f94f4f52e417b624f40329c484ee954f}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_760_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="541.9253ms" validation_time="2.590788ms" queue_time="64.482779ms" inference_time="474.851989ms" time_per_token="23.742599ms" seed="Some(14941048975292732004)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:07.625223Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=676504f7982ff96d058121f59d961dd4}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_800_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 300, return_full_text: None, stop: ["\nEva:", "\nShizuka:", "\n###"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="1.211921872s" validation_time="778.706µs" queue_time="13.944244ms" inference_time="1.197199151s" time_per_token="21.003493ms" seed="Some(5853084379632762489)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:09.218011Z INFO lorax_router::loader: router/src/loader.rs:241: adapter NextDayAI/xtraspicy1.0_13b_r32_800_adapter offloaded 2023-12-23T03:04:09.218033Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_800_adapter status to Downloaded 2023-12-23T03:04:09.218716Z INFO lorax_router::loader: router/src/loader.rs:241: adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter offloaded 2023-12-23T03:04:09.218739Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter status to Downloaded 2023-12-23T03:04:09.239236Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_400_adapter loaded 2023-12-23T03:04:09.239269Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_400_adapter status to Ready 2023-12-23T03:04:09.258709Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter loaded 2023-12-23T03:04:09.258724Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter status to Ready 2023-12-23T03:04:09.701608Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=a9611795aa4b0336970442689421c739}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_400_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, stop: ["\nYour roommate Amber :", "\nWinston:", "\n###"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="489.291069ms" validation_time="4.884384ms" queue_time="22.199583ms" inference_time="462.207337ms" time_per_token="24.326701ms" seed="Some(16751434335624526013)"}: lorax_router::server: router/src/server.rs:298: Success

Screenshot might be easier to read:
CleanShot 2023-12-22 at 22 09 05@2x

We are only dealing with 4 adapters

@karlbernard2
Copy link
Author

Here's the trace:

CleanShot 2023-12-22 at 22 41 28@2x

@tgaddair
Copy link
Contributor

Thanks for the back trace @karlbernard2, this is very helpful!

Definitely looks like the hanging is occurring the SGMV kernel.

In the short term, you can try disabling SGMV with an environment variable: DISABLE_SGMV=1. That's not a great longterm solution since SGMV is very fast when you have lots of adapters, but it should at least unblock you while I try and repro the issue, and the performance hit shouldn't be very noticeable with fewer than 10 adapters.

I'll see if I can repro this behavior with the adapters you're using here.

@tgaddair
Copy link
Contributor

Hey @karlbernard2, update on this: I tried running some stress tests today with a variety of request patterns to try and replicate your setup, but was unable to trigger the hanging behavior.

Can you share a few more details about your environment:

  • What GPU are you running on?
  • What Nvidia device driver version are you using (from nvidia-smi)?
  • Is this running on prem or in the cloud? If cloud, which one?

Thanks.

@karlbernard2
Copy link
Author

We’re running H100 on NebiusAI Kubernetes. I’ll have to get back to you on Tuesday with info on drivers.

@tgaddair
Copy link
Contributor

tgaddair commented Jan 4, 2024

Hey @karlbernard2, I managed to track down the root cause of the deadlock, and has been fixed in #156.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants