Skip to content

Conversation

@ehhuang
Copy link
Contributor

@ehhuang ehhuang commented Sep 11, 2025

What does this PR do?

Test Plan

See updated README.md

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 11, 2025
@ehhuang ehhuang force-pushed the pr3421 branch 3 times, most recently from 859ac24 to dc6acf8 Compare September 12, 2025 23:08
@ehhuang ehhuang changed the title guidellm chore(perf): run guidellm benchmarks Sep 12, 2025
@ehhuang ehhuang force-pushed the pr3421 branch 4 times, most recently from 9ec5404 to 9bcad5d Compare September 15, 2025 17:20
@ehhuang ehhuang marked this pull request as ready for review September 15, 2025 17:27
@ehhuang ehhuang force-pushed the pr3421 branch 3 times, most recently from 2d0c298 to 06739b3 Compare September 15, 2025 20:22
@ehhuang ehhuang mentioned this pull request Sep 17, 2025
@ehhuang ehhuang force-pushed the pr3421 branch 2 times, most recently from 4ddc5e8 to 5662e9e Compare September 18, 2025 16:28
ehhuang added a commit that referenced this pull request Sep 19, 2025
# What does this PR do?
As shown in #3421, we can scale stack to handle more RPS with k8s
replicas. This PR enables multi process stack with uvicorn --workers so
that we can achieve the same scaling without being in k8s.

To achieve that we refactor main to split out the app construction
logic. This method needs to be non-async. We created a new `Stack` class
to house impls and have a `start()` method to be called in lifespan to
start background tasks instead of starting them in the old
`construct_stack`. This way we avoid having to manage an event loop
manually.


## Test Plan
CI

> uv run --with llama-stack python -m llama_stack.core.server.server
benchmarking/k8s-benchmark/stack_run_config.yaml

works.

> LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv
run uvicorn llama_stack.core.server.server:create_app --port 8321
--workers 4

works.
# "matplotlib",
# ]
# ///
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: clean up above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used by uv run python scripts/generated_charts.py

# "stack 4 2"
"stack 8 2"
# "vllm 1 2"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need to keep these, are these various configs we use for running the benchmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to correspond to the checked in results/

## Benchmark Results

We use [GuideLLM](https://github.com/neuralmagic/guidellm) against our k8s deployment for comprehensive performance testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still using the scripts run-all-benchmarks.sh and run-benchmark.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@ehhuang
Copy link
Contributor Author

ehhuang commented Sep 19, 2025

Will update this with the new workers support on stack server

@ehhuang ehhuang marked this pull request as draft September 19, 2025 21:00
@ehhuang ehhuang force-pushed the pr3421 branch 2 times, most recently from 4a81a4e to 6678849 Compare September 23, 2025 23:06
# What does this PR do?


## Test Plan
# What does this PR do?


## Test Plan
# What does this PR do?


## Test Plan
@ehhuang ehhuang marked this pull request as ready for review September 23, 2025 23:12
@slekkala1
Copy link
Contributor

lgtm

@ehhuang ehhuang merged commit 48a551e into llamastack:main Sep 24, 2025
46 checks passed
iamemilio pushed a commit to iamemilio/llama-stack that referenced this pull request Sep 24, 2025
# What does this PR do?
As shown in llamastack#3421, we can scale stack to handle more RPS with k8s
replicas. This PR enables multi process stack with uvicorn --workers so
that we can achieve the same scaling without being in k8s.

To achieve that we refactor main to split out the app construction
logic. This method needs to be non-async. We created a new `Stack` class
to house impls and have a `start()` method to be called in lifespan to
start background tasks instead of starting them in the old
`construct_stack`. This way we avoid having to manage an event loop
manually.


## Test Plan
CI

> uv run --with llama-stack python -m llama_stack.core.server.server
benchmarking/k8s-benchmark/stack_run_config.yaml

works.

> LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv
run uvicorn llama_stack.core.server.server:create_app --port 8321
--workers 4

works.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants