diff --git a/blog/2025-05-07_welcome_llmd.md b/blog/2025-05-07_welcome_llmd.md deleted file mode 100644 index 0d281c1..0000000 --- a/blog/2025-05-07_welcome_llmd.md +++ /dev/null @@ -1,14 +0,0 @@ ---- -title: Welcome to a new way to do distributed inference! -description: Opening post for the llm-d news blog -slug: welcome-llm-d-demonstration -authors: [Huey, Dewey, Louie] - -tags: [hello, welcome, llm-d] -image: https://i.imgur.com/mErPwqL.png -hide_table_of_contents: false ---- - -# A new way to do distributed inference: llm-d - -Welcome to the llm-d news blog. This blog is created with [**Docusaurus**](https://docusaurus.io/). \ No newline at end of file diff --git a/blog/2025-05-19.md b/blog/2025-05-19.md deleted file mode 100644 index dca6194..0000000 --- a/blog/2025-05-19.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: Here's a Blog Post! -description: Blog post with one author -slug: heres-a-blog-post -authors: [kahuna] - -tags: [blog] -image: https://i.imgur.com/mErPwqL.png -hide_table_of_contents: false ---- - -# Here's a basic blog post -Something big happened today! \ No newline at end of file diff --git a/blog/2025-05-20_announce.md b/blog/2025-05-20_announce.md new file mode 100644 index 0000000..0db4116 --- /dev/null +++ b/blog/2025-05-20_announce.md @@ -0,0 +1,209 @@ +--- +title: Announcing the llm-d community! +description: Debut announcement of llm-d project and community +slug: llm-d-announce +date: 2025-05-20T08:00 + +authors: + - name: Robert Shaw + title: Director of Engineering, Red Hat + url: https://github.com/robertgshaw2-redhat + image_url: https://avatars.githubusercontent.com/u/114415538?v=4 + email: robshaw@redhat.com + + - name: Clayton Coleman + title: Distinguished Engineer, Google + url: https://github.com/smarterclayton + image_url: https://avatars.githubusercontent.com/u/1163175?v=4 + email: claytoncoleman@google.com + + - name: Carlos Costa + title: Distinguished Engineer, IBM + url: https://github.com/chcost + image_url: https://avatars.githubusercontent.com/u/26551701?v=4 + email: chcost@us.ibm.com + +tags: [hello, welcome, llm-d] +hide_table_of_contents: false +--- + +## Announcing the llm-d community + +llm-d is a Kubernetes-native high-performance distributed LLM inference framework + \- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. + +With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in [Inference Gateway (IGW)](https://github.com/kubernetes-sigs/gateway-api-inference-extension?tab=readme-ov-file). + + + +### LLM Inference Goes Distributed + +#### Why Standard Scale Out Falls Short + +Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing. + +![Figure 1: Deploying a service to multiple vLLM instances](../docs/assets/images/image5_46.png) + +This simple pattern is very effective for most request patterns, which have the following characteristics: + +* Requests are short-lived and generally uniform in resource utilization +* Requests have generally uniform latency service level objectives (SLOs) +* Each replica can process each request equally well +* Specializing variants and coordinating replicas to process a single request is not useful + +#### LLM Serving Is Unique + +The LLM inference workload, however, is unique with slow, non-uniform, expensive requests. This means that typical scale-out and load-balancing patterns fall short of optimal performance. + +![Figure 2: Comparison of modern HTTP requests](../docs/assets/images/image7_33.png) + +Let’s take a look at each one step-by-step: + +*A. Requests are expensive with significant variance in resource utilization.* + +* Each LLM inference request has a different “shape” to it, as measured by the number of input tokens and output tokens. There is significant variance in these parameters across requests and workloads. + * RAG has long inputs \- prompt and retrieved docs \- and short generated outputs + * Reasoning has a short or medium inputs and long generated outputs + +![Figure 3: Comparing the RAG pattern and Thinking/Reasoning pattern with prefill and decode stages](../docs/assets/images/image2_4.jpg) + +* These differences in request times can lead to significant imbalances across instances, which are compounded as loaded instances get overwhelmed. Overloads lead to longer ITL (Inter-Token Latency), which leads to more load, which leads to more ITL. + +*B. Routing to specific replicas with cached prior computation can achieve orders of magnitude better latency.* + +* Many common LLM workloads have “multi-turn” request patterns, where the same prompt is sent iteratively to the same instance. + * Agentic (tool calls are iterative request flow) + * Code completion task (requests reuse current codebase as context) + +![The agentic pattern sequence](../docs/assets/images/image8_0.jpg) + +* LLM inference servers like vLLM implement a method called “automatic prefix caching”, which enables “skipping” a significant amount of prefill computation when there is a cache hit. If requests are routed to vLLM replicas that have the data in the cache, we skip computation. Increasing the likelihood of prefix cache hits with a larger cache size can dramatically improve tail latencies. + +![The prefix aching method](../docs/assets/images/image3.jpg) + +*C. Specializing and coordinating replicas to process a single request can lead to more throughput per GPU.* + +* Inference is split into two phases – prefill and decode. Prefill generates the first output token and runs in parallel over all the prompt tokens \- this phase is compute bound. Decode generates tokens one at a time by doing a full pass over the model, making this phase memory bandwidth-bound. + +* Standard LLM deployments perform the prefill and decode phases of inference within a single replica.Given that prefill and decode phases of inference have different resource requirements, co-locating these phases on the same replica leads to inefficient resource use, especially for long sequences. + +* **Disaggregation** (e.g. [Distserve](https://arxiv.org/abs/2401.09670)) separates prefill and decode phases onto different variants, enabling independent optimization and scaling of each phase. + * Google [leverages disaggregated serving on TPUs](https://cloud.google.com/blog/products/compute/whats-new-with-ai-hypercomputer) to provide better first-token latency and simplify operational scaling. + + * DeepSeek released a [discussion of the design of their inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which leverages aggressive disaggregation to achieve remarkable performance at scale. + +![Disaggregation separates the prefill and decode phases](../docs/assets/images/image4_57.png) + +*D. Production deployments often have a range of quality of service (QoS) requirements.* + +* Use cases for a single LLM endpoint can have a wide variety of quality of service requirements. Consider the following examples: + * Latency is the most important factor: Code completion requests and search responses need to minimize latency to provide an “in the loop” experience. O(ms) latency tolerance. + * Latency is important: Chat agent sessions and email drafting with interactive use cases. O(seconds) latency tolerance. + * Latency tolerant: Video call and email summarization and “deep research” agents with daily or hourly usage patterns. O(minutes) latency tolerance. + * Latency agnostic: Overnight batch processing workloads, meeting minute generation, and autonomous agents. O(hours) latency tolerance. + +* Given the compute intensity (and, therefore, high costs) of LLMs, tight latency SLOs are substantially more expensive to achieve. This spectrum of latency requirements presents an opportunity to further optimize infrastructure efficiency – the more latency tolerant a workload is, the more we can optimize infrastructure efficiency amongst other workloads. + +### Why llm-d? + +To exploit these characteristics and achieve optimal performance for LLM workloads, the inference serving landscape is rapidly transitioning towards distributed cluster-scale architectures. For instance, in its “Open Source Week”, the DeepSeek team published the design of its [inference system](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md), which aggressively leverages disaggregation and KV caching to achieve remarkable performance per $ of compute. + +However, for most GenAI innovators, ML platform teams, and IT operations groups, these benefits remain out of reach. Building and operating a complex, monolithic system is time-consuming and challenging, especially in the context of the rapid pace of innovation and enterprise deployments with tens or hundreds of models for divergent use cases. This complexity risks time to market, higher operational costs and sprawl, and difficulty adopting and experimenting. + +#### Our Objective + +The objective of llm-d is to create a well-lit path for anyone to adopt the leading distributed inference optimizations *within their existing deployment framework* \- Kubernetes. + +To achieve this goal, we have the following design principles for the project: + +* **Operationalizability:** modular and resilient architecture with native integration into Kubernetes via Inference Gateway API +* **Flexibility:** cross-platform (active work to support NVIDIA, Google TPU, AMD, and Intel), with extensible implementations of key composable layers of the stack +* **Performance**: leverage distributed optimizations like disaggregation and prefix-aware routing to achieve the highest tok/$ while meeting SLOs + +#### Architecture + +To achieve this objective, we designed llm-d with a modular and layered architecture on top of industry-standard open-source technologies \- vLLM, Kubernetes, and Inference Gateway. + + +* [**vLLM**. vLLM](https://docs.vllm.ai/en/latest/) is the leading open-source LLM inference engine, supporting a wide range of models (including Llama and DeepSeek) and hardware accelerators (including NVIDIA GPU, Google TPU, AMD ) with high performance. + +* [**Kubernetes**](https://kubernetes.io/docs/home/) **(K8s)**. K8s is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. It is the industry standard for deploying and updating LLM inference engines across various hardware accelerators. + +* [**Inference Gateway**](https://gateway-api-inference-extension.sigs.k8s.io/) **(IGW)**. IGW is an official Kubernetes project that extends the [Gateway API](https://gateway-api.sigs.k8s.io/) (the next generation of Kubernetes Ingress and Load Balancing API) with inference-specific routing. IGW includes many important features like model routing, serving priority, and extensible scheduling logic for “smart” load balancing. IGW integrates with many different gateway implementations, such as Envoy, making it widely portable across Kubernetes clusters. + +![](../docs/assets/images/llm-d-arch-simplified.svg) + +And our key new contributions: + +* **vLLM Optimized Inference Scheduler** \- IGW defines a pattern for customizable “smart” load-balancing via the [Endpoint Picker Protocol (EPP)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol). Leveraging enhanced operational telemetry exposed by vLLM, the inference scheduler implements the filtering and scoring algorithms necessary to make “smart” scheduling decisions around disaggregated serving, prefix-cache-awareness, and load-awareness, validated to be used out-of-the-box by llm-d users. Advanced teams can also tweak or implement their own scorers and filterers to further customize for their use cases, while still benefiting from upcoming operational features in the inference gateway, like flow control and latency-aware balancing. + * For more details, see our Northstar: [\[PUBLIC\] llm-d Scheduler Northstar](https://docs.google.com/document/d/1kE1LY8OVjiOgKVD9-9Po96HODbTIbgHp4qgvw06BCOc/edit?tab=t.0) + +* **Disaggregated Serving with [vLLM](https://github.com/vllm-project/vllm) \-** llm-d leverages vLLM’s recently enabled support for disaggregated serving via a pluggable KV Connector API to run prefill and decode on independent instances, using high-performance transport libraries like [NVIDIA’s NIXL](https://github.com/ai-dynamo/nixl). + + In llm-d, we plan to support two “well-lit” paths for prefill/decode (P/D) disaggregation: + * Latency optimized implementation using fast interconnects (IB, RDMA, ICI) + * Throughput optimized implementation using data center networking + * For more details, see our Northstar:[\[PUBLIC\] llm-d Disaggregated Serving Northstar](https://docs.google.com/document/d/1FNN5snmipaTxEA1FGEeSH7Z_kEqskouKD1XYhVyTHr8/edit?tab=t.0#heading=h.ycwld2oth1kj) + +* **Disaggregated Prefix Caching with vLLM** \- llm-d uses the same vLLM KV connector API used in disaggregated serving to provide a pluggable cache for previous calculations, including offloading KVs to host, remote storage, and systems like [LMCache](https://github.com/LMCache/LMCache). + + In llm-d, we plan to support two “well-lit” paths for KV cache disaggregation: + * Independent caching with basic offloading to host memory and disk, providing a zero operational cost mechanism that utilizes all system resources + * Shared caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system. + * For more details, see our Northstar: [\[PUBLIC\] llm-d Prefix Caching Northstar](https://docs.google.com/document/d/1d-jKVHpTJ_tkvy6Pfbl3q2FM59NpfnqPAh__Uz_bEZ8/edit?tab=t.0#heading=h.6qazyl873259) + +* **Variant Autoscaling over Hardware, Workload, and Traffic** \- Accelerator hardware varies dramatically in terms of compute, memory, and cost, workloads sharing the same models vary by their required quality of service, the distinct phases of LLM inference and large mixture-of-expert models vary on whether they are compute, memory, or network bound, and incoming traffic varies over time and by workload. Today, all of these decisions are made at deployment time, and almost all deployers struggle to enable autoscaling to reduce their costs safely. + + Drawing on extensive experience from end users and OSS collaborators like AIBrix, we plan to implement a traffic- and hardware-aware autoscaler that: + * Measures the capacity of each model server instance + * Derive a load function that takes into account different request shapes and QoS + * Using the recent traffic mix \- QPS (Queries Per Second), QoS, and shape distribution \- calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, and label each instance with a grouping + * Report load metrics per grouping that allows Kubernetes horizontal pod autoscaling to match hardware in use to hardware needed without violating SLOs + * For more details, see our Northstar: [\[PUBLIC\] llm-d Autoscaling Northstar](https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0) + +#### Example llm-d Features + +llm-d integrates IGW and vLLM together, enabling a high performance distributed serving stack. Let’s discuss some of the example features enabled by llm-d. + +**Prefix and KV cache-aware routing** + +The first key collaboration between IGW and vLLM in llm-d was developing prefix-cache aware routing to complement the existing KV cache utilization aware load balancing in IGW. + +We conducted a series of experiments to evaluate the performance of the [llm-d-inference-scheduler](https://github.com/llm-d/llm-d-inference-scheduler) with prefix-aware routing on 2 NVIDIA 8xH100 nodes using the [LMbenchmark in a long-input/short-output configuration designed](https://github.com/LMCache/LMBenchmark/tree/main/synthetic-multi-round-qa) to stress KV cache reuse and routing decision quality. + +| | Model | Configuration | ISL | OSL | Latency SLO | +| :---- | :---- | :---- | :---- | :---- | :---- | +| **S1** | LlaMA 4 Scout FP8 | TP2, 2 replicas | 20,000 | 100 | None | +| **S2** | LlaMA 4 Scout FP8 | TP2, 4 replicas | 12,000 | 100 | P95 TTFT \<= 2s | +| **S3** | Llama 3.1 70B FP16 | TP2, 4 replicas | 8,000 | 100 | P95 TTFT \<= 2s | + +# ![](../docs/assets/images/image1_116.png) + +**Key Observations:** + +* **S1:** At 4 QPS, llm-d achieves a mean TTFT approximately 3X lower than the baseline (lower is better). +* **S2:** llm-d delivers \~50% higher QPS than the baseline while meeting SLO requirements (higher is better). +* **S3:** llm-d sustains 2X the baseline QPS under SLO constraints (higher is better). + +These results show that llm-d’s cache- and prefix-aware scheduling effectively reduces TTFT and increases QPS compared to the baseline, while consistently meeting SLA requirements. + +Try it out with the \`base.yaml\` config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart). And as a customization example, see the [template](https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/create_new_filter.md) for adding your own scheduler filter. + +**P/D disaggregation** + +We’ve completed an initial implementation of P/D disaggregation with vLLM and llm-d-inference-scheduler, which delivers promising speedups for prefill-heavy workloads (20:1 ISL | OSL). Our next focus is finalizing the implementation with heterogeneous TP and completing comprehensive benchmarks for disaggregated serving. Short-term priorities include enabling heterogeneous TP, scaling with high-performance P/D \+ EP\<\>DP for large scale MoEs, and DP-aware load balancing. We will follow up with a detailed performance blog in the coming weeks. + +Try it out with the pd-nixl.yaml config in our [quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart). + +### Get started with llm-d + +llm-d builds brings together the performance of vLLM with the operationalizability of Kuberentes, creating a modular architecture for distributed LLM inference, targeting high performance on the latest models and agentic architectures. + +We welcome AI engineers and researchers to join the llm-d community and contribute: + +* Check out our repository on Github: [https://github.com/llm-d/llm-d](https://github.com/llm-d/llm-d) +* Join our developer slack: [https://inviter.co/llm-d-slack](https://inviter.co/llm-d-slack) +* Try out our quick starts to deploy llm-d on your Kubernetes cluster: [https://github.com/llm-d/llm-d-deployer/tree/main/quickstart](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart) + +Please join us. The future of AI is open. + diff --git a/blog/authors.yml b/blog/authors.yml index f6b7380..9d3e5e2 100644 --- a/blog/authors.yml +++ b/blog/authors.yml @@ -1,22 +1,26 @@ -Huey: - name: Huw - title: The Nephew in Red - -Dewey: - name: Dewydd - title: The one in Blue +redhat: + name: RedHat + url: https://redhat.com + image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg -Louie: - name: Lewellyn - title: That one in green +robshaw: + name: Robert Shaw + title: Director of Engineering, Red Hat + url: https://github.com/robertgshaw2-redhat + image_url: https://avatars.githubusercontent.com/u/114415538?v=4 + email: robshaw@redhat.com -kahuna: - name: Big kahuna - title: The one in charge +smarterclayton: + name: Clayton Coleman + title: Distinguished Engineer, Google + url: https://github.com/smarterclayton + image_url: https://avatars.githubusercontent.com/u/1163175?v=4 + email: claytoncoleman@google.com -redhat-author: - name: RedHat - title: One of the sponsors - url: https://redhat.com - image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg \ No newline at end of file +chcost: + name: Carlos Costa + title: Distinguished Engineer, IBM + url: https://github.com/chcost + image_url: https://avatars.githubusercontent.com/u/26551701?v=4 + email: chcost@us.ibm.com \ No newline at end of file diff --git a/blog/tags.yml b/blog/tags.yml index 72a68f0..00208f1 100644 --- a/blog/tags.yml +++ b/blog/tags.yml @@ -19,7 +19,7 @@ llm-d: description: llm-d tag description news: - label: News Releases! + label: News Releases permalink: /news-releases description: Used for "official" news releases in the blog @@ -34,6 +34,12 @@ hola: description: Hola tag description blog: - label: just a blog + label: blog posts permalink: /blog description: everyday blog posts + + +announce: + label: Announcements + permalink: /announce + description: Announcements that aren't news releases diff --git a/docs/architecture/00_architecture.md b/docs/architecture/00_architecture.md index 08f4285..d4522b9 100644 --- a/docs/architecture/00_architecture.md +++ b/docs/architecture/00_architecture.md @@ -3,7 +3,7 @@ sidebar_position: 0 label: llm-d Architecture --- # Overview of llm-d architecture -`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. +`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. With `llm-d`, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in [Inference Gateway (IGW)](https://github.com/kubernetes-sigs/gateway-api-inference-extension). @@ -14,7 +14,7 @@ Built by leaders in the Kubernetes and vLLM projects, `llm-d` is a community-dri `llm-d` adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway. -![llm-d Architecture](../assets/images/llm-d-arch.svg) +![llm-d Architecture](../assets/images/llm-d-arch-simplified.svg) @@ -31,6 +31,7 @@ Key features of `llm-d` include: - **Variant Autoscaling over Hardware, Workload, and Traffic** (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) Using the recent traffic mix to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. [See our Northstar design](https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0) +For more, see the [project proposal](https://github.com/llm-d/llm-d/blob/dev/docs/proposals/llm-d.md) ## Getting Started diff --git a/docs/architecture/Component Architecture/02_inf-extension.md b/docs/architecture/Component Architecture/02_inf-extension.md deleted file mode 100644 index 385ac90..0000000 --- a/docs/architecture/Component Architecture/02_inf-extension.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -sidebar_position: 2 -sidebar_label: Inference Extension ---- - -[![Go Report Card](https://goreportcard.com/badge/sigs.k8s.io/gateway-api-inference-extension)](https://goreportcard.com/report/sigs.k8s.io/gateway-api-inference-extension) -[![Go Reference](https://pkg.go.dev/badge/sigs.k8s.io/gateway-api-inference-extension.svg)](https://pkg.go.dev/sigs.k8s.io/gateway-api-inference-extension) -[![License](https://img.shields.io/github/license/kubernetes-sigs/gateway-api-inference-extension/LICENSE)] - -# Gateway API Inference Extension - -Gateway API Inference Extension optimizes self-hosting Generative Models on Kubernetes. -This is achieved by leveraging Envoy's [External Processing] (ext-proc) to extend any gateway that supports both ext-proc and [Gateway API] into an **[inference gateway]**. - - -[Inference Gateway]:#concepts-and-definitions - -## Concepts and Definitions - -The following specific terms to this project: - -- **Inference Gateway (IGW)**: A proxy/load-balancer which has been coupled with an - `Endpoint Picker`. It provides optimized routing and load balancing for - serving Kubernetes self-hosted generative Artificial Intelligence (AI) - workloads. It simplifies the deployment, management, and observability of AI - inference workloads. -- **Inference Scheduler**: An extendable component that makes decisions about which endpoint is optimal (best cost / - best performance) for an inference request based on `Metrics and Capabilities` - from [Model Serving](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol/README.md). -- **Metrics and Capabilities**: Data provided by model serving platforms about - performance, availability and capabilities to optimize routing. Includes - things like [Prefix Cache] status or [LoRA Adapters] availability. -- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal). - - -The following are key industry terms that are important to understand for -this project: - -- **Model**: A generative AI model that has learned patterns from data and is - used for inference. Models vary in size and architecture, from smaller - domain-specific models to massive multi-billion parameter neural networks that - are optimized for diverse language tasks. -- **Inference**: The process of running a generative AI model, such as a large - language model, diffusion model etc, to generate text, embeddings, or other - outputs from input data. -- **Model server**: A service (in our case, containerized) responsible for - receiving inference requests and returning predictions from a model. -- **Accelerator**: specialized hardware, such as Graphics Processing Units - (GPUs) that can be attached to Kubernetes nodes to speed up computations, - particularly for training and inference tasks. - - -For deeper insights and more advanced concepts, refer to our [proposals](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals). - -[Inference]:https://www.digitalocean.com/community/tutorials/llm-inference-optimization -[Gateway API]:https://github.com/kubernetes-sigs/gateway-api -[Prefix Cache]:https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html -[LoRA Adapters]:https://docs.vllm.ai/en/stable/features/lora.html -[External Processing]:https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter - -## Technical Overview - -This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **[inference gateway]** - supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee. - -The Inference Gateway: - -* Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases -* Provides [Kubernetes-native declarative APIs](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/) to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades -* Adds end to end observability around service objective attainment -* Ensures operational guardrails between different client model names, allowing a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators - -![Architecture Diagram](../../assets/images/inference-gateway-architecture.svg) - -It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol). Support for Google's Jetstream, nVidia Triton, text-generation-inference, and SGLang is coming soon. - -## Status - -This project is [alpha (0.3 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.3.0). It should not be used in production yet. - -## Getting Started - -Follow our [Getting Started Guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to get the inference-extension up and running on your cluster! - -See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs - -## Roadmap - -As Inference Gateway builds towards a GA release. We will continue to expand our capabilities, namely: -1. Prefix-cache aware load balancing with interfaces for remote caches -1. Recommended LoRA adapter pipeline for automated rollout -1. Fairness and priority between workloads within the same criticality band -1. HPA support for autoscaling on aggregate metrics derived from the load balancer -1. Support for large multi-modal inputs and outputs -1. Support for other GenAI model types (diffusion and other non-completion protocols) -1. Heterogeneous accelerators - serve workloads on multiple types of accelerator using latency and request cost-aware load balancing -1. Disaggregated serving support with independently scaling pools - - -## End-to-End Tests - -Follow this link to the [e2e README](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/test/e2e/epp/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster. - -## Contributing - -Our community meeting is weekly at Thursday 10AM PDT ([Zoom](https://zoom.us/j/9955436256?pwd=Z2FQWU1jeDZkVC9RRTN4TlZyZTBHZz09), [Meeting Notes](https://www.google.com/url?q=https://docs.google.com/document/d/1frfPE5L1sI3737rdQV04IcDGeOcGJj2ItjMg6z2SRH0/edit?usp%3Dsharing&sa=D&source=calendar&usd=2&usg=AOvVaw1pUVy7UN_2PMj8qJJcFm1U)). - -We currently utilize the [#wg-serving](https://kubernetes.slack.com/?redir=%2Fmessages%2Fwg-serving) slack channel for communications. - -Contributions are readily welcomed, follow the [dev guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/dev.md) to start contributing! - -### Code of conduct - -Participation in the Kubernetes community is governed by the [Kubernetes Code of Conduct](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/code-of-conduct.md). diff --git a/docs/architecture/Component Architecture/02_inf-simulator.md b/docs/architecture/Component Architecture/02_inf-simulator.md new file mode 100644 index 0000000..18c56f6 --- /dev/null +++ b/docs/architecture/Component Architecture/02_inf-simulator.md @@ -0,0 +1,120 @@ +--- +sidebar_position: 2 +sidebar_label: Inference Simulator +--- +# vLLM Simulator +To help with development and testing we have developed a light weight vLLM simulator. It does not truly +run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. +Currently it supports partial OpenAI-compatible API: +- /v1/chat/completions +- /v1/completions +- /v1/models + +In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics: +- vllm:lora_requests_info + +The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters. + +The simulator supports two modes of operation: +- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used. +- `random` mode: the response is randomly chosen from a set of pre-defined sentences. + +Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`. + +For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream. + +For a requst with `stream=false`: the response is returned after delay of ` + ( * ( - 1))` + +It can be run standalone or in a Pod for testing under packages such as Kind. + +## Limitations +API responses contains a subset of the fields provided by the OpenAI API. + +
+ Click to show the structure of requests/responses + +- `/v1/chat/completions` + - **request** + - stream + - model + - messages + - role + - content + - **response** + - id + - created + - model + - choices + - index + - finish_reason + - message +- `/v1/completions` + - **request** + - stream + - model + - prompt + - max_tokens (for future usage) + - **response** + - id + - created + - model + - choices + - text +- `/v1/models` + - **response** + - object (list) + - data + - id + - object (model) + - created + - owned_by + - root + - parent +
+
+For more details see the [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#openai-completions-api-with-vllm) + +## Command line parameters +- `port`: the port the simulator listents on, mandatory +- `model`: the currently 'loaded' model, mandatory +- `lora`: a list of available LoRA adapters, separated by commas, optional, by default empty +- `mode`: the simulator mode, optional, by default `random` + - `echo`: returns the same text that was sent in the request + - `random`: returns a sentence chosen at random from a set of pre-defined sentences +- `time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero +- `inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero +- `max-loras`: maximum number of LoRAs in a single batch, optional, default is one +- `max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras +- `max-running-requests`: maximum number of inference requests that could be processed at the same time + + +## Working with docker image + +### Building +To build a Docker image of the vLLM Simulator, run: +```bash +make build-llm-d-inference-sim-image +``` + +### Running +To run the vLLM Simulator image under Docker, run: +```bash +docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1" +``` +**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model. + +## Standalone testing + +### Building +To build the vLLM simulator, run: +```bash +make build-llm-d-inference-sim +``` + +### Running +To run the router in a standalone test environment, run: +```bash +./bin/llm-d-inference-sim --model my_model --port 8000 +``` + + diff --git a/docs/assets/images/carlos costa.jpeg b/docs/assets/images/carlos costa.jpeg new file mode 100644 index 0000000..15d1439 Binary files /dev/null and b/docs/assets/images/carlos costa.jpeg differ diff --git a/docs/assets/images/clayton coleman.jpeg b/docs/assets/images/clayton coleman.jpeg new file mode 100644 index 0000000..4192aca Binary files /dev/null and b/docs/assets/images/clayton coleman.jpeg differ diff --git a/docs/assets/images/image1_116.png b/docs/assets/images/image1_116.png new file mode 100644 index 0000000..c46e14c Binary files /dev/null and b/docs/assets/images/image1_116.png differ diff --git a/docs/assets/images/image2_4.jpg b/docs/assets/images/image2_4.jpg new file mode 100644 index 0000000..b9b3bf2 Binary files /dev/null and b/docs/assets/images/image2_4.jpg differ diff --git a/docs/assets/images/image3.jpg b/docs/assets/images/image3.jpg new file mode 100644 index 0000000..689c945 Binary files /dev/null and b/docs/assets/images/image3.jpg differ diff --git a/docs/assets/images/image4_57.png b/docs/assets/images/image4_57.png new file mode 100644 index 0000000..de41556 Binary files /dev/null and b/docs/assets/images/image4_57.png differ diff --git a/docs/assets/images/image5_46.png b/docs/assets/images/image5_46.png new file mode 100644 index 0000000..9ad10a9 Binary files /dev/null and b/docs/assets/images/image5_46.png differ diff --git a/docs/assets/images/image7_33.png b/docs/assets/images/image7_33.png new file mode 100644 index 0000000..f921791 Binary files /dev/null and b/docs/assets/images/image7_33.png differ diff --git a/docs/assets/images/image8_0.jpg b/docs/assets/images/image8_0.jpg new file mode 100644 index 0000000..0a04e26 Binary files /dev/null and b/docs/assets/images/image8_0.jpg differ diff --git a/docs/assets/images/llm-d-arch-simplified.svg b/docs/assets/images/llm-d-arch-simplified.svg new file mode 100644 index 0000000..825cb91 --- /dev/null +++ b/docs/assets/images/llm-d-arch-simplified.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/robert shaw headshot.jpeg b/docs/assets/images/robert shaw headshot.jpeg new file mode 100644 index 0000000..e485fb0 Binary files /dev/null and b/docs/assets/images/robert shaw headshot.jpeg differ diff --git a/docs/community/contact_us.md b/docs/community/contact_us.md index 8be3107..48c532f 100644 --- a/docs/community/contact_us.md +++ b/docs/community/contact_us.md @@ -9,6 +9,8 @@ There are several ways you can join the community effort to develop and enhance - Via the [**Github pages for llm-d:** https://github.com/llm-d](https://github.com/llm-d)** - Via our [**Slack Workspace:** https://llm-d.slack.com](https://llm-d.slack.com) - Via [**Reddit**: Reddit:https://www.reddit.com/r/llm_d/](Reddit:https://www.reddit.com/r/llm_d/) +- We host a weekly standup for contributors on Wednesdays at 1230pm ET. Please join: [Meeting Details](https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=NG9yZ3AyYTN0N3VlaW01b21xbWV2c21uNjRfMjAyNTA1MjhUMTYzMDAwWiByb2JzaGF3QHJlZGhhdC5jb20&tmsrc=robshaw%40redhat.com&scp=ALL) +- We use Google Groups to share architecture diagrams and other content. Please join: [Google Group](https://groups.google.com/g/llm-d-contributors) You can also find us on diff --git a/docs/guide/Installation/prerequisites.md b/docs/guide/Installation/prerequisites.md index 3046e6b..051991a 100644 --- a/docs/guide/Installation/prerequisites.md +++ b/docs/guide/Installation/prerequisites.md @@ -1,30 +1,39 @@ --- sidebar_position: 1 +sidebar_label: Prerequisites --- -# Prerequisites for running llm-d +# Prerequisites for running the llm-d QuickStart -**Note that these are the prerequisites for running the QuickStart Demo. +### Target Platforms -## Compute Resources +Since the llm-d-deployer is based on helm charts, llm-d can be deployed on a variety of Kubernetes platforms. As more platforms are supported, the installer will be updated to support them. - +Documentation for example cluster setups are provided in the [infra](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart/infra) directory of the llm-d-deployer repository. -### Hardware Profiles +- [OpenShift on AWS](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart/infra/openshift-aws.md) -The QuickStart has been tested on: +#### Minikube -- Minikube on AWS - - single g6e.12xlarge -- Red Hat OpenShift on AWS - - 6 x m5.4xlarge - - 2 x g6e.2xlarge - - OpenShift 4.17.21 - - NVIDIA GPU Operator 24.9.2 - - OpenShift Data Foundation 4.17.6 +This can be run on a minimum ec2 node type [g6e.12xlarge](https://aws.amazon.com/ec2/instance-types/g6e/) (4xL40S 48GB but only 2 are used by default) to infer the model meta-llama/Llama-3.2-3B-Instruct that will get spun up. +> ⚠️ If your cluster has no available GPUs, the **prefill** and **decode** pods will remain in **Pending** state. -### Target Platforms +Verify you have properly installed the container toolkit with the runtime of your choice. + +```bash +# Podman +podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu nvidia-smi +# Docker +sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi +``` + +#### OpenShift + +- OpenShift - This quickstart was tested on OpenShift 4.17. Older versions may work but have not been tested. +- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). +- NO Service Mesh or Istio installation as Istio CRDs will conflict with the gateway +- Cluster administrator privileges are required to install the llm-d cluster scoped resources #### Kubernetes @@ -51,11 +60,27 @@ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi ## Software prerequisites -- Client Configuration +## Client Configuration + +### Get the code + +Clone the llm-d-deployer repository. + +```bash +git clone https://github.com/llm-d/llm-d-deployer.git +``` + +Navigate to the quickstart directory + +```bash +cd llm-d-deployer/quickstart +``` + ### Required tools Following prerequisite are required for the installer to work. -- [yq – installation & releases](https://github.com/mikefarah/yq#install) +- [yq (mikefarah) – installation](https://github.com/mikefarah/yq?tab=readme-ov-file#install) - [jq – download & install guide](https://stedolan.github.io/jq/download/) - [git – installation guide](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) - [Helm – quick-start install](https://helm.sh/docs/intro/install/) @@ -71,12 +96,13 @@ You can use the installer script that installs all the required dependencies. C ### Required credentials and configuration -- [llm-d-deployer GitHub repo – clone here](https://github.com/neuralmagic/llm-d-deployer.git) -- [Quay.io Registry – sign-up & credentials](https://quay.io/) +- [llm-d-deployer GitHub repo – clone here](https://github.com/llm-d/llm-d-deployer.git) +- [ghcr.io Registry – credentials](https://github.com/settings/tokens) You must have a GitHub account and a "classic" personal access token with `read:packages` access to the llm-d-deployer repository. - [Red Hat Registry – terms & access](https://access.redhat.com/registry/) - [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). - > ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and - > accept the usage terms to pull this with your HF token if you have not already done so. + +> ⚠️ Your Hugging Face account must have access to the model you want to use. You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and +> accept the usage terms if you have not already done so. Registry Authentication: The installer looks for an auth file in: @@ -91,15 +117,11 @@ If not found, you can create one with the following commands: Create with Docker: ```bash -docker --config ~/.config/containers/ login quay.io -docker --config ~/.config/containers/ login registry.redhat.io +docker --config ~/.config/containers/ login ghcr.io ``` Create with Podman: ```bash -podman login quay.io --authfile ~/.config/containers/auth.json -podman login registry.redhat.io --authfile ~/.config/containers/auth.json -``` - - +podman login ghcr.io --authfile ~/.config/containers/auth.json +``` \ No newline at end of file diff --git a/docs/guide/Installation/quickstart.md b/docs/guide/Installation/quickstart.md index 1cc9990..333b0b4 100644 --- a/docs/guide/Installation/quickstart.md +++ b/docs/guide/Installation/quickstart.md @@ -4,91 +4,32 @@ sidebar_label: Quick Start installer --- # Trying llm-d via the Quick Start installer +Getting Started with llm-d on Kubernetes. For specific instructions on how to install llm-d on minikube, see the [README-minikube.md](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/README-minikube.md) instructions. -For more information on llm-d, see the llm-d git repository [here](https://github.com/llm-d/llm-d) and website [here](https://llmd.io). +For more information on llm-d in general, see the llm-d git repository [here](https://github.com/llm-d/llm-d) and website [here](https://llm-d.ai). ## Overview This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster, using an opinionated flow in order to get up and running as quickly as possible. -## Client Configuration - -### Required tools - -Following prerequisite are required for the installer to work. - -- [yq (mikefarah) – installation](https://github.com/mikefarah/yq?tab=readme-ov-file#install) -- [jq – download & install guide](https://stedolan.github.io/jq/download/) -- [git – installation guide](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) -- [Helm – quick-start install](https://helm.sh/docs/intro/install/) -- [Kustomize – official install docs](https://kubectl.docs.kubernetes.io/installation/kustomize/) -- [kubectl – install & setup](https://kubernetes.io/docs/tasks/tools/install-kubectl/) - -You can use the installer script that installs all the required dependencies. Currently only Linux is supported. - -```bash -# Currently Linux only -./install-deps.sh -``` - -### Required credentials and configuration - -- [llm-d-deployer GitHub repo – clone here](https://github.com/llm-d/llm-d-deployer.git) -- [ghcr.io Registry – credentials](https://github.com/settings/tokens) You must have a GitHub account and a "classic" personal access token with `read:packages` access to the llm-d-deployer repository. -- [Red Hat Registry – terms & access](https://access.redhat.com/registry/) -- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). - -> ⚠️ Your Hugging Face account must have access to the model you want to use. You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and -> accept the usage terms if you have not already done so. - -Registry Authentication: The installer looks for an auth file in: - -```bash -~/.config/containers/auth.json -# or -~/.config/containers/config.json -``` - -If not found, you can create one with the following commands: - -Create with Docker: - -```bash -docker --config ~/.config/containers/ login ghcr.io -``` - -Create with Podman: - -```bash -podman login ghcr.io --authfile ~/.config/containers/auth.json -``` - -### Target Platforms - -#### Kubernetes +For more information on llm-d, see the llm-d git repository [here](https://github.com/llm-d/llm-d) and website [here](https://llmd.io). -This can be run on a minimum ec2 node type [g6e.12xlarge](https://aws.amazon.com/ec2/instance-types/g6e/) (4xL40S 48GB but only 2 are used by default) to infer the model meta-llama/Llama-3.2-3B-Instruct that will get spun up. +## Prerequisites -> ⚠️ If your cluster has no available GPUs, the **prefill** and **decode** pods will remain in **Pending** state. +First ensure you have all the tools and resources as described in [Prerequisites](./prerequisites.md) -Verify you have properly installed the container toolkit with the runtime of your choice. + -```bash -# Podman -podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu nvidia-smi -# Docker -sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi -``` +## llm-d Installation -#### OpenShift + - Change to the directory holding your clone of the llm-d-deployer code + - Navigate to the quickstart directory, e.g. -- OpenShift - This quickstart was tested on OpenShift 4.18. Older versions may work but have not been tested. -- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). -- NO Service Mesh or Istio installation as Istio CRDs will conflict with the gateway + ```bash + cd llm-d-deployer/quickstart + ``` - - -## llm-d Installation +Only a single installation of llm-d on a cluster is currently supported. In the future, multiple model services will be supported. Until then, [uninstall llm-d](#uninstall) before reinstalling. The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory. After this, it will apply all the manifests in order to bring up the cluster. @@ -104,11 +45,11 @@ The llmd-installer.sh script aims to simplify the installation of llm-d using th It also supports uninstalling the llm-d infrastructure and the sample app. -Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue kubectl commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command. +Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue `kubectl` or `oc` commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command. ### Usage -The installer needs to be run from the `llm-d-deployer/quickstart` directory. +The installer needs to be run from the `llm-d-deployer/quickstart` directory as a cluster admin with CLI access to the cluster. ```bash ./llmd-installer.sh [OPTIONS] @@ -116,18 +57,20 @@ The installer needs to be run from the `llm-d-deployer/quickstart` directory. ### Flags -| Flag | Description | Example | -|--------------------------------|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------------| -| `--hf-token TOKEN` | HuggingFace API token (or set `HF_TOKEN` env var) | `./llmd-installer.sh --hf-token "abc123"` | -| `--auth-file PATH` | Path to your registry auth file ig not in one of the two listed files in the auth section of the readme | `./llmd-installer.sh --auth-file ~/.config/containers/auth.json` | -| `--storage-size SIZE` | Size of storage volume (default: 7Gi) | `./llmd-installer.sh --storage-size 15Gi` | -| `--skip-download-model` | Skip downloading the model to PVC if modelArtifactURI is pvc based | `./llmd-installer.sh --skip-download-model` | -| `--storage-class CLASS` | Storage class to use (default: efs-sc) | `./llmd-installer.sh --storage-class ocs-storagecluster-cephfs` | -| `--namespace NAME` | Kubernetes namespace to use (default: `llm-d`) | `./llmd-installer.sh --namespace foo` | -| `--values-file NAME` | Absolute path to a Helm values.yaml file (default: llm-d-deployer/charts/llm-d/values.yaml) | `./llmd-installer.sh --values-file /path/to/values.yaml` | -| `--uninstall` | Uninstall llm-d and cleanup resources | `./llmd-installer.sh --uninstall` | -| `--disable-metrics-collection` | Disable metrics collection (Prometheus will not be installed) | `./llmd-installer.sh --disable-metrics-collection` | -| `-h`, `--help` | Show help and exit | `./llmd-installer.sh --help` | +| Flag | Description | Example | +|--------------------------------------|---------------------------------------------------------------|------------------------------------------------------------------| +| `-a`, `--auth-file PATH` | Path to containers auth.json | `./llmd-installer.sh --auth-file ~/.config/containers/auth.json` | +| `-z`, `--storage-size SIZE` | Size of storage volume | `./llmd-installer.sh --storage-size 15Gi` | +| `-c`, `--storage-class CLASS` | Storage class to use (default: efs-sc) | `./llmd-installer.sh --storage-class ocs-storagecluster-cephfs` | +| `-n`, `--namespace NAME` | K8s namespace (default: llm-d) | `./llmd-installer.sh --namespace foo` | +| `-f`, `--values-file PATH` | Path to Helm values.yaml file (default: values.yaml) | `./llmd-installer.sh --values-file /path/to/values.yaml` | +| `-u`, `--uninstall` | Uninstall the llm-d components from the current cluster | `./llmd-installer.sh --uninstall` | +| `-d`, `--debug` | Add debug mode to the helm install | `./llmd-installer.sh --debug` | +| `-i`, `--skip-infra` | Skip the infrastructure components of the installation | `./llmd-installer.sh --skip-infra` | +| `-t`, `--download-timeout` | Timeout for model download job | `./llmd-installer.sh --download-timeout` | +| `-D`, `--download-model` | Download the model to PVC from Hugging Face | `./llmd-installer.sh --download-model` | +| `-m`, `--disable-metrics-collection` | Disable metrics collection (Prometheus will not be installed) | `./llmd-installer.sh --disable-metrics-collection` | +| `-h`, `--help` | Show this help and exit | `./llmd-installer.sh --help` | ## Examples @@ -140,7 +83,7 @@ export HF_TOKEN="your-token" ### Install on OpenShift -Before running the installer, ensure you have logged into the cluster. For example: +Before running the installer, ensure you have logged into the cluster as a cluster administrator. For example: ```bash oc login --token=sha256~yourtoken --server=https://api.yourcluster.com:6443 @@ -151,35 +94,13 @@ export HF_TOKEN="your-token" ./llmd-installer.sh ``` - - ### Validation The inference-gateway serves as the HTTP ingress point for all inference requests in our deployment. It’s implemented as a Kubernetes Gateway (`gateway.networking.k8s.io/v1`) using either kgateway or istio as the gatewayClassName, and sits in front of your inference pods to handle path-based routing, load balancing, retries, and metrics. This example validates that the gateway itself is routing your completion requests correctly. -You can execute the [`test-request.sh`](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/test-request.sh) script to test on the cluster. - -In addition, if you're using an OpenShift Cluster or have created an ingress, you can test the endpoint from an external location. - -```bash -INGRESS_ADDRESS=$(kubectl get ingress -n "$NAMESPACE" | tail -n1 | awk '{print $3}') - -curl -sS -X GET "http://${INGRESS_ADDRESS}/v1/models" \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' - -MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - -curl -sS -X POST "http://${INGRESS_ADDRESS}/v1/completions" \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ - -d '{ - "model":"'"$MODEL_ID"'", - "prompt": "You are a helpful AI assistant. Please introduce yourself in one sentence.", - }' -``` +You can execute the [`test-request.sh`](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/test-request.sh) script in the quickstart folder to test on the cluster. > If you receive an error indicating PodSecurity "restricted" violations when running the smoke-test script, you > need to remove the restrictive PodSecurity labels from the namespace. Once these labels are removed, re-run the @@ -194,39 +115,70 @@ kubectl label namespace \ pod-security.kubernetes.io/audit-version- ``` -### Bring Your Own Model +### Customizing your deployment + +The helm charts can be customized by modifying the [values.yaml](https://github.com/llm-d/llm-d-deployer/blob/main/charts/llm-d/values.yaml) file. However, it is recommended to override values in the `values.yaml` by creating a custom yaml file and passing it to the installer using the `--values-file` flag. +Several examples are provided in the [examples](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/examples) directory. You would invoke the installer with the following command: + +```bash +./llmd-installer.sh --values-file ./examples/base.yaml +``` + +These files are designed to be used as a starting point to customize your deployment. Refer to the [values.yaml](https://github.com/llm-d/llm-d-deployer/blob/main/charts/llm-d/values.yaml) file for all the possible options. + +#### Sample Application and Model Configuration -There is a default sample application that by loads [`meta-llama/Llama-3.2-3B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) -based on the sample application [values.yaml](https://github.com/llm-d/llm-d-deployer/blob/main/charts/llm-d/values.yaml) file. If you want to swap that model out with -another [vllm compatible model](https://docs.vllm.ai/en/latest/models/supported_models.html). Simply modify the -values file with the model you wish to run. +Some of the more common options for changing the sample application model are: -Here is an example snippet of the default model values being replaced with -[`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct). +- `sampleApplication.model.modelArtifactURI` - The URI of the model to use. This is the path to the model either to Hugging Face (`hf://meta-llama/Llama-3.2-3B-Instruct`) or a persistent volume claim (PVC) (`pvc://model-pvc/meta-llama/Llama-3.2-1B-Instruct`). Using a PVC can be paired with the `--download-model` flag to download the model to PVC. +- `sampleApplication.model.modelName` - The name of the model to use. This will be used in the naming of deployed resources and also the model ID when using the API. +- `sampleApplication.baseConfigMapRefName` - The name of the preset base configuration to use. This will depend on the features you want to enable. +- `sampleApplication.prefill.replicas` - The number of prefill replicas to deploy. +- `sampleApplication.decode.replicas` - The number of decode replicas to deploy. ```yaml +sampleApplication: model: - # -- Fully qualified pvc URI: pvc:/// - modelArtifactURI: pvc://llama-3.2-1b-instruct-pvc/models/meta-llama/Llama-3.2-1B-Instruct - - # # -- Fully qualified hf URI: pvc:/// - # modelArtifactURI: hf://meta-llama/Llama-3.2-3B-Instruct - - # -- Name of the model - modelName: "Llama-3.2-1B-Instruct" - - # -- Aliases to the Model named vllm will serve with - servedModelNames: [] - - auth: - # -- HF token auth config via k8s secret. Required if using hf:// URI or not using pvc:// URI with `--skip-download-model` in quickstart - hfToken: - # -- If the secret should be created or one already exists - create: true - # -- Name of the secret to create to store your huggingface token - name: llm-d-hf-token - # -- Value of the token. Do not set this but use `envsubst` in conjunction with the helm chart - key: HF_TOKEN + modelArtifactURI: hf://meta-llama/Llama-3.2-1B-Instruct + modelName: "llama3-1B" + baseConfigMapRefName: basic-gpu-with-nixl-and-redis-lookup-preset + prefill: + replicas: 1 + decode: + replicas: 1 +``` + +#### Feature Flags + +`redis.enabled` - Whether to enable Redis needed to enable the KV Cache Aware Scorer +`modelservice.epp.defaultEnvVarsOverride` - The environment variables to override for the model service. For each feature flag, you can set the value to `true` or `false` to enable or disable the feature. + +```yaml +redis: + enabled: true +modelservice: + epp: + defaultEnvVarsOverride: + - name: ENABLE_KVCACHE_AWARE_SCORER + value: "false" + - name: ENABLE_PREFIX_AWARE_SCORER + value: "true" + - name: ENABLE_LOAD_AWARE_SCORER + value: "true" + - name: ENABLE_SESSION_AWARE_SCORER + value: "false" + - name: PD_ENABLED + value: "false" + - name: PD_PROMPT_LEN_THRESHOLD + value: "10" + - name: PREFILL_ENABLE_KVCACHE_AWARE_SCORER + value: "false" + - name: PREFILL_ENABLE_LOAD_AWARE_SCORER + value: "false" + - name: PREFILL_ENABLE_PREFIX_AWARE_SCORER + value: "false" + - name: PREFILL_ENABLE_SESSION_AWARE_SCORER + value: "false" ``` ### Metrics Collection @@ -257,8 +209,8 @@ kubectl port-forward -n llm-d-monitoring --address 0.0.0.0 svc/prometheus-grafan Access the UIs at: -- Prometheus: `````` -- Grafana: ``` (default credentials: admin/admin)``` +- Prometheus: [http://YOUR_IP:9090](#) +- Grafana: [http://YOUR_IP:3000](#) (default credentials: admin/admin) ##### Option 2: Ingress (Optional) @@ -333,7 +285,33 @@ When running in a cloud environment (like EC2), make sure to: ### Troubleshooting The various images can take some time to download depending on your connectivity. Watching events -and logs of the prefill and decode pods is a good place to start. +and logs of the prefill and decode pods is a good place to start. Here are some examples to help +you get started. + +```bash +# View the status of the pods in the default llm-d namespace. Replace "llm-d" if you used a custom namespace on install +kubectl get pods -n llm-d + +# Describe all prefill pods: +kubectl describe pods -l llm-d.ai/role=prefill -n llm-d + +# Fetch logs from each prefill pod: +kubectl logs -l llm-d.ai/role=prefill --all-containers=true -n llm-d --tail=200 + +# Describe all decode pods: +kubectl describe pods -l llm-d.ai/role=decode -n llm-d + +# Fetch logs from each decode pod: +kubectl logs -l llm-d.ai/role=decode --all-containers=true -n llm-d --tail=200 + +# Describe all endpoint-picker pods: +kubectl describe pod -n llm-d -l llm-d.ai/epp + +# Fetch logs from each endpoint-picker pod: +kubectl logs -n llm-d -l llm-d.ai/epp --all-containers=true --tail=200 +``` + +More examples of debugging logs can be found [here](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/examples/no-features/README.md). ### Uninstall diff --git a/src/components/Install/index.js b/src/components/Install/index.js index d6e6b74..e75f634 100644 --- a/src/components/Install/index.js +++ b/src/components/Install/index.js @@ -46,7 +46,7 @@ export default function Install() { alt="3. " src={require('/docs/assets/counting-03.png').default} > - Explore llm-d! + Explore llm-d! {/* -------------------------------------------------------------------------- */}