diff --git a/blog/2025-05-07_welcome_llmd.md b/blog/2025-05-07_welcome_llmd.md deleted file mode 100644 index 0d281c1..0000000 --- a/blog/2025-05-07_welcome_llmd.md +++ /dev/null @@ -1,14 +0,0 @@ ---- -title: Welcome to a new way to do distributed inference! -description: Opening post for the llm-d news blog -slug: welcome-llm-d-demonstration -authors: [Huey, Dewey, Louie] - -tags: [hello, welcome, llm-d] -image: https://i.imgur.com/mErPwqL.png -hide_table_of_contents: false ---- - -# A new way to do distributed inference: llm-d - -Welcome to the llm-d news blog. This blog is created with [**Docusaurus**](https://docusaurus.io/). \ No newline at end of file diff --git a/blog/2025-05-19.md b/blog/2025-05-19.md deleted file mode 100644 index dca6194..0000000 --- a/blog/2025-05-19.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: Here's a Blog Post! -description: Blog post with one author -slug: heres-a-blog-post -authors: [kahuna] - -tags: [blog] -image: https://i.imgur.com/mErPwqL.png -hide_table_of_contents: false ---- - -# Here's a basic blog post -Something big happened today! \ No newline at end of file diff --git a/blog/authors.yml b/blog/authors.yml index f6b7380..9d3e5e2 100644 --- a/blog/authors.yml +++ b/blog/authors.yml @@ -1,22 +1,26 @@ -Huey: - name: Huw - title: The Nephew in Red - -Dewey: - name: Dewydd - title: The one in Blue +redhat: + name: RedHat + url: https://redhat.com + image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg -Louie: - name: Lewellyn - title: That one in green +robshaw: + name: Robert Shaw + title: Director of Engineering, Red Hat + url: https://github.com/robertgshaw2-redhat + image_url: https://avatars.githubusercontent.com/u/114415538?v=4 + email: robshaw@redhat.com -kahuna: - name: Big kahuna - title: The one in charge +smarterclayton: + name: Clayton Coleman + title: Distinguished Engineer, Google + url: https://github.com/smarterclayton + image_url: https://avatars.githubusercontent.com/u/1163175?v=4 + email: claytoncoleman@google.com -redhat-author: - name: RedHat - title: One of the sponsors - url: https://redhat.com - image_url: https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-Hat_icon-Standard-RGB.svg \ No newline at end of file +chcost: + name: Carlos Costa + title: Distinguished Engineer, IBM + url: https://github.com/chcost + image_url: https://avatars.githubusercontent.com/u/26551701?v=4 + email: chcost@us.ibm.com \ No newline at end of file diff --git a/blog/tags.yml b/blog/tags.yml index 72a68f0..00208f1 100644 --- a/blog/tags.yml +++ b/blog/tags.yml @@ -19,7 +19,7 @@ llm-d: description: llm-d tag description news: - label: News Releases! + label: News Releases permalink: /news-releases description: Used for "official" news releases in the blog @@ -34,6 +34,12 @@ hola: description: Hola tag description blog: - label: just a blog + label: blog posts permalink: /blog description: everyday blog posts + + +announce: + label: Announcements + permalink: /announce + description: Announcements that aren't news releases diff --git a/docs/architecture/00_architecture.md b/docs/architecture/00_architecture.md index 08f4285..d4522b9 100644 --- a/docs/architecture/00_architecture.md +++ b/docs/architecture/00_architecture.md @@ -3,7 +3,7 @@ sidebar_position: 0 label: llm-d Architecture --- # Overview of llm-d architecture -`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. +`llm-d` is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. With `llm-d`, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in [Inference Gateway (IGW)](https://github.com/kubernetes-sigs/gateway-api-inference-extension). @@ -14,7 +14,7 @@ Built by leaders in the Kubernetes and vLLM projects, `llm-d` is a community-dri `llm-d` adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway. -![llm-d Architecture](../assets/images/llm-d-arch.svg) +![llm-d Architecture](../assets/images/llm-d-arch-simplified.svg) @@ -31,6 +31,7 @@ Key features of `llm-d` include: - **Variant Autoscaling over Hardware, Workload, and Traffic** (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) Using the recent traffic mix to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. [See our Northstar design](https://docs.google.com/document/d/1inTneLEZTv3rDEBB9KLOB9K6oMq8c3jkogARJqdt_58/edit?tab=t.0) +For more, see the [project proposal](https://github.com/llm-d/llm-d/blob/dev/docs/proposals/llm-d.md) ## Getting Started diff --git a/docs/architecture/Component Architecture/02_inf-extension.md b/docs/architecture/Component Architecture/02_inf-extension.md deleted file mode 100644 index 385ac90..0000000 --- a/docs/architecture/Component Architecture/02_inf-extension.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -sidebar_position: 2 -sidebar_label: Inference Extension ---- - -[![Go Report Card](https://goreportcard.com/badge/sigs.k8s.io/gateway-api-inference-extension)](https://goreportcard.com/report/sigs.k8s.io/gateway-api-inference-extension) -[![Go Reference](https://pkg.go.dev/badge/sigs.k8s.io/gateway-api-inference-extension.svg)](https://pkg.go.dev/sigs.k8s.io/gateway-api-inference-extension) -[![License](https://img.shields.io/github/license/kubernetes-sigs/gateway-api-inference-extension/LICENSE)] - -# Gateway API Inference Extension - -Gateway API Inference Extension optimizes self-hosting Generative Models on Kubernetes. -This is achieved by leveraging Envoy's [External Processing] (ext-proc) to extend any gateway that supports both ext-proc and [Gateway API] into an **[inference gateway]**. - - -[Inference Gateway]:#concepts-and-definitions - -## Concepts and Definitions - -The following specific terms to this project: - -- **Inference Gateway (IGW)**: A proxy/load-balancer which has been coupled with an - `Endpoint Picker`. It provides optimized routing and load balancing for - serving Kubernetes self-hosted generative Artificial Intelligence (AI) - workloads. It simplifies the deployment, management, and observability of AI - inference workloads. -- **Inference Scheduler**: An extendable component that makes decisions about which endpoint is optimal (best cost / - best performance) for an inference request based on `Metrics and Capabilities` - from [Model Serving](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol/README.md). -- **Metrics and Capabilities**: Data provided by model serving platforms about - performance, availability and capabilities to optimize routing. Includes - things like [Prefix Cache] status or [LoRA Adapters] availability. -- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal). - - -The following are key industry terms that are important to understand for -this project: - -- **Model**: A generative AI model that has learned patterns from data and is - used for inference. Models vary in size and architecture, from smaller - domain-specific models to massive multi-billion parameter neural networks that - are optimized for diverse language tasks. -- **Inference**: The process of running a generative AI model, such as a large - language model, diffusion model etc, to generate text, embeddings, or other - outputs from input data. -- **Model server**: A service (in our case, containerized) responsible for - receiving inference requests and returning predictions from a model. -- **Accelerator**: specialized hardware, such as Graphics Processing Units - (GPUs) that can be attached to Kubernetes nodes to speed up computations, - particularly for training and inference tasks. - - -For deeper insights and more advanced concepts, refer to our [proposals](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals). - -[Inference]:https://www.digitalocean.com/community/tutorials/llm-inference-optimization -[Gateway API]:https://github.com/kubernetes-sigs/gateway-api -[Prefix Cache]:https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html -[LoRA Adapters]:https://docs.vllm.ai/en/stable/features/lora.html -[External Processing]:https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter - -## Technical Overview - -This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **[inference gateway]** - supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee. - -The Inference Gateway: - -* Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases -* Provides [Kubernetes-native declarative APIs](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/) to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades -* Adds end to end observability around service objective attainment -* Ensures operational guardrails between different client model names, allowing a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators - -![Architecture Diagram](../../assets/images/inference-gateway-architecture.svg) - -It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol). Support for Google's Jetstream, nVidia Triton, text-generation-inference, and SGLang is coming soon. - -## Status - -This project is [alpha (0.3 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.3.0). It should not be used in production yet. - -## Getting Started - -Follow our [Getting Started Guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to get the inference-extension up and running on your cluster! - -See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs - -## Roadmap - -As Inference Gateway builds towards a GA release. We will continue to expand our capabilities, namely: -1. Prefix-cache aware load balancing with interfaces for remote caches -1. Recommended LoRA adapter pipeline for automated rollout -1. Fairness and priority between workloads within the same criticality band -1. HPA support for autoscaling on aggregate metrics derived from the load balancer -1. Support for large multi-modal inputs and outputs -1. Support for other GenAI model types (diffusion and other non-completion protocols) -1. Heterogeneous accelerators - serve workloads on multiple types of accelerator using latency and request cost-aware load balancing -1. Disaggregated serving support with independently scaling pools - - -## End-to-End Tests - -Follow this link to the [e2e README](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/test/e2e/epp/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster. - -## Contributing - -Our community meeting is weekly at Thursday 10AM PDT ([Zoom](https://zoom.us/j/9955436256?pwd=Z2FQWU1jeDZkVC9RRTN4TlZyZTBHZz09), [Meeting Notes](https://www.google.com/url?q=https://docs.google.com/document/d/1frfPE5L1sI3737rdQV04IcDGeOcGJj2ItjMg6z2SRH0/edit?usp%3Dsharing&sa=D&source=calendar&usd=2&usg=AOvVaw1pUVy7UN_2PMj8qJJcFm1U)). - -We currently utilize the [#wg-serving](https://kubernetes.slack.com/?redir=%2Fmessages%2Fwg-serving) slack channel for communications. - -Contributions are readily welcomed, follow the [dev guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/dev.md) to start contributing! - -### Code of conduct - -Participation in the Kubernetes community is governed by the [Kubernetes Code of Conduct](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/code-of-conduct.md). diff --git a/docs/architecture/Component Architecture/02_inf-simulator.md b/docs/architecture/Component Architecture/02_inf-simulator.md new file mode 100644 index 0000000..18c56f6 --- /dev/null +++ b/docs/architecture/Component Architecture/02_inf-simulator.md @@ -0,0 +1,120 @@ +--- +sidebar_position: 2 +sidebar_label: Inference Simulator +--- +# vLLM Simulator +To help with development and testing we have developed a light weight vLLM simulator. It does not truly +run inference, but it does emulate responses to the HTTP REST endpoints of vLLM. +Currently it supports partial OpenAI-compatible API: +- /v1/chat/completions +- /v1/completions +- /v1/models + +In addition, it supports a subset of vLLM's Prometheus metrics. These metrics are exposed via the /metrics HTTP REST endpoint. Currently supported are the following metrics: +- vllm:lora_requests_info + +The simulated inferense has no connection with the model and LoRA adapters specified in the command line parameters. The /v1/models endpoint returns simulated results based on those same command line parameters. + +The simulator supports two modes of operation: +- `echo` mode: the response contains the same text that was received in the request. For `/v1/chat/completions` the last message for the role=`user` is used. +- `random` mode: the response is randomly chosen from a set of pre-defined sentences. + +Timing of the response is defined by two parameters: `time-to-first-token` and `inter-token-latency`. + +For a request with `stream=true`: `time-to-first-token` defines the delay before the first token is returned, `inter-token-latency` defines the delay between subsequent tokens in the stream. + +For a requst with `stream=false`: the response is returned after delay of ` + ( * ( - 1))` + +It can be run standalone or in a Pod for testing under packages such as Kind. + +## Limitations +API responses contains a subset of the fields provided by the OpenAI API. + +
+ Click to show the structure of requests/responses + +- `/v1/chat/completions` + - **request** + - stream + - model + - messages + - role + - content + - **response** + - id + - created + - model + - choices + - index + - finish_reason + - message +- `/v1/completions` + - **request** + - stream + - model + - prompt + - max_tokens (for future usage) + - **response** + - id + - created + - model + - choices + - text +- `/v1/models` + - **response** + - object (list) + - data + - id + - object (model) + - created + - owned_by + - root + - parent +
+
+For more details see the [vLLM documentation](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#openai-completions-api-with-vllm) + +## Command line parameters +- `port`: the port the simulator listents on, mandatory +- `model`: the currently 'loaded' model, mandatory +- `lora`: a list of available LoRA adapters, separated by commas, optional, by default empty +- `mode`: the simulator mode, optional, by default `random` + - `echo`: returns the same text that was sent in the request + - `random`: returns a sentence chosen at random from a set of pre-defined sentences +- `time-to-first-token`: the time to the first token (in milliseconds), optional, by default zero +- `inter-token-latency`: the time to 'generate' each additional token (in milliseconds), optional, by default zero +- `max-loras`: maximum number of LoRAs in a single batch, optional, default is one +- `max-cpu-loras`: maximum number of LoRAs to store in CPU memory, optional, must be >= than max_loras, default is max_loras +- `max-running-requests`: maximum number of inference requests that could be processed at the same time + + +## Working with docker image + +### Building +To build a Docker image of the vLLM Simulator, run: +```bash +make build-llm-d-inference-sim-image +``` + +### Running +To run the vLLM Simulator image under Docker, run: +```bash +docker run --rm --publish 8000:8000 ai-aware-router/llm-d-inference-sim:0.0.1 /ai-aware-router/llm-d-inference-sim --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct" --lora "tweet-summary-0,tweet-summary-1" +``` +**Note:** The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model. + +## Standalone testing + +### Building +To build the vLLM simulator, run: +```bash +make build-llm-d-inference-sim +``` + +### Running +To run the router in a standalone test environment, run: +```bash +./bin/llm-d-inference-sim --model my_model --port 8000 +``` + + diff --git a/docs/assets/images/carlos costa.jpeg b/docs/assets/images/carlos costa.jpeg new file mode 100644 index 0000000..15d1439 Binary files /dev/null and b/docs/assets/images/carlos costa.jpeg differ diff --git a/docs/assets/images/clayton coleman.jpeg b/docs/assets/images/clayton coleman.jpeg new file mode 100644 index 0000000..4192aca Binary files /dev/null and b/docs/assets/images/clayton coleman.jpeg differ diff --git a/docs/assets/images/llm-d-arch-simplified.svg b/docs/assets/images/llm-d-arch-simplified.svg new file mode 100644 index 0000000..825cb91 --- /dev/null +++ b/docs/assets/images/llm-d-arch-simplified.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/images/robert shaw headshot.jpeg b/docs/assets/images/robert shaw headshot.jpeg new file mode 100644 index 0000000..e485fb0 Binary files /dev/null and b/docs/assets/images/robert shaw headshot.jpeg differ diff --git a/docs/community/contact_us.md b/docs/community/contact_us.md index 8be3107..48c532f 100644 --- a/docs/community/contact_us.md +++ b/docs/community/contact_us.md @@ -9,6 +9,8 @@ There are several ways you can join the community effort to develop and enhance - Via the [**Github pages for llm-d:** https://github.com/llm-d](https://github.com/llm-d)** - Via our [**Slack Workspace:** https://llm-d.slack.com](https://llm-d.slack.com) - Via [**Reddit**: Reddit:https://www.reddit.com/r/llm_d/](Reddit:https://www.reddit.com/r/llm_d/) +- We host a weekly standup for contributors on Wednesdays at 1230pm ET. Please join: [Meeting Details](https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=NG9yZ3AyYTN0N3VlaW01b21xbWV2c21uNjRfMjAyNTA1MjhUMTYzMDAwWiByb2JzaGF3QHJlZGhhdC5jb20&tmsrc=robshaw%40redhat.com&scp=ALL) +- We use Google Groups to share architecture diagrams and other content. Please join: [Google Group](https://groups.google.com/g/llm-d-contributors) You can also find us on diff --git a/docs/guide/Installation/prerequisites.md b/docs/guide/Installation/prerequisites.md index 3046e6b..051991a 100644 --- a/docs/guide/Installation/prerequisites.md +++ b/docs/guide/Installation/prerequisites.md @@ -1,30 +1,39 @@ --- sidebar_position: 1 +sidebar_label: Prerequisites --- -# Prerequisites for running llm-d +# Prerequisites for running the llm-d QuickStart -**Note that these are the prerequisites for running the QuickStart Demo. +### Target Platforms -## Compute Resources +Since the llm-d-deployer is based on helm charts, llm-d can be deployed on a variety of Kubernetes platforms. As more platforms are supported, the installer will be updated to support them. - +Documentation for example cluster setups are provided in the [infra](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart/infra) directory of the llm-d-deployer repository. -### Hardware Profiles +- [OpenShift on AWS](https://github.com/llm-d/llm-d-deployer/tree/main/quickstart/infra/openshift-aws.md) -The QuickStart has been tested on: +#### Minikube -- Minikube on AWS - - single g6e.12xlarge -- Red Hat OpenShift on AWS - - 6 x m5.4xlarge - - 2 x g6e.2xlarge - - OpenShift 4.17.21 - - NVIDIA GPU Operator 24.9.2 - - OpenShift Data Foundation 4.17.6 +This can be run on a minimum ec2 node type [g6e.12xlarge](https://aws.amazon.com/ec2/instance-types/g6e/) (4xL40S 48GB but only 2 are used by default) to infer the model meta-llama/Llama-3.2-3B-Instruct that will get spun up. +> ⚠️ If your cluster has no available GPUs, the **prefill** and **decode** pods will remain in **Pending** state. -### Target Platforms +Verify you have properly installed the container toolkit with the runtime of your choice. + +```bash +# Podman +podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu nvidia-smi +# Docker +sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi +``` + +#### OpenShift + +- OpenShift - This quickstart was tested on OpenShift 4.17. Older versions may work but have not been tested. +- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). +- NO Service Mesh or Istio installation as Istio CRDs will conflict with the gateway +- Cluster administrator privileges are required to install the llm-d cluster scoped resources #### Kubernetes @@ -51,11 +60,27 @@ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi ## Software prerequisites -- Client Configuration +## Client Configuration + +### Get the code + +Clone the llm-d-deployer repository. + +```bash +git clone https://github.com/llm-d/llm-d-deployer.git +``` + +Navigate to the quickstart directory + +```bash +cd llm-d-deployer/quickstart +``` + ### Required tools Following prerequisite are required for the installer to work. -- [yq – installation & releases](https://github.com/mikefarah/yq#install) +- [yq (mikefarah) – installation](https://github.com/mikefarah/yq?tab=readme-ov-file#install) - [jq – download & install guide](https://stedolan.github.io/jq/download/) - [git – installation guide](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) - [Helm – quick-start install](https://helm.sh/docs/intro/install/) @@ -71,12 +96,13 @@ You can use the installer script that installs all the required dependencies. C ### Required credentials and configuration -- [llm-d-deployer GitHub repo – clone here](https://github.com/neuralmagic/llm-d-deployer.git) -- [Quay.io Registry – sign-up & credentials](https://quay.io/) +- [llm-d-deployer GitHub repo – clone here](https://github.com/llm-d/llm-d-deployer.git) +- [ghcr.io Registry – credentials](https://github.com/settings/tokens) You must have a GitHub account and a "classic" personal access token with `read:packages` access to the llm-d-deployer repository. - [Red Hat Registry – terms & access](https://access.redhat.com/registry/) - [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). - > ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and - > accept the usage terms to pull this with your HF token if you have not already done so. + +> ⚠️ Your Hugging Face account must have access to the model you want to use. You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and +> accept the usage terms if you have not already done so. Registry Authentication: The installer looks for an auth file in: @@ -91,15 +117,11 @@ If not found, you can create one with the following commands: Create with Docker: ```bash -docker --config ~/.config/containers/ login quay.io -docker --config ~/.config/containers/ login registry.redhat.io +docker --config ~/.config/containers/ login ghcr.io ``` Create with Podman: ```bash -podman login quay.io --authfile ~/.config/containers/auth.json -podman login registry.redhat.io --authfile ~/.config/containers/auth.json -``` - - +podman login ghcr.io --authfile ~/.config/containers/auth.json +``` \ No newline at end of file diff --git a/docs/guide/Installation/quickstart.md b/docs/guide/Installation/quickstart.md index 1cc9990..333b0b4 100644 --- a/docs/guide/Installation/quickstart.md +++ b/docs/guide/Installation/quickstart.md @@ -4,91 +4,32 @@ sidebar_label: Quick Start installer --- # Trying llm-d via the Quick Start installer +Getting Started with llm-d on Kubernetes. For specific instructions on how to install llm-d on minikube, see the [README-minikube.md](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/README-minikube.md) instructions. -For more information on llm-d, see the llm-d git repository [here](https://github.com/llm-d/llm-d) and website [here](https://llmd.io). +For more information on llm-d in general, see the llm-d git repository [here](https://github.com/llm-d/llm-d) and website [here](https://llm-d.ai). ## Overview This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster, using an opinionated flow in order to get up and running as quickly as possible. -## Client Configuration - -### Required tools - -Following prerequisite are required for the installer to work. - -- [yq (mikefarah) – installation](https://github.com/mikefarah/yq?tab=readme-ov-file#install) -- [jq – download & install guide](https://stedolan.github.io/jq/download/) -- [git – installation guide](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) -- [Helm – quick-start install](https://helm.sh/docs/intro/install/) -- [Kustomize – official install docs](https://kubectl.docs.kubernetes.io/installation/kustomize/) -- [kubectl – install & setup](https://kubernetes.io/docs/tasks/tools/install-kubectl/) - -You can use the installer script that installs all the required dependencies. Currently only Linux is supported. - -```bash -# Currently Linux only -./install-deps.sh -``` - -### Required credentials and configuration - -- [llm-d-deployer GitHub repo – clone here](https://github.com/llm-d/llm-d-deployer.git) -- [ghcr.io Registry – credentials](https://github.com/settings/tokens) You must have a GitHub account and a "classic" personal access token with `read:packages` access to the llm-d-deployer repository. -- [Red Hat Registry – terms & access](https://access.redhat.com/registry/) -- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). - -> ⚠️ Your Hugging Face account must have access to the model you want to use. You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and -> accept the usage terms if you have not already done so. - -Registry Authentication: The installer looks for an auth file in: - -```bash -~/.config/containers/auth.json -# or -~/.config/containers/config.json -``` - -If not found, you can create one with the following commands: - -Create with Docker: - -```bash -docker --config ~/.config/containers/ login ghcr.io -``` - -Create with Podman: - -```bash -podman login ghcr.io --authfile ~/.config/containers/auth.json -``` - -### Target Platforms - -#### Kubernetes +For more information on llm-d, see the llm-d git repository [here](https://github.com/llm-d/llm-d) and website [here](https://llmd.io). -This can be run on a minimum ec2 node type [g6e.12xlarge](https://aws.amazon.com/ec2/instance-types/g6e/) (4xL40S 48GB but only 2 are used by default) to infer the model meta-llama/Llama-3.2-3B-Instruct that will get spun up. +## Prerequisites -> ⚠️ If your cluster has no available GPUs, the **prefill** and **decode** pods will remain in **Pending** state. +First ensure you have all the tools and resources as described in [Prerequisites](./prerequisites.md) -Verify you have properly installed the container toolkit with the runtime of your choice. + -```bash -# Podman -podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu nvidia-smi -# Docker -sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi -``` +## llm-d Installation -#### OpenShift + - Change to the directory holding your clone of the llm-d-deployer code + - Navigate to the quickstart directory, e.g. -- OpenShift - This quickstart was tested on OpenShift 4.18. Older versions may work but have not been tested. -- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). -- NO Service Mesh or Istio installation as Istio CRDs will conflict with the gateway + ```bash + cd llm-d-deployer/quickstart + ``` - - -## llm-d Installation +Only a single installation of llm-d on a cluster is currently supported. In the future, multiple model services will be supported. Until then, [uninstall llm-d](#uninstall) before reinstalling. The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory. After this, it will apply all the manifests in order to bring up the cluster. @@ -104,11 +45,11 @@ The llmd-installer.sh script aims to simplify the installation of llm-d using th It also supports uninstalling the llm-d infrastructure and the sample app. -Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue kubectl commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command. +Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue `kubectl` or `oc` commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command. ### Usage -The installer needs to be run from the `llm-d-deployer/quickstart` directory. +The installer needs to be run from the `llm-d-deployer/quickstart` directory as a cluster admin with CLI access to the cluster. ```bash ./llmd-installer.sh [OPTIONS] @@ -116,18 +57,20 @@ The installer needs to be run from the `llm-d-deployer/quickstart` directory. ### Flags -| Flag | Description | Example | -|--------------------------------|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------------| -| `--hf-token TOKEN` | HuggingFace API token (or set `HF_TOKEN` env var) | `./llmd-installer.sh --hf-token "abc123"` | -| `--auth-file PATH` | Path to your registry auth file ig not in one of the two listed files in the auth section of the readme | `./llmd-installer.sh --auth-file ~/.config/containers/auth.json` | -| `--storage-size SIZE` | Size of storage volume (default: 7Gi) | `./llmd-installer.sh --storage-size 15Gi` | -| `--skip-download-model` | Skip downloading the model to PVC if modelArtifactURI is pvc based | `./llmd-installer.sh --skip-download-model` | -| `--storage-class CLASS` | Storage class to use (default: efs-sc) | `./llmd-installer.sh --storage-class ocs-storagecluster-cephfs` | -| `--namespace NAME` | Kubernetes namespace to use (default: `llm-d`) | `./llmd-installer.sh --namespace foo` | -| `--values-file NAME` | Absolute path to a Helm values.yaml file (default: llm-d-deployer/charts/llm-d/values.yaml) | `./llmd-installer.sh --values-file /path/to/values.yaml` | -| `--uninstall` | Uninstall llm-d and cleanup resources | `./llmd-installer.sh --uninstall` | -| `--disable-metrics-collection` | Disable metrics collection (Prometheus will not be installed) | `./llmd-installer.sh --disable-metrics-collection` | -| `-h`, `--help` | Show help and exit | `./llmd-installer.sh --help` | +| Flag | Description | Example | +|--------------------------------------|---------------------------------------------------------------|------------------------------------------------------------------| +| `-a`, `--auth-file PATH` | Path to containers auth.json | `./llmd-installer.sh --auth-file ~/.config/containers/auth.json` | +| `-z`, `--storage-size SIZE` | Size of storage volume | `./llmd-installer.sh --storage-size 15Gi` | +| `-c`, `--storage-class CLASS` | Storage class to use (default: efs-sc) | `./llmd-installer.sh --storage-class ocs-storagecluster-cephfs` | +| `-n`, `--namespace NAME` | K8s namespace (default: llm-d) | `./llmd-installer.sh --namespace foo` | +| `-f`, `--values-file PATH` | Path to Helm values.yaml file (default: values.yaml) | `./llmd-installer.sh --values-file /path/to/values.yaml` | +| `-u`, `--uninstall` | Uninstall the llm-d components from the current cluster | `./llmd-installer.sh --uninstall` | +| `-d`, `--debug` | Add debug mode to the helm install | `./llmd-installer.sh --debug` | +| `-i`, `--skip-infra` | Skip the infrastructure components of the installation | `./llmd-installer.sh --skip-infra` | +| `-t`, `--download-timeout` | Timeout for model download job | `./llmd-installer.sh --download-timeout` | +| `-D`, `--download-model` | Download the model to PVC from Hugging Face | `./llmd-installer.sh --download-model` | +| `-m`, `--disable-metrics-collection` | Disable metrics collection (Prometheus will not be installed) | `./llmd-installer.sh --disable-metrics-collection` | +| `-h`, `--help` | Show this help and exit | `./llmd-installer.sh --help` | ## Examples @@ -140,7 +83,7 @@ export HF_TOKEN="your-token" ### Install on OpenShift -Before running the installer, ensure you have logged into the cluster. For example: +Before running the installer, ensure you have logged into the cluster as a cluster administrator. For example: ```bash oc login --token=sha256~yourtoken --server=https://api.yourcluster.com:6443 @@ -151,35 +94,13 @@ export HF_TOKEN="your-token" ./llmd-installer.sh ``` - - ### Validation The inference-gateway serves as the HTTP ingress point for all inference requests in our deployment. It’s implemented as a Kubernetes Gateway (`gateway.networking.k8s.io/v1`) using either kgateway or istio as the gatewayClassName, and sits in front of your inference pods to handle path-based routing, load balancing, retries, and metrics. This example validates that the gateway itself is routing your completion requests correctly. -You can execute the [`test-request.sh`](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/test-request.sh) script to test on the cluster. - -In addition, if you're using an OpenShift Cluster or have created an ingress, you can test the endpoint from an external location. - -```bash -INGRESS_ADDRESS=$(kubectl get ingress -n "$NAMESPACE" | tail -n1 | awk '{print $3}') - -curl -sS -X GET "http://${INGRESS_ADDRESS}/v1/models" \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' - -MODEL_ID=meta-llama/Llama-3.2-3B-Instruct - -curl -sS -X POST "http://${INGRESS_ADDRESS}/v1/completions" \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ - -d '{ - "model":"'"$MODEL_ID"'", - "prompt": "You are a helpful AI assistant. Please introduce yourself in one sentence.", - }' -``` +You can execute the [`test-request.sh`](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/test-request.sh) script in the quickstart folder to test on the cluster. > If you receive an error indicating PodSecurity "restricted" violations when running the smoke-test script, you > need to remove the restrictive PodSecurity labels from the namespace. Once these labels are removed, re-run the @@ -194,39 +115,70 @@ kubectl label namespace \ pod-security.kubernetes.io/audit-version- ``` -### Bring Your Own Model +### Customizing your deployment + +The helm charts can be customized by modifying the [values.yaml](https://github.com/llm-d/llm-d-deployer/blob/main/charts/llm-d/values.yaml) file. However, it is recommended to override values in the `values.yaml` by creating a custom yaml file and passing it to the installer using the `--values-file` flag. +Several examples are provided in the [examples](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/examples) directory. You would invoke the installer with the following command: + +```bash +./llmd-installer.sh --values-file ./examples/base.yaml +``` + +These files are designed to be used as a starting point to customize your deployment. Refer to the [values.yaml](https://github.com/llm-d/llm-d-deployer/blob/main/charts/llm-d/values.yaml) file for all the possible options. + +#### Sample Application and Model Configuration -There is a default sample application that by loads [`meta-llama/Llama-3.2-3B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) -based on the sample application [values.yaml](https://github.com/llm-d/llm-d-deployer/blob/main/charts/llm-d/values.yaml) file. If you want to swap that model out with -another [vllm compatible model](https://docs.vllm.ai/en/latest/models/supported_models.html). Simply modify the -values file with the model you wish to run. +Some of the more common options for changing the sample application model are: -Here is an example snippet of the default model values being replaced with -[`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct). +- `sampleApplication.model.modelArtifactURI` - The URI of the model to use. This is the path to the model either to Hugging Face (`hf://meta-llama/Llama-3.2-3B-Instruct`) or a persistent volume claim (PVC) (`pvc://model-pvc/meta-llama/Llama-3.2-1B-Instruct`). Using a PVC can be paired with the `--download-model` flag to download the model to PVC. +- `sampleApplication.model.modelName` - The name of the model to use. This will be used in the naming of deployed resources and also the model ID when using the API. +- `sampleApplication.baseConfigMapRefName` - The name of the preset base configuration to use. This will depend on the features you want to enable. +- `sampleApplication.prefill.replicas` - The number of prefill replicas to deploy. +- `sampleApplication.decode.replicas` - The number of decode replicas to deploy. ```yaml +sampleApplication: model: - # -- Fully qualified pvc URI: pvc:/// - modelArtifactURI: pvc://llama-3.2-1b-instruct-pvc/models/meta-llama/Llama-3.2-1B-Instruct - - # # -- Fully qualified hf URI: pvc:/// - # modelArtifactURI: hf://meta-llama/Llama-3.2-3B-Instruct - - # -- Name of the model - modelName: "Llama-3.2-1B-Instruct" - - # -- Aliases to the Model named vllm will serve with - servedModelNames: [] - - auth: - # -- HF token auth config via k8s secret. Required if using hf:// URI or not using pvc:// URI with `--skip-download-model` in quickstart - hfToken: - # -- If the secret should be created or one already exists - create: true - # -- Name of the secret to create to store your huggingface token - name: llm-d-hf-token - # -- Value of the token. Do not set this but use `envsubst` in conjunction with the helm chart - key: HF_TOKEN + modelArtifactURI: hf://meta-llama/Llama-3.2-1B-Instruct + modelName: "llama3-1B" + baseConfigMapRefName: basic-gpu-with-nixl-and-redis-lookup-preset + prefill: + replicas: 1 + decode: + replicas: 1 +``` + +#### Feature Flags + +`redis.enabled` - Whether to enable Redis needed to enable the KV Cache Aware Scorer +`modelservice.epp.defaultEnvVarsOverride` - The environment variables to override for the model service. For each feature flag, you can set the value to `true` or `false` to enable or disable the feature. + +```yaml +redis: + enabled: true +modelservice: + epp: + defaultEnvVarsOverride: + - name: ENABLE_KVCACHE_AWARE_SCORER + value: "false" + - name: ENABLE_PREFIX_AWARE_SCORER + value: "true" + - name: ENABLE_LOAD_AWARE_SCORER + value: "true" + - name: ENABLE_SESSION_AWARE_SCORER + value: "false" + - name: PD_ENABLED + value: "false" + - name: PD_PROMPT_LEN_THRESHOLD + value: "10" + - name: PREFILL_ENABLE_KVCACHE_AWARE_SCORER + value: "false" + - name: PREFILL_ENABLE_LOAD_AWARE_SCORER + value: "false" + - name: PREFILL_ENABLE_PREFIX_AWARE_SCORER + value: "false" + - name: PREFILL_ENABLE_SESSION_AWARE_SCORER + value: "false" ``` ### Metrics Collection @@ -257,8 +209,8 @@ kubectl port-forward -n llm-d-monitoring --address 0.0.0.0 svc/prometheus-grafan Access the UIs at: -- Prometheus: `````` -- Grafana: ``` (default credentials: admin/admin)``` +- Prometheus: [http://YOUR_IP:9090](#) +- Grafana: [http://YOUR_IP:3000](#) (default credentials: admin/admin) ##### Option 2: Ingress (Optional) @@ -333,7 +285,33 @@ When running in a cloud environment (like EC2), make sure to: ### Troubleshooting The various images can take some time to download depending on your connectivity. Watching events -and logs of the prefill and decode pods is a good place to start. +and logs of the prefill and decode pods is a good place to start. Here are some examples to help +you get started. + +```bash +# View the status of the pods in the default llm-d namespace. Replace "llm-d" if you used a custom namespace on install +kubectl get pods -n llm-d + +# Describe all prefill pods: +kubectl describe pods -l llm-d.ai/role=prefill -n llm-d + +# Fetch logs from each prefill pod: +kubectl logs -l llm-d.ai/role=prefill --all-containers=true -n llm-d --tail=200 + +# Describe all decode pods: +kubectl describe pods -l llm-d.ai/role=decode -n llm-d + +# Fetch logs from each decode pod: +kubectl logs -l llm-d.ai/role=decode --all-containers=true -n llm-d --tail=200 + +# Describe all endpoint-picker pods: +kubectl describe pod -n llm-d -l llm-d.ai/epp + +# Fetch logs from each endpoint-picker pod: +kubectl logs -n llm-d -l llm-d.ai/epp --all-containers=true --tail=200 +``` + +More examples of debugging logs can be found [here](https://github.com/llm-d/llm-d-deployer/blob/main/quickstart/examples/no-features/README.md). ### Uninstall diff --git a/src/components/Install/index.js b/src/components/Install/index.js index d6e6b74..e75f634 100644 --- a/src/components/Install/index.js +++ b/src/components/Install/index.js @@ -46,7 +46,7 @@ export default function Install() { alt="3. " src={require('/docs/assets/counting-03.png').default} > - Explore llm-d! + Explore llm-d! {/* -------------------------------------------------------------------------- */}