diff --git a/ai/api-reference/llm.mdx b/ai/api-reference/llm.mdx
index c3a7380c..579e6b73 100644
--- a/ai/api-reference/llm.mdx
+++ b/ai/api-reference/llm.mdx
@@ -1,24 +1,11 @@
---
openapi: post /llm
---
-
-We are currently deploying the Large Language Model (LLM) pipeline to our gateway infrastructure.
-This warning will be removed once all listed gateways have successfully transitioned to serving the LLM pipeline, ensuring a seamless and enhanced user experience.
-
-
-The LLM pipeline supports streaming response by setting `stream=true` in the request. The response is then streamed with Server Sent Events (SSE)
-in chunks as the tokens are generated.
-
-Each streaming response chunk will have the following format:
-
-`data: {"chunk": "word "}`
-
-The final chunk of the response will be indicated by the following format:
-
-`data: {"chunk": "[DONE]", "tokens_used": 256, "done": true}`
-The Response type below is for non-streaming responses that will return all of the response in one
+
+ The LLM pipeline is OpenAI API-compatible but does **not** implement all features of the OpenAI API.
+
The default Gateway used in this guide is the public
[Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
@@ -28,3 +15,48 @@ The Response type below is for non-streaming responses that will return all of t
Gateway node or partner with one via the `ai-video` channel on
[Discord](https://discord.gg/livepeer).
+
+### Streaming Responses
+
+
+ Ensure your client supports SSE and processes each `data:` line as it arrives.
+
+
+By default, the `/llm` endpoint returns a single JSON response in the OpenAI
+[chat/completions](https://platform.openai.com/docs/api-reference/chat/object)
+format, as shown in the sidebar.
+
+To receive responses token-by-token, set `"stream": true` in the request body. The server will then use **Server-Sent Events (SSE)** to stream output in real time.
+
+
+Each streamed chunk will look like:
+
+```json
+data: {
+ "choices": [
+ {
+ "delta": {
+ "content": "...token...",
+ "role": "assistant"
+ },
+ "finish_reason": null
+ }
+ ]
+}
+```
+
+The final chunk will have empty content and `"finish_reason": "stop"`:
+
+```json
+data: {
+ "choices": [
+ {
+ "delta": {
+ "content": "",
+ "role": "assistant"
+ },
+ "finish_reason": "stop"
+ }
+ ]
+}
+```
diff --git a/ai/pipelines/llm.mdx b/ai/pipelines/llm.mdx
new file mode 100644
index 00000000..65161df6
--- /dev/null
+++ b/ai/pipelines/llm.mdx
@@ -0,0 +1,156 @@
+---
+title: LLM
+---
+
+## Overview
+
+The `llm` pipeline provides an OpenAI-compatible interface for text generation,
+designed to integrate seamlessly into media workflows.
+
+## Models
+
+The `llm` pipeline supports **any Hugging Face-compatible LLM model**. Since
+models evolve quickly, the set of warm (preloaded) models on Orchestrators
+changes regularly.
+
+To see which models are currently available, check the
+[Network Capabilities dashboard](https://tools.livepeer.cloud/ai/network-capabilities).
+At the time of writing, the most commonly available model is
+[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
+
+
+ For faster responses with different
+ [LLM](https://huggingface.co/models?pipeline_tag=text-generation)
+ models, ask Orchestrators to load it on their GPU via the `ai-research` channel
+ in [Discord Server](https://discord.gg/livepeer).
+
+
+## Basic Usage Instructions
+
+
+ For a detailed understanding of the `llm` endpoint and to experiment with the
+ API, see the [Livepeer AI API Reference](/ai/api-reference/llm).
+
+
+To generate text with the `llm` pipeline, send a `POST` request to the Gateway's
+`llm` API endpoint:
+
+```bash
+curl -X POST "https:///llm" \
+ -H "Authorization: Bearer " \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "messages": [
+ { "role": "user", "content": "Tell a robot story." }
+ ]
+ }'
+```
+
+In this command:
+
+- `` should be replaced with your AI Gateway's IP address.
+- `` should be replaced with your API token if required by the AI Gateway.
+- `model` is the LLM model to use for generation.
+- `messages` is the conversation or prompt input for the model.
+
+For additional optional parameters such as `temperature`, `max_tokens`, or
+`stream`, refer to the [Livepeer AI API Reference](/ai/api-reference/llm).
+
+After execution, the Orchestrator processes the request and returns the response
+to the Gateway which forwards the response in response to the request.
+
+Example partial non-streaming response below:
+```json
+{
+ "id": "chatcmpl-abc123",
+ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "choices": [
+ {
+ "message": {
+ "role": "assistant",
+ "content": "Once upon a time, in a gleaming city of circuits..."
+ }
+ }
+ ]
+}
+```
+
+By default, responses are returned as a single JSON object. To stream output
+token-by-token using **Server-Sent Events (SSE)**, set `"stream": true` in the
+request body.
+
+## Orchestrator Configuration
+
+To configure your Orchestrator to serve the `llm` pipeline, refer to the
+[Orchestrator Configuration](/ai/orchestrators/get-started) guide.
+
+### Tuning Environment Variables
+
+The `llm` pipeline supports several environment variables that can be adjusted
+to optimize performance based on your hardware and workload. These are
+particularly helpful for managing memory usage and parallelism when running
+large models.
+
+
+ Enables 8-bit quantization using `bitsandbytes` for lower memory usage. Set to
+ `true` to enable. Defaults to `false`.
+
+
+ Number of pipeline parallel stages. Defaults to `1`.
+
+
+ Number of tensor parallel units. Must divide evenly into the number of
+ attention heads in the model. Defaults to `1`.
+
+
+ Maximum number of tokens per input sequence. Defaults to `8192`.
+
+
+ Maximum number of tokens processed in a single batch. Should be greater than
+ or equal to `MAX_MODEL_LEN`. Defaults to `8192`.
+
+
+ Maximum number of sequences processed per batch. Defaults to `128`.
+
+
+ Target GPU memory utilization as a float between `0` and `1`. Higher values
+ make fuller use of GPU memory. Defaults to `0.85`.
+
+
+### System Requirements
+
+The following system requirements are recommended for optimal performance:
+
+- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 16GB** of
+ VRAM.
+
+## Recommended Pipeline Pricing
+
+
+ We are planning to simplify the pricing in the future so orchestrators can set
+ one AI price per compute unit and have the system automatically scale based on
+ the model's compute requirements.
+
+
+The `/llm` pipeline is currently priced based on the **maximum output tokens**
+specified in the request — not actual usage — due to current payment system
+limitations. We're actively working to support usage-based pricing to better
+align with industry standards.
+
+The LLM pricing landscape is highly competitive and rapidly evolving.
+Orchestrators should set prices based on their infrastructure costs and
+[market positioning](https://llmpricecheck.com/). As a reference, inference on
+`llama-3-8b-instruct` is currently around `0.08 USD` per 1 million **output
+tokens**.
+
+## API Reference
+
+
+ Explore the `llm` endpoint and experiment with the API in the Livepeer AI API
+ Reference.
+
diff --git a/ai/pipelines/overview.mdx b/ai/pipelines/overview.mdx
index fd987c61..f50f2bd9 100644
--- a/ai/pipelines/overview.mdx
+++ b/ai/pipelines/overview.mdx
@@ -98,4 +98,8 @@ pipelines:
The upscale pipeline transforms low-resolution images into high-quality ones
without distortion
+
+ The LLM pipeline provides an OpenAI-compatible interface for text
+ generation, enabling seamless integration into media workflows.
+
diff --git a/api-reference/generate/llm.mdx b/api-reference/generate/llm.mdx
new file mode 100644
index 00000000..2e9d3cc5
--- /dev/null
+++ b/api-reference/generate/llm.mdx
@@ -0,0 +1,156 @@
+---
+title: LLM
+---
+
+## Overview
+
+The `llm` pipeline provides an OpenAI-compatible interface for text generation,
+designed to integrate seamlessly into media workflows.
+
+## Models
+
+The `llm` pipeline supports **any Hugging Face-compatible LLM model**. Since
+models evolve quickly, the set of warm (preloaded) models on Orchestrators
+changes regularly.
+
+To see which models are currently available, check the
+[Network Capabilities dashboard](https://tools.livepeer.cloud/ai/network-capabilities).
+At the time of writing, the most commonly available model is
+[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
+
+
+ For faster responses with different
+ [LLM](https://huggingface.co/models?pipeline_tag=text-generation) diffusion
+ models, ask Orchestrators to load it on their GPU via the `ai-video` channel
+ in [Discord Server](https://discord.gg/livepeer).
+
+
+## Basic Usage Instructions
+
+
+ For a detailed understanding of the `llm` endpoint and to experiment with the
+ API, see the [Livepeer AI API Reference](/ai/api-reference/llm).
+
+
+To generate text with the `llm` pipeline, send a `POST` request to the Gateway's
+`llm` API endpoint:
+
+```bash
+curl -X POST "https:///llm" \
+ -H "Authorization: Bearer " \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "messages": [
+ { "role": "user", "content": "Tell a robot story." }
+ ]
+ }'
+```
+
+In this command:
+
+- `` should be replaced with your AI Gateway's IP address.
+- `` should be replaced with your API token.
+- `model` is the LLM model to use for generation.
+- `messages` is the conversation or prompt input for the model.
+
+For additional optional parameters such as `temperature`, `max_tokens`, or
+`stream`, refer to the [Livepeer AI API Reference](/ai/api-reference/llm).
+
+After execution, the Orchestrator processes the request and returns the response
+to the Gateway:
+
+```json
+{
+ "id": "chatcmpl-abc123",
+ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+ "choices": [
+ {
+ "message": {
+ "role": "assistant",
+ "content": "Once upon a time, in a gleaming city of circuits..."
+ }
+ }
+ ]
+}
+```
+
+By default, responses are returned as a single JSON object. To stream output
+token-by-token using **Server-Sent Events (SSE)**, set `"stream": true` in the
+request body.
+
+## Orchestrator Configuration
+
+To configure your Orchestrator to serve the `llm` pipeline, refer to the
+[Orchestrator Configuration](/ai/orchestrators/get-started) guide.
+
+### Tuning Environment Variables
+
+The `llm` pipeline supports several environment variables that can be adjusted
+to optimize performance based on your hardware and workload. These are
+particularly helpful for managing memory usage and parallelism when running
+large models.
+
+
+ Enables 8-bit quantization using `bitsandbytes` for lower memory usage. Set to
+ `true` to enable. Defaults to `false`.
+
+
+ Number of pipeline parallel stages. Should not exceed the number of model
+ layers. Defaults to `1`.
+
+
+ Number of tensor parallel units. Must divide evenly into the number of
+ attention heads in the model. Defaults to `1`.
+
+
+ Maximum number of tokens per input sequence. Defaults to `8192`.
+
+
+ Maximum number of tokens processed in a single batch. Should be greater than
+ or equal to `MAX_MODEL_LEN`. Defaults to `8192`.
+
+
+ Maximum number of sequences processed per batch. Defaults to `128`.
+
+
+ Target GPU memory utilization as a float between `0` and `1`. Higher values
+ make fuller use of GPU memory. Defaults to `0.97`.
+
+
+### System Requirements
+
+The following system requirements are recommended for optimal performance:
+
+- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 16GB** of
+ VRAM.
+
+## Recommended Pipeline Pricing
+
+
+ We are planning to simplify the pricing in the future so orchestrators can set
+ one AI price per compute unit and have the system automatically scale based on
+ the model's compute requirements.
+
+
+The `/llm` pipeline is currently priced based on the **maximum output tokens**
+specified in the request — not actual usage — due to current payment system
+limitations. We're actively working to support usage-based pricing to better
+align with industry standards.
+
+The LLM pricing landscape is highly competitive and rapidly evolving.
+Orchestrators should set prices based on their infrastructure costs and
+[market positioning](https://llmpricecheck.com/). As a reference, inference on
+`llama-3-8b-instruct` is currently around `0.08 USD` per 1 million **output
+tokens**.
+
+## API Reference
+
+
+ Explore the `llm` endpoint and experiment with the API in the Livepeer AI API
+ Reference.
+
diff --git a/mint.json b/mint.json
index 5e48e1ab..b70dc247 100644
--- a/mint.json
+++ b/mint.json
@@ -537,6 +537,7 @@
"ai/pipelines/image-to-image",
"ai/pipelines/image-to-text",
"ai/pipelines/image-to-video",
+ "ai/pipelines/llm",
"ai/pipelines/segment-anything-2",
"ai/pipelines/text-to-image",
"ai/pipelines/text-to-speech",
@@ -604,6 +605,7 @@
"ai/api-reference/image-to-image",
"ai/api-reference/image-to-text",
"ai/api-reference/image-to-video",
+ "ai/api-reference/llm",
"ai/api-reference/segment-anything-2",
"ai/api-reference/text-to-image",
"ai/api-reference/text-to-speech",
@@ -837,6 +839,7 @@
"api-reference/generate/text-to-image",
"api-reference/generate/image-to-image",
"api-reference/generate/image-to-video",
+ "api-reference/generate/llm",
"api-reference/generate/segment-anything-2",
"api-reference/generate/upscale"
]