Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 48 additions & 16 deletions ai/api-reference/llm.mdx
Original file line number Diff line number Diff line change
@@ -1,24 +1,11 @@
---
openapi: post /llm
---
<Warning>
We are currently deploying the Large Language Model (LLM) pipeline to our gateway infrastructure.
This warning will be removed once all listed gateways have successfully transitioned to serving the LLM pipeline, ensuring a seamless and enhanced user experience.
</Warning>
<Note>
The LLM pipeline supports streaming response by setting `stream=true` in the request. The response is then streamed with Server Sent Events (SSE)
in chunks as the tokens are generated.

Each streaming response chunk will have the following format:

`data: {"chunk": "word "}`

The final chunk of the response will be indicated by the following format:

`data: {"chunk": "[DONE]", "tokens_used": 256, "done": true}`

The Response type below is for non-streaming responses that will return all of the response in one
<Note>
The LLM pipeline is OpenAI API-compatible but does **not** implement all features of the OpenAI API.
</Note>

<Info>
The default Gateway used in this guide is the public
[Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
Expand All @@ -28,3 +15,48 @@ The Response type below is for non-streaming responses that will return all of t
Gateway node or partner with one via the `ai-video` channel on
[Discord](https://discord.gg/livepeer).
</Info>

### Streaming Responses

<Note>
Ensure your client supports SSE and processes each `data:` line as it arrives.
</Note>

By default, the `/llm` endpoint returns a single JSON response in the OpenAI
[chat/completions](https://platform.openai.com/docs/api-reference/chat/object)
format, as shown in the sidebar.

To receive responses token-by-token, set `"stream": true` in the request body. The server will then use **Server-Sent Events (SSE)** to stream output in real time.


Each streamed chunk will look like:

```json
data: {
"choices": [
{
"delta": {
"content": "...token...",
"role": "assistant"
},
"finish_reason": null
}
]
}
```

The final chunk will have empty content and `"finish_reason": "stop"`:

```json
data: {
"choices": [
{
"delta": {
"content": "",
"role": "assistant"
},
"finish_reason": "stop"
}
]
}
```
156 changes: 156 additions & 0 deletions ai/pipelines/llm.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
title: LLM
---

## Overview

The `llm` pipeline provides an OpenAI-compatible interface for text generation,
designed to integrate seamlessly into media workflows.

## Models

The `llm` pipeline supports **any Hugging Face-compatible LLM model**. Since
models evolve quickly, the set of warm (preloaded) models on Orchestrators
changes regularly.

To see which models are currently available, check the
[Network Capabilities dashboard](https://tools.livepeer.cloud/ai/network-capabilities).
At the time of writing, the most commonly available model is
[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

<Tip>
For faster responses with different
[LLM](https://huggingface.co/models?pipeline_tag=text-generation)
models, ask Orchestrators to load it on their GPU via the `ai-research` channel
in [Discord Server](https://discord.gg/livepeer).
</Tip>

## Basic Usage Instructions

<Tip>
For a detailed understanding of the `llm` endpoint and to experiment with the
API, see the [Livepeer AI API Reference](/ai/api-reference/llm).
</Tip>

To generate text with the `llm` pipeline, send a `POST` request to the Gateway's
`llm` API endpoint:

```bash
curl -X POST "https://<GATEWAY_IP>/llm" \
-H "Authorization: Bearer <TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{ "role": "user", "content": "Tell a robot story." }
]
}'
```

In this command:

- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
- `<TOKEN>` should be replaced with your API token if required by the AI Gateway.
- `model` is the LLM model to use for generation.
- `messages` is the conversation or prompt input for the model.

For additional optional parameters such as `temperature`, `max_tokens`, or
`stream`, refer to the [Livepeer AI API Reference](/ai/api-reference/llm).

After execution, the Orchestrator processes the request and returns the response
to the Gateway which forwards the response in response to the request.

Example partial non-streaming response below:
```json
{
"id": "chatcmpl-abc123",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"choices": [
{
"message": {
"role": "assistant",
"content": "Once upon a time, in a gleaming city of circuits..."
}
}
]
}
```

By default, responses are returned as a single JSON object. To stream output
token-by-token using **Server-Sent Events (SSE)**, set `"stream": true` in the
request body.

## Orchestrator Configuration

To configure your Orchestrator to serve the `llm` pipeline, refer to the
[Orchestrator Configuration](/ai/orchestrators/get-started) guide.

### Tuning Environment Variables

The `llm` pipeline supports several environment variables that can be adjusted
to optimize performance based on your hardware and workload. These are
particularly helpful for managing memory usage and parallelism when running
large models.

<ParamField path="USE_8BIT" type="boolean">
Enables 8-bit quantization using `bitsandbytes` for lower memory usage. Set to
`true` to enable. Defaults to `false`.
</ParamField>
<ParamField path="PIPELINE_PARALLEL_SIZE" type="integer">
Number of pipeline parallel stages. Defaults to `1`.
</ParamField>
<ParamField path="TENSOR_PARALLEL_SIZE" type="integer">
Number of tensor parallel units. Must divide evenly into the number of
attention heads in the model. Defaults to `1`.
</ParamField>
<ParamField path="MAX_MODEL_LEN" type="integer">
Maximum number of tokens per input sequence. Defaults to `8192`.
</ParamField>
<ParamField path="MAX_NUM_BATCHED_TOKENS" type="integer">
Maximum number of tokens processed in a single batch. Should be greater than
or equal to `MAX_MODEL_LEN`. Defaults to `8192`.
</ParamField>
<ParamField path="MAX_NUM_SEQS" type="integer">
Maximum number of sequences processed per batch. Defaults to `128`.
</ParamField>
<ParamField path="GPU_MEMORY_UTILIZATION" type="float">
Target GPU memory utilization as a float between `0` and `1`. Higher values
make fuller use of GPU memory. Defaults to `0.85`.
</ParamField>

### System Requirements

The following system requirements are recommended for optimal performance:

- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 16GB** of
VRAM.

## Recommended Pipeline Pricing

<Note>
We are planning to simplify the pricing in the future so orchestrators can set
one AI price per compute unit and have the system automatically scale based on
the model's compute requirements.
</Note>

The `/llm` pipeline is currently priced based on the **maximum output tokens**
specified in the request — not actual usage — due to current payment system
limitations. We're actively working to support usage-based pricing to better
align with industry standards.

The LLM pricing landscape is highly competitive and rapidly evolving.
Orchestrators should set prices based on their infrastructure costs and
[market positioning](https://llmpricecheck.com/). As a reference, inference on
`llama-3-8b-instruct` is currently around `0.08 USD` per 1 million **output
tokens**.

## API Reference

<Card
title="API Reference"
icon="rectangle-terminal"
href="/ai/api-reference/llm"
>
Explore the `llm` endpoint and experiment with the API in the Livepeer AI API
Reference.
</Card>
4 changes: 4 additions & 0 deletions ai/pipelines/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -98,4 +98,8 @@ pipelines:
The upscale pipeline transforms low-resolution images into high-quality ones
without distortion
</Card>
<Card title="LLM" icon="rectangle-terminal" href="/ai/pipelines/llm">
The LLM pipeline provides an OpenAI-compatible interface for text
generation, enabling seamless integration into media workflows.
</Card>
</CardGroup>
156 changes: 156 additions & 0 deletions api-reference/generate/llm.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
title: LLM
---

## Overview

The `llm` pipeline provides an OpenAI-compatible interface for text generation,
designed to integrate seamlessly into media workflows.

## Models

The `llm` pipeline supports **any Hugging Face-compatible LLM model**. Since
models evolve quickly, the set of warm (preloaded) models on Orchestrators
changes regularly.

To see which models are currently available, check the
[Network Capabilities dashboard](https://tools.livepeer.cloud/ai/network-capabilities).
At the time of writing, the most commonly available model is
[meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).

<Tip>
For faster responses with different
[LLM](https://huggingface.co/models?pipeline_tag=text-generation) diffusion
models, ask Orchestrators to load it on their GPU via the `ai-video` channel
in [Discord Server](https://discord.gg/livepeer).
</Tip>

## Basic Usage Instructions

<Tip>
For a detailed understanding of the `llm` endpoint and to experiment with the
API, see the [Livepeer AI API Reference](/ai/api-reference/llm).
</Tip>

To generate text with the `llm` pipeline, send a `POST` request to the Gateway's
`llm` API endpoint:

```bash
curl -X POST "https://<GATEWAY_IP>/llm" \
-H "Authorization: Bearer <TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{ "role": "user", "content": "Tell a robot story." }
]
}'
```

In this command:

- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
- `<TOKEN>` should be replaced with your API token.
- `model` is the LLM model to use for generation.
- `messages` is the conversation or prompt input for the model.

For additional optional parameters such as `temperature`, `max_tokens`, or
`stream`, refer to the [Livepeer AI API Reference](/ai/api-reference/llm).

After execution, the Orchestrator processes the request and returns the response
to the Gateway:

```json
{
"id": "chatcmpl-abc123",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"choices": [
{
"message": {
"role": "assistant",
"content": "Once upon a time, in a gleaming city of circuits..."
}
}
]
}
```

By default, responses are returned as a single JSON object. To stream output
token-by-token using **Server-Sent Events (SSE)**, set `"stream": true` in the
request body.

## Orchestrator Configuration

To configure your Orchestrator to serve the `llm` pipeline, refer to the
[Orchestrator Configuration](/ai/orchestrators/get-started) guide.

### Tuning Environment Variables

The `llm` pipeline supports several environment variables that can be adjusted
to optimize performance based on your hardware and workload. These are
particularly helpful for managing memory usage and parallelism when running
large models.

<ParamField path="USE_8BIT" type="boolean">
Enables 8-bit quantization using `bitsandbytes` for lower memory usage. Set to
`true` to enable. Defaults to `false`.
</ParamField>
<ParamField path="PIPELINE_PARALLEL_SIZE" type="integer">
Number of pipeline parallel stages. Should not exceed the number of model
layers. Defaults to `1`.
</ParamField>
<ParamField path="TENSOR_PARALLEL_SIZE" type="integer">
Number of tensor parallel units. Must divide evenly into the number of
attention heads in the model. Defaults to `1`.
</ParamField>
<ParamField path="MAX_MODEL_LEN" type="integer">
Maximum number of tokens per input sequence. Defaults to `8192`.
</ParamField>
<ParamField path="MAX_NUM_BATCHED_TOKENS" type="integer">
Maximum number of tokens processed in a single batch. Should be greater than
or equal to `MAX_MODEL_LEN`. Defaults to `8192`.
</ParamField>
<ParamField path="MAX_NUM_SEQS" type="integer">
Maximum number of sequences processed per batch. Defaults to `128`.
</ParamField>
<ParamField path="GPU_MEMORY_UTILIZATION" type="float">
Target GPU memory utilization as a float between `0` and `1`. Higher values
make fuller use of GPU memory. Defaults to `0.97`.
</ParamField>

### System Requirements

The following system requirements are recommended for optimal performance:

- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 16GB** of
VRAM.

## Recommended Pipeline Pricing

<Note>
We are planning to simplify the pricing in the future so orchestrators can set
one AI price per compute unit and have the system automatically scale based on
the model's compute requirements.
</Note>

The `/llm` pipeline is currently priced based on the **maximum output tokens**
specified in the request — not actual usage — due to current payment system
limitations. We're actively working to support usage-based pricing to better
align with industry standards.

The LLM pricing landscape is highly competitive and rapidly evolving.
Orchestrators should set prices based on their infrastructure costs and
[market positioning](https://llmpricecheck.com/). As a reference, inference on
`llama-3-8b-instruct` is currently around `0.08 USD` per 1 million **output
tokens**.

## API Reference

<Card
title="API Reference"
icon="rectangle-terminal"
href="/ai/api-reference/llm"
>
Explore the `llm` endpoint and experiment with the API in the Livepeer AI API
Reference.
</Card>
Loading