In production systems with user interaction, streaming output from LLMs greatly improves the user experience. Streaming allows you to build real-time systems that minimize the time to first token (TTFT) rather than waiting for the entire document to be completed before progessing.


Guardrails natively supports validation for streaming output, supporting both synchronous and asynchronous approaches.

In [None]:
from rich import print
import guardrails as gd
import litellm
from IPython.display import clear_output
import time

Streaming with a guard class can be done by setting the 'stream' parameter to 'True'

In [None]:
guard = gd.Guard()

fragment_generator = guard(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me about LLM streaming APIs."},
    ],
    max_tokens=1024,
    temperature=0,
    stream=True,
)


for op in fragment_generator:
    clear_output(wait=True)
    print(op)
    time.sleep(0.5)

In cases where concurrent network calls are happening (many LLM calls!) it may be beneficial to use an asynchronous LLM client. Guardrails also natively supports asynchronous streaming calls.

In [None]:
guard = gd.AsyncGuard()

fragment_generator = await guard(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me about the streaming API of guardrails."},
    ],
    max_tokens=1024,
    temperature=0,
    stream=True,
)


async for op in fragment_generator:
    clear_output(wait=True)
    print(op)
    time.sleep(0.5)