# Streaming: Real-Time Token Output

Learn how to stream LLM responses token-by-token for better user experience and real-time feedback.

## Why Streaming Matters

### The Problem

Traditional LLM calls wait for the entire response before displaying anything:

```python
# Non-streaming: Wait 10+ seconds, then see everything at once
response = await model.invoke(...)  # üò¥ User waits...
print(response)  # üí• Everything appears at once
```

### The Solution

Streaming provides tokens as they're generated:

```python
# Streaming: See tokens immediately as they arrive
async for chunk in model.invoke(...):
    print(chunk.data, end="", flush=True)  # ‚ö° Real-time output
```

### Benefits

1. **Better UX**: User sees immediate feedback
2. **Perceived Speed**: Feels faster even if total time is same
3. **Memory Efficiency**: Process chunks without buffering entire response
4. **Progressive Rendering**: Update UI as data arrives
5. **Early Termination**: Stop generation if needed

## Setup

In [None]:
import asyncio
import time

from lionpride import Session
from lionpride.services import iModel

## 1. Basic Streaming with generate()

The simplest way to stream: enable `stream=True` and iterate over chunks.

In [None]:
async def basic_streaming():
    """Stream tokens as they're generated"""
    session = Session()

    # Create model (streaming is enabled via invoke_stream method)
    model = iModel(provider="openai", model="gpt-4o-mini", temperature=0.7)
    session.services.register(model)

    # Stream response using invoke_stream()
    print("Assistant: ", end="", flush=True)

    response_chunks = []
    async for chunk in model.invoke_stream(
        model="gpt-4o-mini", messages=[{"role": "user", "content": "Write a haiku about coding"}]
    ):
        response_chunks.append(chunk)
        print(chunk, end="", flush=True)  # Print immediately

    print()  # Newline at end

    full_response = "".join(response_chunks)
    print(f"\n‚úÖ Received {len(response_chunks)} chunks ({len(full_response)} chars)")

    return full_response


# Run the example
result = await basic_streaming()

## 2. Comparing Stream vs Non-Stream

Let's see the timing difference side-by-side.

In [None]:
async def compare_streaming():
    """Compare streaming vs non-streaming"""
    session = Session()

    # Create model
    model = iModel(provider="openai", model="gpt-4o-mini", temperature=0.7)
    session.services.register(model)

    instruction = "Explain async/await in Python in 2 paragraphs"
    messages = [{"role": "user", "content": instruction}]

    # Non-streaming: Wait for everything
    print("=" * 60)
    print("NON-STREAMING (wait for complete response)")
    print("=" * 60)

    start = time.time()
    calling = await model.invoke(model="gpt-4o-mini", messages=messages)
    elapsed_sync = time.time() - start

    # Extract response from calling object
    response_sync = calling.execution.response.data if calling.execution.response else ""

    print(f"‚è±Ô∏è  Waited {elapsed_sync:.2f}s, then got everything at once:\n")
    print(str(response_sync)[:200] + "...\n")

    # Streaming: See tokens immediately
    print("=" * 60)
    print("STREAMING (tokens appear in real-time)")
    print("=" * 60)

    start = time.time()
    first_token_time = None
    chunks = []

    async for chunk in model.invoke_stream(model="gpt-4o-mini", messages=messages):
        if first_token_time is None:
            first_token_time = time.time() - start
        chunks.append(chunk)
        print(chunk, end="", flush=True)

    elapsed_stream = time.time() - start

    print(f"\n\n‚ö° First token after {first_token_time:.2f}s")
    print(f"‚è±Ô∏è  Total time: {elapsed_stream:.2f}s")
    print(f"\n‚ú® User perceived {(elapsed_sync / first_token_time):.1f}x faster!")


await compare_streaming()

## 3. Streaming with Progress Indicators

Show progress while streaming for long responses.

In [None]:
async def stream_with_progress():
    """Show progress while streaming"""
    session = Session()
    model = iModel(provider="openai", model="gpt-4o-mini", temperature=0.7)
    session.services.register(model)

    messages = [
        {
            "role": "user",
            "content": "Explain how transformers work in machine learning (detailed explanation)",
        }
    ]

    print("Generating response", end="", flush=True)

    tokens_received = 0
    response_chunks = []
    words_seen = 0

    async for chunk in model.invoke_stream(model="gpt-4o-mini", messages=messages):
        tokens_received += 1
        response_chunks.append(chunk)

        # Approximate word count
        words_seen += len(chunk.split())

        # Show progress every 10 tokens
        if tokens_received % 10 == 0:
            print(".", end="", flush=True)

    print(f"\n\n‚úÖ Received {tokens_received} chunks (~{words_seen} words)\n")
    print("Response:")
    print("=" * 60)
    print("".join(response_chunks))
    print("=" * 60)

    return "".join(response_chunks)


result = await stream_with_progress()

## 4. Streaming with Conversation Context

Stream responses while maintaining conversation history.

In [None]:
async def stream_with_context():
    """Stream responses while maintaining conversation context"""
    session = Session()
    model = iModel(
        provider="anthropic",
        endpoint="messages",
        model="claude-3-5-sonnet-20241022",
        temperature=0.7,
    )
    session.services.register(model)

    branch = session.create_branch(name="streaming-chat")

    # Turn 1: Build context using communicate()
    from lionpride.operations import communicate

    print("User: Tell me about Python decorators\n")

    first_response = await communicate(
        session=session,
        branch=branch,
        parameters={
            "instruction": "Tell me about Python decorators in 2 sentences",
            "imodel": model.name,
        },
    )

    print(f"Assistant: {first_response}\n")

    # Turn 2: Streaming follow-up with context
    print("User: Can you show me a practical example?\n")
    print("Assistant: ", end="", flush=True)

    # Get conversation history for streaming
    from lionpride.session.messages.utils import prepare_messages_for_chat

    messages = [session.messages[msg_id] for msg_id in branch.order]
    chat_messages = list(prepare_messages_for_chat(messages=messages, progression=branch))

    # Add new instruction
    chat_messages.append({"role": "user", "content": "Can you show me a practical example?"})

    # Stream response
    full_response = []
    async for chunk in model.invoke_stream(
        model="claude-3-5-sonnet-20241022", messages=chat_messages
    ):
        full_response.append(chunk)
        print(chunk, end="", flush=True)

    print("\n")

    # Save streamed response to branch
    from lionpride.session.messages import AssistantResponseContent, Message

    response_message = Message(
        content=AssistantResponseContent(assistant_response="".join(full_response))
    )
    session.add_message(response_message, branches=branch)

    print(f"‚úÖ Conversation has {len(branch.order)} messages")

    return session, branch


session, branch = await stream_with_context()

## 5. Building a Streaming Chat UI Pattern

A reusable pattern for chat applications with streaming.

In [None]:
class StreamingChat:
    """Reusable streaming chat pattern"""

    def __init__(self, model: iModel):
        self.model = model
        self.session = Session()
        self.session.services.register(model)
        self.branch = self.session.create_branch(name="main")

    async def chat(self, user_input: str, show_stats: bool = True):
        """Send message and stream response"""
        from lionpride.session.messages import AssistantResponseContent, InstructionContent, Message
        from lionpride.session.messages.utils import prepare_messages_for_chat

        # Display user message
        print(f"\nüë§ User: {user_input}")
        print("ü§ñ Assistant: ", end="", flush=True)

        # Get conversation history
        messages = [self.session.messages[msg_id] for msg_id in self.branch.order]
        chat_messages = list(prepare_messages_for_chat(messages=messages, progression=self.branch))

        # Add new message
        chat_messages.append({"role": "user", "content": user_input})

        # Stream response with stats
        start_time = time.time()
        first_token_time = None
        chunks = []

        async for chunk in self.model.invoke_stream(
            model=self.model.backend.config.model, messages=chat_messages
        ):
            if first_token_time is None:
                first_token_time = time.time() - start_time

            chunks.append(chunk)
            print(chunk, end="", flush=True)

        total_time = time.time() - start_time
        full_response = "".join(chunks)

        print()  # Newline

        if show_stats:
            print(
                f"\nüìä Stats: {len(chunks)} chunks | "
                f"First token: {first_token_time:.2f}s | "
                f"Total: {total_time:.2f}s"
            )

        # Save to conversation
        user_msg = Message(content=InstructionContent(instruction=user_input))
        response_msg = Message(content=AssistantResponseContent(assistant_response=full_response))

        self.session.add_message(user_msg, branches=self.branch)
        self.session.add_message(response_msg, branches=self.branch)

        return full_response

    def history(self):
        """Display conversation history"""
        print("\n" + "=" * 60)
        print("CONVERSATION HISTORY")
        print("=" * 60)

        for msg_id in self.branch.order:
            msg = self.session.messages[msg_id]
            role = "üë§" if msg.role.value == "user" else "ü§ñ"

            if hasattr(msg.content, "instruction"):
                content = msg.content.instruction
            elif hasattr(msg.content, "assistant_response"):
                content = msg.content.assistant_response
            else:
                content = str(msg.content)

            preview = content[:100] + "..." if len(content) > 100 else content
            print(f"{role} {msg.role.value}: {preview}\n")


# Create streaming chat instance
model = iModel(provider="openai", model="gpt-4o-mini", temperature=0.7)
chat = StreamingChat(model)

# Have a multi-turn conversation
await chat.chat("What are the key benefits of async programming?")
await chat.chat("Can you give me a code example?")
await chat.chat("What are common mistakes to avoid?")

# Show history
chat.history()

## 6. Parallel Streaming from Multiple Models

Compare responses from different models streaming simultaneously.

In [None]:
async def stream_one_model(model: iModel, instruction: str, label: str):
    """Stream from one model with label"""
    messages = [{"role": "user", "content": instruction}]

    print(f"\n[{label}] Starting...")
    response = []

    start = time.time()
    async for chunk in model.invoke_stream(model=model.backend.config.model, messages=messages):
        response.append(chunk)
        # Show first few chunks
        if len(response) <= 5:
            print(f"[{label}] {chunk}", end="", flush=True)

    elapsed = time.time() - start
    full_response = "".join(response)
    print(f"\n[{label}] Complete ({len(response)} chunks, {elapsed:.2f}s)")

    return full_response


async def parallel_streaming():
    """Stream from multiple models simultaneously"""
    session = Session()

    gpt = iModel(provider="openai", model="gpt-4o-mini", temperature=0.7, name="gpt")

    claude = iModel(
        provider="anthropic",
        endpoint="messages",
        model="claude-3-5-haiku-20241022",
        temperature=0.7,
        name="claude",
    )

    session.services.register(gpt)
    session.services.register(claude)

    question = "What makes a good API design? Answer in 3 bullet points."

    print("=" * 60)
    print(f"Streaming from both models: {question}")
    print("=" * 60)

    # Stream from both models in parallel
    results = await asyncio.gather(
        stream_one_model(gpt, question, "GPT"),
        stream_one_model(claude, question, "Claude"),
    )

    print("\n" + "=" * 60)
    print("FINAL RESPONSES")
    print("=" * 60)

    print(f"\nü§ñ GPT Response ({len(results[0])} chars):")
    print("-" * 60)
    print(results[0])

    print(f"\nü§ñ Claude Response ({len(results[1])} chars):")
    print("-" * 60)
    print(results[1])

    return results


results = await parallel_streaming()

## 7. Advanced: Buffered Streaming

Buffer chunks before displaying for smoother output.

In [None]:
async def buffered_stream(model: iModel, messages: list, buffer_size: int = 5):
    """Buffer chunks before displaying"""
    print(f"Buffering every {buffer_size} chunks...\n")
    print("Response: ", end="", flush=True)

    buffer = []
    total_chunks = 0

    async for chunk in model.invoke_stream(model=model.backend.config.model, messages=messages):
        buffer.append(chunk)
        total_chunks += 1

        if len(buffer) >= buffer_size:
            # Flush buffer
            print("".join(buffer), end="", flush=True)
            buffer = []

    # Flush remaining
    if buffer:
        print("".join(buffer), end="", flush=True)

    print(f"\n\n‚úÖ Buffered {total_chunks} chunks (flushed every {buffer_size} chunks)")


# Example usage
session = Session()
model = iModel(provider="openai", model="gpt-4o-mini", temperature=0.7)
session.services.register(model)

messages = [{"role": "user", "content": "Explain the benefits of microservices architecture"}]

await buffered_stream(model, messages, buffer_size=10)

## 8. Error Handling in Streams

Handle errors gracefully during streaming.

In [None]:
async def safe_stream(model: iModel, messages: list):
    """Stream with error handling"""
    chunks = []

    try:
        print("Streaming with error handling...\n")
        print("Response: ", end="", flush=True)

        async for chunk in model.invoke_stream(model=model.backend.config.model, messages=messages):
            chunks.append(chunk)
            print(chunk, end="", flush=True)

        print("\n\n‚úÖ Streaming completed successfully")
        print(f"Received {len(chunks)} chunks")

    except TimeoutError:
        print(f"\n\n‚è±Ô∏è  Streaming timeout after {len(chunks)} chunks")
        print("Partial response available")

    except Exception as e:
        print(f"\n\n‚ùå Streaming error: {type(e).__name__}: {e}")
        print(f"Received {len(chunks)} chunks before error")

    finally:
        # Always return partial response
        partial_response = "".join(chunks)
        print(f"\nPartial response length: {len(partial_response)} chars")
        return partial_response


# Example usage
session = Session()
model = iModel(provider="openai", model="gpt-4o-mini", temperature=0.7)
session.services.register(model)

messages = [{"role": "user", "content": "Explain the concept of event-driven architecture"}]

result = await safe_stream(model, messages)

## Key Concepts

### Streaming Architecture

```
LLM Provider ‚Üí HTTP SSE/Streaming ‚Üí lionpride iModel ‚Üí invoke_stream() ‚Üí async iterator ‚Üí Your code
```

### Stream vs Non-Stream

```python
# Non-streaming: Wait for complete response
model = iModel(provider="openai", model="gpt-4o-mini")
calling = await model.invoke(...)  # Returns Calling object
response = calling.execution.response.data

# Streaming: Iterate over chunks
async for chunk in model.invoke_stream(...):  # Yields string chunks
    process(chunk)
```

### Chunk Structure

```python
# Each chunk is a string (text content)
async for chunk in model.invoke_stream(...):
    print(chunk)  # Direct string output
```

## Common Pitfalls

### 1. Using invoke() instead of invoke_stream()

```python
# ‚ùå Wrong - invoke() returns Calling, not iterator
async for chunk in model.invoke(...):  # TypeError!

# ‚úÖ Right - use invoke_stream() for streaming
async for chunk in model.invoke_stream(...):  # Works
```

### 2. Not flushing print buffer

```python
# ‚ùå Wrong - output may be buffered
print(chunk, end="")

# ‚úÖ Right - flush immediately
print(chunk, end="", flush=True)
```

### 3. Blocking operations in stream loop

```python
# ‚ùå Wrong - blocks streaming
async for chunk in model.invoke_stream(...):
    time.sleep(0.1)  # Don't block!
    print(chunk)

# ‚úÖ Right - use async sleep if needed
async for chunk in model.invoke_stream(...):
    await asyncio.sleep(0.1)  # Non-blocking
    print(chunk)
```

### 4. Not handling incomplete chunks

```python
# ‚ùå Wrong - assumes complete words
async for chunk in model.invoke_stream(...):
    words = chunk.split()  # May split mid-word!

# ‚úÖ Right - accumulate and process complete units
buffer = ""
async for chunk in model.invoke_stream(...):
    buffer += chunk
    # Process complete sentences/paragraphs
```

## When to Use Streaming

### ‚úÖ Use streaming for:

- Long responses (>100 tokens)
- Interactive applications
- Real-time feedback requirements
- Large-scale generation
- Better perceived performance

### ‚ùå Skip streaming for:

- Short responses (<50 tokens)
- Batch processing
- Responses needing validation before display
- Structured output (wait for complete JSON)
- When latency isn't critical

## Summary

You've learned how to:

1. ‚úÖ **Stream responses** with `model.invoke_stream()`
2. ‚úÖ **Iterate over chunks** with `async for`
3. ‚úÖ **Display real-time output** with `flush=True`
4. ‚úÖ **Track progress** with chunk counters
5. ‚úÖ **Maintain context** across turns
6. ‚úÖ **Build chat patterns** with streaming
7. ‚úÖ **Stream in parallel** from multiple models
8. ‚úÖ **Handle errors** gracefully

### Key Takeaways

- **Use `invoke_stream()`** for streaming (not `invoke()`)
- **First token latency** matters more than total time
- **Always flush output** for real-time display
- **Buffer wisely** for smoother output
- **Handle errors** to preserve partial responses

### Next Steps

- **Notebook 08**: Error handling patterns
- **Cookbook**: [Error Handling](../docs/cookbook/error_handling.md)

**Streaming makes LLM applications feel faster and more responsive. Use it for better user experience!**