# üß© Mini-Lab: Streaming Responses

**Module 2: LLM Core Concepts** | **Duration: ~20 min** | **Type: Mini-Lab**

---

## Learning Objectives

By the end of this mini-lab, you will be able to:

1. **Understand** how autoregressive generation works (token-by-token)
2. **Understand** how streaming differs from regular API responses
3. **Implement** streaming with OpenAI's API
4. **Process** tokens as they arrive in real-time
5. **Build** interactive user experiences with streaming

## Target Concepts

| Concept | Description |
|---------|-------------|
| Autoregressive Generation | LLMs generate text one token at a time, each depending on previous tokens |
| Streaming | Receiving tokens incrementally as they're generated |

## üß† How LLMs Generate Text: Autoregressive Generation

**LLMs generate text one token at a time.** This is called **autoregressive generation**:

```
Prompt: "The capital of France is"

Step 1: Model predicts ‚Üí "Paris" (based on prompt)
Step 2: Model predicts ‚Üí "." (based on prompt + "Paris")
Step 3: Model predicts ‚Üí "\n" (based on prompt + "Paris" + ".")
...and so on until a stop condition is met
```

### Why This Matters

1. **Each token depends on ALL previous tokens** - the model "reads" everything before predicting the next word
2. **Generation is sequential** - you can't generate token 5 before tokens 1-4
3. **This is why streaming is possible** - we can send each token as it's generated
4. **This is why context matters** - earlier tokens influence all later predictions

### Visualizing the Process

```
Time ‚Üí
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ The ‚îÇ cap  ‚îÇ ital  ‚îÇ of   ‚îÇ Fr  ‚îÇance ‚îÇ  ‚Üê Input (prompt)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                        ‚Üì
                                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                                    ‚îÇ is    ‚îÇ  Token 1 (generated)
                                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                            ‚Üì
                                        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                                        ‚îÇ Paris ‚îÇ  Token 2 (generated)
                                        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                ‚Üì
                                            ‚îå‚îÄ‚îÄ‚îÄ‚îê
                                            ‚îÇ . ‚îÇ  Token 3 (generated)
                                            ‚îî‚îÄ‚îÄ‚îÄ‚îò
```

**Streaming exposes this token-by-token generation to your application!**

## Why Streaming Matters

- **Reduced perceived latency**: Users see output immediately
- **Better UX**: Feels more interactive and responsive
- **Early termination**: Can stop generation if output is wrong
- **Real-time processing**: Process tokens as they arrive

## 1. Setup

In [1]:
import os
import time
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import display, Markdown, clear_output

load_dotenv()
client = OpenAI()

def md(text):
    display(Markdown(text))

print("‚úì Setup complete")

‚úì Setup complete


## 2. Non-Streaming vs Streaming Comparison

First, let's compare the two approaches:

In [2]:
def compare_latency(prompt):
    """Compare perceived latency between streaming and non-streaming."""
    
    print("\nüìä Latency Comparison")
    print("="*50)
    print(f"Prompt: {prompt}\n")
    
    # Non-streaming: measure time to first character
    print("1Ô∏è‚É£ NON-STREAMING:")
    start = time.time()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
        stream=False
    )
    first_char_time = time.time() - start
    total_time = first_char_time  # Same as total for non-streaming
    content = response.choices[0].message.content
    print(f"   Time to first character: {first_char_time:.2f}s")
    print(f"   Total time: {total_time:.2f}s")
    print(f"   Response length: {len(content)} chars")
    
    # Streaming: measure time to first token
    print("\n2Ô∏è‚É£ STREAMING:")
    start = time.time()
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
        stream=True
    )
    
    first_token_time = None
    streamed_content = ""
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.time() - start
            streamed_content += chunk.choices[0].delta.content
    
    total_time_stream = time.time() - start
    
    print(f"   Time to first character: {first_token_time:.2f}s")
    print(f"   Total time: {total_time_stream:.2f}s")
    print(f"   Response length: {len(streamed_content)} chars")
    
    # Calculate improvement
    improvement = ((first_char_time - first_token_time) / first_char_time) * 100
    print(f"\n‚ú® Streaming is {improvement:.0f}% faster for first character!")

compare_latency("Explain the concept of machine learning in 3 sentences.")


üìä Latency Comparison
Prompt: Explain the concept of machine learning in 3 sentences.

1Ô∏è‚É£ NON-STREAMING:
   Time to first character: 1.71s
   Total time: 1.71s
   Response length: 464 chars

2Ô∏è‚É£ STREAMING:
   Time to first character: 0.31s
   Total time: 1.51s
   Response length: 473 chars

‚ú® Streaming is 82% faster for first character!


## 3. Basic Streaming Implementation

In [15]:
def basic_streaming(prompt):
    """Basic streaming demonstration."""
    
    print(f"üìù Prompt: {prompt}\n")
    print("üì§ Streaming response:\n")
    
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
        stream=True
    )
    
    full_response = ""
    
    for chunk in stream:
        # Each chunk contains a delta (partial content)
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            # streming with print command
            # print(content, end="", flush=True)  # Print without newline
            full_response += content

            # streming with markdown command
            # Clear and re-render for live update effect
            clear_output(wait=True)
            md(full_response)

    print("\n")  # Final newline
    
    return full_response

basic_streaming("List 5 benefits of learning programming, briefly.");

Here are five benefits of learning programming:

1. **Problem-Solving Skills**: Programming enhances logical thinking and problem-solving abilities, enabling individuals to break down complex issues into manageable parts and find effective solutions.

2. **Career Opportunities**: Proficiency in programming opens doors to numerous career paths in technology, software development, data analysis, and more, often accompanied by high demand and competitive salaries.

3. **Creativity and Innovation**: Learning to code allows individuals to bring their ideas to life, fostering creativity and the ability to create unique applications, websites, and software solutions.

4. **Understanding Technology**: Knowledge of programming helps individuals understand how technology works, making them more adept at utilizing and adapting to new technologies in an increasingly digital world.

5. **Collaboration and Teamwork**: Programming often involves working in teams, leading to improved collaboration skills as individuals learn to communicate ideas and work together efficiently on projects.





## 4. Interactive Streaming with Live Updates

In [16]:
def interactive_streaming(prompt):
    """Stream with live markdown rendering in Jupyter."""
    
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt + "\n\nFormat your response in markdown."}
        ],
        max_tokens=300,
        stream=True
    )
    
    full_response = ""
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            full_response += content
            
            # Clear and re-render for live update effect
            clear_output(wait=True)
            md(f"**Prompt:** *{prompt}*\n\n---\n\n{full_response}‚ñå")
    
    # Final render without cursor
    clear_output(wait=True)
    md(f"**Prompt:** *{prompt}*\n\n---\n\n{full_response}")
    
    return full_response

interactive_streaming("What are the key principles of good software design?");

**Prompt:** *What are the key principles of good software design?*

---

# Key Principles of Good Software Design

Good software design is crucial for creating maintainable, scalable, and robust software applications. Here are some key principles to consider:

## 1. Separation of Concerns
- **Definition**: Divide a program into distinct sections, each addressing a separate concern or functionality.
- **Benefit**: Enhances maintainability and allows for easier debugging and testing.

## 2. DRY (Don't Repeat Yourself)
- **Definition**: Avoid duplication of code and logic.
- **Benefit**: Reduces redundancy, minimizes errors, and makes changes easier to implement across the codebase.

## 3. KISS (Keep It Simple, Stupid)
- **Definition**: Design systems in the simplest way possible; avoid over-complication.
- **Benefit**: Simplifies understanding and reduces the chance of introducing bugs.

## 4. YAGNI (You Aren't Gonna Need It)
- **Definition**: Do not add functionality until it is necessary.
- **Benefit**: Prevents bloat and ensures that the software remains focused on the actual requirements.

## 5. Single Responsibility Principle (SRP)
- **Definition**: A module or class should have only one reason to change, meaning it should only have one responsibility.
- **Benefit**: Enhances code clarity and minimizes the impact of changes.

## 6. Open/Closed Principle
- **Definition**: Software entities (classes, modules, functions, etc

## 5. Processing Streamed Content

Process tokens as they arrive for various use cases:

In [5]:
def stream_with_word_detection(prompt, watch_words):
    """Stream and detect specific words in real-time."""
    
    print(f"üìù Prompt: {prompt}")
    print(f"üëÄ Watching for: {watch_words}\n")
    
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
        stream=True
    )
    
    full_response = ""
    found_words = []
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            full_response += content
            
            # Check for watch words
            for word in watch_words:
                if word.lower() in full_response.lower() and word not in found_words:
                    found_words.append(word)
                    print(f"\nüéØ Found '{word}'!")
            
            print(content, end="", flush=True)
    
    print(f"\n\nüìä Summary: Found {len(found_words)}/{len(watch_words)} watch words")
    
    # Display as rendered markdown for better readability
    print("\nüìã Rendered Output:\n")
    md(full_response)
    
    return full_response, found_words

stream_with_word_detection(
    "Explain the benefits of neural networks and deep learning for AI applications.",
    ["neural", "deep", "learning", "training", "model"]
)

üìù Prompt: Explain the benefits of neural networks and deep learning for AI applications.
üëÄ Watching for: ['neural', 'deep', 'learning', 'training', 'model']

Ne
üéØ Found 'neural'!
ural networks and
üéØ Found 'deep'!
 deep
üéØ Found 'learning'!
 learning have revolutionized the field of artificial intelligence (AI) by providing powerful tools for tackling complex problems. Here are some of the key benefits of these technologies in AI applications:

1. **High Performance on Complex Tasks**:
   - Neural networks, especially deep learning
üéØ Found 'model'!
 models, excel at handling complex tasks such as image recognition, natural language processing, and game playing. Their ability to model intricate patterns and relationships in data allows them to achieve state-of-the-art performance in various domains.

2. **Automated Feature Extraction**:
   - Traditional machine learning techniques often require manual feature engineering, which can be time-consuming and requires domain e

Neural networks and deep learning have revolutionized the field of artificial intelligence (AI) by providing powerful tools for tackling complex problems. Here are some of the key benefits of these technologies in AI applications:

1. **High Performance on Complex Tasks**:
   - Neural networks, especially deep learning models, excel at handling complex tasks such as image recognition, natural language processing, and game playing. Their ability to model intricate patterns and relationships in data allows them to achieve state-of-the-art performance in various domains.

2. **Automated Feature Extraction**:
   - Traditional machine learning techniques often require manual feature engineering, which can be time-consuming and requires domain expertise. Deep learning models automatically learn to extract relevant features from raw data, reducing the need for manual intervention and allowing for more efficient model development.

3. **Scalability**:
   - Neural networks can scale effectively with the availability of data. As the amount of training data increases, deep learning models can continue to improve their performance, making them

('Neural networks and deep learning have revolutionized the field of artificial intelligence (AI) by providing powerful tools for tackling complex problems. Here are some of the key benefits of these technologies in AI applications:\n\n1. **High Performance on Complex Tasks**:\n   - Neural networks, especially deep learning models, excel at handling complex tasks such as image recognition, natural language processing, and game playing. Their ability to model intricate patterns and relationships in data allows them to achieve state-of-the-art performance in various domains.\n\n2. **Automated Feature Extraction**:\n   - Traditional machine learning techniques often require manual feature engineering, which can be time-consuming and requires domain expertise. Deep learning models automatically learn to extract relevant features from raw data, reducing the need for manual intervention and allowing for more efficient model development.\n\n3. **Scalability**:\n   - Neural networks can scale 

In [17]:
def stream_with_early_stop(prompt, stop_phrase):
    """Stream and stop when a specific phrase is detected."""
    
    print(f"üìù Prompt: {prompt}")
    print(f"üõë Will stop at: '{stop_phrase}'\n")
    
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
        stream=True
    )
    
    full_response = ""
    stopped_early = False
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            full_response += content
            print(content, end="", flush=True)
            
            # Check for stop phrase
            if stop_phrase.lower() in full_response.lower():
                print(f"\n\n‚èπÔ∏è Stopped: Found '{stop_phrase}'")
                stopped_early = True
                break
    
    if not stopped_early:
        print("\n\n‚úÖ Completed without finding stop phrase")
    
    return full_response, stopped_early

stream_with_early_stop(
    "Count from 1 to 20, writing each number on a new line.",
    stop_phrase="10"
);

üìù Prompt: Count from 1 to 20, writing each number on a new line.
üõë Will stop at: '10'

Sure! Here you go:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10

‚èπÔ∏è Stopped: Found '10'


## 6. Token Counting During Streaming

In [18]:
def stream_with_stats(prompt, max_tokens=200):
    """Stream with real-time statistics."""
    
    start_time = time.time()
    
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        stream=True,
        stream_options={"include_usage": True}  # Get token counts
    )
    
    full_response = ""
    chunk_count = 0
    first_token_time = None
    usage = None
    
    for chunk in stream:
        chunk_count += 1
        
        # Capture usage from final chunk
        if chunk.usage:
            usage = chunk.usage
        
        if chunk.choices and chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.time() - start_time
            
            content = chunk.choices[0].delta.content
            full_response += content
            print(content, end="", flush=True)
    
    total_time = time.time() - start_time
    
    print(f"\n\n{'='*50}")
    print("üìä Streaming Statistics:")
    print(f"   Chunks received: {chunk_count}")
    print(f"   Time to first token: {first_token_time:.3f}s")
    print(f"   Total time: {total_time:.3f}s")
    print(f"   Characters: {len(full_response)}")
    
    if usage:
        print(f"   Prompt tokens: {usage.prompt_tokens}")
        print(f"   Completion tokens: {usage.completion_tokens}")
        tokens_per_second = usage.completion_tokens / total_time
        print(f"   Speed: {tokens_per_second:.1f} tokens/second")
    
    # Display as rendered markdown for better readability
    print(f"\n{'='*50}")
    print("üìã Rendered Output:\n")
    md(full_response)
    
    return full_response

stream_with_stats("Write a short poem about coding.");

In lines of logic, dreams take flight,  
With keystrokes dancing, day and night.  
A whisper of syntax, a spark of thought,  
Building worlds with the codes that we‚Äôve wrought.  

Loops and functions, a rhythm so clear,  
Debugging the chaos, conquering fear.  
Each semicolon like a heartbeat's pause,  
In the realm of pixels, we‚Äôre the hidden cause.  

From zero to one, creation's embrace,  
In virtual landscapes, we find our place.  
So here's to the coders, both near and far,  
Crafting the future, one line‚Äîa star.

üìä Streaming Statistics:
   Chunks received: 132
   Time to first token: 0.490s
   Total time: 2.488s
   Characters: 523
   Prompt tokens: 14
   Completion tokens: 129
   Speed: 51.8 tokens/second

üìã Rendered Output:



In lines of logic, dreams take flight,  
With keystrokes dancing, day and night.  
A whisper of syntax, a spark of thought,  
Building worlds with the codes that we‚Äôve wrought.  

Loops and functions, a rhythm so clear,  
Debugging the chaos, conquering fear.  
Each semicolon like a heartbeat's pause,  
In the realm of pixels, we‚Äôre the hidden cause.  

From zero to one, creation's embrace,  
In virtual landscapes, we find our place.  
So here's to the coders, both near and far,  
Crafting the future, one line‚Äîa star.

## 7. Building a Chat Interface with Streaming

In [32]:
class StreamingChat:
    """Simple streaming chat interface."""
    
    def __init__(self, system_prompt="You are a helpful assistant."):
        self.messages = [{"role": "system", "content": system_prompt}]
        self.client = OpenAI()
    
    def chat(self, user_message):
        """Send message and stream response."""
        
        # Add user message
        self.messages.append({"role": "user", "content": user_message})
        
        print(f"üë§ You: {user_message}")
        print(f"ü§ñ Assistant: ", end="")
        
        # Stream response
        stream = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=self.messages,
            max_tokens=300,
            stream=True
        )
        
        assistant_response = ""
        
        # Create a display handle for live updates
        display_handle = display(Markdown(""), display_id=True)
        
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                assistant_response += content
                
                # Update the specific display without clearing other output
                display_handle.update(Markdown(assistant_response + "‚ñå"))
        
        # Final update without cursor
        display_handle.update(Markdown(assistant_response))
        print()  # Add spacing after response
        
        # Add assistant response to history
        self.messages.append({"role": "assistant", "content": assistant_response})
        
        return assistant_response
    
    def clear_history(self):
        """Clear conversation history, keeping system prompt."""
        self.messages = [self.messages[0]]
        print("üóëÔ∏è History cleared")

chat = StreamingChat("You are a helpful coding tutor. Keep responses concise.")
chat.chat("What is a function in programming?");
chat.chat("Can you show a simple example in Python?");


üë§ You: What is a function in programming?
ü§ñ Assistant: 

A function in programming is a reusable block of code that performs a specific task. It can take inputs, known as parameters, and may return a result. Functions help organize code, improve readability, and facilitate code reuse.


üë§ You: Can you show a simple example in Python?
ü§ñ Assistant: 

Certainly! Here‚Äôs a simple example of a function in Python that adds two numbers:

```python
def add_numbers(a, b):
    return a + b

# Using the function
result = add_numbers(3, 5)
print(result)  # Output: 8
```

In this example, `add_numbers` is a function that takes two parameters, `a` and `b`, and returns their sum.




In [31]:
chat.chat("What about functions with parameters?");

üë§ You: What about functions with parameters?
ü§ñ Assistant: 

It seems like you might be asking for a deeper exploration of functions with parameters. Here's a more detailed look at the topic:

### Types of Parameters in Functions

1. **Required Parameters**: These must be provided when the function is called.
2. **Optional Parameters**: You can provide default values.
3. **Variable-Length Parameters**: You can allow for an arbitrary number of arguments.

### Examples

1. **Required Parameters**:

   ```python
   def subtract(a, b):
       return a - b

   result = subtract(10, 4)  # Output: 6
   ```

2. **Optional Parameters** (with default values):

   ```python
   def multiply(a, b=2):
       return a * b

   print(multiply(5))    # Output: 10 (5 * 2)
   print(multiply(5, 3)) # Output: 15 (5 * 3)
   ```

3. **Variable-Length Parameters**:

   ```python
   def summarize(*args):
       return sum(args)

   print(summarize(1, 2, 3))          # Output: 6
   print(summarize(1, 2, 3, 4, 5))    # Output: 15
   ```

### Explanation

- **Required Parameters**: The function must receive exactly the specified parameters.
- **Optional Parameters**: If an




## üéØ Summary

### Key Takeaways

1. **Streaming Benefits**
   - Faster perceived response time
   - Better user experience
   - Ability to process/stop early

2. **Implementation**
   - Set `stream=True` in API call
   - Iterate over chunks
   - Access content via `chunk.choices[0].delta.content`

3. **Common Patterns**
   - Live markdown rendering
   - Word/phrase detection
   - Early termination
   - Token counting with `stream_options`

4. **Best Practices**
   - Always use `flush=True` when printing
   - Handle `None` content in chunks
   - Track both chunks and content

### Next Steps

- **mini-model-compare**: Compare streaming across models
- **lab-llm-playground**: Build complete interactive playground
- **Module 9**: Implement streaming in production APIs