# Real-Time Interaction with Streaming

Hello everyone. So far, our API calls have followed a standard pattern: we send a request, wait for the *entire* response to be generated, and then receive it all at once. This is fine for quick tasks, but for longer generations, it can lead to a poor user experience with a lot of waiting.

This is where **streaming** comes in. Instead of waiting for the full response, streaming lets us receive the output **a few tokens at a time, as they're being generated.** This is how applications like ChatGPT feel so fast and responsive.

Let's see how to implement this simple but powerful feature.

## Setup

To appreciate the benefit of streaming, we need a task that isn't instantaneous. We'll ask the model to generate a detailed explanation of a programming concept. This will take a few seconds, making the difference between blocking and streaming very clear.

In [16]:
import time
import litellm
from textwrap import dedent
from dotenv import load_dotenv

load_dotenv()

MODEL_NAME = "openai/gpt-4o-mini"
MAX_TOKENS = 200

long_prompt = [
    {
        "role": "user",
        "content": dedent("""
            You are a senior Python developer tutoring a junior colleague.
            Explain the difference between concurrency and parallelism in Python.
            Keep it concise.
        """)
    }
]

## The Blocking Request

First, let's make a standard API call. When you run the cell below, pay close attention to the pause between when you execute it and when the full text appears. This waiting period is what we want to eliminate.

In [17]:
print("--- Making a standard blocking call---")
start_time = time.perf_counter()

response_blocking = litellm.completion(
    model=MODEL_NAME,
    messages=long_prompt,
    max_tokens=MAX_TOKENS
)

content = response_blocking.choices[0].message.content
end_time = time.perf_counter()

print(f"Request completed in {end_time - start_time:.4f} seconds.")
print("\nFull response:")
print(content)

--- Making a standard blocking call---
Request completed in 5.5574 seconds.

Full response:
Certainly! 

**Concurrency** and **Parallelism** are both concepts related to executing multiple tasks, but they have distinct meanings:

1. **Concurrency**:
   - Refers to the ability of a program to manage multiple tasks at once. This doesn’t necessarily mean they are executed simultaneously; rather, they may progress independently within the same time frame.
   - In Python, concurrency is often achieved using frameworks like `asyncio`, which allows for handling asynchronous operations. The tasks share a single thread but can yield control while waiting for resources, making it seem like they are running at the same time.

2. **Parallelism**:
   - Involves executing multiple tasks simultaneously, leveraging multiple CPU cores. This means tasks are truly running at the same time, which can lead to significant performance enhancements for CPU-bound tasks.
   - In Python, parallelism can be achie

## The Streaming Request

Now, let's make the exact same request, but this time we'll add one crucial parameter: `stream=True`.

When we do this, the function no longer returns a complete response object. Instead, it returns a **generator**. We can then loop through this generator to get each "chunk" of the response as it's created. This allows us to print the text to the screen almost instantly.

In [19]:
print("--- Making a streaming call---")
start_time = time.perf_counter()
full_response_streaming = ""

response_streaming = litellm.completion(
    model=MODEL_NAME,
    messages=long_prompt,
    max_tokens=MAX_TOKENS,
    stream=True
)

for chunk in response_streaming:
    if chunk.choices[0].delta.content:
        chunk_content = chunk.choices[0].delta.content
        print(chunk_content, end="", flush=True)
        full_response_streaming += chunk_content

end_time = time.perf_counter()

print(f"\n\nRequest completed in {end_time - start_time:.4f} seconds.")

--- Making a streaming call---
Certainly! 

**Concurrency** and **parallelism** are two concepts often used in programming, particularly in Python, but they serve different purposes.

- **Concurrency** is about dealing with many tasks at the same time but not necessarily executing them simultaneously. This means the system can handle multiple tasks by switching between them, making progress on each task as resources become available. In Python, this is often achieved using asynchronous programming (e.g., `asyncio`) or threading. However, due to Global Interpreter Lock (GIL), only one thread can execute Python bytecode at a time in CPython, which limits true parallel execution in threads.

- **Parallelism**, on the other hand, involves executing multiple tasks at the exact same time, which is possible when you have multiple processors or cores. In Python, this can be achieved using the `multiprocessing` module, which creates separate processes that can run independently and truly execut