## Streaming

*(Coding along with the [Anthropic API fundamentals](https://github.com/anthropics/courses/tree/master/anthropic_api_fundamentals) of Anthropic's courses GitHub repo)*

The goals of this section are to understand how streaming works and to work with stream events.

## Basis setup

In [1]:
# https://github.com/anthropics/courses/blob/master/anthropic_api_fundamentals/05_Streaming.ipynb
from anthropic import Anthropic
import pandas as pd

anthropic_api_key = pd.read_csv("~/tmp/anthropic/anthropic-key-1.txt", sep=" ", header=None)[0][0]
print("Don't be a fool and sent your api key to github")

# instantiating the client
client = Anthropic(api_key=anthropic_api_key)
MODEL_NAME="claude-3-5-sonnet-20241022"

Don't be a fool and sent your api key to github


In [2]:
# just as a recap, the syntax we've used so far:
response = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me an essay about macaws and clay licks in the Amazon",
        }
    ],
    model=MODEL_NAME,
    max_tokens=800,
    temperature=0,
)
print("We have a response back!")
print("========================")
print(response.content[0].text)

We have a response back!
Here's an essay about macaws and clay licks in the Amazon:

Macaws and Clay Licks: A Fascinating Natural Phenomenon in the Amazon

Deep within the Amazon rainforest, one of nature's most spectacular displays occurs daily at clay licks, where hundreds of brilliant scarlet macaws, blue-and-yellow macaws, and other parrot species gather to consume clay from exposed riverbank walls. This remarkable behavior, known as geophagy, is not merely a curious habit but rather a vital aspect of these birds' survival strategy.

Clay licks, locally known as "collpas," are natural clay banks typically found along river edges throughout the Amazon basin, particularly in Peru, Brazil, and Ecuador. These sites serve as essential mineral supplementation stations for various species of parrots, but macaws are perhaps the most striking visitors due to their large size and vibrant plumage.

The primary reason macaws engage in clay consumption is to neutralize toxic compounds found in 

With the approach we just used, we only get content back from the API once all of the content has been generated. Nothing is printed out until the entire response is printed all at once. If we don't want to wait for the entire response to be generated before we get a response, we can use __streaming__. With this approach content is streamed to a user's browser and displayed while the model is generating the response 

### Working with streams

To get a streaming response from the API we've to pass `stream=True` as a parameter to `client.messages.create`.

In [3]:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me a 3 word sentence, without a preamble.  Just give me 3 words",
        }
    ],
    model=MODEL_NAME,
    max_tokens=100,
    temperature=0,
    stream=True,
)

In [4]:
# taking a look at the stream variable
stream # stream object

<anthropic.Stream at 0x112273f80>

In [5]:
# the stream object is a generator object that yields individual server-sent events (SSE) as they are received from the API
# we need to iterate over the stream object and work with each individual server-sent event
# iterating over the stream response:
for event in stream:
    print(event)

RawMessageStartEvent(message=Message(id='msg_01WP4BHRURLro5RBcLwciaAZ', content=[], model='claude-3-5-sonnet-20241022', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=30, output_tokens=1)), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Dogs', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=' chase cats.', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(output_tokens=7))
RawMessageStopEvent(type='message_stop')


#### __Color-coded explanation of the streaming events:__

<img src="../../assets/images/streaming_events.png" width="70%" />

Each stream contains a series of events in the following order:

- __MessageStartEvent__ - A message with empty content
- __Series of content blocks__ - Each of which contains:
  - A __ContentBlockStartEvent__
  - One or more __ContentBlockDeltaEvents__
  - A __ContentBlockStopEvent__
- One or more __MessageDeltaEvents__ which indicate top-level changes to the final message
- A final __MessageStopEvent__

#### __Events associated with a single content block:__

<img src="../../assets/images/content_block_events.png" width="70%" />

As we can see in the diagram, the model-generated content we care about comes from the ContentBlockDeltaEvents. Each of them contains a type set to "content_block_delta." To get the content itself, we need to access the `text` property inside of `delta`.

*(Image source: # https://github.com/anthropics/courses/blob/master/anthropic_api_fundamentals/05_Streaming.ipynb)*

In [6]:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me a 3 word sentence, without a preamble.  Just give me 3 words",
        }
    ],
    model=MODEL_NAME,
    max_tokens=100,
    temperature=0,
    stream=True,
)

for event in stream:
    if event.type == "content_block_delta":
        print(event.delta.text)

Dogs
 chase cats.


To format the printed content a bit more nicely, we can make use of two additional arguments that we pass to `print()`:

- `end=""`: By default, the print() function adds a newline character (\n) at the end of the printed text. However, by setting end="", we specify that the printed text should not be followed by a newline character. This means that the next print() statement will continue printing on the same line.
- `flush=True`: The flush parameter is set to True to force the output to be immediately written to the console or standard output, without waiting for a newline character or the buffer to be filled. This ensures that the text is displayed in real-time as it is received from the streaming response.



In [7]:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me a 3 word sentence, without a preamble.  Just give me 3 words",
        }
    ],
    model=MODEL_NAME,
    max_tokens=100,
    temperature=0,
    stream=True,
)
for event in stream:
    if event.type == "content_block_delta":
        print(event.delta.text, flush=True, end="")


Dogs chase cats.

In [8]:
# now for a longer text response
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
    stream=True,
)
for event in stream:
    if event.type == "content_block_delta":
        print(event.delta.text, flush=True, end="")

Large Language Models (LLMs) work through several key components and processes. Here's a simplified explanation:

 Process:ng
d on massive amounts of text data from the internet, books, and other sources
 patterns in language through a process called "unsupervised learning"
 involves predicting the next word in a sequence based on previous words

2. Key Components:
 Transformer architecture: The underlying neural network structure
: Help the model focus on relevant parts of input text
 of adjustable values that store learned patterns

3. Basic Operation:
 text input
 multiple layers of neural networks
d patterns to generate appropriate responses
dicts most likely next words based on context

:. Key Capabilities
- Pattern recognition in language
- Understanding context
- Text generation
 adaptation

5. Limitations:
- No true understanding or consciousness
 incorrect information
off dated to training data cut
- No real-time information

 the actual technical details are quite complex and

### <span style="color:green">Interlude: A brief conversation with Claude.ai</span>

>Hello Claude! I'm working with the streaming example of the Claude Python SDK. 
>
>I'm using the following example code on a M1 Mac:
>
>```
>stream = client.messages.create(
>    messages=[
>        {
>            "role": "user",
>           "content": "How do large language models work?",
>        }
>    ],
>    model=MODEL_NAME,
>    max_tokens=100,
>    temperature=0,
>    stream=True,
>)
>for event in stream:
>    if event.type == "content_block_delta":
>        print(event.delta.text, flush=True, end="")
>```
><br>
>
>In the printout that I'm getting the first characters of every paragraph (except for the first one) are missing. I assume that the problem is in the code line `print(event.delta.text, flush=True, end="")`. Do you have suggestions how to modify the code to get a correct output?    print(text_buffer, end="")

__Claude.ai:__ This is a common issue with streaming outputs where newline characters and the first characters after them can sometimes get separated across different chunks. Here's how you can modify the code to handle this better:

```python
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=100,
    temperature=0,
    stream=True,
)

# Buffer to store accumulated text
text_buffer = ""

for event in stream:
    if event.type == "content_block_delta":
        chunk = event.delta.text
        text_buffer += chunk
        
        # If we have a complete line (ends with newline), print it
        while '\n' in text_buffer:
            line, text_buffer = text_buffer.split('\n', 1)
            print(line)
            print()  # Add blank line after each paragraph
            
# Print any remaining text in the buffer
if text_buffer:
    print(text_buffer, end="")
```

The key changes are:
1. We create a text buffer to accumulate chunks
2. We only print complete lines when we encounter a newline character
3. We split the buffer at newlines and keep the remainder for the next iteration
4. At the end, we print any remaining text that didn't end with a newline

This approach ensures that paragraphs are printed completely and correctly, without missing any characters. The streaming chunks will be reassembled properly before being displayed.

In [9]:
# code example generated by claude.ai:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
    stream=True,
)

# Buffer to store accumulated text
text_buffer = ""

for event in stream:
    if event.type == "content_block_delta":
        chunk = event.delta.text
        text_buffer += chunk
        
        # If we have a complete line (ends with newline), print it
        while '\n' in text_buffer:
            line, text_buffer = text_buffer.split('\n', 1)
            print(line)
            # print()  # Add blank line after each paragraph
            
# Print any remaining text in the buffer
if text_buffer:
    print(text_buffer, end="")

Large Language Models (LLMs) work through several key components and processes. Here's a simplified explanation:

1. Architecture:
- LLMs are based on transformer architecture, which uses self-attention mechanisms
- They consist of billions of parameters (weights and biases) arranged in neural networks
- The models are typically organized in multiple layers that process information sequentially

2. Training:
- Models are trained on massive amounts of text data from the internet, books, and other sources
- They learn patterns in language through unsupervised learning
- Training involves predicting the next word in a sequence, helping them understand context and relationships

3. Key processes:
- Tokenization: Breaking input text into smaller units (tokens)
- Attention mechanisms: Weighing the importance of different words in context
- Pattern recognition: Identifying relationships between words and concepts
- Probability distribution: Generating likely responses based on training data



In [10]:
# claude's a concise and elegant version with the streaming logic separated into its own function
def print_stream_response(stream):
    buffer = ""
    for event in stream:
        if event.type == "content_block_delta":
            buffer += event.delta.text
            while '\n' in buffer:
                line, buffer = buffer.split('\n', 1)
                # print(line, '\n')
                print(line)
    if buffer:
        print(buffer, end="")

# API call
stream = client.messages.create(
    messages=[{"role": "user", "content": "How do large language models work?"}],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
    stream=True,
)

print_stream_response(stream)

Large Language Models (LLMs) work through several key components and processes. Here's a simplified explanation:

1. Training Process:
- LLMs are trained on massive amounts of text data from the internet, books, and other sources
- They learn patterns in language through a process called "unsupervised learning"
- Training involves predicting the next word in a sequence based on previous words

2. Key Components:
- Transformer architecture: The underlying neural network structure that processes text
- Attention mechanisms: Allow the model to focus on relevant parts of input text
- Parameters: Billions of adjustable values that store learned patterns

3. Basic Operation:
- Takes text input (prompt)
- Processes it through multiple layers of neural networks
- Generates predictions for the most likely next words
- Outputs coherent text based on learned patterns

4. Key Capabilities:
- Pattern recognition in language
- Understanding context
- Text generation
- Task adaptation through prompti

In this modified version Claude has made the following __improvements__:

- Moved the streaming logic into a dedicated function for better organization
- Simplified variable names (text_buffer → buffer)
- Removed redundant comments
- Condensed the messages list into a single line
- Combined the print statements for the line and newline

The function remains robust but is now more concise and reusable. You can easily use it for any streaming response from the Claude API.

In [11]:
# once again above example with printing out the whole event
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
    stream=True,
)

for event in stream:
    print(event)

RawMessageStartEvent(message=Message(id='msg_01KdyNHrwqmq9jd4woh5wcbR', content=[], model='claude-3-5-sonnet-20241022', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=14, output_tokens=2)), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Large Language', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=' Models (LLMs) work through', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=" several key components and processes. Here's a", type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=' simplified explanation:\n\n1. Architecture', type='text_delta'), index=0, type='content_block

In [12]:
# alternative approach with the same result
with client.messages.stream(
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    temperature=0,
) as stream:
  for text in stream.text_stream:
      print(text, end="", flush=True)

Large Language Models (LLMs) work through several key components and processes. Here's a simplified explanation:

 Process:ng
d on massive amounts of text data from the internet, books, and other sources
upervised learning"ns in language through a process called "uns
dicting the next word in a sequence based on previous words

2. Key Components:
 architecture: The underlying neural network structure
 Attention mechanisms: Help the model focus on relevant parts of input text
 of adjustable values that store learned patterns

 Operation:
- Takes text input
 multiple layers of neural networks
d patterns to generate appropriate responses
dicts most likely next words based on context

 Key Capabilities:
 language recognition in
- Understanding context
 generation
- Task adaptation

5. Limitations:
- No true understanding or consciousness
 incorrect or biased information
d to training data cutoff date
 real-time information

d overview - the actual technical details are quite complex and inv

In [13]:
'''
response = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
)
print(response.content[0].text)
'''

'\nresponse = client.messages.create(\n    messages=[\n        {\n            "role": "user",\n            "content": "How do large language models work?",\n        }\n    ],\n    model=MODEL_NAME,\n    max_tokens=1000,\n    temperature=0,\n)\nprint(response.content[0].text)\n'

#### __Accessing information about our token usage:__

- `MessageStartEvent` contains our input(prompt) token usage information
- `MessageDeltaEvent` contains information on how many output tokens were generated

<img src="../../assets/images/events_token_usage.png" width="70%" />

*(Image source: # https://github.com/anthropics/courses/blob/master/anthropic_api_fundamentals/05_Streaming.ipynb)*

In [14]:
# printing out how many tokens are used in our prompt and how many tokens the model generates:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    stream=True,
)
for event in stream:
    if event.type == "message_start":
        input_tokens = event.message.usage.input_tokens
        print("MESSAGE START EVENT", flush=True)
        print(f"Input tokens used: {input_tokens}", flush=True)
        print("========================")
    elif event.type == "content_block_delta":
        print(event.delta.text, flush=True, end="")
    elif event.type == "message_delta":
        output_tokens = event.usage.output_tokens
        print("\n========================", flush=True)
        print("MESSAGE DELTA EVENT", flush=True)
        print(f"Output tokens used: {output_tokens}", flush=True)
        

MESSAGE START EVENT
Input tokens used: 14
 Models (LLMs) work through several key components and processes. Here's a simplified explanation:

:. Architecture
d on transformer architecture, which uses attention mechanisms
d biases)nsist of billions of parameters (weights an
 sequences of tokensas

:. Training Process
 amounts of text datasive
 patterns and relationships between words/tokens
 self-supervised learning to predict next words in sequences
 through backpropagation

3. Key Features:
 mechanisms to focus on relevant parts of input
 Context understanding through positional encoding
 recognition across different levels of language

:. Operation
split into smaller units) (
 transformer layersrough multiple
d on learned patternss base
 distributions for next tokens

:. Capabilities
- Text completion
- Question answering
- Translation
- Summarization
 generation
- General language understanding

 is a simplified overview - the actual implementation involves complex mathematics and c

#### Other streaming event types


- __Ping events__: Streams may also include any number of ping events.
- __Error events__: Error events may occasionally occur in the event stream. During periods of high usage you may receive an overloaded_error (corresponds to a HTTP 529 in a non-streaming context).

Error event example:
```
event: error
data: {"type": "error", "error": {"type": "overloaded_error", "message": "Overloaded"}}
````

### Time to first token (TTFT)

Streaming is supposed to improve the amount of time it takes for us to receive the first bit of model generated content - the __Time to first token (TTFT)__.

In [15]:
# (1) non-streaming approach, asking the model to generate a very long piece of text but cut it off at 500 tokens
import time
def measure_non_streaming_ttft():
    start_time = time.time()

    response = client.messages.create(
        max_tokens=500,
        messages=[
            {
                "role": "user",
                "content": "Write mme a long essay explaining the history of the American Revolution",
            }
        ],
        temperature=0,
        model=MODEL_NAME,
    )

    response_time = time.time() - start_time

    print(f"Time to receive first token: {response_time:.3f} seconds")
    print(f"Time to recieve complete response: {response_time:.3f} seconds")
    print(f"Total tokens generated: {response.usage.output_tokens}")
    
    print(response.content[0].text)

In [16]:
measure_non_streaming_ttft()

Time to receive first token: 9.874 seconds
Time to recieve complete response: 9.874 seconds
Total tokens generated: 500
Here's a comprehensive essay on the American Revolution:

The American Revolution (1765-1783) was a pivotal moment in world history that led to the creation of the United States of America and marked the first successful colonial rebellion against a European power. The revolution was the result of growing tensions between Great Britain and its thirteen American colonies, ultimately leading to a full-scale war and the birth of a new nation.

Origins and Early Tensions:
The seeds of revolution were planted in the aftermath of the French and Indian War (1754-1763). Although Britain emerged victorious, the war left it with substantial debt. To recover these costs, Parliament began imposing new taxes on the American colonies. The British government believed that since they had protected the colonies during the war, the colonists should help pay for it.

In 1764, the Sugar 

In [17]:
# same thing using a streaming approach
def measure_streaming_ttft():
    start_time = time.time()

    stream = client.messages.create(
        max_tokens=500,
        messages=[
            {
                "role": "user",
                "content": "Write mme a long essay explaining the history of the American Revolution",
            }
        ],
        temperature=0,
        model="claude-3-haiku-20240307",
        stream=True
    )
    have_received_first_token = False
    for event in stream:
        if event.type == "content_block_delta":
            if not have_received_first_token:
                ttft = time.time() - start_time
                have_received_first_token = True
            print(event.delta.text, flush=True, end="")
        elif event.type == "message_delta":
            output_tokens = event.usage.output_tokens
            total_time = time.time() - start_time

    print(f"\nTime to receive first token: {ttft:.3f} seconds", flush=True)
    print(f"Time to recieve complete response: {total_time:.3f} seconds", flush=True)
    print(f"Total tokens generated: {output_tokens}", flush=True)

In [18]:
measure_streaming_ttft()

Here is a long essay explaining the history of the American Revolution:

 history of the United States, marking the country's transition from a collection of British colonies to an independent nation. The roots of the revolution can be traced back to the French and Indian War, which was fought between Britain and France from 1754 to 1763. This conflict resulted in the British gaining control of much of North America, but it also left the British government with a significant debt that they sought to recoup by imposing new taxes and regulations on the colonies.

 major point of contention between the colonies and the British government was the Stamp Act, which was passed in 1765. This act required all printed materials in the colonies, including newspapers, pamphlets, bills, legal documents, licenses, almanacs, dice, and playing cards, to carry an embossed revenue stamp. The colonists viewed this as a violation of their rights as British subjects, as they had no representation in the Br

Comparing the results:

- __Without Streaming:__
  - Time to receive first token: 8.193 seconds
  - Time to recieve complete response: 8.193 seconds
  - Total tokens generated: 500
- __With Streaming:__
  - Time to receive first token: 0.492 seconds
  - Time to recieve complete response: 4.127 seconds
  - Total tokens generated: 500

In [19]:
# let's try an example that generates 1000 tokens using Opus
def compare_ttft():
    def measure_streaming_ttft():
        start_time = time.time()

        stream = client.messages.create(
            max_tokens=1000,
            messages=[
                {
                    "role": "user",
                    "content": "Write mme a very very long essay explaining the history of the American Revolution",
                }
            ],
            temperature=0,
            model="claude-3-opus-20240229",
            stream=True
        )
        have_received_first_token = False
        for event in stream:
            if event.type == "content_block_delta":
                if not have_received_first_token:
                    ttft = time.time() - start_time
                    have_received_first_token = True
            elif event.type == "message_delta":
                output_tokens = event.usage.output_tokens
                total_time = time.time() - start_time
        return (ttft, output_tokens)
    
    def measure_non_streaming_ttft():
        start_time = time.time()

        response = client.messages.create(
            max_tokens=1000,
            messages=[
                {
                    "role": "user",
                    "content": "Write mme a very very long essay explaining the history of the American Revolution",
                }
            ],
            temperature=0,
            model="claude-3-opus-20240229"
        )
        ttft = time.time() - start_time
        return (ttft, response.usage.output_tokens)
    
    streaming_ttft, streaming_tokens = measure_streaming_ttft()
    non_streaming_ttft, non_streaming_tokens = measure_non_streaming_ttft()

    print("OPUS STREAMING")
    print(f"Time to first token: {streaming_ttft}")
    print(f"Tokens generated: {streaming_tokens}")
    print("#########################################################")
    print("OPUS NON STREAMING")
    print(f"Time to first token: {non_streaming_ttft}")
    print(f"Tokens generated: {non_streaming_tokens}")

In [20]:
compare_ttft()

OPUS STREAMING
Time to first token: 0.7362079620361328
Tokens generated: 1000
#########################################################
OPUS NON STREAMING
Time to first token: 35.947712898254395
Tokens generated: 1000


### Streaming helpers

In [21]:
from anthropic import AsyncAnthropic

client = AsyncAnthropic(api_key=anthropic_api_key)

async def streaming_with_helpers():
    async with client.messages.stream(
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": "Write me sonnet about orchids",
            }
        ],
        model="claude-3-opus-20240229",
    ) as stream:
        async for text in stream.text_stream:
            print(text, end="", flush=True)        

    final_message = await stream.get_final_message()
    print("\n\nSTREAMING IS DONE.  HERE IS THE FINAL ACCUMULATED MESSAGE: ")
    print(final_message.to_json())

await streaming_with_helpers()

In gardens fair, where beauty reigns supreme,
, a queen among the blooms,
, a sight that's sure to gleam,
's most exquisite plumes.

 rich, from pure white to deep red,
d patterns intricate, beyond compare,
 blossom seems to have a story read,
 grace and charm beyond repair.

 fairies in the breeze,
 presence grand and strong,
 dancing to the hum of busy bees,
 the gentle birds' melodious song.

 orchid, how your beauty does beguile,
 in our sight.e, forever

STREAMING IS DONE.  HERE IS THE FINAL ACCUMULATED MESSAGE: 
{
  "id": "msg_01H885AdbChf5eZRynx61Jas",
  "content": [
    {
      "citations": null,
      "text": "In gardens fair, where beauty reigns supreme,\nThe orchid stands, a queen among the blooms,\nHer petals soft, a sight that's sure to gleam,\nAdorned in nature's most exquisite plumes.\n\nWith colors rich, from pure white to deep red,\nAnd patterns intricate, beyond compare,\nEach blossom seems to have a story read,\nA tale of grace and charm beyond repair.\n\nSome delica

In [22]:
# once again with a little help from Claude:
# here's how we can integrate the original print_stream_response function with the async example:
async def print_stream_response(stream):
    buffer = ""
    # using uses client.messages.stream allows us to easily access generated text content as it streams in 
    # without having to manually check the stream event type
    # stream.text_stream provides an iterator over just the text deltas in the stream
    async for text in stream.text_stream:
        buffer += text
        while '\n' in buffer:
            line, buffer = buffer.split('\n', 1)
            print(line)
    if buffer:
        print(buffer, end="")

async def streaming_with_helpers():
    # code uses client.messages.stream which allows us to use helpers like stream.text_stream (in print_stream_response())
    # to easily access generated text content as it streams in without having to manually check the stream event type
    async with client.messages.stream(
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": "Write me sonnet about orchids",
            }
        ],
        model="claude-3-opus-20240229",
    ) as stream:
        await print_stream_response(stream)
        # another useful helper methods is get_final_message 
        # which returns a final accumulated message once the stream has been read to completion
        # get_final_message can be useful if you both want to use streaming 
        # but also need access to the entire finished text generation when it's complete
        final_message = await stream.get_final_message()
        print("\n\nSTREAMING IS DONE. HERE IS THE FINAL ACCUMULATED MESSAGE:")
        print(final_message.to_json())

await streaming_with_helpers()

In gardens fair, where beauty's reign is true,
There grows a flower of grace and elegance rare,
With petals soft, of varied shape and hue,
The orchid stands, a sight beyond compare.

Some bloom in shades of purple, pink, or white,
While others bear a pattern wild and bold,
Each blossom is a masterpiece of light,
A work of art, a wonder to behold.

Their fragrance sweet, a gift from nature's hand,
Enchants the senses, stirs the heart's desire,
In forests deep or cultivated land,
The orchid's charm will never cease to inspire.

Oh, lovely orchid, symbol of refined taste,
Your beauty shines, forever undefaced.

STREAMING IS DONE. HERE IS THE FINAL ACCUMULATED MESSAGE:
{
  "id": "msg_01NWc4ybWV7k5Tx3jf6h2Zi9",
  "content": [
    {
      "citations": null,
      "text": "In gardens fair, where beauty's reign is true,\nThere grows a flower of grace and elegance rare,\nWith petals soft, of varied shape and hue,\nThe orchid stands, a sight beyond compare.\n\nSome bloom in shades of purple, pin

#### Passing an Event Handler argument to client.messages.stream

The example that the documentation provides returns the following error: `TypeError: AsyncMessages.stream() got an unexpected keyword argument 'event_handler'`.  The recent Claude Python SDK seems to have changed how event handlers are implemented. The example I'm using here got rewritten with the help of Claude.ai.

The code got modified to:
1. Remove the event_handler parameter since it's no longer supported
2. Directly process the message deltas in the stream loop
3. Keep the color formatting for the output
4. Still show different event types

In [23]:
from anthropic import AsyncAnthropic, AsyncMessageStream

class ColoredStream(AsyncMessageStream):
    # ANSI color codes
    green = '\033[32m'
    reset = '\033[0m'
    
    async def on_text(self, text, snapshot):
        # Print model's text in green
        print(f"{self.green}{text}{self.reset}", end="", flush=True)

    async def on_stream_event(self, event):
        # Handle different event types
        if event.type == "message_start":
            print("\nMessage started!")
        elif event.type == "content_block_start":
            print("\nContent block started!")
        elif event.type == "content_block_stop":
            print("\nContent block stopped!")
        elif event.type == "message_delta":
            print(f"\nMessage delta received")
        elif event.type == "message_stop":
            print("\nMessage completed!")

async def stream_with_handlers():
    client = AsyncAnthropic(api_key=anthropic_api_key)
    
    async with client.messages.stream(
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": "Generate a 5-word poem",
            }
        ],
        model="claude-3-opus-20240229",
    ) as stream:
        async for message_delta in stream:
            if message_delta.type == "content_block_delta":
                print(f"{ColoredStream.green}{message_delta.delta.text}{ColoredStream.reset}", end="", flush=True)
            else:
                print(f"\nEvent type: {message_delta.type}")
        
        final_message = await stream.get_final_message()
        print("\n\nFinal accumulated message:", final_message.to_json())

# Usage
await stream_with_handlers()


Event type: message_start

Event type: content_block_start
[32mWhispers echo through silent nights.[0m
Event type: text

Event type: content_block_stop

Event type: message_delta

Event type: message_stop


Final accumulated message: {
  "id": "msg_01E4bAaADS2M2YFSbiRkwLnL",
  "content": [
    {
      "citations": null,
      "text": "Whispers echo through silent nights.",
      "type": "text"
    }
  ],
  "model": "claude-3-opus-20240229",
  "role": "assistant",
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "type": "message",
  "usage": {
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0,
    "input_tokens": 14,
    "output_tokens": 11
  }
}
