## Streaming

*(Coding along with the [Anthropic API fundamentals](https://github.com/anthropics/courses/tree/master/anthropic_api_fundamentals) of Anthropic's courses GitHub repo)*

The goals of this section are to understand how streaming works and to work with stream events.

## Basis setup

In [2]:
# https://github.com/anthropics/courses/blob/master/anthropic_api_fundamentals/05_Streaming.ipynb
from anthropic import Anthropic
import pandas as pd

anthropic_api_key = pd.read_csv("~/tmp/anthropic/anthropic-key-1.txt", sep=" ", header=None)[0][0]
print("Don't be a fool and sent your api key to github")

# instantiating the client
client = Anthropic(api_key=anthropic_api_key)
MODEL_NAME="claude-3-5-sonnet-20241022"

Don't be a fool and sent your api key to github


In [3]:
# just as a recap, the syntax we've used so far:
response = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me an essay about macaws and clay licks in the Amazon",
        }
    ],
    model=MODEL_NAME,
    max_tokens=800,
    temperature=0,
)
print("We have a response back!")
print("========================")
print(response.content[0].text)

We have a response back!
Here's an essay about macaws and clay licks in the Amazon:

Macaws and Clay Licks: A Fascinating Natural Phenomenon in the Amazon

Deep within the Amazon rainforest, one of nature's most spectacular displays occurs daily at clay licks, where hundreds of brilliant macaws and other parrots gather to consume clay from exposed riverbank walls. This remarkable behavior, known as geophagy, is not only a stunning visual spectacle but also plays a crucial role in these birds' survival and well-being.

Clay licks, or "colpas" as they are known locally, are natural clay banks typically found along river edges throughout the Amazon basin, particularly in Peru, Bolivia, and Brazil. These sites attract numerous species of parrots, but it is the large, charismatic macaws that create the most impressive displays. With their vibrant plumage ranging from scarlet red to brilliant blue and yellow, species such as the Scarlet Macaw, Blue-and-yellow Macaw, and Red-and-green Macaw t

With the approach we just used, we only get content back from the API once all of the content has been generated. Nothing is printed out until the entire response is printed all at once. If we don't want to wait for the entire response to be generated before we get a response, we can use __streaming__. With this approach content is streamed to a user's browser and displayed while the model is generating the response 

### Working with streams

To get a streaming response from the API we've to pass `stream=True` as a parameter to `client.messages.create`.

In [8]:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me a 3 word sentence, without a preamble.  Just give me 3 words",
        }
    ],
    model=MODEL_NAME,
    max_tokens=100,
    temperature=0,
    stream=True,
)

In [9]:
# taking a look at the stream variable
stream # stream object

<anthropic.Stream at 0x11a72a420>

In [10]:
# the stream object is a generator object that yields individual server-sent events (SSE) as they are received from the API
# we need to iterate over the stream object and work with each individual server-sent event
# iterating over the stream response:
for event in stream:
    print(event)

RawMessageStartEvent(message=Message(id='msg_01KLchS6G7qG6EDWX5ndAEYz', content=[], model='claude-3-5-sonnet-20241022', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=30, output_tokens=1)), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Dogs', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=' chase cats.', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockStopEvent(index=0, type='content_block_stop')
RawMessageDeltaEvent(delta=Delta(stop_reason='end_turn', stop_sequence=None), type='message_delta', usage=MessageDeltaUsage(output_tokens=7))
RawMessageStopEvent(type='message_stop')


#### __Color-coded explanation of the streaming events:__

<img src="../../assets/images/streaming_events.png" width="70%" />

Each stream contains a series of events in the following order:

- __MessageStartEvent__ - A message with empty content
- __Series of content blocks__ - Each of which contains:
  - A __ContentBlockStartEvent__
  - One or more __ContentBlockDeltaEvents__
  - A __ContentBlockStopEvent__
- One or more __MessageDeltaEvents__ which indicate top-level changes to the final message
- A final __MessageStopEvent__

#### __Events associated with a single content block:__

<img src="../../assets/images/content_block_events.png" width="70%" />

As we can see in the diagram, the model-generated content we care about comes from the ContentBlockDeltaEvents. Each of them contains a type set to "content_block_delta." To get the content itself, we need to access the `text` property inside of `delta`.

*(Image source: # https://github.com/anthropics/courses/blob/master/anthropic_api_fundamentals/05_Streaming.ipynb)*

In [13]:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me a 3 word sentence, without a preamble.  Just give me 3 words",
        }
    ],
    model=MODEL_NAME,
    max_tokens=100,
    temperature=0,
    stream=True,
)

for event in stream:
    if event.type == "content_block_delta":
        print(event.delta.text)

Dogs
 chase cats.


To format the printed content a bit more nicely, we can make use of two additional arguments that we pass to `print()`:

- `end=""`: By default, the print() function adds a newline character (\n) at the end of the printed text. However, by setting end="", we specify that the printed text should not be followed by a newline character. This means that the next print() statement will continue printing on the same line.
- `flush=True`: The flush parameter is set to True to force the output to be immediately written to the console or standard output, without waiting for a newline character or the buffer to be filled. This ensures that the text is displayed in real-time as it is received from the streaming response.



In [14]:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "Write me a 3 word sentence, without a preamble.  Just give me 3 words",
        }
    ],
    model=MODEL_NAME,
    max_tokens=100,
    temperature=0,
    stream=True,
)
for event in stream:
    if event.type == "content_block_delta":
        print(event.delta.text, flush=True, end="")


Dogs chase cats.

In [25]:
# now for a longer text response
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
    stream=True,
)
for event in stream:
    if event.type == "content_block_delta":
        print(event.delta.text, flush=True, end="")

Large Language Models (LLMs) work through several key components and processes. Here's a simplified explanation:

 Process:ng
d on massive amounts of text data from the internet, books, and other sources
 a process called "unsupervised learning"
 a sequence based on previous wordsnext word in

 Components:
 neural network structure that processes text
 Mechanisms: Allow the model to focus on relevant parts of input text
 store learned patternsof adjustable values that

3. Basic Operation:
 text input
 multiple layers of neural networks
 for appropriate responses
d patternstext based on learne

4. Key Capabilities:
 recognition in language
- Understanding context
- Text generation
- Task adaptation

5. Limitations:
- No true understanding or consciousness
 incorrect or biased information
 data cutoff datening
-time information

 the actual technical details are quite complex and involve advanced mathematics and computer science concepts.

In [27]:
# once again above example with printing out the whole event
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
    stream=True,
)

for event in stream:
    print(event)

RawMessageStartEvent(message=Message(id='msg_01DS5cC9quTM3pstNGwmRvaV', content=[], model='claude-3-5-sonnet-20241022', role='assistant', stop_reason=None, stop_sequence=None, type='message', usage=Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=14, output_tokens=2)), type='message_start')
RawContentBlockStartEvent(content_block=TextBlock(citations=None, text='', type='text'), index=0, type='content_block_start')
RawContentBlockDeltaEvent(delta=TextDelta(text='Large Language', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=' Models (LLMs) work', type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=" through several key components and processes. Here's a", type='text_delta'), index=0, type='content_block_delta')
RawContentBlockDeltaEvent(delta=TextDelta(text=' simplified explanation:\n\n1. Architecture', type='text_delta'), index=0, type='content_block

In [17]:
# alternative approach with the same result
with client.messages.stream(
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    temperature=0,
) as stream:
  for text in stream.text_stream:
      print(text, end="", flush=True)

Large Language Models (LLMs) work through several key components and processes. Here's a simplified explanation:

1. Training Process:
 LLMs are trained on massive amounts of text data from the internet, books, and other sources
 patterns in language through a process called "unsupervised learning"
 involves predicting the next word in a sequence based on previous words

2. Key Components:
 Transformer architecture: The underlying neural network structure
 Attention mechanisms: Help the model focus on relevant parts of input text
: Billions of adjustable values that store learned patterns

3. Basic Operation:
 text input
 layers of neural networkstiple
 appropriate responsess to generate
 based on contextikely next words

:. Key Capabilities
- Pattern recognition in language
- Understanding context
- Text generation
- Task adaptation

5. Limitations:
 true understanding or consciousness
 incorrect information
off dated to training data cut
- No real-time information

 actual technical 

In [22]:
'''
response = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    temperature=0,
)
print(response.content[0].text)
'''

'\nresponse = client.messages.create(\n    messages=[\n        {\n            "role": "user",\n            "content": "How do large language models work?",\n        }\n    ],\n    model=MODEL_NAME,\n    max_tokens=1000,\n    temperature=0,\n)\nprint(response.content[0].text)\n'

#### __Accessing information about our token usage:__

- `MessageStartEvent` contains our input(prompt) token usage information
- `MessageDeltaEvent` contains information on how many output tokens were generated

<img src="../../assets/images/events_token_usage.png" width="70%" />

*(Image source: # https://github.com/anthropics/courses/blob/master/anthropic_api_fundamentals/05_Streaming.ipynb)*

In [24]:
# printing out how many tokens are used in our prompt and how many tokens the model generates:
stream = client.messages.create(
    messages=[
        {
            "role": "user",
            "content": "How do large language models work?",
        }
    ],
    model=MODEL_NAME,
    max_tokens=1000,
    stream=True,
)
for event in stream:
    if event.type == "message_start":
        input_tokens = event.message.usage.input_tokens
        print("MESSAGE START EVENT", flush=True)
        print(f"Input tokens used: {input_tokens}", flush=True)
        print("========================")
    elif event.type == "content_block_delta":
        print(event.delta.text, flush=True, end="")
    elif event.type == "message_delta":
        output_tokens = event.usage.output_tokens
        print("\n========================", flush=True)
        print("MESSAGE DELTA EVENT", flush=True)
        print(f"Output tokens used: {output_tokens}", flush=True)
        

MESSAGE START EVENT
Input tokens used: 14
 Models (LLMs) like myself work through several key components and processes. Here's a simplified explanation:

1. Training Data:
 LLMs are trained on massive amounts of text data from the internet, books, and other sources
 language and information about the world

 Architecture:
 "transformer" architecture
 This consists of layers of neural networks that process text using attention mechanisms
 weigh the importance of different words in context

3. Process:
 predicts the next word in a sequence based on previous words
 to understand patterns, grammar, facts, and reasoning through this process
adjustable weights) are fine-tuned during training

 Capabilities:
 recognition in language
 and relationshipsntext
like text responses
izing informationsynthes

5. Limitations:
 No true understanding or consciousness
- Can produce incorrect information
- Limited to training data cutoff date
- Cannot learn from conversations

 technical details are quite