# How to stream runnables

Streaming is critical in making applications based on LLMs feel responsive to end-users.

Important LangChain primitives like chat models, output parsers, prompts, retrievers, and agents implement the LangChain Runnable Interface.

This interface provides two general approaches to stream content:

1. sync `stream` and async `astream`: a **default implementation** of streaming that streams the **final output** from the chain.
2. async `astream_events` and async `astream_log`: these provide a way to steam both **intermediate steps** and **final output** from the chain.

Let's take a look at both approaches, and try to understand how to use them.


## Using Stream

All `Runnable` objects implement a sync method called `stream` and an async variant called `astream`.

These methods are designed to stream the final output in chunks, yielding each chunk as soon as it available.

Streaming is only possible if all steps in the program know how to process an input stream; i.e., process an input chunk one at a time, and yield a corresponding output chunk.

The complexity of this processing can vary, from straightforward tasks like emitting tokens produced by an LLM, to more challenging ones like streaming parts of JSON results before the entire JSON is complete.

The best place to exploring streaming is with the signle most important components in LLMs apps--the LLMs themselves!

## LLMs and Chat Models

Large language models and their chat variants are the primary bottleneck in LLM based apps.

Large language modes can take **several seconds** to generate a complete response to a query. This is far slower than the **~200-300 ms** threshold at which an application feels responsive to an end user.

The key strategy to make the application feel more responsive is show intermediate progress; viz. to stream the output from the model **token by token**.

We will show examples of streaming using a chat model.

In [1]:
import os
import yaml
from langchain_community.chat_models import ChatZhipuAI


with open('../utils/config.yml', 'r') as stream:
    zhipuai_api_key = yaml.safe_load(stream)['api_key']

os.environ['ZHIPUAI_API_KEY'] = zhipuai_api_key

model = ChatZhipuAI(
    model='glm-4-plus',
    temperature=0,
)

In [2]:
chunks = []
for chunk in model.stream('what color is the sky?'):
    chunks.append(chunk)
    print(chunk.content,end='|', flush=True)

The| color| of| the| sky| can| vary| depending| on| the| time| of| day|,| weather| conditions|,| and| the| observer|'s| location|.| Generally|:

-| **|During| the| day|**:| The| sky| appears| blue| due| to| Ray|leigh| scattering|,| where| shorter| (|blue|)| wavelengths| of| sunlight| are| scattered| in| all| directions| by| the| gases| and| particles| in| the| Earth|'s| atmosphere|.

-| **|At| sunrise| and| sunset|**:| The| sky| can| take| on| hues| of| orange|,| pink|,| and| red|.| This| occurs| because| the| sun| is| lower| on| the| horizon|,| and| its| light| has| to| pass| through| more| of| the| Earth|'s| atmosphere|,| scattering| the| shorter| wavelengths| and| allowing| the| longer| (|red| and| orange|)| wavelengths| to| dominate|.

-| **|On| over|cast| days|**:| The| sky| might| appear| gray| or| white| due| to| the| presence| of| clouds| that| scatter| and| reflect| all| wavelengths| of| light| equally|.

-| **|At| night|**:| The| sky| is| typically| dark|,| often| appearing| 

Alternatively, if you're working in an async environment, you may consider using the async `astream` API:

In [4]:
chunks = []
async for chunk in model.astream('what color is the sky?'):
    chunks.append(chunk)
    print(chunk.content,end='|', flush=True)

The| color| of| the| sky| can| vary| depending| on| the| time| of| day|,| weather| conditions|,| and| the| observer|'s| location|.| During| the| day|,| when| the| sun| is| high| and| the| weather| is| clear|,| the| sky| typically| appears| blue|.| This| blue| color| is| due| to| Ray|leigh| scattering|,| where| the| Earth|'s| atmosphere| sc|atters| sunlight| in| all| directions| and| blue| light| is| scattered| more| because| it| travels| in| shorter|,| smaller| waves|.

At| sunrise| and| sunset|,| the| sky| can| display| a| range| of| colors| including| red|s|,| oranges|,| and| pur|ples|.| This| happens| because| the| sun| is| lower| on| the| horizon|,| and| its| light| has| to| pass| through| a| thicker| layer| of| the| atmosphere|,| scattering| the| shorter|-w|avelength| light| (|blue| and| violet|)| out| of| our| line| of| sight| and| letting| the| red|s| and| oranges| dominate|.

During| over|cast| or| storm|y| weather|,| the| sky| might| appear| gray| or| even| black|.| At| night|

Let inspect one of the chunks

In [6]:
chunks[0]

AIMessageChunk(content='The', id='run-516bff67-34e2-465d-a16a-6f3eb56f68b4')

We got back something called an `AIMessageChunk`. This chunk represents a prat of a `AIMessage`.

Message chunks are additive by design==one can simply add them up to get the state f the response so far!

In [7]:
chunks[0] + chunks[1] + chunks[2] + chunks[3] + chunks[4]

AIMessageChunk(content='The color of the sky', id='run-516bff67-34e2-465d-a16a-6f3eb56f68b4')

## Chains

Virtually all LLM applications involve more steps than just a call to a language model.

Let's build a simple chain using `LangChain Expression Language(LCEL)` that combines a prompt, model and a parser and verify that steaming works.

We will use `StrOutputParser` to parse the output from the model. This is a simple parser that extracts the `content` field from an `AIMessageChunk`, giving us the `token` returned by the model.

In [9]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template('tell me a joke about {topic}')
parser = StrOutputParser()
chain = prompt | model | parser

async for chunk in chain.astream({'topic': 'parrot'}):
    print(chunk,end='|', flush=True)

Sure|,| here|'s| one| for| you|:

Why| do| par|rots| only| ever| say| one| thing|?

Because| they|'re| always| getting| called| to| "|p|aws|"|!| 🦜😄|

Note that we're getting streaming output even though we're using `parser` at the end of the chain above. The `parser` operates on each streaming chunk individidually. Many of the LCEL primitives also support this kind of transform-style passthrough streaming, which can be very convenient when constructing apps.

Custom functions can be designed to return generators, which are able to operate on streams.

Certain runnables, like prompt templates and chat models, cannot process individual chunks and instead aggregate all previous steps. Such runnable can interrupt the stream process