# Streaming and Batching

In this notebook you'll learn how to stream model responses and handle multiple chat completion requests in batches.

---

## Objectives

By the time you complete this notebook, you will:

- Learn to stream model responses.
- Learn to batch model responses.
- Compare the performance of batch processing to single prompt chat completion.

---

## Imports

In [1]:
!pip install groq langchain-groq

Collecting groq
  Downloading groq-0.31.1-py3-none-any.whl.metadata (16 kB)
Collecting langchain-groq
  Downloading langchain_groq-0.3.7-py3-none-any.whl.metadata (2.6 kB)
Downloading groq-0.31.1-py3-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.9/134.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_groq-0.3.7-py3-none-any.whl (16 kB)
Installing collected packages: groq, langchain-groq
Successfully installed groq-0.31.1 langchain-groq-0.3.7


In [2]:
import os
import getpass

os.environ["GROQ_API_KEY"] = getpass.getpass("GROQ API Key:\n")

GROQ API Key:
··········


## Create a Model Instance

In [3]:
from langchain_groq import ChatGroq
llm = ChatGroq(model_name="llama-3.3-70b-versatile", temperature=0.7)

## Sanity Check

Before proceeding with new use cases, let's sanity check that we can interact with our local model via LangChain.

In [4]:
prompt = 'Where and when was NVIDIA founded?'
result = llm.invoke(prompt)

In [5]:
print(result.content)

NVIDIA was founded on April 5, 1993, in Santa Clara, California, USA. It was founded by Jensen Huang, Chris Malachowsky, and Curtis Priem.


---

## Streaming Responses

As an alternative to the `invoke` method, you can use the `stream` method to receive the model response in chunks. This way, you don't have to wait for the entire response to be generated, and you can see the output as it is being produced. Especially for long responses, or in user-facing applications, streaming output can result in a much better user experience.

Let's create a prompt that generates a longer response.

In [6]:
prompt = 'Explain who you are in roughly 500 words.'

Given this prompt, let's see how the `stream` function works.

In [14]:
for chunk in llm.stream(prompt):
    print(chunk.content, end='')

I am an artificial intelligence language model, which means I'm a computer program designed to understand, generate, and process human-like language. My primary function is to assist and communicate with users like you, providing information, answering questions, and engaging in conversation. I'm a type of machine learning model, trained on a massive dataset of text from various sources, including books, articles, and online content.

My training data is sourced from a vast array of topics and domains, allowing me to possess a broad knowledge base. I can provide information on subjects such as history, science, technology, literature, and more. I'm not limited to just providing factual information; I can also generate creative content, like stories or poems, and even help with language-related tasks, such as translation or text summarization.

One of my key features is my ability to learn and improve over time. Through interactions with users, I can refine my understanding of language 

The `stream` method in LangChain serves as a foundational tool and shows the response as it is being generated. This can make the interaction with the LLMs feel more responsive and improve the user experience.

---

## Batching Responses

You can also use `batch` to call the prompts on a list of inputs. Calling `batch` will return a list of responses in the same order as they were passed in.

Not only is `batch` convenient when working with collections of data that all need to be responded to in some way by an LLM, but the `batch` method is designed to process multiple prompts concurrently, effectively running the responses in parallel as much as possible. This allows for more efficient handling of multiple requests, reducing the overall time needed to generate responses for a list of prompts. By batching requests, you can leverage the computational power of the language model to handle multiple inputs simultaneously, improving performance and throughput.

We'll demonstrate the functionality and performance benefits of batching by using this list of prompts about state capitals.

In [15]:
state_capital_questions = [
    'What is the capital of California?',
    'What is the capital of Texas?',
    'What is the capital of New York?',
    'What is the capital of Florida?',
    'What is the capital of Illinois?',
    'What is the capital of Ohio?'
]

Using `batch` we can pass in the entire list...

In [16]:
capitals = llm.batch(state_capital_questions)

... and get back a list of responses.

In [17]:
len(capitals)

6

In [18]:
for capital in capitals:
    print(capital.content)

The capital of California is Sacramento.
The capital of Texas is Austin.
The capital of New York is Albany.
The capital of Florida is Tallahassee.
The capital of Illinois is Springfield.
The capital of Ohio is Columbus.


---

## Comparing batch and invoke Performance

Just to make a quick observation about the potential performance gains from batching, here we time a call to `batch`. Note the `Wall time`.

In [19]:
%%time
llm.batch(state_capital_questions)

CPU times: user 60.4 ms, sys: 7.42 ms, total: 67.8 ms
Wall time: 441 ms


[AIMessage(content='The capital of California is Sacramento.', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 42, 'total_tokens': 50, 'completion_time': 0.001833084, 'prompt_time': 0.011459491, 'queue_time': 0.185998124, 'total_time': 0.013292575}, 'model_name': 'llama-3.3-70b-versatile', 'system_fingerprint': 'fp_2ddfbb0da0', 'service_tier': 'on_demand', 'finish_reason': 'stop', 'logprobs': None}, id='run--5e43a46e-303b-4cea-9e78-b5212fea679b-0', usage_metadata={'input_tokens': 42, 'output_tokens': 8, 'total_tokens': 50}),
 AIMessage(content='The capital of Texas is Austin.', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 42, 'total_tokens': 50, 'completion_time': 0.000317005, 'prompt_time': 0.016788008, 'queue_time': 0.185887093, 'total_time': 0.017105013}, 'model_name': 'llama-3.3-70b-versatile', 'system_fingerprint': 'fp_2ddfbb0da0', 'service_tier': 'on_demand', 'finish_reason': 's

And now to compare, we iterate over the `state_capital_questions` list and call `invoke` on each item. Again, note the `Wall time` and compare it to the results from batching above.

In [20]:
%%time
for cq in state_capital_questions:
    llm.invoke(cq)

CPU times: user 32.9 ms, sys: 3.13 ms, total: 36 ms
Wall time: 1.95 s


---

## Exercise: Batch Process to Create an FAQ Document

For this exercise you'll use batch processing to respond to a variety of LLM-related questions in service of creating an FAQ document (in this notebook setting the document will just be something we print to screen).

Here is a list of LLM-related questions.

In [None]:
faq_questions = [
    'What is a Large Language Model (LLM)?',
    'How do LLMs work?',
    'What are some common applications of LLMs?',
    'What is fine-tuning in the context of LLMs?',
    'How do LLMs handle context?',
    'What are some limitations of LLMs?',
    'How do LLMs generate text?',
    'What is the importance of prompt engineering in LLMs?',
    'How can LLMs be used in chatbots?',
    'What are some ethical considerations when using LLMs?'
]

You job is to populate `faq_answers` below with a list of responses to each of the questions. Use the `batch` method to make this very easy.

Upon successful completion, you should be able to print the return value of calling the following `create_faq_document` with `faq_questions` and `faq_answers` and get an FAQ document for all of the LLM-related questions above.

In [None]:
def create_faq_document(faq_questions, faq_answers):
    faq_document = ''
    for question, response in zip(faq_questions, faq_answers):
        faq_document += f'{question.upper()}\n\n'
        faq_document += f'{response.content}\n\n'
        faq_document += '-'*30 + '\n\n'

    return faq_document

If you get stuck, check out the *Solution* below.

### Your Work Here

In [None]:
faq_answers = []

In [None]:
# This should work after you successfully populate `faq_answers` with LLM responses.
print(create_faq_document(faq_questions, faq_answers))

### Solution

In [None]:
faq_answers = llm.batch(faq_questions)

In [None]:
def create_faq_document(faq_questions, faq_answers):
    faq_document = ''
    for question, response in zip(faq_questions, faq_answers):
        faq_document += f'{question.upper()}\n\n'
        faq_document += f'{response.content}\n\n'
        faq_document += '-'*30 + '\n\n'

    return faq_document

In [None]:
print(create_faq_document(faq_questions, faq_answers))

WHAT IS A LARGE LANGUAGE MODEL (LLM)?

A Large Language Model (LLM) is a type of artificial intelligence (AI) designed to process and understand human language. It's a computer program that uses complex algorithms and statistical models to learn patterns and relationships in language data, allowing it to generate text, answer questions, and even converse with humans.

LLMs are trained on vast amounts of text data, often sourced from the internet, books, and other digital sources. This training data enables the model to learn the structure, syntax, and semantics of language, including grammar, vocabulary, and context.

Some key characteristics of LLMs include:

1. **Scalability**: LLMs are designed to handle massive amounts of data and can process thousands of parameters, making them highly scalable.
2. **Deep learning**: LLMs use deep learning techniques, such as neural networks, to analyze and represent language data.
3. **Language understanding**: LLMs aim to understand the meaning a

---