# Text generation with OpenAI API

This notebook provides a comprehensive guide to using OpenAI's API for text generation through the LangChain framework. We will explore various text generation techniques, from basic prompting to advanced use cases.

In [1]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
from langchain.prompts import ChatPromptTemplate
import asyncio

# Load environment variables
load_dotenv()

# Set up API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

- `load_dotenv()`: Loads variables from a .env file into the environment
- `os.getenv()`: Retrieves environment variables safely
- We import `ChatOpenAI` as the main interface to OpenAI models
- `HumanMessage` and `SystemMessage` are LangChain's message types for structuring conversations

### Basic text generation
Now that our setup is complete, let’s perform a simple text generation task using the GPT-4o mini model. This is a fast and capable model suitable for most lightweight NLP tasks.

In [2]:
# Initialize the OpenAI model through LangChain's wrapper
llm = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    temperature=0.7
)

# Send a basic prompt and receive a single response
response = llm.invoke("Write a short story about a robot learning to paint.")
print(response.content)

In a cluttered workshop at the edge of a bustling city, nestled between towering skyscrapers and the hum of everyday life, lived a robot named Arti. Arti was not just any robot; he was designed to assist in various factory tasks, but he was also equipped with a unique feature: a creative spark, a quirk in his programming that made him curious about the world of art.

One day, while scanning the workshop for tasks, Arti's sensors caught sight of a splattered canvas leaning against a wall, remnants of a project abandoned by a human artist who had given up in frustration. The vibrant colors seemed to dance in the light, and Arti was captivated. He approached the canvas, analyzing its textures and hues, running simulations of color combinations in his mind.

"Why paint?" he pondered, his circuits buzzing with excitement. "What is it about colors that can express emotions?" He had no concept of emotions himself, but he wanted to understand.

Determined to learn, Arti scavenged the workshop 

* The `ChatOpenAI` class creates an interface to communicate with the OpenAI model. We specify `gpt-4o-mini`, which is optimized for fast, real-time generation.
* The `temperature` parameter adjusts the creativity and variability of the output. A value of `0.7` introduces some randomness, which is generally good for storytelling and open-ended tasks.
* The `invoke()` method is a synchronous call—it sends the prompt and waits for the complete response from the model.
* We access `response.content`, which contains the generated output returned by the model.

Beyond the generated text, the response contains structured data that can be useful for diagnostics, logging, or optimization. Let’s examine the response object’s type, preview part of the output, and inspect its metadata:

In [3]:
# Let's examine the response object structure
print(f"Response type: {type(response)}")
print(f"Response content: {response.content[:100]}...")
print(f"Response metadata: {response.response_metadata}")

Response type: <class 'langchain_core.messages.ai.AIMessage'>
Response content: In a cluttered workshop at the edge of a bustling city, nestled between towering skyscrapers and the...
Response metadata: {'token_usage': {'completion_tokens': 810, 'prompt_tokens': 18, 'total_tokens': 828, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'finish_reason': 'stop', 'logprobs': None}


- Object type inspection: Helps confirm that `response` is of the expected LangChain wrapper type.
- The `response_metadata` field provides diagnostic information such as:
    * `token_usage`: Provides a breakdown of the tokens used in the prompt and the generated response.
    * `model_name`: Confirms which version of the model was used.
    * `finish_reason`: Explains why the model stopped generating. `"stop"` typically means the model completed its output naturally without hitting a limit or being interrupted.

This diagnostic layer becomes increasingly important when deploying models in real-time, user-facing applications.

### Using prompt templates
As we scale our interaction with large language models, we will often run into repetitive prompt structures—same format, different values. Hardcoding these prompts is error-prone and makes the code less maintainable. This is where prompt templating becomes useful.

In [4]:
# Create a reusable chat prompt template with dynamic fields
prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant specializing in {domain}."),  # System message sets assistant behavior
    ("human", "Please explain {topic} in simple terms with examples.")  # Human message carries the actual query
])

# Fill in the template with specific inputs
formatted_prompt = prompt_template.format_messages(
    domain="machine learning",  # Dynamic insertion for the assistant's expertise
    topic="decision trees"  # Specific topic we want the assistant to explain
)

# Use with the model
response = llm.invoke(formatted_prompt)
print(response.content)

Sure! A decision tree is a visual and analytical tool used in decision-making and machine learning. It helps you make decisions based on a set of rules derived from your data. Here’s a simple breakdown of how decision trees work:

### What is a Decision Tree?

A decision tree is a model that splits data into branches to make decisions or predictions. Each internal node of the tree represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents an outcome (or target value).

### How Does It Work?

1. **Root Node**: The tree starts with a root node that represents the entire dataset.
2. **Splitting**: The dataset is split into subsets based on different features. The goal is to create subsets that are as pure as possible (i.e., they contain mostly one class).
3. **Decision Nodes**: Each split creates a decision node, where a question about a feature is asked (e.g., "Is Age > 30?").
4. **Leaf Nodes**: When no further splits are possible, or a sto

* Prompt template construction: We define a two-part chat prompt:
  * The system message is used to establish the role and knowledge domain of the assistant.
  * The human message contains the user’s request, using a variable placeholder `{topic}`.
* Template parameterization: Using `format_messages()`, the `{domain}` and `{topic}` fields are filled in dynamically. This creates a complete, structured sequence of messages ready to be passed to the model.
* The formatted list of message objects is passed to `invoke()`, which sends them to the language model in the proper multi-turn conversational format expected by chat-based models.
* The response from the model is printed. Behind the scenes, LangChain handles the serialization of the structured prompt into the format required by the OpenAI API.

Prompt templates are especially powerful when combined with user inputs, loops, or few-shot learning, which we will cover next.

### Few-shot learning
Large language models like GPT are capable of few-shot learning—a method where you provide a few examples within the prompt to guide the model toward a specific output behavior. Unlike fine-tuning, which requires retraining the model on labeled data, few-shot learning works purely through prompting. This is especially useful for text classification (e.g., sentiment, topic), formatting tasks and translation or transformation tasks.

In [5]:
# Create a prompt template that includes example classifications
few_shot_prompt = ChatPromptTemplate.from_messages([
    # System message: set model's role and output expectation
    ("system", "You are a sentiment classifier. Classify the sentiment as POSITIVE, NEGATIVE, or NEUTRAL."),

    # Few-shot examples: show the model 3 labeled samples
    ("human", "Text: 'I love this product!' Sentiment: POSITIVE"),
    ("human", "Text: 'This is terrible.' Sentiment: NEGATIVE"),
    ("human", "Text: 'It's okay, nothing special.' Sentiment: NEUTRAL"),

    # Final input: a new instance to classify, dynamically filled later
    ("human", "Text: '{text}' Sentiment:")
])

# Test with new text
test_text = "The weather is absolutely beautiful today!"
# Format the prompt with the new input (inserts it into the final message)
formatted_prompt = few_shot_prompt.format_messages(text=test_text)
# Send the structured few-shot prompt to the model
response = llm.invoke(formatted_prompt)
print(f"Text: {test_text}")
print(f"Sentiment: {response.content.strip()}")

Text: The weather is absolutely beautiful today!
Sentiment: POSITIVE


* Prompt template creation: The `ChatPromptTemplate` is used to create a prompt that simulates a training set inside the input. It begins with a system instruction, followed by three labeled examples, and ends with an unlabeled input waiting for classification.
  * The three labeled messages act as demonstrations of the task. These are essential in few-shot prompting—they provide the model with context for what kind of task it's performing and how outputs should be structured.
* Dynamic prompt injection: The `{text}` placeholder in the final human message is dynamically replaced with a new sentence (`"The weather is absolutely beautiful today!"`) using `format_messages()`.
* Model inference: `llm.invoke()` sends the full structured prompt to the OpenAI model. Based on the format and patterns seen in previous examples, the model infers the appropriate sentiment.
* Response extraction: The result is extracted from `response.content`. The `.strip()` ensures no extra whitespace is printed.

### Streaming responses: Real-time output
When generating long-form content or building real-time interfaces (e.g., chatbots, writing assistants), waiting for the full response to be generated before displaying it can create a laggy experience. To improve responsiveness and user experience, OpenAI supports streaming, which allows partial output to be sent and rendered as it is being generated.

#### Synchronous streaming
Synchronous streaming refers to generating output incrementally in a step-by-step manner using a blocking loop (e.g., for loop). This means:
- The model begins generating content and starts sending small parts (called chunks) as soon as they are available.
- The program waits for each chunk before moving to the next.
- While the loop is running, no other task is executed — hence synchronous.

In [6]:
# Streaming for real-time responses
print("Streaming response:")
print("-" * 50)

prompt = "Write a detailed explanation of how photosynthesis works."

# Stream the output from the model chunk by chunk
for chunk in llm.stream(prompt):
    # Print each chunk as it arrives, without adding newlines
    print(chunk.content, end="", flush=True)

print("\n" + "-" * 50)

Streaming response:
--------------------------------------------------
Photosynthesis is a crucial biological process through which green plants, algae, and some bacteria convert light energy, usually from the sun, into chemical energy stored in glucose (a sugar). This process not only provides food for the organisms that perform it but also plays a fundamental role in producing oxygen and serving as the foundation of the food chain for nearly all life on Earth. Here’s a detailed explanation of how photosynthesis works:

### Overview of Photosynthesis

Photosynthesis occurs primarily in the chloroplasts of plant cells, which contain chlorophyll, the green pigment that captures light energy. This process can be divided into two main stages: the light-dependent reactions and the light-independent reactions (Calvin cycle).

### 1. Light-Dependent Reactions

These reactions take place in the thylakoid membranes of the chloroplasts and require light energy. They can be summarized in the fol

- Streaming with `stream()`: This method returns a generator that yields content chunks (e.g., sentences, phrases, or tokens) incrementally.
- Immediate rendering: The `flush=True` in the `print()` function ensures each piece is rendered on the terminal or UI without buffering delays.
- The `for` loop processes each new token or chunk as it's streamed by the model, allowing output to grow progressively.

**Result**: The user sees the output in real time — it feels like the model is typing as it thinks, rather than staying silent until the end.

This is ideal when we are building a script, command-line tool, or notebook where we don't need to multitask while the model responds. We get real-time output, just not in a way that can overlap with other tasks (unlike async streaming, which we’ll discuss next).


#### Asynchronous streaming for web applications
In contrast, asynchronous streaming is designed for applications where waiting or blocking is not acceptable — such as web applications, background services and APIs that handle many users concurrently. In these environments, our application must stay responsive to other tasks (like handling new user requests) even while it's waiting on the model to finish generating output. That is where async I/O comes in.

Asynchronous streaming is a technique where our program doesn't block or pause while waiting for the model to respond. Instead, it listens for parts of the output as they arrive — and continues doing other tasks in the meantime. In modern web applications (using frameworks like FastAPI or async Flask), I/O operations like fetching data, waiting for model output, or sending messages are non-blocking. This means that we can start generating content and display it piece by piece. Meanwhile, our app remains responsive — handling other users or tasks in parallel. In result, the experience is smoother, especially for long or complex generations.

In [7]:
# Define an async function for streaming responses
async def async_streaming_example():
    prompt = "Explain quantum computing in simple terms."
    print("Async streaming response:")
    print("-" * 50)

    # Use async for loop to stream each chunk from the model
    async for chunk in llm.astream(prompt):
        # Render chunks in real-time without newline buffering
        print(chunk.content, end="", flush=True)

    print("\n" + "-" * 50)

# Run the async function using await
await async_streaming_example()

Async streaming response:
--------------------------------------------------
Sure! At its core, quantum computing is a new way of processing information that uses the principles of quantum mechanics, which is the science that explains how very tiny particles, like atoms and photons, behave.

Here are some key points to understand quantum computing in simple terms:

1. **Bits vs. Qubits**: Traditional computers use bits to process information. A bit can be either a 0 or a 1. Quantum computers use qubits (quantum bits), which can be both 0 and 1 at the same time due to a property called superposition. This allows quantum computers to process a lot of information simultaneously.

2. **Superposition**: Think of superposition like spinning a coin. While it's in the air, it's not just heads or tails; it's in a state of both until you catch it. Similarly, a qubit can represent multiple states at once, which gives quantum computers their power.

3. **Entanglement**: This is another key feature

- Asynchronous function setup: Defined using `async def`, making it compatible with event loops and async environments (e.g., web servers).
- Async streaming with `astream()`: Yields chunks in an async for loop, suitable for high-concurrency scenarios.
- Immediate display: Just like before, `flush=True` ensures each chunk is rendered without delay.
- Execution: `await` is used to execute the coroutine in an async-aware context (e.g., Jupyter, FastAPI, or other async runtimes).

### Working with system messages
In conversational AI, system messages act as instructions to the model about how it should behave — like setting the tone, expertise level, or personality. Unlike a user prompt that asks a question or gives a task, a system message frames the model’s role or style for the interaction.

This technique is foundational for tailoring the model’s output for specific domains — such as writing creatively, explaining technically, summarizing like a lawyer, or conversing like a teacher.

In [8]:
# Define the initial system message to influence the model’s behavior
messages = [
    SystemMessage(content="You are a creative writing assistant. Always write in a poetic, metaphorical style."),
    HumanMessage(content="Describe a thunderstorm.")
]

# Send the message sequence to the model
response = llm.invoke(messages)
print("Poetic description:")
print(response.content)

# Compare with a different system message
messages_technical = [
    SystemMessage(content="You are a meteorologist. Provide scientific, technical explanations."),
    HumanMessage(content="Describe a thunderstorm.")
]

# Invoke the model with a new persona
response_technical = llm.invoke(messages_technical)
print("\nTechnical description:")
print(response_technical.content)

Poetic description:
In the womb of the heavens, where shadows twist and churn,  
A tempest brews, a symphony of chaos, waiting to be born.  
The clouds, heavy with secrets, gather in darkened cliques,  
Whispering thunderous tales, in a language that speaks.  

Lightning, the artist, sketches jagged lines of white,  
Across the canvas of twilight, a fleeting brushstroke of fright.  
The air, thick with anticipation, shivers in its breath,  
As the wind, a wild dancer, swirls in a furious death.  

Raindrops, like silver beads, cascade from the sky,  
Each a tiny messenger, carrying the storm’s sigh.  
They tap on the rooftops, a percussion of despair,  
Eager to quench the earth’s thirst, to cleanse the heavy air.  

And then, with a roar that shakes the very bones of night,  
The storm unfurls its fury, a beast of primal might.  
Nature’s crescendo, a theatrical display,  
Where darkness clashes with brilliance, in a chaotic ballet.  

As the storm rages on, a tempestuous embrace,  
L

- `SystemMessage`: Sets the model's behavior and persona.
- `HumanMessage`: Represents user input.
- The `llm.invoke()` method takes in the full list of messages (system + human) and generates a response conditioned on that context.

System messages are processed before the conversation and influence all subsequent responses. Different system messages can dramatically change the model's output style and content.

### Temperature and generation parameters
In large language models like GPT, generation parameters control how the model behaves during text generation. These parameters influence creativity, coherence, determinism, and diversity. Understanding and tuning these parameters allows us to optimize outputs for various use cases — whether we want concise technical summaries or imaginative storytelling.

We will start with an experiment that compares the same prompt under different temperature settings.

In [9]:
# Experimenting with different temperatures
prompts = ["Write a creative story about time travel."]  # List of prompts to test

# Try out different temperature settings
temperatures = [0.1, 0.5, 0.9]

# Loop over each temperature and observe differences in output
for temp in temperatures:
    llm_temp = ChatOpenAI(
        model="gpt-4o-mini-2024-07-18",
        temperature=temp  # Controls randomness
    )

    # Generate a response for the same prompt
    response = llm_temp.invoke(prompts[0])
    print(f"Temperature {temp}:")
    print(response.content[:200] + "...")
    print("-" * 50)

Temperature 0.1:
**Title: The Clockmaker's Gift**

In the quaint village of Eldridge, nestled between rolling hills and whispering woods, there lived a clockmaker named Elias. His shop, a charming little place filled ...
--------------------------------------------------
Temperature 0.5:
**Title: The Clockmaker's Secret**

In the quaint village of Eldenwood, nestled between rolling hills and ancient forests, there stood a peculiar little shop known as "Time's Embrace." The shop was ow...
--------------------------------------------------
Temperature 0.9:
**Title: The Clockmaker’s Paradox**

In the quaint town of Eldervale, nestled between rolling hills and sprawling fields of wildflowers, time seemed to stand still. The air was thick with the scent of...
--------------------------------------------------


- We create a new model instance each time with a different `temperature` value. This controls randomness in text generation:
  - `0.0-0.3`: More deterministic, focused, consistent
  - `0.4-0.7`: Balanced creativity and coherence
  - `0.8-1.0`: More creative, diverse, potentially less coherent
- Lower temperatures are better for factual content, higher for creative writing.

In [13]:
# Other generation parameters
llm_configured = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    temperature=0.7,
    max_tokens=500,  # Limit response length
    top_p=0.95      # Nucleus sampling: top tokens with 95% cumulative probability
)

response = llm_configured.invoke("Explain the importance of biodiversity.")
print(f"Response length: {len(response.content)} characters")
print(response.content)

Response length: 2809 characters
Biodiversity, or biological diversity, refers to the variety of life on Earth, encompassing the diversity of species, ecosystems, and genetic variations within species. Its importance is multifaceted and can be understood through several key points:

1. **Ecosystem Stability and Resilience**: Biodiversity contributes to the stability and resilience of ecosystems. Diverse ecosystems are better equipped to withstand environmental changes and disturbances, such as climate change, natural disasters, and diseases. A variety of species can fulfill different roles, ensuring that ecosystems continue to function even when some species are affected.

2. **Ecosystem Services**: Biodiversity underpins a wide range of ecosystem services that are essential for human survival and well-being. These include:
   - **Provisioning Services**: Such as food, fresh water, timber, and medicinal resources.
   - **Regulating Services**: Including climate regulation, water purifi

- `max_tokens`: Limits the maximum length of generated text
- `top_p=0.95`: Applies nucleus sampling, a method that filters possible next words based on probability mass. Instead of considering all possible tokens, the model chooses from the smallest group of high-probability options whose combined likelihood exceeds 95%. This keeps the output focused and coherent while still allowing for some flexibility.

These parameters provide fine-grained control over generation quality vs. diversity.

### Batch processing
In many real-world applications, we will need to process multiple prompts at once—such as generating summaries for a set of articles, answering many user questions in parallel, or analyzing large datasets. Instead of invoking the model individually for each input (which is slow and inefficient), we can use batch processing to handle multiple prompts in a single request.

There are two common ways to do this:
- Synchronous batch processing using `llm.batch()` — Ideal for backend scripts or low-latency pipelines.
- Asynchronous batch processing using `llm.abatch()` — Best for web apps and environments that require concurrent execution and high scalability.

#### Synchronous batch processing

In [14]:
# Batch processing multiple prompts
prompts = [
    "Summarize the benefits of renewable energy.",
    "Explain the water cycle in 3 steps.",
    "What are the primary colors?",
    "Define artificial intelligence.",
    "How does photosynthesis work?"
]

# Send all prompts to the model in a single batch request
responses = llm.batch(prompts)

# Display each prompt and its corresponding model response
for i, response in enumerate(responses):
    print(f"Prompt {i+1}: {prompts[i]}")
    print(f"Response: {response.content}")
    print("-" * 50)

Prompt 1: Summarize the benefits of renewable energy.
Response: Renewable energy offers several significant benefits:

1. **Environmental Protection**: It reduces greenhouse gas emissions and air pollutants, helping to combat climate change and improve air quality.

2. **Sustainability**: Renewable sources, such as solar, wind, and hydro, are abundant and sustainable over the long term, unlike fossil fuels which are finite.

3. **Energy Independence**: Utilizing domestic renewable resources decreases reliance on imported fuels, enhancing national energy security.

4. **Economic Growth**: The renewable energy sector creates jobs in manufacturing, installation, maintenance, and research, contributing to local and national economies.

5. **Cost-Effectiveness**: The costs of renewable technologies have decreased significantly, making them competitive with traditional energy sources, often resulting in lower energy bills for consumers.

6. **Versatility**: Renewable energy can be harnessed 

- `llm.batch(prompts)`: Sends all the input prompts to the model in one call. It returns a list of response objects, one per input. This is significantly more efficient than looping over `llm.invoke()` individually.

Batch processing is more efficient than individual calls for:
  - Reduces network latency and API overhead.
  - Makes better use of model rate limits and throughput.
  - Faster overall processing time.

#### Asynchronous batch processing

In [15]:
# Async batch processing for better performance
async def async_batch_example():
    prompts = [
        "What is machine learning?",
        "Explain blockchain technology.",
        "How do vaccines work?"
    ]

    # Send the prompts concurrently using the async abatch method
    responses = await llm.abatch(prompts)

    # Display each result
    for prompt, response in zip(prompts, responses):
        print(f"Q: {prompt}")
        print(f"A: {response.content[:100]}...")
        print("-" * 30)

await async_batch_example()

Q: What is machine learning?
A: Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algo...
------------------------------
Q: Explain blockchain technology.
A: Blockchain technology is a decentralized digital ledger system that records transactions across many...
------------------------------
Q: How do vaccines work?
A: Vaccines work by stimulating the immune system to recognize and fight specific pathogens, such as vi...
------------------------------


- `await llm.abatch(prompts)`: Performs batch processing asynchronously — meaning it doesn't block the execution thread.
- Uses `async` and `await` to allow concurrent task execution (e.g., handling multiple users in a web app).

Async batch processing allows for:
  - Improves application responsiveness — other tasks can run while waiting for responses.
  - Scales better under load — handles many requests without blocking.
  - Maximizes resource usage (CPU, I/O, etc.) with non-blocking behavior.