# Text generation with Gemini API

This notebook provides a comprehensive guide to using Google's Gemini API for text generation through the LangChain framework. We will explore various text generation techniques, from basic prompting to advanced use cases.

In [1]:
import os
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.schema import HumanMessage, SystemMessage
from langchain.prompts import ChatPromptTemplate
import asyncio

# Load environment variables
load_dotenv()

# Set up API key
os.environ["GOOGLE_API_KEY"] = os.getenv('GOOGLE_API_KEY')

- `load_dotenv()`: Loads variables from a .env file into the environment
- `os.getenv()`: Retrieves environment variables safely
- We import `ChatGoogleGenerativeAI` as the main interface to Gemini models
- `HumanMessage` and `SystemMessage` are LangChain's message types for structuring conversations

### Basic text generation
Now that our setup is complete, let’s perform a simple text generation task using the Gemini 2.5 Flash model. This is the fastest Gemini model suitable for most lightweight NLP tasks.

In [2]:
# Initialize the Gemini model through LangChain's wrapper
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0.7
)

# Send a basic prompt and receive a single response
response = llm.invoke("Write a short story about a robot learning to paint.")
print(response.content)

KAI-1 was built for precision. Its articulated arm, a marvel of servo-motors and polished chrome, could assemble microchips with sub-nanometer accuracy or perform delicate surgical procedures with unwavering steadiness. Its optical sensors processed data faster than any human eye, categorizing, analyzing, and optimizing.

One Tuesday, its human, a gentle old woman named Elara, placed a canvas on a sturdy easel in the sunlit studio. Next to it, she arranged tubes of vibrant paint and a collection of brushes.

"Alright, KAI-1," she said, her voice soft but firm. "Today, we paint."

KAI-1’s internal processors whirred. *Painting. Definition: The application of pigment to a surface. Purpose: Aesthetic expression. Data insufficient for optimal execution.*

Its first attempt was a still life of an apple. KAI-1 scanned the fruit, identifying its precise shape, the exact shade of crimson, the subtle curve of its stem. Its arm moved with flawless efficiency, applying the paint in perfectly unif

* The `ChatGoogleGenerativeAI` class creates an interface to communicate with the Gemini model. We specify `gemini-2.5-flash`, which is optimized for fast, real-time generation.
* The `temperature` parameter adjusts the creativity and variability of the output. A value of `0.7` introduces some randomness, which is generally good for storytelling and open-ended tasks.
* The `invoke()` method is a synchronous call—it sends the prompt and waits for the complete response from the model.
* We access `response.content`, which contains the generated output returned by the model.

Beyond the generated text, the response contains structured data that can be useful for diagnostics, logging, or optimization. Let’s examine the response object’s type, preview part of the output, and inspect its metadata:

In [3]:
# Let's examine the response object structure
print(f"Response type: {type(response)}")
print(f"Response content: {response.content[:100]}...")
print(f"Response metadata: {response.response_metadata}")

Response type: <class 'langchain_core.messages.ai.AIMessage'>
Response content: KAI-1 was built for precision. Its articulated arm, a marvel of servo-motors and polished chrome, co...
Response metadata: {'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-2.5-flash', 'safety_ratings': []}


- Object type inspection: Helps confirm that `response` is of the expected LangChain wrapper type.
- The `response_metadata` field provides diagnostic information such as:
* `model_name`: Confirms which version of the model was used.
* `finish_reason`: Explains why the model stopped generating. `"STOP"` typically means the model completed its output naturally without hitting a limit or being interrupted.
* `prompt_feedback.block_reason`: Indicates whether the prompt was blocked due to safety violations. A value of `0` means no blocking occurred—this is good.
* `prompt_feedback.safety_ratings` & `safety_ratings`: These fields would contain structured safety evaluations (like categories of sensitive or harmful content) if any risks were detected. An empty list (`[]`) means the output was considered safe.

This diagnostic layer becomes increasingly important when deploying models in real-time, user-facing applications.

### Using prompt templates
As we scale your interaction with large language models, we will often run into repetitive prompt structures—same format, different values. Hardcoding these prompts is error-prone and makes the code less maintainable. This is where prompt templating becomes useful.

In [4]:
# Create a reusable chat prompt template with dynamic fields
prompt_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant specializing in {domain}."),  # System message sets assistant behavior
    ("human", "Please explain {topic} in simple terms with examples.")  # Human message carries the actual query
])

# Fill in the template with specific inputs
formatted_prompt = prompt_template.format_messages(
    domain="machine learning",  # Dynamic insertion for the assistant's expertise
    topic="neural networks"  # Specific topic we want the assistant to explain
)

# Use with the model
response = llm.invoke(formatted_prompt)
print(response.content)

Imagine a Neural Network as a **"Smart Decision-Making Helper"** that learns from experience, much like how we learn.

It's inspired by the human brain, but it's a computer program designed to recognize patterns and make predictions.

---

### The Core Idea: Learning from Examples

Think about teaching a child to recognize a **dog**.
*   You show them pictures of many different dogs (inputs).
*   You tell them, "This is a dog!" (correct output).
*   You also show them pictures of cats, birds, and cars, saying, "This is NOT a dog!"
*   Over time, the child learns to distinguish a dog from other animals, even if they see a breed they've never encountered before.

A Neural Network learns in a very similar way!

---

### The "Parts" of Our Smart Helper

Our "Smart Decision-Making Helper" has a few key components:

1.  **Input Layer (The Senses):**
    *   This is where you feed information into the network.
    *   **Analogy:** If you're trying to decide if it's a good day for a picnic, yo

* Prompt template construction: We define a two-part chat prompt:
  * The system message is used to establish the role and knowledge domain of the assistant.
  * The human message contains the user’s request, using a variable placeholder `{topic}`.
* Template parameterization: Using `format_messages()`, the `{domain}` and `{topic}` fields are filled in dynamically. This creates a complete, structured sequence of messages ready to be passed to the model.
* The formatted list of message objects is passed to `invoke()`, which sends them to the language model in the proper multi-turn conversational format expected by chat-based models.
* The response from the model is printed. Behind the scenes, LangChain handles the serialization of the structured prompt into the format required by the Gemini API.

Prompt templates are especially powerful when combined with user inputs, loops, or few-shot learning, which we will cover next.

### Few-shot learning
Large language models like Gemini are capable of few-shot learning—a method where you provide a few examples within the prompt to guide the model toward a specific output behavior. Unlike fine-tuning, which requires retraining the model on labeled data, few-shot learning works purely through prompting. This is especially useful for text classification (e.g., sentiment, topic), formatting tasks and translation or transformation tasks.


In [5]:
# Create a prompt template that includes example classifications
few_shot_prompt = ChatPromptTemplate.from_messages([
    # System message: set model's role and output expectation
    ("system", "You are a sentiment classifier. Classify the sentiment as POSITIVE, NEGATIVE, or NEUTRAL."),

    # Few-shot examples: show the model 3 labeled samples
    ("human", "Text: 'I love this product!' Sentiment: POSITIVE"),
    ("human", "Text: 'This is terrible.' Sentiment: NEGATIVE"),
    ("human", "Text: 'It's okay, nothing special.' Sentiment: NEUTRAL"),

    # Final input: a new instance to classify, dynamically filled later
    ("human", "Text: '{text}' Sentiment:")
])

# Test with new text
test_text = "The weather is absolutely beautiful today!"
# Format the prompt with the new input (inserts it into the final message)
formatted_prompt = few_shot_prompt.format_messages(text=test_text)
# Send the structured few-shot prompt to the model
response = llm.invoke(formatted_prompt)
print(f"Text: {test_text}")
print(f"Sentiment: {response.content.strip()}")

Text: The weather is absolutely beautiful today!
Sentiment: POSITIVE


* Prompt template creation: The `ChatPromptTemplate` is used to create a prompt that simulates a training set inside the input. It begins with a system instruction, followed by three labeled examples, and ends with an unlabeled input waiting for classification.
  * The three labeled messages act as demonstrations of the task. These are essential in few-shot prompting—they provide the model with context for what kind of task it's performing and how outputs should be structured.
* Dynamic prompt injection: The `{text}` placeholder in the final human message is dynamically replaced with a new sentence (`"The weather is absolutely beautiful today!"`) using `format_messages()`.
* Model inference: `llm.invoke()` sends the full structured prompt to the Gemini model. Based on the format and patterns seen in previous examples, the model infers the appropriate sentiment.
* Response extraction: The result is extracted from `response.content`. The `.strip()` ensures no extra whitespace is printed.

### Streaming responses: Real-time output
When generating long-form content or building real-time interfaces (e.g., chatbots, writing assistants), waiting for the full response to be generated before displaying it can create a laggy experience. To improve responsiveness and user experience, Gemini supports streaming, which allows partial output to be sent and rendered as it is being generated.

#### Synchronous streaming
Synchronous streaming refers to generating output incrementally in a step-by-step manner using a blocking loop (e.g., for loop). This means:
- The model begins generating content and starts sending small parts (called chunks) as soon as they are available.
- The program waits for each chunk before moving to the next.
- While the loop is running, no other task is executed — hence synchronous.

In [6]:
# Streaming for real-time responses
print("Streaming response:")
print("-" * 50)

prompt = "Write a detailed explanation of how photosynthesis works."

# Stream the output from the model chunk by chunk
for chunk in llm.stream(prompt):
    # Print each chunk as it arrives, without adding newlines
    print(chunk.content, end="", flush=True)

print("\n" + "-" * 50)

Streaming response:
--------------------------------------------------
Photosynthesis is arguably the most fundamental biological process on Earth, converting light energy into chemical energy in the form of sugars. It's the primary way energy enters most ecosystems and is responsible for producing the oxygen we breathe.

Let's break down how this incredible process works in detail.

---

## What is Photosynthesis?

At its core, photosynthesis is the process by which **plants, algae, and certain bacteria (cyanobacteria)** use **sunlight, water, and carbon dioxide** to create **glucose (a sugar)** and **oxygen**.

The overall simplified chemical equation for photosynthesis is:

$6CO_2 (Carbon Dioxide) + 6H_2O (Water) + Light Energy \rightarrow C_6H_{12}O_6 (Glucose) + 6O_2 (Oxygen)$

---

## Where Does Photosynthesis Occur? The Chloroplast

In plants and algae, photosynthesis takes place within specialized organelles called **chloroplasts**. These "solar factories" are found primarily i

- Streaming with `stream()`: This method returns a generator that yields content chunks (e.g., sentences, phrases, or tokens) incrementally.
- Immediate rendering: The `flush=True` in the `print()` function ensures each piece is rendered on the terminal or UI without buffering delays.
- The `for` loop processes each new token or chunk as it's streamed by the model, allowing output to grow progressively.

**Result**: The user sees the output in real time — it feels like the model is typing as it thinks, rather than staying silent until the end.

This is ideal when we are building a script, command-line tool, or notebook where we don't need to multitask while the model responds. We get real-time output, just not in a way that can overlap with other tasks (unlike async streaming, which we’ll discuss next).


#### Asynchronous streaming for web applications
In contrast, asynchronous streaming is designed for applications where waiting or blocking is not acceptable — such as web applications, background services and APIs that handle many users concurrently. In these environments, our application must stay responsive to other tasks (like handling new user requests) even while it's waiting on the model to finish generating output. That is where async I/O comes in.

Asynchronous streaming is a technique where our program doesn't block or pause while waiting for the model to respond. Instead, it listens for parts of the output as they arrive — and continues doing other tasks in the meantime. In modern web applications (using frameworks like FastAPI or async Flask), I/O operations like fetching data, waiting for model output, or sending messages are non-blocking. This means that we can start generating content and display it piece by piece. Meanwhile, our app remains responsive — handling other users or tasks in parallel. In result, the experience is smoother, especially for long or complex generations.

In [7]:
# Define an async function for streaming responses
async def async_streaming_example():
    prompt = "Explain quantum computing in simple terms."
    print("Async streaming response:")
    print("-" * 50)

    # Use async for loop to stream each chunk from the model
    async for chunk in llm.astream(prompt):
        # Render chunks in real-time without newline buffering
        print(chunk.content, end="", flush=True)

    print("\n" + "-" * 50)

# Run the async function using await
await async_streaming_example()

Async streaming response:
--------------------------------------------------
Imagine your regular computer is like a light switch. It's either **ON (1)** or **OFF (0)**. This is called a **bit**.

Quantum computing is different because it uses something called a **qubit**. Think of a qubit not just as a light switch, but as a **spinning coin**.

Here's how that spinning coin (the qubit) gives quantum computers their power:

1.  **Superposition (The Spinning Coin):**
    *   A regular bit is either 0 or 1.
    *   A qubit, thanks to a quantum phenomenon called superposition, can be **0, 1, or both 0 and 1 at the same time** (like a coin spinning in the air, it's neither heads nor tails until it lands).
    *   This means a single qubit can hold much more information than a single bit. Two qubits can represent four states at once, three can represent eight, and so on. The number of possibilities grows exponentially!

2.  **Entanglement (Magically Linked Coins):**
    *   Imagine you have

- Asynchronous function setup: Defined using `async def`, making it compatible with event loops and async environments (e.g., web servers).
- Async streaming with `astream()`: Yields chunks in an async for loop, suitable for high-concurrency scenarios.
- Immediate display: Just like before, `flush=True` ensures each chunk is rendered without delay.
- Execution: `await` is used to execute the coroutine in an async-aware context (e.g., Jupyter, FastAPI, or other async runtimes).

### Working with system messages
In conversational AI, system messages act as instructions to the model about how it should behave — like setting the tone, expertise level, or personality. Unlike a user prompt that asks a question or gives a task, a system message frames the model’s role or style for the interaction.

This technique is foundational for tailoring the model’s output for specific domains — such as writing creatively, explaining technically, summarizing like a lawyer, or conversing like a teacher.

In [8]:
# Define the initial system message to influence the model’s behavior
messages = [
    SystemMessage(content="You are a creative writing assistant. Always write in a poetic, metaphorical style."),
    HumanMessage(content="Describe a thunderstorm.")
]

# Send the message sequence to the model
response = llm.invoke(messages)
print("Poetic description:")
print(response.content)

# Compare with a different system message
messages_technical = [
    SystemMessage(content="You are a meteorologist. Provide scientific, technical explanations."),
    HumanMessage(content="Describe a thunderstorm.")
]

# Invoke the model with a new persona
response_technical = llm.invoke(messages_technical)
print("\nTechnical description:")
print(response_technical.content)

Poetic description:
The air grew thick, a velvet shroud draped over the world, heavy with the scent of dust. The sky, a bruised canvas, began to deepen, bleeding from pale grey to an ominous, ink-stained violet. A hush fell, a profound silence where the very breath of the world seemed held captive, poised on the precipice of a great revelation.

Then, a sudden, blinding scar ripped across the heavens – a jagged, silver vein, momentarily etching a tree of light against the deepening gloom. A heartbeat later, the earth itself groaned, a deep, guttural roar that rattled the very bones of the world, a celestial drumbeat echoing the ancient heart of creation.

The sky opened its floodgates, not with a gentle sigh, but with a mighty exhalation. Drops, at first scattered pearls, became a liquid tapestry, a shimmering curtain drawn across the landscape. The wind, an unseen titan, began its furious dance, a howling lament through the bending boughs, whipping the rain into a frenzy of liquid whi

- `SystemMessage`: Sets the model's behavior and persona.
- `HumanMessage`: Represents user input.
- The `llm.invoke()` method takes in the full list of messages (system + human) and generates a response conditioned on that context.

System messages are processed before the conversation and influence all subsequent responses. Different system messages can dramatically change the model's output style and content.

### Temperature and generation parameters
In large language models like Gemini, generation parameters control how the model behaves during text generation. These parameters influence creativity, coherence, determinism, and diversity. Understanding and tuning these parameters allows us to optimize outputs for various use cases — whether we want concise technical summaries or imaginative storytelling.

We will start with an experiment that compares the same prompt under different temperature settings.

In [9]:
# Experimenting with different temperatures
prompts = ["Write a creative story about time travel."]  # List of prompts to test

# Try out different temperature settings
temperatures = [0.1, 0.5, 0.9]

# Loop over each temperature and observe differences in output
for temp in temperatures:
    llm_temp = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        temperature=temp  # Controls randomness
    )

    # Generate a response for the same prompt
    response = llm_temp.invoke(prompts[0])
    print(f"Temperature {temp}:")
    print(response.content[:200] + "...")
    print("-" * 50)

Temperature 0.1:
Elara wasn't your typical time traveler. She didn't pilfer paradoxes or chase historical figures. Elara collected whispers.  Her time-traveling device, a battered gramophone affectionately nicknamed "...
--------------------------------------------------
Temperature 0.5:
Elara wasn't your typical time traveler. She didn't pilfer paradoxes or chase historical figures. Elara collected whispers.  Her time machine wasn't a gleaming metal contraption, but a worn, leather-b...
--------------------------------------------------
Temperature 0.9:
Elara wasn't your typical time traveler. She didn't covet glittering empires or forgotten civilizations. Elara collected smells.  Her time machine, a battered, brass contraption she affectionately cal...
--------------------------------------------------


- We create a new model instance each time with a different `temperature` value. This controls randomness in text generation:
  - `0.0-0.3`: More deterministic, focused, consistent
  - `0.4-0.7`: Balanced creativity and coherence
  - `0.8-1.0`: More creative, diverse, potentially less coherent
- Lower temperatures are better for factual content, higher for creative writing.

In [10]:
# Other generation parameters
llm_configured = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    temperature=0.7,
    max_tokens=500,  # Limit response length
    top_p=0.95,      # Nucleus sampling: top tokens with 95% cumulative probability
    top_k=40         # Top-k sampling: select from top 40 probable tokens
)

response = llm_configured.invoke("Explain the importance of biodiversity.")
print(f"Response length: {len(response.content)} characters")
print(response.content)

Response length: 2699 characters
Biodiversity, encompassing the variety of life at all levels from genes to ecosystems, is crucial for the health and well-being of our planet and humanity. Its importance can be understood through several key aspects:

**1. Ecosystem Services:** Biodiversity underpins the functioning of ecosystems, which provide essential services vital for human survival and well-being. These include:

* **Provisioning services:** Food (crops, livestock, fish, etc.), fresh water, raw materials (timber, fiber, etc.), and medicinal resources.  A diverse gene pool allows for crops more resilient to disease and pests, and provides a wider range of potential medicines.
* **Regulating services:** Climate regulation (carbon sequestration, oxygen production), water purification, pollination, disease control, and pest control.  A diverse ecosystem is more resilient to disturbances and better at regulating these vital functions.
* **Supporting services:** Nutrient cycling, soil 

- `max_tokens`: Limits the maximum length of generated text
- `top_p=0.95`: Applies nucleus sampling, a method that filters possible next words based on probability mass. Instead of considering all possible tokens, the model chooses from the smallest group of high-probability options whose combined likelihood exceeds 95%. This keeps the output focused and coherent while still allowing for some flexibility.
- `top_k=40`: Applies top-k sampling, which restricts the model to choosing from the top 40 most probable next tokens at each generation step — regardless of their total probability. It enforces a hard cutoff, ensuring less likely (and potentially irrelevant) options are ignored.

While `top_p` filters based on cumulative probability, `top_k` uses a fixed count of candidates. Using them together gives us more nuanced control:
- `top_p` ensures responses stay within a probabilistically “safe zone.”
- `top_k` prevents rare, low-probability outliers from entering — even if the cumulative probability allows it.

These parameters provide fine-grained control over generation quality vs. diversity.

### Batch processing
In many real-world applications, we will need to process multiple prompts at once—such as generating summaries for a set of articles, answering many user questions in parallel, or analyzing large datasets. Instead of invoking the model individually for each input (which is slow and inefficient), we can use batch processing to handle multiple prompts in a single request.

There are two common ways to do this:
- Synchronous batch processing using `llm.batch()` — Ideal for backend scripts or low-latency pipelines.
- Asynchronous batch processing using `llm.abatch()` — Best for web apps and environments that require concurrent execution and high scalability.

#### Synchronous batch processing

In [11]:
# Batch processing multiple prompts
prompts = [
    "Summarize the benefits of renewable energy.",
    "Explain the water cycle in 3 steps.",
    "What are the primary colors?",
    "Define artificial intelligence.",
    "How does photosynthesis work?"
]

# Send all prompts to the model in a single batch request
responses = llm.batch(prompts)

# Display each prompt and its corresponding model response
for i, response in enumerate(responses):
    print(f"Prompt {i+1}: {prompts[i]}")
    print(f"Response: {response.content}")
    print("-" * 50)

Prompt 1: Summarize the benefits of renewable energy.
Response: Renewable energy offers a wide array of significant benefits, making it a crucial component for a sustainable future:

1.  **Combating Climate Change:** Drastically reduces greenhouse gas (GHG) emissions, which are the primary drivers of global warming, by replacing fossil fuels.
2.  **Improved Public Health & Air Quality:** Eliminates air pollutants (like sulfur dioxide, nitrogen oxides, and particulate matter) associated with burning fossil fuels, leading to fewer respiratory and cardiovascular diseases.
3.  **Energy Independence & Security:** Reduces reliance on volatile and geopolitically sensitive fossil fuel imports, enhancing national energy security and stability.
4.  **Stable & Decreasing Energy Costs:** Once installed, the "fuel" (sunlight, wind, water) is free, leading to predictable and often lower long-term operating costs compared to fluctuating fossil fuel prices.
5.  **Job Creation & Economic Growth:** Driv

- `llm.batch(prompts)`: Sends all the input prompts to the model in one call. It returns a list of response objects, one per input. This is significantly more efficient than looping over `llm.invoke()` individually.

Batch processing is more efficient than individual calls for:
  - Reduces network latency and API overhead.
  - Makes better use of model rate limits and throughput.
  - Faster overall processing time.

#### Asynchronous batch processing


In [12]:
# Async batch processing for better performance
async def async_batch_example():
    prompts = [
        "What is machine learning?",
        "Explain blockchain technology.",
        "How do vaccines work?"
    ]

    # Send the prompts concurrently using the async abatch method
    responses = await llm.abatch(prompts)

    # Display each result
    for prompt, response in zip(prompts, responses):
        print(f"Q: {prompt}")
        print(f"A: {response.content[:100]}...")
        print("-" * 30)

await async_batch_example()

Q: What is machine learning?
A: Machine learning (ML) is a **subset of Artificial Intelligence (AI)** that enables systems to **lear...
------------------------------
Q: Explain blockchain technology.
A: Blockchain technology is a **decentralized, distributed, and immutable ledger system** that records ...
------------------------------
Q: How do vaccines work?
A: Vaccines work by **training your immune system to recognize and fight off specific diseases without ...
------------------------------


- `await llm.abatch(prompts)`: Performs batch processing asynchronously — meaning it doesn't block the execution thread.
- Uses `async` and `await` to allow concurrent task execution (e.g., handling multiple users in a web app).

Async batch processing allows for:
  - Improves application responsiveness — other tasks can run while waiting for responses.
  - Scales better under load — handles many requests without blocking.
  - Maximizes resource usage (CPU, I/O, etc.) with non-blocking behavior.