##### Copyright 2025 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Gemini API: Getting started with information grounding for Gemini models

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Grounding.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

In this notebook you will learn how to use information grounding with [Gemini models](https://ai.google.dev/gemini-api/docs/models/).

Information grounding is the process of connecting these models to specific, verifiable information sources to enhance the accuracy, relevance, and factual correctness of their responses. While LLMs are trained on vast amounts of data, this knowledge can be general, outdated, or lack specific context for particular tasks or domains. Grounding helps to bridge this gap by providing the LLM with access to curated, up-to-date information.

Here you will experiment with:
- Grounding information using Google Search grounding
- Adding YouTube links to gather context information to your prompt
- Using URL context to include website URL as context to your prompt

## Set up the SDK

This guide uses the [`google-genai`](https://pypi.org/project/google-genai) Python SDK to connect to the Gemini models.

### Install SDK

The **[Google Gen AI SDK](https://github.com/googleapis/python-genai)** provides programmatic access to Gemini models using both the [Google AI for Developers](https://ai.google.dev/gemini-api/docs/models/) and [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/overview) APIs. With a few exceptions, code that runs on one platform will run on both. This means that you can prototype an application using the Developer API and then migrate the application to Vertex AI without rewriting your code.

More details about this new SDK on the [documentation](https://googleapis.github.io/python-genai/) or in the [Getting started](./Get_started.ipynb) notebook.

In [None]:
%pip install -q -U "google-genai>=1.16.0"

### Set up your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see the [Authentication](https://github.com/google-gemini/gemini-api-cookbook/blob/main/quickstarts/Authentication.ipynb) quickstart for an example.

In [2]:
from google.colab import userdata

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

### Select model and initialize SDK client

Select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. [thinking notebook](./Get_started_thinking.ipynb) for more details and in particular learn how to switch the thiking off).

In [3]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

MODEL_ID = "gemini-2.5-flash-preview-05-20" # @param ["gemini-2.5-flash-preview-05-20", "gemini-2.5-pro-preview-06-05", "gemini-2.0-flash", "gemini-2.0-flash-lite"] {"allow-input":true, isTemplate: true}

## Use Google Search grounding
Google Search grounding is particularly useful for queries that require current information or external knowledge. Using Google Search, Gemini can access nearly real-time information and better responses.

In [4]:
from IPython.display import HTML, Markdown

response = client.models.generate_content(
    model=MODEL_ID,
    contents='What was the latest Indian Premier League match and who won?',
    config={"tools": [{"google_search": {}}]},
)

# print the response
display(Markdown(f"Response:\n {response.text}"))
# print the search details
print(f"Search Query: {response.candidates[0].grounding_metadata.web_search_queries}")
# urls used for grounding
print(f"Search Pages: {', '.join([site.web.title for site in response.candidates[0].grounding_metadata.grounding_chunks])}")

display(HTML(response.candidates[0].grounding_metadata.search_entry_point.rendered_content))

Response:
 The latest Indian Premier League (IPL) match, which was the final of the 2024 season, took place on May 26, 2024.

In this match, the Kolkata Knight Riders (KKR) defeated the Sunrisers Hyderabad (SRH) by 8 wickets to win their third IPL title. The Sunrisers Hyderabad were bundled out for 113 runs, which was the lowest total in an IPL final.

Search Query: ['latest Indian Premier League match and winner', 'When did IPL 2024 end?']
Search Pages: wikipedia.org, jagranjosh.com, livemint.com, thehindu.com


You can see that running the same prompt without search grounding gives you outdated information:

In [5]:
from IPython.display import Markdown

response = client.models.generate_content(
    model=MODEL_ID,
    contents='What was the latest Indian Premier League match and who won?',
)

# print the response
display(Markdown(response.text))

The latest Indian Premier League match played was the **final of the 2024 season**, which took place on **May 26, 2024**.

**Match:** Sunrisers Hyderabad (SRH) vs. Kolkata Knight Riders (KKR)
**Winner:** **Kolkata Knight Riders (KKR)** won by 8 wickets.

## Grounding with YouTube links

you can directly include a public YouTube URL in your prompt. The Gemini models will then process the video content to perform tasks like summarization and answering questions about the content.

This capability leverages Gemini's multimodal understanding, allowing it to analyze and interpret video data alongside any text prompts provided.

You do need to explicitly declare the video URL you want the model to process as part of the contents of the request. Here a simple interaction where you ask the model to summarize a YouTube video:

In [6]:
yt_link = "https://www.youtube.com/watch?v=XV1kOFo1C8M"

response = client.models.generate_content(
    model=MODEL_ID,
    contents= types.Content(
        parts=[
            types.Part(text="Summarize this video."),
            types.Part(
                file_data=types.FileData(file_uri=yt_link)
            )
        ]
    )
)

Markdown(response.text)

This video introduces "Gemma Chess," a new application of Google DeepMind's Gemma language model to the game of chess. The speaker, Ju-yeong Ji, explains that unlike traditional chess engines (like AlphaZero) which are "super smart calculators" focused on finding the best moves, Gemma aims to add a "new dimension" by leveraging its ability to understand and generate human-like text.

Gemma's key applications in chess include:

1.  **Explaining Chess:** It can analyze complex games (like Kasparov vs. Deep Blue) and explain *why* specific moves are significant, detailing strategies, tactical opportunities, and potential dangers in plain language, rather than just technical move sequences or numbers.
2.  **Storytelling:** Gemma can transform chess game data into engaging narratives, describing the flow of matches, the players involved, and the overall dramatic arc, making the games more accessible and relatable.
3.  **Supporting Chess Learning:** It acts as a "super helpful study buddy," explaining chess concepts (e.g., Sicilian Defense, passed pawn) in natural language, adapting to the user's skill level (beginner, intermediate, advanced), and even offering explanations in different languages. This provides a personalized, 24/7 "chess coach."

In summary, Gemma combines the computational power of chess AI with advanced linguistic understanding, offering a more intuitive and human-centric way to learn, analyze, and experience chess.

But you can also use the link as the source of truth for your request. In this example, you will first ask how Gemma models can help on chess games:

In [7]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents= types.Content(
        parts=[
            types.Part(text="How Gemma models can help on chess games?"),
        ]
    )
)

Markdown(response.text)

Gemma models, as Large Language Models (LLMs), don't play chess directly like dedicated chess engines (e.g., Stockfish, AlphaZero). They lack the internal game tree search, move validation, and evaluation functions that chess engines possess.

However, Gemma's strengths lie in its ability to understand, generate, and process human language, which can be immensely helpful in various aspects of chess learning, analysis, and communication. Think of Gemma as a powerful analytical and educational *companion* rather than an opponent or a direct player.

Here's how Gemma models can help in chess games:

1.  **Learning and Understanding Chess Concepts:**
    *   **Explaining Rules and Mechanics:** For beginners, Gemma can clearly explain how pieces move, special rules (castling, en passant), and basic objectives.
    *   **Defining Terminology:** Ask Gemma to explain terms like "zugzwang," "fork," "skewer," "pawn structure," "tempo," "initiative," "desperado," etc.
    *   **Illustrating Concepts with Examples:** Gemma can describe positions or sequences of moves that demonstrate specific tactics or strategic ideas.
    *   **Opening and Endgame Theory:** While not an exhaustive database, Gemma can summarize the main ideas, common lines, and strategic goals behind various openings (e.g., "Explain the ideas behind the Sicilian Defense" or "What are the basic principles of king and pawn vs. king endgames?").

2.  **Post-Game Analysis and Improvement:**
    *   **Interpreting Engine Analysis:** If you have an engine's output (like a FEN string with an evaluation, or a recommended move sequence), Gemma can help translate that into human-understandable explanations. For example, "Why does Stockfish recommend this move in this FEN?" or "Explain the tactical idea behind this engine line."
    *   **Summarizing Game Logs (PGNs):** You can feed a PGN (Portable Game Notation) to Gemma and ask it to summarize the key moments, critical mistakes, turning points, or strategic themes of the game.
    *   **Identifying Common Mistakes (with context):** If you describe your typical errors (e.g., "I often blunder pawns in the middlegame"), Gemma can offer general advice or common reasons why such mistakes occur, or suggest drills.
    *   **Explaining Tactical Puzzles:** If you describe a chess puzzle position (FEN or visual description), Gemma can explain the solution, the underlying tactical motif, and why other moves don't work.

3.  **Strategic Planning and Brainstorming:**
    *   **Developing Game Plans:** Given a specific opening or middlegame position, Gemma can brainstorm potential strategic plans for both sides, considering factors like pawn structures, piece activity, and king safety.
    *   **Opponent Analysis (with input):** If you provide information about an opponent's past games or playing style, Gemma could help summarize their tendencies or suggest counter-strategies (e.g., "My opponent often plays the Caro-Kann; what are some aggressive lines to consider against it?").
    *   **Generating Ideas:** For a complex position, Gemma can suggest different strategic approaches, even if it can't calculate the precise best move.

4.  **Content Creation and Communication:**
    *   **Generating Commentary:** Gemma can help write descriptive commentary for a chess game, explaining moves, player intentions, and the flow of the match.
    *   **Creating Study Material:** It can help generate questions, explanations, or summaries for chess lessons or study guides.
    *   **Translating Chess Content:** If you have chess articles, videos, or commentary in a foreign language, Gemma can assist with translation.

**Important Limitations to Keep in Mind:**

*   **Gemma is NOT a Chess Engine:** It cannot play chess, calculate moves, validate moves, or perform the deep search required for optimal play. Its "understanding" of chess comes from text data, not a game engine's algorithms.
*   **No Real-time Assistance (Ethical/Practical):** Using Gemma during a live game for move suggestions would be cheating. Its utility is primarily for learning, analysis, and preparation *outside* of live play.
*   **Relies on Input:** Gemma needs clear and accurate input (FEN, PGN, detailed descriptions) to provide useful chess-related responses. It cannot "see" a chessboard.
*   **Hallucinations/Inaccuracies:** Like any LLM, Gemma can sometimes generate plausible but incorrect information. Always cross-reference critical chess advice with dedicated chess engines or trusted human experts.
*   **Knowledge Cutoff:** Gemma's training data has a cutoff, meaning it might not be aware of the absolute latest developments in opening theory or recent games.

In summary, Gemma models are excellent linguistic tools that can demystify complex chess concepts, aid in post-game analysis by interpreting engine output, and assist in strategic planning and content creation. They serve as an intelligent assistant for learning and understanding the game, rather than a player or a direct tactical calculator.

And then you can ask the same question, now having the YouTube video as context to be used by the model:

In [8]:
yt_link = "https://www.youtube.com/watch?v=XV1kOFo1C8M"

response = client.models.generate_content(
    model=MODEL_ID,
    contents= types.Content(
        parts=[
            types.Part(text="How Gemma models can help on chess games?"),
            types.Part(
                file_data=types.FileData(file_uri=yt_link)
            )
        ]
    )
)

Markdown(response.text)

Gemma models, as large language models (LLMs), can significantly enhance the chess experience by leveraging their natural language understanding and generation capabilities, rather than directly calculating the best moves like traditional chess engines.

Here's how Gemma can help in chess games, based on the video:

1.  **Enhanced Analysis and Explanation (The Explainer):**
    *   **Translating Complexity:** Traditional chess engines often provide numerical evaluations and complex move sequences (like "Nf3 d5 2.g3 Bg4"). Gemma can take this technical output and translate it into plain, human-understandable text.
    *   **Explaining "Why":** Instead of just showing a move, Gemma can explain the *strategic ideas* and *tactical reasons* behind a move. For example, it can explain why a pawn sacrifice is interesting due to disrupting the opponent's plans or its psychological impact.
    *   **Summarizing Key Moments:** For long or complicated games, Gemma can pick out the most important tactical and strategic moments, helping players quickly grasp the critical turning points.

2.  **Interactive Learning and Coaching (Supporting Chess Learning):**
    *   **Personalized Explanations:** Gemma can act as a "study buddy" or "personal chess coach." You can ask it to explain chess concepts (like the "Sicilian Defense" or a "passed pawn") and it can tailor the explanation to your skill level (beginner, intermediate, advanced).
    *   **Multilingual Support:** As demonstrated in the video, Gemma can understand and explain concepts in various languages (e.g., Korean), making chess learning more accessible globally.
    *   **Targeted Feedback:** It can provide feedback on your understanding of chess ideas and even suggest areas where you might want to improve.

3.  **Storytelling and Narrative Generation (Storytellers):**
    *   **Bringing Games to Life:** Gemma can take the raw data of a chess game (like PGN notation, including player names and tournament info) and weave it into a compelling narrative or short story.
    *   **Adding Human Context:** It can describe the atmosphere of a match, infer hypothetical thoughts or emotions of the players, and highlight dramatic turns, making the game more engaging and relatable than just a sequence of moves. This is like getting a "cool backstory" for a puzzle, making it more interesting.

4.  **Combining Strengths (Hybrid Approach):**
    *   **Complementing Engines:** By integrating with traditional chess engines (which excel at calculation), Gemma can take the optimal moves identified by the engine and then *explain* the reasoning behind them in natural language. This blends the raw computational power of chess AI with the human-like understanding and communication of an LLM, offering a more intuitive and insightful analysis experience.

In essence, Gemma models don't play chess themselves, but they act as an intelligent interpreter and communicator, making chess analysis, learning, and enjoyment more accessible and profound for human players.

Now your answer is more insightful for the topic you want, using the knowledge shared on the video and not necessarily available on the model knowledge.

## Grounding information using URL context

The URL Context tool empowers Gemini models to directly access and process content from specific web page URLs you provide within your API requests. This is incredibly interesting because it allows your applications to dynamically interact with live web information without needing you to manually pre-process and feed that content to the model.

URL Context is effective because it allows the models to base its responses and analysis directly on the content of the designated web pages. Instead of relying solely on its general training data or broad web searches (which are also valuable grounding tools), URL Context anchors the model's understanding to the specific information present at those URLs.

In [9]:
prompt = """
based on https://ai.google.dev/gemini-api/docs/models, what are the key
differences between Gemini 1.5, Gemini 2.0 and Gemini 2.5 models?
Create a markdown table comparing the differences.
"""

tools = []
tools.append(types.Tool(url_context=types.UrlContext))

config = types.GenerateContentConfig(
    tools=tools,
)

response = client.models.generate_content(
      contents=[prompt],
      model=MODEL_ID,
      config=config
)

display(Markdown(response.text))

Here's a comparison of key differences between Gemini 1.5, Gemini 2.0, and Gemini 2.5 models based on the provided documentation:

| Feature                 | Gemini 1.5 Pro                                                                                                                                                                                                                                               | Gemini 1.5 Flash                                                                      | Gemini 2.0 Flash                                                                                              | Gemini 2.5 Pro (Preview)                                                                                                                                                                                                                                                                                                                             | Gemini 2.5 Flash (Preview)                                                                                                                                                                                                                                                                                                                             |
| :---------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Optimization/Purpose** | Mid-size multimodal model optimized for a wide range of reasoning tasks, capable of processing large amounts of data (e.g., 2 hours of video, 19 hours of audio, 60,000 lines of code, or 2,000 pages of text).                                         | Fast and versatile performance across a diverse variety of tasks.                 | Newest multimodal model with next-generation features, improved capabilities, low latency, enhanced performance, built to power agentic experiences. | Our most powerful thinking model with maximum response accuracy and state-of-the-art performance, best for complex coding, reasoning, and multimodal understanding, and analyzing large databases.                                                                                                                                                        | Our best model in terms of price-performance, offering well-rounded capabilities, best for low latency, high volume tasks that require thinking.                                                                                                                                                                                                 |
| **Input Modalities**    | Audio, images, video, and text.                                                                                                                                                                                                                          | Audio, images, video, and text.                                                   | Audio, images, video, and text.                                                   | Audio, images, video, and text.                                                                                                                                                                                                                                                                                                                  | Audio, images, video, and text.                                                                                                                                                                                                                                                                                                                  |
| **Output Modalities**   | Text.                                                                                                                                                                                                                                                    | Text.                                                                             | Text.                                                                             | Text.                                                                                                                                                                                                                                                                                                                                            | Text.                                                                                                                                                                                                                                                                                                                                            |
| **Input Token Limit**   | 2,097,152.                                                                                                                                                                                                                                               | 1,048,576.                                                                        | 1,048,576.                                                                        | 1,048,576.                                                                                                                                                                                                                                                                                                                                       | 1,048,576.                                                                                                                                                                                                                                                                                                                                       |
| **Output Token Limit**  | 8,192.                                                                                                                                                                                                                                                   | 8,192.                                                                            | 8,192.                                                                            | 65,536.                                                                                                                                                                                                                                                                                                                                          | 65,536.                                                                                                                                                                                                                                                                                                                                          |
| **Key Capabilities**    | Handles complex reasoning tasks, large datasets, and long context.                                                                                                                                                                                       | Fast and versatile performance.                                                   | Next-gen features, speed, thinking, real-time streaming, built for agentic experiences. | Enhanced thinking and reasoning, multimodal understanding, advanced coding.                                                                                                                                                                                                                                                                          | Adaptive thinking, cost efficiency, with well-rounded capabilities.                                                                                                                                                                                                                                                                                    |
| **Latest Update**       | September 2024.                                                                                                                                                                                                                                          | September 2024.                                                                   | February 2025.                                                                    | May 2025.                                                                                                                                                                                                                                                                                                                                        | May 2025.                                                                                                                                                                                                                                                                                                                                        |

**Summary of Key Differences:**

*   **Generational Advancements**: Gemini 2.0 and 2.5 represent newer generations with "next-generation features" and "improved capabilities" compared to 1.5.
*   **Performance and Purpose**:
    *   **Pro models (1.5 Pro, 2.5 Pro)** are designed for complex reasoning, multimodal understanding, and handling large amounts of data. Gemini 2.5 Pro is presented as the most powerful thinking model with maximum accuracy.
    *   **Flash models (1.5 Flash, 2.0 Flash, 2.5 Flash)** prioritize speed, cost efficiency, and versatility. Gemini 2.5 Flash offers the best price-performance ratio for low-latency, high-volume tasks.
*   **Output Token Limit**: A significant difference is the output token limit for Gemini 2.5 models (65,536 tokens) compared to Gemini 1.5 and 2.0 models (8,192 tokens).
*   **Long Context Window**: Gemini 1.5 Pro stands out with a larger input token limit (2,097,152) than other models listed (1,048,576), indicating its superior capability for processing extensive inputs.
*   **Experimental/Preview Status**: Gemini 2.5 models are currently in "Preview" status, meaning they may have more restrictive rate limits compared to stable models.

As a reference, you can see how the answer would be without the URL context, using the official models documentation as reference:

In [10]:
prompt = """
what are the key differences between Gemini 1.5, Gemini 2.0 and Gemini 2.5
models? Create a markdown table comparing the differences.
"""

response = client.models.generate_content(
      contents=[prompt],
      model=MODEL_ID,
)

Markdown(response.text)

It's important to clarify that as of my last update, **Gemini 1.5 Pro** is the current major iteration of Google's flagship multimodal model, and it represents a significant leap from the initial **Gemini 1.0** series (which included Ultra, Pro, and Nano).

There hasn't been a publicly announced, distinct model called "Gemini 2.0" or "Gemini 2.5" in the same way that 1.0 and 1.5 Pro were launched. It's possible you might be thinking of:
*   The **Gemini 1.0 series** as the initial release.
*   **Gemini 1.5 Pro** as the "next generation" (which includes the revolutionary long context window).
*   **Future, unannounced versions** that would logically follow 1.5, which might eventually be named 2.0 or 2.x. Google often refers to underlying architectural shifts or future capabilities internally, but these don't always translate to immediate, distinct public model names.

Therefore, the key comparison is primarily between the **Gemini 1.0 family** and **Gemini 1.5 Pro**. Any mention of "2.0" or "2.5" would refer to speculative future models or a misinterpretation of current naming conventions.

Here's a table comparing Gemini 1.0 and Gemini 1.5 Pro, and addressing the "2.0" and "2.5" points:

---

## Comparison of Gemini Models

| Feature                 | Gemini 1.0 (e.g., Ultra, Pro, Nano)             | Gemini 1.5 Pro                                            | Gemini 2.0 / 2.5 (Future/Speculative)             |
| :---------------------- | :---------------------------------------------- | :-------------------------------------------------------- | :------------------------------------------------- |
| **Launch/Announcement** | December 2023                                   | February 2024 (private preview), April 2024 (public preview) | Not publicly announced as distinct models yet.     |
| **Core Architecture**   | Traditional Transformer-based                  | **Mixture-of-Experts (MoE)** architecture; highly efficient | Likely further advancements in MoE or novel architectures |
| **Context Window Size** | Up to **32K tokens**                            | Standard: **1 Million tokens**; Experimental: **2 Million tokens** | Expected to be even larger or more efficient in processing. |
| **Multimodality**       | Native understanding of text, images, audio, video. | Enhanced native understanding across modalities, especially for long-form video/audio. | Deeper, more integrated multimodal reasoning; potentially new modalities. |
| **Performance**         | State-of-the-art at launch; strong general reasoning. | Surpasses Gemini 1.0 Ultra on many benchmarks, especially in long-context tasks (e.g., summarization, code analysis). | Expected to significantly outperform 1.5 Pro across all metrics. |
| **Key Innovation**      | First broadly available truly multimodal model from Google. | **Revolutionary long context window**; MoE efficiency for performance and cost. | Unknown, but likely breakthrough in reasoning, agency, or real-world interaction. |
| **Use Cases**           | General chatbot, content generation, coding, image analysis. | Advanced long-document analysis, video summarization, large codebase understanding, complex problem-solving. | Future applications requiring even greater autonomy, complex reasoning over vast data, and real-time interaction. |
| **Current Status**      | Generally Available (API, Gemini Advanced/Bard) | Public Preview / General Availability (API, select products) | Not a defined public model; refers to future generations of Gemini models. |

---

**In summary:**

*   **Gemini 1.0** was the groundbreaking initial release of Google's multimodal model family.
*   **Gemini 1.5 Pro** is the current flagship, distinguished by its **massive context window** (1M+ tokens) and efficient **Mixture-of-Experts (MoE)** architecture, making it exceptionally powerful for complex, long-form tasks.
*   "**Gemini 2.0**" and "**Gemini 2.5**" are not currently distinct, publicly launched models. If and when new major versions are released after 1.5 Pro, they would likely feature even more advanced capabilities, but their naming and specific features are yet to be announced by Google.

As you can see, using the model knowledge only, it does not know about the new Gemini 2.5 models family.

## Next steps

<a name="next_steps"></a>

* For more details about using Google Search grounding, check out the [Search Grounding cookbook](./Search_Grounding.ipynb).
* If you are looking for another scenarios using videos, take a look at the [Video understanding cookbook](./Video_understanding.ipynb).

Also check the other Gemini capabilities that you can find in the [Gemini quickstarts](https://github.com/google-gemini/cookbook/tree/main/quickstarts/).