##### Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Using Gemini API with Qdrant vector search for hybrid retrieval in legal AI


<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/qdrant/Hybrid_Search_Legal.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

<!-- Community Contributor Badge -->
<table>
  <tr>
    <!-- Author Avatar Cell -->
    <td bgcolor="#d7e6ff">
      <a href="https://github.com/mrscoopers" target="_blank" title="View Jenny's profile on GitHub">
        <img src="https://github.com/mrscoopers.png?size=100"
             alt="Jenny's GitHub avatar"
             width="100"
             height="100">
      </a>
    </td>
    <!-- Text Content Cell -->
    <td bgcolor="#d7e6ff">
      <h2><font color='black'>This notebook was contributed by <a href="https://github.com/mrscoopers" target="_blank"><font color='#217bfe'><strong>Jenny</strong></font></a>.</font></h2>
      <h5><font color='black'><a href="https://www.linkedin.com/in/evgeniya-sukhodolskaya/"><font color="#078efb">Jenny's LinkedIn</font></a></h5></font><br>
      <!-- Footer -->
      <font color='black'><small><em>Have a cool Gemini example? Feel free to <a href="https://github.com/google-gemini/cookbook/blob/main/CONTRIBUTING.md" target="_blank"><font color="#078efb">share it too</font></a>!</em></small></font>
    </td>
  </tr>
</table>

<!-- Princing warning Badge -->
<table>
  <tr>
    <!-- Emoji -->
    <td bgcolor="#f5949e">
      <font size=30>⚠️</font>
    </td>
    <!-- Text Content Cell -->
    <td bgcolor="#f5949e">
      <h3><font color=black>This notebook requires paid tier rate limits to run properly.<br>  
(cf. <a href="https://ai.google.dev/pricing#veo2"><font color='#217bfe'>pricing</font></a> for more details).</font></h3>
    </td>
  </tr>
</table>

## Overview

![Austronaut judge](https://storage.googleapis.com/qdrant-examples/astronaut_judge_2.png)

In the legal domain, **accuracy** and **factual correctness** are immensely critical.

A Legal AI startup that collaborated with [Qdrant](https://qdrant.tech/) has outlined **an approach to securing both in Legal AI applications** (for example, Retrieval Augmented Generation (RAG)-based or agentic):

> *“Turn everything into a retrieval problem where you're retrieving ground truth. If you frame it that way, you don't have to worry about hallucinations, as everything given to the user is grounded in some part of a valid document.”*

Truly, many Legal AI businesses require **high-quality retrieval** in their applications. To get there, you need:
- The knowledge of the right tools and techniques that increase search relevance;
- A well-suited embedding model;
- Being ready to experiment!:)

### This notebook

In this notebook, you’ll learn how to combine `gemini-embedding-001` with the tools provided by the Qdrant vector search engine to build a **legal QA retrieval pipeline**.

You'll learn how to:
- Set up a hybrid search (dense + keyword) in Qdrant;
- Use [Matryoshka Representations](https://huggingface.co/blog/matryoshka) of Gemini embeddings to trade off quality vs. cost.

## Setup

### Install SDK

- `google-genai` for `gemini-embedding-001` embeddings;
- `qdrant-client[fastembed]` - the Qdrant's python client;
- HuggingFace `datasets` - to load open sourced legal Q&A datasets

In [1]:
%pip install -q -U "google-genai>=1.0.0" qdrant-client[fastembed] datasets

### Set up your API keys:

- `GOOGLE_API_KEY`, required for using `gemini-embedding-001` embeddings  
  (look up how to generate it [here](https://ai.google.dev/gemini-api/docs/api-key))

- `QDRANT_API_KEY` and `QDRANT_URL` from a **free-forever** Qdrant Cloud cluster  
(you'll be guided on how to get both in the [Qdrant Cloud UI](https://cloud.qdrant.io/))

To run the following cell, your API keys must be stored in a Colab Secret tab.

In [2]:
from google.colab import userdata

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
QDRANT_API_KEY = userdata.get('QDRANT_API_KEY')
QDRANT_URL = userdata.get('QDRANT_URL')

## Step 1: Download the Dataset

You'll use one of the [Hugging Face datasets from Isaacus](https://huggingface.co/isaacus), a legal artificial intelligence research company.

A common use case in legal AI is a Retrieval-Augmented Generation (RAG) chatbot. To evaluate retrieval performance for such applications, you need a Question-Answer (QA) dataset.

### Choosing a Dataset

- [Open Australian Legal QA](https://huggingface.co/datasets/isaacus/open-australian-legal-qa) looks interesting. However, all its LLM-generated questions mention the exact name of the legal case, which also appears in the answer. The dataset maps each question to one answer (1:1), making it trivial to build a perfect retriever => not even close to real-life scenarios:)

- Instead, let's consider [LegalQAEval](https://huggingface.co/datasets/isaacus/LegalQAEval). It looks more like the kind of questions a user might ask a RAG-based legal chatbot. For example:  
  * "*How are pharmacists regulated in most jurisdictions?*"
  * "*what is ncts*"

#### LegalQAEval

This dataset contains ~2400 QA pairs and includes:

- `id`: a unique string identifier;
- `question`: a natural language question;
- `text`: a chunk of text that *may* contain the answer;
- `answers`: a list of answers (and their positions within the text), or `null` if the `text` does not have the answer.

Load the legal QA corpus; you'll use all available splits.

In [3]:
from datasets import load_dataset, concatenate_datasets

corpus = concatenate_datasets(load_dataset('isaacus/LegalQAEval', split=['val', 'test']))

### Text chunks deduplication

Since the dataset can contain `text` chunks with multiple questions related to them, initially deduplicate `text` fields to not store identical information several times.

In [4]:
import pandas as pd
import datasets

# Convert the Hugging Face dataset to a pandas DataFrame
df = corpus.to_pandas()

# Group by 'text' and aggregate 'id' into a list
grouped_corpus = df.groupby('text')['id'].apply(list).reset_index().rename(columns={'id': 'ids'})

corpus_deduplicated = datasets.Dataset.from_pandas(grouped_corpus)

## Step 2: Define the use case configuration

In a typical legal chatbot scenario, users ask a question, and an LLM generates an answer based on a relevant text chunk.

To imitate it, you'll need to store in Qdrant numerical representations (embeddings) of `text` chunks.  
During retrieval, a `question` will be converted into a numerical representation in the same embedding space. Then, (approximately) the nearest `text` chunk will be found in the vector index.

> The Gemini embedding model [supports](https://ai.google.dev/gemini-api/docs/embeddings#supported-task-types) RAG-style Q&A retrieval (task type `QUESTION_ANSWERING`).

Now, to fully define our storage configuration, let's consider several factors relevant to a common RAG use case in the legal AI domain.

### Cost versus accuracy: matryoshka representations

Gemini `gemini-embedding-001` embeddings are 3072-dimensional.  
In a RAG setup with ~1 million chunks, storing such embeddings in RAM (for fast retrieval) would require about **12 GB**.

The Gemini embedding model supports an approach to balance accuracy & cost of retrieval. It is trained using [Matryoshka Representation Learning (MRL)](https://ai.google.dev/gemini-api/docs/embeddings#control-embedding-size), meaning that the most important information about the encoded text is stored in the first dimensions of the embedding.

So, you can, for example:
- Use only the first 768 dimensions of the Gemini embedding for **faster retrieval**;
- And then **rerank** the retrieved results using the full 3072-dimensional embeddings for higher precision.

### Accuracy from the best of both worlds: hybrid search

In legal use cases, it is often beneficial to combine the strengths of:  
- **Keyword-based search (lexical)** for more direct control over matches;  
- **Embedding-based search (semantic)** for handling questions phrased in a conversational way.

> **In Qdrant, both approaches can be combined in [hybrid & multi-stage queries](https://qdrant.tech/documentation/concepts/hybrid-queries/)**.  

For the keyword-based part, Qdrant supports multiple options, from traditional BM25 to sparse neural retrievers like [SPLADE](https://qdrant.tech/documentation/fastembed/fastembed-splade/). Among the options, there's [**our custom improvement of BM25 called miniCOIL**](https://qdrant.tech/documentation/fastembed/fastembed-minicoil/), which you will use in this notebook.

> In Qdrant, keyword-based retrieval is achieved using [sparse vectors](https://qdrant.tech/documentation/concepts/vectors/#sparse-vectors).


### Collection configuration
Configure a Qdrant collection for the legal QA retrieval pipeline.

In [12]:
from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(  # Initializing Qdrant client.
    url=QDRANT_URL,
    api_key=QDRANT_API_KEY,
)

COLLECTION_NAME = "legal_AI_QA"
GEMINI_EMBEDDING_RETRIEVAL_SIZE = 768
GEMINI_EMBEDDING_FULL_SIZE = 3072

if not qdrant_client.collection_exists(collection_name=COLLECTION_NAME):
    qdrant_client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config={
            "gemini_embedding_retrieve": models.VectorParams(
                size=GEMINI_EMBEDDING_RETRIEVAL_SIZE,  # Smaller embeddings for faster retrieval.
                distance=models.Distance.COSINE,
            ),
            "gemini_embedding_rerank": models.VectorParams(
                size=GEMINI_EMBEDDING_FULL_SIZE,  # Full-sized embeddings for precision-boosting reranking.
                distance=models.Distance.COSINE,
                hnsw_config=models.HnswConfigDiff(
                    m=0  # Since these embeddings aren't used for retrieval, you don't need to spend resources on building a vector index.
                ),
                on_disk=True,  # To save on RAM used for retrieval.
            ),
        },
        sparse_vectors_config={
            "miniCOIL": models.SparseVectorParams(
                modifier=models.Modifier.IDF  # Inverse Document Frequency statistic, computed on the Qdrant side.
            )
        },
    )

## Step 3: embed texts & index data to Qdrant

To speed up the process of converting the data, you'll:

1. Embed with Gemini all `text` chunks in batches using the `get_embeddings_batch` function.  
2. Upload the results to Qdrant in batches.  
The Qdrant Python client provides the functions `upload_collection` and `upload_points`. These handle batching, retries, and parallelization. They take generators as input, so you'll create a generator function `qdrant_points_stream` for this purpose.

> **Note:** Qdrant automatically normalizes uploaded embeddings if the distance function in your collection was set to `COSINE` (cosine similarity). This means you don’t need to pre-normalize truncated Gemini Matryoshka embeddings, as it's [recommended in the Gemini documentation](https://ai.google.dev/gemini-api/docs/embeddings#quality-for-smaller-dimensions).


In [None]:
GEMINI_MODEL_ID = "gemini-embedding-001" # @param ["gemini-embedding-001"] {"allow-input":true, isTemplate: true}

In [6]:
from google import genai
from google.genai import types
from google.api_core import retry
import uuid

google_client = genai.Client(api_key=GOOGLE_API_KEY)

@retry.Retry(timeout=300)
def get_embeddings_batch(texts, task_type: str = "RETRIEVAL_DOCUMENT"):
    """Generates embeddings for a batch of texts.

    Args:
        texts: A list of strings to embed.
        task_type: The task type for the embedding model.

    Returns:
        A list of embedding vectors.

    Raises:
        Exception: If an error occurs during embedding generation.
    """
    try:
        res = google_client.models.embed_content(
            model=GEMINI_MODEL_ID,
            contents=texts,
            config=types.EmbedContentConfig(task_type=task_type),
        )
        return [e.values for e in res.embeddings]
    except Exception as e:
        print(f"An error occurred while getting embeddings: {e}")
        raise


def qdrant_points_stream(corpus, avg_corpus_text_length, gemini_batch_size: int = 8):
    """Streams Qdrant points with embeddings for a given corpus.

    Args:
        corpus: The dataset to process.
        avg_corpus_text_length: The average text length for miniCOIL (based on BM25 formula).
        gemini_batch_size: The batch size for Gemini embedding requests.

    Yields:
        Qdrant PointStruct objects.
    """
    for start in range(0, len(corpus), gemini_batch_size):  # Iterate over the dataset in batches.
        end = min(start + gemini_batch_size, len(corpus))
        batch = corpus.select(range(start, end))  # Current batch slice.

        gemini_embeddings_full = get_embeddings_batch(
            [row["text"] for row in batch], task_type="RETRIEVAL_DOCUMENT"
            )  # Generate embeddings for this batch.

        for batch_item, gemini_embedding_full in zip(batch, gemini_embeddings_full):
            yield models.PointStruct(
                id=str(uuid.uuid4()),  # Unique ID (string UUID or integer supported by Qdrant).
                payload={  # Metadata stored alongside the vector.
                    "text": batch_item["text"],  # Raw text for users/LLMs.
                    "ids": batch_item["ids"],  # IDs of the QA pairs related to this `text` (for later evaluation).
                },
                vector={  # Embeddings.
                    "gemini_embedding_rerank": gemini_embedding_full,  # Full Gemini embedding for reranking.
                    "gemini_embedding_retrieve": gemini_embedding_full[:768],  # Truncated Gemini embedding for retrieval.
                    "miniCOIL": models.Document(  # Custom Qdrant-optimized BM25 replacement.
                        text=batch_item["text"],
                        model="Qdrant/minicoil-v1",
                        options={"avg_len": avg_corpus_text_length, "k": 0.9, "b": 0.4},  # Corpus avg length, k_1 & b from BM25 formula.
                    ),
                },
            )


Now you'll embed the data and upload the embeddings.

> Try experimenting with different batch sizes when generating embeddings and uploading them to Qdrant.  
The fastest setup usually depends on your network speed & RAM/CPU/GPU, and keep in mind that embedding inference is not a very fast process.

> The representations used in Qdrant for the keyword-based retrieval part of hybrid search are produced by Qdrant.  
In Colab, Qdrant will download the required models the first time you use them (in our case, **Qdrant/minicoil-v1**), as they’re needed for converting `text` chunks to sparse representations.


In [8]:
import tqdm

COLLECTION_NAME = "legal_AI_QA"

# Estimating the average length of the texts in the corpus on the subsample of 1000, to use in BM25-inspired keywords-based retrieval.
SUBSET_SIZE = 1000
avg_corpus_text_length = sum(len(text.split()) for text in corpus["text"][:SUBSET_SIZE]) / SUBSET_SIZE

qdrant_client.upload_points(
    collection_name=COLLECTION_NAME,
    points=tqdm.tqdm(
        qdrant_points_stream(corpus_deduplicated,
                            avg_corpus_text_length=avg_corpus_text_length,
                            gemini_batch_size=4),
        desc="Uploading points",
    ),
    batch_size=4,
)

## Step 4: experiment & evaluate

What’s important for every retrieval task is experimenting with different instruments & running evaluations based on a sensible metric.

### Metric
In RAG, the goal is usually to get the **correct result within the top-N retrieved results, using a very small N**, since that’s what the LLM will use to generate a grounded answer, and you'd want to save context window size/reduce token costs.  

You'll use the metric **`hit@1`**, meaning the top-1 ranked text chunk is actually the answer to the question.

### Eval set
For experiments, you should only use questions where the `answers` field is not `null`, since this guarantees that this text chunk contains the answer to the question.


In [9]:
questions = corpus.filter(lambda item: len(item['answers']) > 0)

Inference Gemini embeddings for all the questions, so you can experiment freely without spending extra time or money.

In [10]:
import tqdm

question_embeddings = {}

question_texts = [q['question'] for q in questions]
question_ids = [q['id'] for q in questions]
all_embeddings = []

BATCH_SIZE = 32

for i in tqdm.tqdm(range(0, len(question_texts), BATCH_SIZE), desc="Embedding questions"):
    batch_texts = question_texts[i:i + BATCH_SIZE]
    embeddings = get_embeddings_batch(batch_texts, task_type="QUESTION_ANSWERING")
    all_embeddings.extend(embeddings)

question_embeddings = {qid: emb for qid, emb in zip(question_ids, all_embeddings)}

And randomly select a test subset.

In [11]:
questions = questions.shuffle(seed=42).select(range(500))

### Experiment

There are many ways to improve search results. For example, **reranking** alone can be done with high-dimensional embeddings like Gemini, [multivectors like ColBERT](https://qdrant.tech/documentation/advanced-tutorials/using-multivector-representations/) or cross-encoders.

For simplicity, you'll focus on three simple retrieval approaches, three experiments that are a good starting point for high-precision-demanding domains like legal:

**Experiment 1: Vanilla Retrieval**  
Use truncated Gemini embeddings for vanilla retrieval.  
This gives you a simple reference point to compare improvements against.  

**Experiment 2: Reranking**  
Rerank the retrieved subset with full-sized Gemini embeddings.  
Larger embeddings capture finer semantic details that the retriever may miss.

**Experiment 3: Hybrid Search**  
Combine semantic (captures meaning) and keyword-based (ensures exact matches) retrieval in **Hybrid Search**.  
*For this toy dataset, keyword matching may not add much, as all questions are very "conversational" style with not-so-many-keywords, but in real-life legal AI retrieval, it makes a difference*

The setup is the following:
1. Run two searches with the same query.  
2. Merge the results into a single list with a fusion algorithm. Here you'll use [Reciprocal Rank Fusion (RRF)](https://qdrant.tech/documentation/concepts/hybrid-queries/#hybrid-search), a simple well-known zero-shot method of fusion.  

---
Compare the results of all three experiments on our evaluation set using the chosen metric.  

In [None]:
COLLECTION_NAME = "legal_AI_QA"
N = 1

def hit_at_n(results: list[models.ScoredPoint], question_id: str, N: int) -> int:
    """Calculates if the correct document is within the top N retrieved results.

    Args:
        results: A list of scored points from a Qdrant search.
        question_id: The ID of the question to check for.
        N: The number of top results to check.

    Returns:
        1 if the correct document is found, 0 otherwise.
    """
    for result in results:
        if question_id in result.payload["ids"]:
            return 1
    return 0

hits_baseline = 0
hits_rerank = 0
hits_hybrid = 0

for question in tqdm.tqdm(questions, desc=f"Evaluating hits@{N}"):
    full_gemini_embedding = question_embeddings[question['id']] # Embedding of the question.

    ## Experiment 1: Baseline
    result_baseline = qdrant_client.query_points(
        collection_name=COLLECTION_NAME,
        query=full_gemini_embedding[:768],  # Use the first quarter of the Gemini embedding for retrieval.
        limit=N,
        using="gemini_embedding_retrieve",
        with_payload=True,
    )
    hits_baseline += hit_at_n(result_baseline.points, question_id=question['id'], N=N)

    ## Experiment 2: Reranking
    result_rerank = qdrant_client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=models.Prefetch( # First, retrieve 50 candidates with the smaller embedding.
            query=full_gemini_embedding[:768],
            using="gemini_embedding_retrieve",
            limit=50
        ),
        query=full_gemini_embedding, # Then rerank those 50 results with the full embedding.
        limit=N,
        using="gemini_embedding_rerank",
        with_payload=True
    )
    hits_rerank += hit_at_n(result_rerank.points, question_id=question['id'], N=N)

    ## Experiment 3: Hybrid search (semantic + keyword)
    result_hybrid = qdrant_client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch( # Retrieve 25 results using semantic search (Gemini truncated embeddings).
                query=full_gemini_embedding[:768],
                using="gemini_embedding_retrieve",
                limit=25
            ),
            models.Prefetch( # Retrieve 25 results using miniCOIL (Qdrant’s custom improved version of BM25-based keyword search).
                query=models.Document(
                    text=question['question'],
                    model="Qdrant/minicoil-v1"
                ),
                using="miniCOIL",
                limit=25
            )
        ],
        query=models.FusionQuery(fusion="rrf"), # Fuse the two result sets with Reciprocal Rank Fusion (RRF).
        limit=N,
        with_payload=True
    )
    hits_hybrid += hit_at_n(result_hybrid.points, question_id=question['id'], N=N)


hits_at_N_baseline = hits_baseline / len(questions) # Compute average hits@N for each experiment.
hits_at_N_rerank = hits_rerank / len(questions)
hits_at_N_hybrid = hits_hybrid / len(questions)

print("\n")
print(f"Retrieval Avg Hits@{N}: {hits_at_N_baseline:.3f} ({hits_baseline}/{len(questions)})")
print(f"Rerank Avg Hits@{N}: {hits_at_N_rerank:.3f} ({hits_rerank}/{len(questions)})")
print(f"Hybrid Search Avg Hits@{N}: {hits_at_N_hybrid:.3f} ({hits_hybrid}/{len(questions)})")

## Next steps

In this notebook, you set up a **retrieval pipeline behind a typical legal RAG chatbot** with **Qdrant Vector Search Engine** and **Gemini Embeddings**.   
You tried several approaches to retrieval, making use of Gemini's capability to generate Matryoshka representations & Qdrant's tooling for retrieval with reranking and hybrid search.

Of course, legal applications require much more than a plain zero-shot pipeline. Retrieval quality always depends on the dataset and use case, so there’s no silver bullet besides experimenting & iterating.

### Where to go from zero-shot:
- Analyze queries-misses;  
- Tune vector index parameters (for example, `ef` for search at scale in Qdrant);  
- Experiment with different fusion strategies & parameters in hybrid search;  
- Try query expansion (or filters extraction).  
- ...

Use this notebook as a baseline to build on and experiment to find what works best for you!

![Austronaut judge 2](https://storage.googleapis.com/qdrant-examples/astronaut_court_1.png)