[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/advanced_techniques/contextual_chunk_embedding.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/company/blog/technical/contextualized-chunk-embeddings-combining-local-detail-with-global-context/?utm_campaign=devrel&utm_source=cross-post&utm_medium=organic_social&utm_content=https%3A%2F%2Fgithub.com%2Fmongodb-developer%2FGenAI-Showcase&utm_term=apoorva.joshi)

# Contextualized chunk embeddings: Combining local detail with global context

This notebook shows you how to implement and evaluate Voyage AI's _voyage-context-3_ contextualized chunk embedding model.

## Step 1: Install required libraries

- **datasets**: Python library to get access to datasets available on Hugging Face Hub
- **pdfplumber**: Python library to parse and analyze PDFs
- **langchain-text-splitters**: Text chunking utilities in LangChain
- **tiktoken**: Token counting and encoding library
- **voyageai**: Python library to interact with Voyage AI's APIs

In [None]:
!pip install -qU datasets==4.3.0 pdfplumber==0.11.7 langchain-text-splitters==0.3.11 voyageai==0.3.5 tiktoken==0.12.0

## Step 2: Setup prerequisites

Follow the steps [here](https://dashboard.voyageai.com/organization/api-keys) to obtain a Voyage AI API key.

In [29]:
import getpass
import os

import voyageai

In [None]:
# Set Voyage AI API key as an environment variable
os.environ["VOYAGE_API_KEY"] = getpass.getpass("Enter your VoyageAI API key:")
# Initialize the Voyage AI client
voyage_client = voyageai.Client()

## Step 3: Download the dataset

In [31]:
from datasets import load_dataset

# Download a dataset from Hugging Face
docs = load_dataset("MongoDB/legal-docs", split="train")

In [32]:
# Get the first PDF in the dataset
pdf = docs[0]["pdf"]

In [33]:
# Get the number of pages in the PDF
len(pdf.pages)

40

In [34]:
# Preview the first page in the PDF
pdf.pages[0].extract_text()

'Exhibit 10.2\nExecution Version\nINTELLECTUAL PROPERTY AGREEMENT\nThis INTELLECTUAL PROPERTY AGREEMENT (this “Agreement”), dated as of December 31, 2018 (the “Effective Date”) is entered into by and\nbetween Armstrong Flooring, Inc., a Delaware corporation (“Seller”) and AFI Licensing LLC, a Delaware limited liability company (“Licensing” and\ntogether with Seller, “Arizona”) and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation (“Buyer”) and Armstrong\nHardwood Flooring Company, a Tennessee corporation (the “Company” and together with Buyer the “Buyer Entities”) (each of Arizona on the one hand\nand the Buyer Entities on the other hand, a “Party” and collectively, the “Parties”).\nWHEREAS, Seller and Buyer have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the “Stock Purchase\nAgreement”); WHEREAS, pursuant to the Stock Purchase Agreement, Seller has agreed to sell and transfer, and Buyer has agreed to purchase and\nacqui

## Step 4: Chunk the PDF content

In [35]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [36]:
separators = ["\n\n", "\n", " ", "", "#", "##", "###"]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4", separators=separators, chunk_size=200, chunk_overlap=0
)

In [37]:
chunked_docs = []
# Iterate through the documents
for doc_id, doc in enumerate(docs):
    pages = doc["pdf"].pages
    # Keep track of chunk IDs per document
    chunk_id = 0
    # Iterate through the pages in each document
    for page in pages:
        chunks = text_splitter.split_text(page.extract_text())
        for chunk in chunks:
            chunked_docs.append(
                {"chunk": chunk, "chunk_id": chunk_id, "doc_id": doc_id}
            )
            chunk_id += 1

In [38]:
chunked_docs[0]

{'chunk': 'Exhibit 10.2\nExecution Version\nINTELLECTUAL PROPERTY AGREEMENT\nThis INTELLECTUAL PROPERTY AGREEMENT (this “Agreement”), dated as of December 31, 2018 (the “Effective Date”) is entered into by and\nbetween Armstrong Flooring, Inc., a Delaware corporation (“Seller”) and AFI Licensing LLC, a Delaware limited liability company (“Licensing” and\ntogether with Seller, “Arizona”) and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation (“Buyer”) and Armstrong\nHardwood Flooring Company, a Tennessee corporation (the “Company” and together with Buyer the “Buyer Entities”) (each of Arizona on the one hand\nand the Buyer Entities on the other hand, a “Party” and collectively, the “Parties”).',
 'chunk_id': 0,
 'doc_id': 0}

## Step 5: Embed the chunks

In [39]:
from typing import List

In [40]:
def get_std_embeddings(input: List[str], input_type: str) -> List[List[float]]:
    """
    Generate context-agnostic embeddings.

    Args:
        input (List[str]): List of document chunks or query wrapped in a list
        input_type: Either "document" or "query"

    Returns:
        List[List[float]]: List of embedding vectors
    """
    response = voyage_client.embed(input, model="voyage-3-large", input_type=input_type)
    return response.embeddings

In [41]:
def get_contextualized_embeddings(
    input: List[List[str]], input_type: str
) -> List[List[float]]:
    """
    Generate contextualized chunk embeddings.

    Args:
        input (List[List[str]]): List of document chunks or query wrapped in a list of lists
        input_type: Either "document" or "query"

    Returns:
        List[List[float]]: List of embedding vectors
    """
    response = voyage_client.contextualized_embed(
        input, model="voyage-context-3", input_type=input_type
    )
    return [emb for r in response.results for emb in r.embeddings]

## Step 6: Evaluation

In [42]:
import numpy as np

In [43]:
queries = [
    {
        "question": "Which state’s law governs the agreement between Armstrong Flooring and AHF Holding?",
        "doc_id": 0,
        "chunk_id": 44,
    },
    {
        "question": "In the Armstrong-AHF agreement, how many days' notice is required to remedy a breach?",
        "doc_id": 0,
        "chunk_id": 9,
    },
    {
        "question": "Where will disputes be resolved under the agreement between Armstrong Flooring and AHF Holding?",
        "doc_id": 0,
        "chunk_id": 44,
    },
    {
        "question": "What happens if either party materially breaches the Armstrong-AHF Intellectual agreement?",
        "doc_id": 0,
        "chunk_id": 35,
    },
    {
        "question": "Under the Armstrong Flooring-AHF Holding agreement, what is the minimum logo size?",
        "doc_id": 0,
        "chunk_id": 94,
    },
    {
        "question": "When does Playa Hotels & Resorts' right of first offer expire?",
        "doc_id": 1,
        "chunk_id": 4,
    },
    {
        "question": "Which state’s law governs the agreement between Hyatt Franchising Latin America and Playa Hotels & Resorts B.V.?",
        "doc_id": 1,
        "chunk_id": 13,
    },
    {
        "question": "What countries can Hyatt Franchising Latin America and Playa develop Hyatt All-Inclusive Resorts in?",
        "doc_id": 1,
        "chunk_id": 1,
    },
    {
        "question": "How many years of hotel experience must arbitrators have under the Hyatt-Playa agreement?",
        "doc_id": 1,
        "chunk_id": 15,
    },
    {
        "question": "Where will arbitration take place for disputes under the Hyatt-Playa agreement?",
        "doc_id": 1,
        "chunk_id": 15,
    },
    {
        "question": "When was the Quaker/Gulf Houghton agreement effective?",
        "doc_id": 2,
        "chunk_id": 0,
    },
    {
        "question": "Which state’s law governs the Quaker/Gulf Houghton agreement?",
        "doc_id": 2,
        "chunk_id": 18,
    },
    {
        "question": "What is the geographic scope of the Quaker/Gulf Houghton agreement?",
        "doc_id": 2,
        "chunk_id": 9,
    },
    {
        "question": "What percentage of publicly traded securities can Gulf Houghton sellers own as passive investors?",
        "doc_id": 2,
        "chunk_id": 9,
    },
    {
        "question": "How long must before Gulf Houghton sellers can hire former employees?",
        "doc_id": 2,
        "chunk_id": 12,
    },
]

In [44]:
def calculate_metrics(query, chunk_embds, embd_type, k):
    # Get query embeddings
    if embd_type == "standard":
        query_embd = get_std_embeddings([query["question"]], "query")[0]
    elif embd_type == "contextual":
        query_embd = get_contextualized_embeddings([[query["question"]]], "query")[0]
    # Calculate pairwise dot product similarity
    similarities = np.dot(chunk_embds, query_embd)
    # Get indices of the top k by similarity
    top_k_idxs = np.argsort(similarities)[::-1][:k]
    # Get the top k most similar chunks
    top_k_docs = [chunked_docs[i] for i in top_k_idxs]
    rank = None
    for i, doc in enumerate(top_k_docs):
        # Check for golden chunk
        if doc["doc_id"] == query["doc_id"] and doc["chunk_id"] == query["chunk_id"]:
            rank = i + 1
            break

    recall = 1 if rank else 0
    return recall, rank

### Standard Embeddings

In [45]:
std_embds = get_std_embeddings([record["chunk"] for record in chunked_docs], "document")

In [46]:
recalls = []
reciprocal_ranks = []
for query in queries:
    recall, rank = calculate_metrics(query, std_embds, "standard", 5)
    recalls.append(recall)
    print(f"{query['question']}: {rank}")
    reciprocal_ranks.append(1 / rank if rank else 0.0)

print(f"Mean recall: {np.mean(recalls) * 100:.2f}%")
print(f"Mean reciprocal rank: {np.mean(reciprocal_ranks) * 100:.2f}%")

Which state’s law governs the agreement between Armstrong Flooring and AHF Holding?: 1
In the Armstrong-AHF agreement, how many days' notice is required to remedy a breach?: None
Where will disputes be resolved under the agreement between Armstrong Flooring and AHF Holding?: None
What happens if either party materially breaches the Armstrong-AHF Intellectual agreement?: 1
Under the Armstrong Flooring-AHF Holding agreement, what is the minimum logo size?: 2
When does Playa Hotels & Resorts' right of first offer expire?: 5
Which state’s law governs the agreement between Hyatt Franchising Latin America and Playa Hotels & Resorts B.V.?: 3
What countries can Hyatt Franchising Latin America and Playa develop Hyatt All-Inclusive Resorts in?: 2
How many years of hotel experience must arbitrators have under the Hyatt-Playa agreement?: 1
Where will arbitration take place for disputes under the Hyatt-Playa agreement?: 2
When was the Quaker/Gulf Houghton agreement effective?: 1
Which state’s law g

### Contextualized Embeddings

In [47]:
from collections import defaultdict

In [48]:
# Convert chunked_docs to list of lists of document chunks-- one list per document
grouped_docs = defaultdict(list)
for chunk in chunked_docs:
    grouped_docs[chunk["doc_id"]].append(chunk["chunk"])

chunks_by_doc = list(grouped_docs.values())

In [49]:
ctxt_embds = get_contextualized_embeddings(chunks_by_doc, "document")

In [50]:
recalls = []
ranks = []
reciprocal_ranks = []
for query in queries:
    recall, rank = calculate_metrics(query, ctxt_embds, "contextual", 5)
    recalls.append(recall)
    print(f"{query['question']}: {rank}")
    reciprocal_ranks.append(1 / rank if rank else 0.0)

print(f"Mean recall: {np.mean(recalls) * 100:.2f}%")
print(f"Mean reciprocal rank: {np.mean(reciprocal_ranks) * 100:.2f}%")

Which state’s law governs the agreement between Armstrong Flooring and AHF Holding?: 1
In the Armstrong-AHF agreement, how many days' notice is required to remedy a breach?: None
Where will disputes be resolved under the agreement between Armstrong Flooring and AHF Holding?: 2
What happens if either party materially breaches the Armstrong-AHF Intellectual agreement?: 1
Under the Armstrong Flooring-AHF Holding agreement, what is the minimum logo size?: 2
When does Playa Hotels & Resorts' right of first offer expire?: 5
Which state’s law governs the agreement between Hyatt Franchising Latin America and Playa Hotels & Resorts B.V.?: 1
What countries can Hyatt Franchising Latin America and Playa develop Hyatt All-Inclusive Resorts in?: 3
How many years of hotel experience must arbitrators have under the Hyatt-Playa agreement?: 1
Where will arbitration take place for disputes under the Hyatt-Playa agreement?: 1
When was the Quaker/Gulf Houghton agreement effective?: 1
Which state’s law gove