[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/advanced_techniques/contextual_chunk_embeddings.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/atlas/contextual-chunk-embeddings/?utm_campaign=devrel&utm_source=cross-post&utm_medium=organic_social&utm_content=https%3A%2F%2Fgithub.com%2Fmongodb-developer%2FGenAI-Showcase&utm_term=apoorva.joshi)

## Step 1: Install required libraries

- **datasets**: Python library to get access to datasets available on Hugging Face Hub
- **pdfplumber**: Python library to interact with OpenAI APIs
- **voyageai**:
- **pymongo**:

In [15]:
!pip install -qU datasets pdfplumber langchain-text-splitters tiktoken voyageai pymongo 

## Step 2: Setup prerequisites

* **Voyage AI**
  * [**Obtain a Voyage AI API key**](https://dashboard.voyageai.com/organization/api-keys)

* **MongoDB**
  * **Register for a [free MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register)**
  * **Create a database cluster**: Once you register and sign into your Atlas account for the first time, you will be directed to the Cluster Deployment page.
    * Select the _Free_ option to create a free tier cluster.
    * Click _Create Deployment_ to create the cluster.
    * In the modal that appears, click _Create database user_. Then click _Choose a connection method_.
    * In the next screen, click _Drivers_.
    * Next, copy the connection string (starts with `mongodb+srv://`) to a safe place.
    * Click _Done_.
  * **Allow Access from anywhere**: To connect to your MongoDB cluster from this notebook, you will need to open up network access to your cluster.
    * From the side navigation bar in the Atlas UI, select _Security_ > _Network Access.
    * On the screen that appears, click _Add IP Address_.
    * In the modal that appears, click _Allow Acess From Anywhere_ and click _Confirm_.

NOTE: Opening access to your MongoDB clusters from anywhere is not recommended in production environments. We are just doing it for easy access here.


In [18]:
import getpass
import os

import voyageai
from pymongo import MongoClient

In [19]:
# Set Voyage AI API Key as an environment variable
os.environ["VOYAGE_API_KEY"] = getpass.getpass("Enter your VoyageAI API key:")
# Initialize the Voyage AI client
voyage_client = voyageai.Client()

Enter your VoyageAI API key: ········


In [20]:
# Set your MongoDB connection string
MONGODB_URI = getpass.getpass("Enter your MongoDB URI: ")
# Initialize the MongoDB client
mongodb_client = MongoClient(
    MONGODB_URI, appname="devrel.showcase.contextual_embeddings_tutorial"
)
mongodb_client.admin.command("ping")

Enter your MongoDB URI:  ········


{'ok': 1.0,
 '$clusterTime': {'clusterTime': Timestamp(1760126544, 1),
  'signature': {'hash': b'\xb9$\x83\x89y\xb0\xfdF\x10V\x81n\xa5\xb7odn\xab[\x10',
   'keyId': 7522922054039896066}},
 'operationTime': Timestamp(1760126544, 1)}

## Step 3: Download a dataset

In [22]:
from datasets import Pdf, load_dataset

# Download a dataset from Hugging Face
docs = load_dataset("MongoDB/legal-docs", split="train")

In [23]:
# Get the first PDF in the dataset
pdf = docs[0]["pdf"]

In [24]:
# Get the number of pages in the PDF
len(pdf.pages)

40

In [25]:
# Preview the first page in the PDF
pdf.pages[0].extract_text()

'Exhibit 10.2\nExecution Version\nINTELLECTUAL PROPERTY AGREEMENT\nThis INTELLECTUAL PROPERTY AGREEMENT (this “Agreement”), dated as of December 31, 2018 (the “Effective Date”) is entered into by and\nbetween Armstrong Flooring, Inc., a Delaware corporation (“Seller”) and AFI Licensing LLC, a Delaware limited liability company (“Licensing” and\ntogether with Seller, “Arizona”) and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation (“Buyer”) and Armstrong\nHardwood Flooring Company, a Tennessee corporation (the “Company” and together with Buyer the “Buyer Entities”) (each of Arizona on the one hand\nand the Buyer Entities on the other hand, a “Party” and collectively, the “Parties”).\nWHEREAS, Seller and Buyer have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the “Stock Purchase\nAgreement”); WHEREAS, pursuant to the Stock Purchase Agreement, Seller has agreed to sell and transfer, and Buyer has agreed to purchase and\nacqui

## Step 4: Chunk the PDF content

In [26]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [27]:
separators = ["\n\n", "\n", " ", "", "#", "##", "###"]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4", separators=separators, chunk_size=200, chunk_overlap=0
)

In [88]:
chunked_docs = []
# Iterate through the documents
for doc_id, doc in enumerate(docs):
    pages = doc["pdf"].pages
    # Keep track of chunk IDs per document
    chunk_id = 0
    # Iterate through the pages in each document
    for page in pages:
        chunks = text_splitter.split_text(page.extract_text())
        for chunk in chunks:
            chunked_docs.append(
                {"chunk": chunk, "chunk_id": chunk_id, "doc_id": doc_id}
            )
            chunk_id += 1

## Step 5: Embed the chunks

In [78]:
from typing import List

In [79]:
def get_std_embeddings(input: List[str], input_type: str) -> List[float]:
    """
    Generate context-agnostic embeddings.

    Args:
        input (List[str]): List of document chunks or query wrapped in a list
        input_type: Either "document" or "query"

    Returns:
        List[float]: List of embedding vectors
    """
    response = voyage_client.embed(input, model="voyage-3-large", input_type=input_type)
    return response.embeddings

In [82]:
def get_contextualized_embeddings(
    input: List[List[str]], input_type: str
) -> List[float]:
    """
    Generate contextualized chunk embeddings.

    Args:
        input (List[List[str]]): List of document chunks or query wrapped in a list of lists
        input_type: Either "document" or "query"

    Returns:
        List[float]: List of embedding vectors
    """
    response = voyage_client.contextualized_embed(
        input, model="voyage-context-3", input_type=input_type
    )
    return [emb for r in response.results for emb in r.embeddings]

## Step 6: Evaluation

In [100]:
import numpy as np

In [104]:
queries = [
    {
        "question": "Which state’s law governs the Intellectual Property Agreement between Armstrong Flooring and AHF Holding?",
        "doc_id": 0,
        "chunk_id": 44,
    },
    {
        "question": "Under the Armstrong Flooring-AHF Holding agreement, how long is the Arizona Trademark License Term?",
        "doc_id": 0,
        "chunk_id": 9,
    },
    {
        "question": "In the agreement between Armstrong Flooring, Inc. and AHF Holding, Inc., how long is the Diamond Trademark License Term?",
        "doc_id": 0,
        "chunk_id": 11,
    },
    {
        "question": "Where will disputes be resolved under the Intellectual Property Agreement between Armstrong Flooring and AHF Holding?",
        "doc_id": 0,
        "chunk_id": 44,
    },
    {
        "question": "What happens if either party materially breaches the Armstrong-AHF Intellectual Property Agreement?",
        "doc_id": 0,
        "chunk_id": 35,
    },
    {
        "question": "When does Playa Hotels & Resorts' right of first offer expire?",
        "doc_id": 1,
        "chunk_id": 4,
    },
    {
        "question": "Which state’s law governs the agreement between Hyatt Franchising Latin America and Playa Hotels & Resorts B.V.?",
        "doc_id": 1,
        "chunk_id": 13,
    },
    {
        "question": "What countries can Hyatt Franchising Latin America and Playa develop Hyatt All-Inclusive Resorts in?",
        "doc_id": 1,
        "chunk_id": 1,
    },
    {
        "question": "How many years of hotel experience must arbitrators have under the Hyatt-Playa agreement?",
        "doc_id": 1,
        "chunk_id": 15,
    },
    {
        "question": "Where will arbitration take place for disputes under the Hyatt-Playa Strategic Alliance Agreement?",
        "doc_id": 1,
        "chunk_id": 15,
    },
    {
        "question": "When was the Quaker/Gulf Houghton non-compete agreement effective?",
        "doc_id": 2,
        "chunk_id": 0,
    },
    {
        "question": "Which state’s law governs the Quaker/Gulf Houghton non-competition agreement?",
        "doc_id": 2,
        "chunk_id": 18,
    },
    {
        "question": "What is the geographic scope of the Quaker/Gulf Houghton non-compete?",
        "doc_id": 2,
        "chunk_id": 9,
    },
    {
        "question": "What percentage of publicly traded securities can Gulf Houghton sellers own as passive investors?",
        "doc_id": 2,
        "chunk_id": 9,
    },
    {
        "question": "How long must former employees be terminated before Gulf Houghton sellers can hire them?",
        "doc_id": 2,
        "chunk_id": 12,
    },
]

In [145]:
def calculate_metrics(query, chunk_embds, embd_type, k):
    if embd_type == "standard":
        query_embd = get_std_embeddings([query["question"]], "query")[0]
    elif embd_type == "contextual":
        query_embd = get_contextualized_embeddings([[query["question"]]], "query")[0]
    similarities = np.dot(chunk_embds, query_embd)
    top_k_idxs = np.argsort(similarities)[::-1][:k]
    top_k_docs = [chunked_docs[i] for i in top_k_idxs]
    golden_rank = None
    for rank, doc in enumerate(top_k_docs):
        if doc["doc_id"] == query["doc_id"] and doc["chunk_id"] == query["chunk_id"]:
            golden_rank = rank + 1
            break

    recall = 1 if golden_rank else 0
    return recall, golden_rank

### Standard Embeddings

In [97]:
std_embds = get_std_embeddings([record["chunk"] for record in chunked_docs], "document")

In [147]:
recalls = []
reciprocal_ranks = []
for query in queries:
    recall, rank = calculate_metrics(query, std_embds, "standard", 10)
    recalls.append(recall)
    print(f"{query['question']}: {rank}")
    reciprocal_ranks.append(1 / rank if rank else 0.0)

print(f"Mean recall: {np.mean(recalls) * 100:.2f}%")
print(f"Mean reciprocal rank: {np.mean(reciprocal_ranks) * 100:.2f}%")

Which state’s law governs the Intellectual Property Agreement between Armstrong Flooring and AHF Holding?: 9
Under the Armstrong Flooring-AHF Holding agreement, how long is the Arizona Trademark License Term?: 4
In the agreement between Armstrong Flooring, Inc. and AHF Holding, Inc., how long is the Diamond Trademark License Term?: 2
Where will disputes be resolved under the Intellectual Property Agreement between Armstrong Flooring and AHF Holding?: 7
What happens if either party materially breaches the Armstrong-AHF Intellectual Property Agreement?: 1
When does Playa Hotels & Resorts' right of first offer expire?: 5
Which state’s law governs the agreement between Hyatt Franchising Latin America and Playa Hotels & Resorts B.V.?: 3
What countries can Hyatt Franchising Latin America and Playa develop Hyatt All-Inclusive Resorts in?: 2
How many years of hotel experience must arbitrators have under the Hyatt-Playa agreement?: 1
Where will arbitration take place for disputes under the Hyat

## Contextualized Embeddings

In [116]:
from collections import defaultdict

In [118]:
grouped_docs = defaultdict(list)
for chunk in chunked_docs:
    grouped_docs[chunk["doc_id"]].append(chunk["chunk"])

chunks_by_doc = list(grouped_docs.values())

In [120]:
ctxt_embds = get_contextualized_embeddings(chunks_by_doc, "document")

In [148]:
recalls = []
ranks = []
reciprocal_ranks = []
for query in queries:
    recall, rank = calculate_metrics(query, ctxt_embds, "contextual", 10)
    recalls.append(recall)
    print(f"{query['question']}: {rank}")
    reciprocal_ranks.append(1 / rank if rank else 0.0)

print(f"Mean recall: {np.mean(recalls) * 100:.2f}%")
print(f"Mean reciprocal rank: {np.mean(reciprocal_ranks) * 100:.2f}%")

Which state’s law governs the Intellectual Property Agreement between Armstrong Flooring and AHF Holding?: 1
Under the Armstrong Flooring-AHF Holding agreement, how long is the Arizona Trademark License Term?: 7
In the agreement between Armstrong Flooring, Inc. and AHF Holding, Inc., how long is the Diamond Trademark License Term?: None
Where will disputes be resolved under the Intellectual Property Agreement between Armstrong Flooring and AHF Holding?: 2
What happens if either party materially breaches the Armstrong-AHF Intellectual Property Agreement?: 1
When does Playa Hotels & Resorts' right of first offer expire?: 5
Which state’s law governs the agreement between Hyatt Franchising Latin America and Playa Hotels & Resorts B.V.?: 1
What countries can Hyatt Franchising Latin America and Playa develop Hyatt All-Inclusive Resorts in?: 3
How many years of hotel experience must arbitrators have under the Hyatt-Playa agreement?: 1
Where will arbitration take place for disputes under the H