# Semantic Search with Voyage AI Embeddings

This notebook is a companion to the [Semantic Search with Voyage AI](https://www.mongodb.com/docs/voyageai/tutorials/semantic-search/) tutorial. Refer to the page for set-up instructions and detailed explanations.

This guide describes how to perform semantic search with Voyage AI models. This page includes examples for basic and advanced semantic search use cases, including search with reranking, as well as multilingual, multimodal, contextualized chunk, and large corpus retrieval.

<a target="_blank" href="https://colab.research.google.com/github/mongodb/docs-notebooks/blob/main/voyageai/notebooks/semantic-search.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Set Up Your Environment

Before you begin, install libraries and set your model API key.

In [None]:
!pip install  --upgrade voyageai numpy datasets pillow

In [None]:
import os

# Set your model API key
os.environ["VOYAGE_API_KEY"] = "<your-model-api-key>"

## Basic Semantic Search

Find similar documents using simple vector similarity.

In [None]:
import voyageai
import numpy as np

# Initialize Voyage AI client
vo = voyageai.Client()

# Sample documents
documents = [
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen.",
    "20th-century innovations, from radios to smartphones, centered on electronic advancements.",
    "Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.",
    "Apple's conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.",
    "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature."
]

# Search query
query = "When is Apple's conference call scheduled?"

# Generate embeddings for documents
doc_embeddings = vo.embed(
    texts=documents,
    model="voyage-4-large",
    input_type="document"
).embeddings

# Generate embedding for query
query_embedding = vo.embed(
    texts=[query],
    model="voyage-4-large",
    input_type="query"
).embeddings[0]

# Calculate similarity scores using dot product
similarities = np.dot(doc_embeddings, query_embedding)

# Sort documents by similarity (highest to lowest)
ranked_indices = np.argsort(-similarities)

# Display results
print(f"Query: '{query}'\n")
for rank, idx in enumerate(ranked_indices, 1):
    print(f"{rank}. {documents[idx]}")
    print(f"   Similarity: {similarities[idx]:.4f}\n")

## Semantic Search with Reranker

Improve search accuracy with reranking models.

In [None]:
import voyageai
import numpy as np

# Initialize Voyage AI client
vo = voyageai.Client()

# Sample documents
documents = [
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables, believed to reduce chronic diseases.",
    "Photosynthesis in plants converts light energy into glucose and produces essential oxygen.",
    "20th-century innovations, from radios to smartphones, centered on electronic advancements.",
    "Rivers provide water, irrigation, and habitat for aquatic species, vital for ecosystems.",
    "Apple's conference call to discuss fourth fiscal quarter results and business updates is scheduled for Thursday, November 2, 2023 at 2:00 p.m. PT / 5:00 p.m. ET.",
    "Shakespeare's works, like 'Hamlet' and 'A Midsummer Night's Dream,' endure in literature."
]

# Search query
query = "When is Apple's conference call scheduled?"

# Generate embeddings for documents
doc_embeddings = vo.embed(
    texts=documents,
    model="voyage-4-large",
    input_type="document"
).embeddings

# Generate embedding for query
query_embedding = vo.embed(
    texts=[query],
    model="voyage-4-large",
    input_type="query"
).embeddings[0]

# Calculate similarity scores using dot product
similarities = np.dot(doc_embeddings, query_embedding)

# Sort by similarity (highest to lowest)
ranked_indices = np.argsort(-similarities)

# Display results before reranking
print(f"Query: '{query}'\n")
print("Before reranker (embedding similarity only):")
for rank, idx in enumerate(ranked_indices[:3], 1):
    print(f"{rank}. {documents[idx]}")
    print(f"   Similarity Score: {similarities[idx]:.4f}\n")

# Rerank documents for improved accuracy
rerank_results = vo.rerank(
    query=query,
    documents=documents,
    model="rerank-2.5"
)

# Display results after reranking
print("\nAfter reranker:")
for rank, result in enumerate(rerank_results.results[:3], 1):
    print(f"{rank}. {documents[result.index]}")
    print(f"   Relevance Score: {result.relevance_score:.4f}\n")

## Multilingual Semantic Search

Search across documents in different languages.

In [None]:
import voyageai
import numpy as np

# Initialize Voyage AI client
vo = voyageai.Client()

# English documents about technology companies
english_docs = [
    "Apple announced record-breaking revenue in its latest quarterly earnings report.",
    "The Mediterranean diet emphasizes fish, olive oil, and vegetables.",
    "Microsoft is investing heavily in artificial intelligence and cloud computing.",
    "Shakespeare's plays continue to influence modern literature and theater."
]

# Spanish documents about technology companies
spanish_docs = [
    "Apple anunció ingresos récord en su último informe trimestral de ganancias.",
    "La dieta mediterránea enfatiza el pescado, el aceite de oliva y las verduras.",
    "Microsoft está invirtiendo fuertemente en inteligencia artificial y computación en la nube.",
    "Las obras de Shakespeare continúan influenciando la literatura y el teatro modernos."
]

# Chinese documents about technology companies
chinese_docs = [
    "苹果公司在最新季度财报中宣布创纪录的收入。",
    "地中海饮食强调鱼类、橄榄油和蔬菜。",
    "微软正在大力投资人工智能和云计算。",
    "莎士比亚的作品继续影响现代文学和戏剧。"
]

# Perform semantic search in English
english_query = "tech company earnings"

# Generate embeddings for English documents
english_embeddings = vo.embed(
    texts=english_docs,
    model="voyage-4-large",
    input_type="document"
).embeddings

# Generate embedding for English query
english_query_embedding = vo.embed(
    texts=[english_query],
    model="voyage-4-large",
    input_type="query"
).embeddings[0]

# Calculate similarity scores using dot product
english_similarities = np.dot(english_embeddings, english_query_embedding)

# Sort by similarity (highest to lowest)
english_ranked = np.argsort(-english_similarities)

print(f"English Query: '{english_query}'\n")
for rank, idx in enumerate(english_ranked[:2], 1):
    print(f"{rank}. {english_docs[idx]}")
    print(f"   Similarity: {english_similarities[idx]:.4f}\n")

# Perform semantic search in Spanish
spanish_query = "ganancias de empresas tecnológicas"

# Generate embeddings for Spanish documents
spanish_embeddings = vo.embed(
    texts=spanish_docs,
    model="voyage-4-large",
    input_type="document"
).embeddings

# Generate embedding for Spanish query
spanish_query_embedding = vo.embed(
    texts=[spanish_query],
    model="voyage-4-large",
    input_type="query"
).embeddings[0]

# Calculate similarity scores using dot product
spanish_similarities = np.dot(spanish_embeddings, spanish_query_embedding)

# Sort by similarity (highest to lowest)
spanish_ranked = np.argsort(-spanish_similarities)

print(f"Spanish Query: '{spanish_query}'\n")
for rank, idx in enumerate(spanish_ranked[:2], 1):
    print(f"{rank}. {spanish_docs[idx]}")
    print(f"   Similarity: {spanish_similarities[idx]:.4f}\n")

# Perform semantic search in Chinese
chinese_query = "科技公司收益"

# Generate embeddings for Chinese documents
chinese_embeddings = vo.embed(
    texts=chinese_docs,
    model="voyage-4-large",
    input_type="document"
).embeddings

# Generate embedding for Chinese query
chinese_query_embedding = vo.embed(
    texts=[chinese_query],
    model="voyage-4-large",
    input_type="query"
).embeddings[0]

# Calculate similarity scores using dot product
chinese_similarities = np.dot(chinese_embeddings, chinese_query_embedding)

# Sort by similarity (highest to lowest)
chinese_ranked = np.argsort(-chinese_similarities)

print(f"Chinese Query: '{chinese_query}'\n")
for rank, idx in enumerate(chinese_ranked[:2], 1):
    print(f"{rank}. {chinese_docs[idx]}")
    print(f"   Similarity: {chinese_similarities[idx]:.4f}\n")

## Multimodal Semantic Search

Search text and image data.

> **Note:** Search for sample images and save them in your project directory. The following code example assumes you have images of a cat, dog, and banana.

In [None]:
import voyageai
import numpy as np
from PIL import Image

# Initialize Voyage AI client
vo = voyageai.Client()

# Prepare interleaved text + image inputs
interleaved_inputs = [
    ["An orange cat", Image.open('cat.jpg')],
    ["A golden retriever", Image.open('dog.jpg')],
    ["A banana", Image.open('banana.jpg')],
]

# Prepare image-only inputs
image_only_inputs = [
    [Image.open('cat.jpg')],
    [Image.open('dog.jpg')],
    [Image.open('banana.jpg')],
]

# Labels for display
labels = ["cat.jpg", "dog.jpg", "banana.jpg"]

# Search query
query = "a cute pet"

# Generate embeddings for interleaved text + image inputs
interleaved_embeddings = vo.multimodal_embed(
    inputs=interleaved_inputs,
    model="voyage-multimodal-3.5"
).embeddings

# Generate embedding for query
query_embedding = vo.multimodal_embed(
    inputs=[[query]],
    model="voyage-multimodal-3.5"
).embeddings[0]

# Calculate similarity scores using dot product
interleaved_similarities = np.dot(interleaved_embeddings, query_embedding)

# Sort by similarity (highest to lowest)
interleaved_ranked = np.argsort(-interleaved_similarities)

print(f"Query: '{query}'\n")
print("Search with interleaved text + image:")
for rank, idx in enumerate(interleaved_ranked, 1):
    print(f"{rank}. {interleaved_inputs[idx][0]}")
    print(f"   Similarity: {interleaved_similarities[idx]:.4f}\n")

# Generate embeddings for image-only inputs
image_only_embeddings = vo.multimodal_embed(
    inputs=image_only_inputs,
    model="voyage-multimodal-3.5"
).embeddings

# Calculate similarity scores using dot product
image_only_similarities = np.dot(image_only_embeddings, query_embedding)

# Sort by similarity (highest to lowest)
image_only_ranked = np.argsort(-image_only_similarities)

print("\nSearch with image-only:")
for rank, idx in enumerate(image_only_ranked, 1):
    print(f"{rank}. {labels[idx]}")
    print(f"   Similarity: {image_only_similarities[idx]:.4f}\n")

## Search with Contextualized Chunk Embeddings

Generate embeddings with additional context for better results.

In [None]:
import voyageai
import numpy as np

# Initialize Voyage AI client
vo = voyageai.Client()

# Sample documents (each document is a list of chunks that share context)
documents = [
    [
        "This is the SEC filing on Greenery Corp.'s Q2 2024 performance.",
        "The company's revenue increased by 7% compared to the previous quarter."
    ],
    [
        "This is the SEC filing on Leafy Inc.'s Q2 2024 performance.",
        "The company's revenue increased by 15% compared to the previous quarter."
    ],
    [
        "This is the SEC filing on Elephant Ltd.'s Q2 2024 performance.",
        "The company's revenue decreased by 2% compared to the previous quarter."
    ]
]

# Search query
query = "What was the revenue growth for Leafy Inc. in Q2 2024?"

# Generate contextualized embeddings (preserves relationships between chunks)
contextualized_result = vo.contextualized_embed(
    inputs=documents,
    model="voyage-context-3",
    input_type="document"
)

# Flatten the embeddings and chunks for semantic search
contextualized_embeddings = []
all_chunks = []
chunk_to_doc = []  # Maps chunk index to document index

for doc_idx, result in enumerate(contextualized_result.results):
    for emb, chunk in zip(result.embeddings, documents[doc_idx]):
        contextualized_embeddings.append(emb)
        all_chunks.append(chunk)
        chunk_to_doc.append(doc_idx)

# Generate contextualized query embedding
query_embedding_ctx = vo.contextualized_embed(
    inputs=[[query]],
    model="voyage-context-3",
    input_type="query"
).results[0].embeddings[0]

# Calculate similarity scores using dot product
similarities_ctx = np.dot(contextualized_embeddings, query_embedding_ctx)

# Sort by similarity (highest to lowest)
ranked_indices_ctx = np.argsort(-similarities_ctx)

# Display top 3 results
print(f"Query: '{query}'\n")
for rank, idx in enumerate(ranked_indices_ctx[:3], 1):
    doc_idx = chunk_to_doc[idx]
    print(f"{rank}. {all_chunks[idx]}")
    print(f"   (From document: {documents[doc_idx][0]})")
    print(f"   Similarity: {similarities_ctx[idx]:.4f}\n")

## Semantic Search with Large Corpus

This example demonstrates semantic search using the [MTEB LegalBench Consumer Contracts QA](https://huggingface.co/datasets/mteb/legalbench_consumer_contracts_qa) benchmark dataset, which contains legal questions and contract clauses.

The dataset includes human-annotated relevance scores indicating which documents are relevant to each query. In this example, you use semantic similarity to find documents, then cross-reference our results against these ground truth labels to evaluate search quality.

In [None]:
import voyageai
import numpy as np
from datasets import load_dataset
from collections import defaultdict

# Initialize Voyage AI client
vo = voyageai.Client()

# Load legal benchmark dataset
corpus_ds = load_dataset("mteb/legalbench_consumer_contracts_qa", "corpus")["corpus"]
queries_ds = load_dataset("mteb/legalbench_consumer_contracts_qa", "queries")["queries"]
qrels_ds = load_dataset("mteb/legalbench_consumer_contracts_qa")["test"]

# Extract corpus and query data
corpus_ids = [row["_id"] for row in corpus_ds]
corpus_texts = [row["text"] for row in corpus_ds]
query_ids = [row["_id"] for row in queries_ds]
query_texts = [row["text"] for row in queries_ds]

# Build relevance mapping from dataset's ground truth (human-annotated) labels
qrels = defaultdict(set)
for row in qrels_ds:
    if row["score"] > 0:
        qrels[row["query-id"]].add(row["corpus-id"])

# Generate embeddings for the entire corpus
print(f"Generating embeddings for {len(corpus_texts)} documents...")
corpus_embeddings = vo.embed(
    texts=corpus_texts,
    model="voyage-4-large",
    input_type="document"
).embeddings

# Select a sample query (change query_idx to explore other queries)
# Query: 'Will Google come to a users assistance in the event of an alleged violation of such users IP rights?'
query_idx = 1
query = query_texts[query_idx]
query_id = query_ids[query_idx]

# Generate embedding for the query
query_embedding = vo.embed(
    texts=[query],
    model="voyage-4-large",
    input_type="query"
).embeddings[0]

# Calculate similarity scores using dot product
similarities = np.dot(corpus_embeddings, query_embedding)

# Sort by similarity (highest to lowest)
ranked_indices = np.argsort(-similarities)

# Display top 5 results
print(f"Query: {query}\n")
print("Top 5 Results:")
for rank, idx in enumerate(ranked_indices[:5], 1):
    doc_id = corpus_ids[idx]
    is_relevant = "✓" if doc_id in qrels[query_id] else "✗"
    print(f"{rank}. [{is_relevant}] Document ID: {doc_id}")
    print(f"   Similarity: {similarities[idx]:.4f}")
    print(f"   Text: {corpus_texts[idx][:100]}...\n")

# Show the ground truth most relevant document
most_relevant_id = list(qrels[query_id])[0]
most_relevant_idx = corpus_ids.index(most_relevant_id)
print(f"Ground truth most relevant document:")
print(f"Document ID: {most_relevant_id}")
print(f"Rank in results: {np.where(ranked_indices == most_relevant_idx)[0][0] + 1}")
print(f"Similarity: {similarities[most_relevant_idx]:.4f}")