# LangChain Document Indexing Comprehensive Guide

## **Introduction**

The **LangChain Indexing API** is a powerful tool designed to streamline the process of loading and synchronizing documents into a vector store. It addresses common challenges such as avoiding duplicate content, re-writing unchanged content, and re-computing embeddings unnecessarily. By leveraging a **Record Manager**, the API tracks document writes and ensures that only new or updated content is processed, saving both time and computational resources. This makes it an ideal solution for maintaining up-to-date and efficient vector stores, even when documents undergo multiple transformations (e.g., text chunking).

The API supports multiple **deletion modes**, allowing users to choose how existing documents in the vector store are handled during indexing. Whether you need to avoid automatic cleanup, continuously clean up old versions, or perform a full refresh of the vector store, the LangChain Indexing API provides the flexibility to meet your needs. This guide explores the key features of the API, its deletion modes, and how to use it effectively with compatible vector stores.

### **Comparison of Deletion Modes**

| **Feature**                        | **None**                  | **Incremental**           | **Full**                   | **Scoped_Full**             |
|------------------------------------|---------------------------|---------------------------|----------------------------|-----------------------------|
| **De-Duplicates Content**          | ✅                        | ✅                        | ✅                         | ✅                          |
| **Parallelizable**                 | ✅                        | ✅                        | ❌                         | ✅                          |
| **Cleans Up Deleted Source Docs**  | ❌                        | ❌                        | ✅                         | ❌                          |
| **Cleans Up Mutations of Source/Derived Docs** | ❌                        | ✅                        | ✅                         | ✅                          |
| **Clean Up Timing**                | -                         | Continuously              | At end of indexing         | At end of indexing          |
| **Best Use Case**                  | Manual control over deletions; no automatic cleanup. | Frequent updates with minimal overlap between old and new versions. | Complete dataset refresh or handling deletions of source documents. | Partial dataset refresh with parallel processing. |

### **How to Use This Table**
- **De-Duplicates Content**: Check if the mode avoids re-indexing duplicate content.
- **Parallelizable**: Determine if the mode supports parallel processing for faster indexing.
- **Cleans Up Deleted Source Docs**: See if the mode automatically removes documents that are no longer in the input.
- **Cleans Up Mutations**: Check if the mode handles updates to source or derived documents.
- **Clean Up Timing**: Understand when the cleanup occurs (continuously or at the end of indexing).
- **Best Use Case**: Match the mode to your specific workflow requirements.

---

## Preparation

### Installing Required Libraries
This section installs the necessary Python libraries for working with LangChain, OpenAI embeddings, and Chroma vector store. These libraries include:
- `langchain-openai`: Provides integration with OpenAI's embedding models.
- `langchain_community`: Contains community-contributed modules and tools for LangChain.
- `langchain_experimental`: Includes experimental features and utilities for LangChain.
- `langchain-chroma`: Enables integration with the Chroma vector database.
- `chromadb`: The core library for the Chroma vector database.

In [1]:
!pip install -qU langchain-openai
!pip install -qU langchain_community
!pip install -qU langchain_experimental
!pip install -qU langchain-chroma>=0.1.2
!pip install -qU chromadb

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.2/54.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.9/411.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m455.6/455.6 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-api-core 1.34.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,

### Initializing OpenAI Embeddings
This section demonstrates how to securely fetch an OpenAI API key using Kaggle's `UserSecretsClient` and initialize the OpenAI embedding model. The `OpenAIEmbeddings` class is used to create an embedding model instance, which will be used to convert text into numerical embeddings.

Key steps:
1. **Fetch API Key**: The OpenAI API key is securely retrieved using Kaggle's `UserSecretsClient`.
2. **Initialize Embeddings**: The `OpenAIEmbeddings` class is initialized with the `text-embedding-3-small` model and the fetched API key.

This setup ensures that the embedding model is ready for use in downstream tasks, such as caching embeddings or creating vector stores.

In [2]:
from langchain_openai import OpenAIEmbeddings
from kaggle_secrets import UserSecretsClient

# Fetch API key securely
user_secrets = UserSecretsClient()
my_api_key = user_secrets.get_secret("api-key-openai")

# Initialize OpenAI embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)

---

## 1. **None Deletion Mode**
This mode **does not perform automatic cleanup** of old content. It ensures that duplicate content is not re-indexed, but it leaves existing documents untouched unless explicitly removed.

### **Key Features of None Deletion Mode**
1. **No Automatic Cleanup**: Existing documents in the vector store are **not deleted**, even if they are no longer part of the input.
2. **De-Duplication**: Ensures that duplicate content is **not re-indexed**, saving time and resources.
3. **Manual Control**: You retain full control over document deletions, making it suitable for scenarios where you want to manage deletions explicitly.

### **When to Use None Deletion Mode**
- When you want to **avoid automatic deletions** of existing documents.
- When you need to **manually manage** the lifecycle of documents in the vector store.
- When you want to ensure **no unintended data loss** during indexing.

In [3]:
from langchain.indexes import SQLRecordManager, index
from langchain_core.documents import Document
from langchain_chroma import Chroma

# Initialize Chroma Vectorstore with OpenAI embeddings
collection_name = "test_index"
vectorstore = Chroma(collection_name=collection_name, embedding_function=embed)

# Initialize a record manager to track document writes
namespace = f"chroma/{collection_name}"
record_manager = SQLRecordManager(namespace, db_url="sqlite:///record_manager_cache.sql")
record_manager.create_schema()

# Define test documents
doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

# Helper function to clear content (used for setup)
def _clear():
    index([], record_manager, vectorstore, cleanup="full", source_id_key="source")

# Clear the vector store and record manager (setup for a clean state)
_clear()

In [4]:
# Index documents with None deletion mode
# Feature: No automatic cleanup of old content
# Explanation: Only one unique document is added, even though `doc1` is provided multiple times.
index([doc1, doc1, doc1, doc1, doc1], record_manager, vectorstore, cleanup=None, source_id_key="source")

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [5]:
# Index new documents
# Explanation: `doc1` is skipped (already indexed), and `doc2` is added.
index([doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key="source")

{'num_added': 1, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 0}

In [6]:
# Second run skips all content
# Explanation: Both documents are already indexed, so nothing is added or updated.
index([doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

---

## 2. **Incremental Deletion Mode**
This mode **continuously cleans up old versions of content** as new versions are indexed. It ensures that the vector store stays up-to-date by removing outdated documents while minimizing the time window during which both old and new versions coexist.

### **Key Features of Incremental Deletion Mode**
1. **Continuous Cleanup**: Old versions of documents are **automatically deleted** as new versions are indexed, ensuring the vector store reflects the latest content.
2. **Efficient Updates**: Only **changed or new documents** are processed, avoiding unnecessary re-indexing of unchanged content.
3. **Minimized Overlap**: The time window during which both old and new versions coexist is minimized, reducing the risk of returning outdated results.

### **When to Use Incremental Deletion Mode**
- When you want to **keep the vector store up-to-date** with the latest versions of your documents.
- When you need to **efficiently handle frequent updates** to your source documents.
- When you want to **avoid a full rebuild** of the vector store while ensuring consistency.

### **Example Workflow**
1. **Initial Indexing**: Add documents to the vector store.
2. **Subsequent Updates**: Update or add new documents. Outdated versions are automatically cleaned up.
3. **No Changes**: If no changes are detected, the process is skipped, saving time and resources.

In [7]:
from langchain.indexes import SQLRecordManager, index
from langchain_core.documents import Document
from langchain_chroma import Chroma

# Initialize Chroma Vectorstore with OpenAI embeddings
collection_name = "test_index"
vectorstore = Chroma(collection_name=collection_name, embedding_function=embed)

# Initialize a record manager to track document writes
namespace = f"chroma/{collection_name}"
record_manager = SQLRecordManager(namespace, db_url="sqlite:///record_manager_cache.sql")
record_manager.create_schema()

# Define test documents
doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

# Helper function to clear content (used for setup)
def _clear():
    index([], record_manager, vectorstore, cleanup="full", source_id_key="source")

# Clear the vector store and record manager (setup for a clean state)
_clear()

In [8]:
# Index documents with Incremental deletion mode
# Feature: Automatically cleans up old versions of content
# Explanation: Both documents are added to the vector store.
index([doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [9]:
# Second run skips both documents (no changes)
# Explanation: No changes are detected, so both documents are skipped.
index([doc1, doc2], record_manager, vectorstore, cleanup="incremental", source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

In [10]:
# No changes if no documents are provided
# Explanation: No documents are provided, so nothing is added or deleted.
index([], record_manager, vectorstore, cleanup="incremental", source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [11]:
# Mutate a document and index the new version
# Explanation: The new version of `doc2` is added, and the old version is deleted.
changed_doc_2 = Document(page_content="puppy", metadata={"source": "doggy.txt"})
index([changed_doc_2], record_manager, vectorstore, cleanup="incremental", source_id_key="source")

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}

---

## 3. **Full Deletion Mode**
This mode ensures that **only the documents provided in the current batch** are retained in the vector store. Any existing documents not included in the batch are **automatically deleted**. This is particularly useful for handling **deletions of source documents** or performing **complete dataset refreshes**.

### **Key Features of Full Deletion Mode**
1. **Complete Control**: Ensures that the vector store contains **only the documents explicitly provided** in the current batch.
2. **Handles Deletions**: Automatically deletes documents that are no longer part of the input, making it ideal for **removing outdated or deleted source documents**.
3. **Dataset Refresh**: Suitable for scenarios where you need to perform a **complete refresh** of the vector store with a new set of documents.

### **When to Use Full Deletion Mode**
- When you want to **ensure the vector store matches exactly** the documents you provide in the current batch.
- When you need to **handle deletions of source documents** (e.g., files or records that no longer exist).
- When performing a **full dataset refresh** or replacing the entire content of the vector store.

### **Example Workflow**
1. **Initial Indexing**: Add a set of documents to the vector store.
2. **Subsequent Updates**: Provide a new batch of documents. Any documents not included in the batch are **automatically deleted**.
3. **Handling Deletions**: If a source document is removed from the input, it is also removed from the vector store.

In [12]:
from langchain.indexes import SQLRecordManager, index
from langchain_core.documents import Document
from langchain_chroma import Chroma

# Initialize Chroma Vectorstore with OpenAI embeddings
collection_name = "test_index"
vectorstore = Chroma(collection_name=collection_name, embedding_function=embed)

# Initialize a record manager to track document writes
namespace = f"chroma/{collection_name}"
record_manager = SQLRecordManager(namespace, db_url="sqlite:///record_manager_cache.sql")
record_manager.create_schema()

# Define test documents
doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

# Helper function to clear content (used for setup)
def _clear():
    index([], record_manager, vectorstore, cleanup="full", source_id_key="source")

# Clear the vector store and record manager (setup for a clean state)
_clear()

In [13]:
# Index documents with Full deletion mode
# Feature: Only the provided documents are retained; others are deleted
# Explanation: Both documents are added to the vector store.
all_docs = [doc1, doc2]
index(all_docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [14]:
# Simulate deletion of the first document (e.g., "kitty.txt" is no longer needed)
del all_docs[0]  # Remove `doc1` from the batch
# Explanation: `doc1` is deleted from the vector store because it is no longer in the batch.
index(all_docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}

---

## 4. **Scoped_Full Deletion Mode**

### **Key Features of Scoped_Full Deletion Mode**
1. **Partial Dataset Refresh**: Only the documents within the specified scope (e.g., `kitty.txt` and `doggy.txt`) are updated and cleaned up.
2. **Parallel Processing**: Supports parallel execution, making it efficient for large datasets.
3. **Cleanup at End of Indexing**: Old versions of the specified documents are deleted after the new versions are indexed.
4. **No Cleanup of Unrelated Documents**: Documents outside the scope (e.g., `birdie.txt`) remain untouched.

### **When to Use Scoped_Full Deletion Mode**
- When you need to **update a subset of documents** in the vector store without affecting the rest.
- When you want to **leverage parallel processing** for faster indexing.
- When you need to **clean up old versions** of specific documents after updating them.

### **Workflow Summary**
1. **Initial Setup**: Populate the vector store with an initial set of documents using **Full Deletion Mode**.
2. **Partial Update**: Use **Scoped_Full Deletion Mode** to update and clean up only the specified documents.
3. **Verification**: Check the vector store to ensure the updates were applied correctly and unrelated documents remain unchanged.

In [15]:
from langchain.indexes import SQLRecordManager, index
from langchain_core.documents import Document
from langchain_chroma import Chroma

# Initialize Chroma Vectorstore with OpenAI embeddings
collection_name = "test_index"
vectorstore = Chroma(collection_name=collection_name, embedding_function=embed)

# Initialize a record manager to track document writes
namespace = f"chroma/{collection_name}"
record_manager = SQLRecordManager(namespace, db_url="sqlite:///record_manager_cache.sql")
record_manager.create_schema()

# Define test documents
doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})
doc3 = Document(page_content="birdie", metadata={"source": "birdie.txt"})

# Helper function to clear content (used for setup)
def _clear():
    index([], record_manager, vectorstore, cleanup="full", source_id_key="source")

# Clear the vector store and record manager (setup for a clean state)
_clear()

In [16]:
# Initial indexing with Full Deletion Mode to populate the vector store
all_docs = [doc1, doc2, doc3]
index(all_docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 3, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [17]:
# Simulate a partial update: Only update documents related to "kitty.txt" and "doggy.txt"
updated_doc1 = Document(page_content="kitty v2", metadata={"source": "kitty.txt"})
updated_doc2 = Document(page_content="doggy v2", metadata={"source": "doggy.txt"})

# Use Scoped_Full Deletion Mode to update only the specified documents
# - The old versions of `doc1` and `doc2` are deleted.
# - The new versions of `doc1` and `doc2` are added.
# - `doc3` remains unchanged in the vector store.
index([updated_doc1, updated_doc2], record_manager, vectorstore, cleanup="scoped_full", source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 2}

In [18]:
# Verify the updated vector store
results = vectorstore.similarity_search("kitty", k=5)
for result in results:
    print(result.page_content)

kitty v2
doggy v2
birdie


## **Conclusion**

The **LangChain Indexing API** is a versatile and efficient solution for managing documents in vector stores. By automating tasks like de-duplication, avoiding unnecessary re-computations, and handling document deletions, it significantly reduces the complexity and cost of maintaining up-to-date vector stores. The availability of multiple **deletion modes** ensures that you can tailor the indexing process to your specific needs, whether you require manual control, continuous updates, or a complete dataset refresh.

When choosing a deletion mode, consider the following:
- Use **None Deletion Mode** when you want full control over deletions and no automatic cleanup.
- Use **Incremental Deletion Mode** for frequent updates with minimal overlap between old and new versions.
- Use **Full Deletion Mode** for complete dataset refreshes or handling deletions of source documents.
- Use **Scoped_Full Deletion Mode** for partial dataset refreshes with parallel processing.

By understanding the strengths and use cases of each mode, you can optimize your indexing workflow and ensure that your vector store remains accurate, efficient, and up-to-date. The LangChain Indexing API, combined with its compatibility with a wide range of vector stores, makes it an essential tool for any application involving document indexing and retrieval.