# Indexing Basics

Here, we will look at a basic indexing workflow using the LangChain indexing API.

The main features of the API are to help:

* Avoid writing duplicated content into the vectostore
* Avoid over-writing content if it's unchanged

The indexing API will work even if working with documents that have gone several 
transformation steps (e.g., via text chunking) w/ respect to the original source document.

## How

LangChain indexing makes use of an persistence layer (`RecordManager`) that keeps track of document writes into the vectostore.

When indexing content, hashes are computed for each document, and the following information is stored in the record manager: 

- the document hash (hashed content of both page content and metadata)
- write time
- the source id -- each document should include information in its metadata to allow us determining the ultimate source of this document

## Deletion Modes

Indexing has 3 deletion modes:

| Delete Mode | De-Duplicates Content | Parallelizable | Handles Deletion of Source Docs | Handles Mutations of Source Docs and/or Derived Docs | Clean Up Timing   |
|-------------|-----------------------|---------------|----------------------------------|----------------------------------------------------|---------------------|
| None        | ✅                    | ✅            | ❌                               | ❌                                                 | -                  |
| Incremental | ✅                    | ✅            | ❌                               | ✅                                                 | Continuously       |
| Full        | ✅                    | ❌            | ✅                               | ✅                                                 | At end of indexing |


* If the source document has been deleted, only the `full` delete mode will be able to delete it from the vectorstore correctly.
* If the source document has been mutated or documents derived from it have been changed (e.g., by changing chunking parameters), either `incremental` or `full` modes will be able to clean up the previous versions of the content.


When content is mutated (e.g., the source PDF file was revised) there will be a period of time during indexing when both the new and old versions may be returned to the user. This happens after the new content was written, but before the old version was deleted.

* `incremental` indexing minimizes this period of time as it able to do clean up continuously.
* `full` mode does the clean up after all batches have been written.

## Vectorstore Requirements

This code only works with LangChain Vectorstores that support:

* document addition by id (`add_documents` method with `ids` argument)
* delete by id (`delete` method with)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from __future__ import annotations
import langchain

In [3]:
from langchain.indexes import SQLRecordManager, index
from langchain.embeddings import CacheBackedEmbeddings, OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.txt import TextParser
from langchain.document_loaders import DocumentPipeline
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from typing import Iterator, List, Optional, Sequence
from langchain.document_loaders.base import BaseLoader
from langchain.schema import BaseDocumentTransformer, Document
from langchain.text_splitter import TextSplitter
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import LanceDB
import lancedb

In [4]:
from langchain.vectorstores import Chroma

embedder = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embedder)

RuntimeError: [91mYour system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0.[0m[94mPlease visit https://docs.trychroma.com/troubleshooting#sqlite to learn how to upgrade.[0m

**Suggestion** Use a namespace that takes into account both the vectostore and the collection name in the vectorstore; e.g., 'chromadb/my_docs' or 'postgres/my_docs'

In [None]:
namespace = 'my_docs'
record_manager = SQLRecordManager(namespace, db_url="sqlite:///record_manager_cache.sql")

Create a schema before using the record manager

In [None]:
record_manager.create_schema()

Instantiate vector db with cached backed embedder

Create a vectorstore

In [9]:
# db = lancedb.connect("/tmp/lancedb")
# table = db.create_table(
#     "my_table",
#     data=[
#         {
#             "vector": embedder.embed_query("Hello World"),
#             "text": "Hello World",
#             "id": "1",
#         }
#     ],
#     mode="overwrite",
# )
# table = db.open_table('my_table')
# vectorstore = LanceDB(table, embedding=embedder)

Let's create some test data to use

In [10]:
# from langchain.vectorstores import Chroma

# faiss = Chroma(embedding_function=embedder)

In [11]:
import os
import tempfile

sample_data_dir = tempfile.mkdtemp(prefix="sample_data")

text1_content = "hello said the little kitty"
text2_content = "byebye said the big doggy"

with open(os.path.join(sample_data_dir, "text1.txt"), "w") as text1_file:
    text1_file.write(text1_content)

with open(os.path.join(sample_data_dir, "text2.txt"), "w") as text2_file:
    text2_file.write(text2_content)


We'll be loading these files and splitting them using a text splitter and then indexing.

In [13]:
file_loader = GenericLoader.from_filesystem(
    sample_data_dir, glob="*.txt", parser=TextParser()
)

In [14]:
list(file_loader.lazy_load())

[Document(page_content='hello said the little kitty', metadata={'source': '/tmp/sample_data92gr1kxd/text1.txt'}),
 Document(page_content='byebye said the big doggy', metadata={'source': '/tmp/sample_data92gr1kxd/text2.txt'})]

In [None]:
%%time
index(file_loader, record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

If we try to re-run it again, no content will be re-written since the original content has not changed!

In [122]:
%%time
index(file_loader, record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

CPU times: user 6.98 ms, sys: 4.67 ms, total: 11.6 ms
Wall time: 27.3 ms


{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

## Run an update!

Change the loader to only pick up 1 of the files!

In [123]:
file_loader = GenericLoader.from_filesystem(
    sample_data_dir, glob="text1.txt", parser=TextParser()
)

In [125]:
list(file_loader.lazy_load())

[Document(page_content='hello said the little kitty', metadata={'source': '/tmp/sample_data4z34vlfe/text1.txt'})]

In [67]:
%%time
index(pipeline_loader, record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

NameError: name 'timestamped_set' is not defined

Run another update

In [19]:
%%time
index(pipeline_loader, record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

Created a chunk of size 21, which is longer than the specified 4


CPU times: user 9.24 ms, sys: 9.44 ms, total: 18.7 ms
Wall time: 45.3 ms


{'num_added': 0, 'num_updated': 0, 'num_skipped': 4, 'num_deleted': 0}

# More complex pipeline

In [None]:
Let's create a document pipeline to make i

class Pipeline(BaseLoader):
    """A document pipeline that can be used to load documents."""

    def __init__(
        self,
        loader: BaseLoader,
        *,
        transformers: Sequence[BaseDocumentTransformer] = (),
    ) -> None:
        """Initialize the document pipeline.
        Args:
            loader: The loader to use for loading the documents.
            transformers: The transformers to use for transforming the documents.
        """
        self.loader = loader
        self.transformers = transformers

    def lazy_load(self) -> Iterator[Document]:
        """Fetch the data from the data selector."""
        try:
            documents = self.loader.lazy_load()
        except NotImplementedError:
            documents = iter(self.loader.load())

        for document in documents:
            _docs = [document]
            for transformer in self.transformers:
                # List below is needed because of typing issue in langchain
                _docs = list(transformer.transform_documents(_docs))
            yield from _docs

    def load(self) -> List[Document]:
        """Fetch the data from the data selector."""
        raise NotImplementedError("Use lazy_load instead")

    def load_and_split(
        self, text_splitter: Optional[TextSplitter] = None
    ) -> List[Document]:
        """Fetch the data from the data selector."""
        raise NotImplementedError("Use lazy_load instead")