# Indexing Basics

Here, we will look at a basic indexing workflow using the LangChain indexing API.

The main features of the API are to help:

* Avoid writing duplicated content into the vectostore
* Avoid over-writing content if it's unchanged

The indexing API will work even if working with documents that have gone several 
transformation steps (e.g., via text chunking) w/ respect to the original source document.

## How

LangChain indexing makes use of an persistence layer (`RecordManager`) that keeps track of document writes into the vectostore.

When indexing content, hashes are computed for each document, and the following information is stored in the record manager: 

- the document hash (hashed content of both page content and metadata)
- write time
- the source id -- each document should include information in its metadata to allow us determining the ultimate source of this document

## Deletion Modes

Indexing has 3 deletion modes:

| Delete Mode | De-Duplicates Content | Parallelizable | Cleans up Deleted of Source Docs| Cleans up  Mutations of Source Docs and/or Derived Docs | Clean Up Timing   |
|-------------|-----------------------|---------------|----------------------------------|----------------------------------------------------|---------------------|
| None        | ✅                    | ✅            | ❌                               | ❌                                                 | -                  |
| Incremental | ✅                    | ✅            | ❌                               | ✅                                                 | Continuously       |
| Full        | ✅                    | ❌            | ✅                               | ✅                                                 | At end of indexing |


`None` does not do any automatic clean up, allowing the user to do clean up of old content. 

`incremental` and `full` offer the following automated clean up:


* If the content of source document or derived documents has **changed**, both `incremental` or `full` modes will clean up previous versions of the content.
* If the source document has been **deleted**, the `full` delete mode will delete it from the vectorstore correctly, but the `incremental` mode will not.


When content is mutated (e.g., the source PDF file was revised) there will be a period of time during indexing when both the new and old versions may be returned to the user. This happens after the new content was written, but before the old version was deleted.

* `incremental` indexing minimizes this period of time as it able to do clean up continuously.
* `full` mode does the clean up after all batches have been written.

## Vectorstore Requirements

This code only works with LangChain Vectorstores that support:

* document addition by id (`add_documents` method with `ids` argument)
* delete by id (`delete` method with)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.redis import Redis
from langchain.indexes import SQLRecordManager, index
from langchain.schema import Document

Initialize a vectorstore and set up the embeddings

In [3]:
collection_name = 'my_docs'

embeddings = OpenAIEmbeddings()
vectorstore = Redis('redis://localhost:6379', index_name=collection_name, embedding_function=embeddings.embed_documents)

Initialize a record manager with an appropriate namespace.

**Suggestion** Use a namespace that takes into account both the vectostore and the collection name in the vectorstore; e.g., 'redis/my_docs', 'chromadb/my_docs' or 'postgres/my_docs'

In [4]:
namespace = f'redis/{collection_name}'
record_manager = SQLRecordManager(namespace, db_url="sqlite:///record_manager_cache.sql")

Create a schema before using the record manager

In [5]:
record_manager.create_schema()

Let's index some test documents

In [6]:
doc1 = Document(page_content='kitty',  metadata={'source': 'kitty.txt'})
doc2 = Document(page_content='doggy',  metadata={'source': 'doggy.txt'})

Indexing into an empty vectorstore

In [7]:
def _clear():
    """Hacky way to clear up content. See the `full` mode section to to understand why it works."""
    index([], record_manager, vectorstore, delete_mode='full', source_id_key='source')

## None Mode

This mode does not do automatic clean up of old versions of content; however, it still takes care of content de-duplication.

In [8]:
_clear()

In [9]:
index([doc1, doc1, doc1, doc1, doc1], record_manager, vectorstore, delete_mode=None, source_id_key='source')

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [10]:
_clear()

In [11]:
index([doc1, doc2], record_manager, vectorstore, delete_mode=None, source_id_key='source')

{'num_added': 1, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 0}

Second time around all content will be skipped

In [12]:
index([doc1, doc2], record_manager, vectorstore, delete_mode=None, source_id_key='source')

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

## Incremental Mode

In [13]:
_clear()

In [14]:
index([doc1, doc2], record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

Indexing again should result in both documents getting **skipped** -- also skipping the embedding operation!

In [15]:
index([doc1, doc2], record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

If we provide no documents with incremental indexing mode, nothing will change

In [16]:
index([], record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

If we mutate a document, the new version will be written and all old versions sharing the same source will be deleted.

In [17]:
changed_doc_2 = Document(page_content='puppy',  metadata={'source': 'doggy.txt'})

In [18]:
index([changed_doc_2], record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}

## Full Mode

In `full` mode the user should pass the `full` universe of content that should be indexed into the indexing function.

Any documents that are not passed into the indexing functino and are present in the vectorstore will be deleted!

This behavior is useful to handle deletions of source documents.

In [19]:
_clear()

In [20]:
all_docs = [doc1, doc2]

In [21]:
index(all_docs, record_manager, vectorstore, delete_mode='full', source_id_key='source')

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Say someone deleted the first doc

In [22]:
del all_docs[0]

In [23]:
all_docs

[Document(page_content='doggy', metadata={'source': 'doggy.txt'})]

Using full mode will clean up the deleted content as well

In [24]:
index(all_docs, record_manager, vectorstore, delete_mode='full', source_id_key='source')

{'num_added': 0, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 2}

## Source 

The metadata attribute contains a filed called `source`. This source should be pointing at the *ultimate* provenance associated with the given document.

For example, if these documents are representing chunks of some parent document, the `source` for both documents should be the same and reference the parent document.

In [25]:
from langchain.text_splitter import CharacterTextSplitter

In [26]:
doc1 = Document(page_content='kitty kitty kitty kitty kitty',  metadata={'source': 'kitty.txt'})
doc2 = Document(page_content='doggy doggy the doggy',  metadata={'source': 'doggy.txt'})

In [27]:
new_docs = CharacterTextSplitter(separator='t', keep_separator=True, chunk_size=12, chunk_overlap=2).split_documents([doc1, doc2])
new_docs

[Document(page_content='kitty kit', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),
 Document(page_content='doggy doggy', metadata={'source': 'doggy.txt'}),
 Document(page_content='the doggy', metadata={'source': 'doggy.txt'})]

In [28]:
_clear()

In [29]:
index(new_docs, record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

{'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

In [30]:
changed_doggy_docs = [
    Document(page_content='woof woof',  metadata={'source': 'doggy.txt'}), 
    Document(page_content='woof woof woof',  metadata={'source': 'doggy.txt'})
]

This should delete the old versions of documents associated with `doggy.txt` source and replace them with the new versions

In [31]:
index(changed_doggy_docs, record_manager, vectorstore, delete_mode='incremental', source_id_key='source')

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 2}

## Using with Loaders

Indexing can accept either an iterable of documents or else any loader.

**Attention** The loader **MUST** set source keys correctly.

In [36]:
from langchain.document_loaders.base import BaseLoader

class MyCustomLoader(BaseLoader):
    def lazy_load(self):
        text_splitter = CharacterTextSplitter(separator='t', keep_separator=True, chunk_size=12, chunk_overlap=2)
        docs = [
            Document(page_content='woof woof',  metadata={'source': 'doggy.txt'}), 
            Document(page_content='woof woof woof',  metadata={'source': 'doggy.txt'})
        ]
        yield from text_splitter.split_documents(docs)

    def load(self):
        return list(self.lazy_load())

In [37]:
_clear()

In [38]:
loader = MyCustomLoader()

In [39]:
loader.load()

[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]

In [40]:
index(loader, record_manager, vectorstore, delete_mode='full', source_id_key='source')

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}