# Indexing Basics

This notebooks shows a basic indexing workflow using the LangChain indexing API.

This example shows indexing using LangConnect.
In this example, intermediate documents are not cached. But a cache is used for Embeddings.

Incrementals and cleanups are supported under the conditions that:
* The loader used returns the entire universe of data -- this means that nothing can be parallelized at the moment.
* User does not get to control the uids of the documents in the vector store (documents are indexed by hash)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import langchain

In [3]:
from langchain.indexes import SQLRecordManager, index

In [4]:
from langchain.embeddings import CacheBackedEmbeddings, OpenAIEmbeddings
from langchain.storage import LocalFileStore

In [5]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers.txt import TextParser
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import LanceDB

import lancedb

Use a namespace to differentiate between different vectorstore

In [6]:
namespace = 'my_docs'

In [7]:
record_manager = SQLRecordManager(namespace, db_url="sqlite:///cache2.sql")

Create a schema before using the record manager

In [8]:
record_manager.create_schema()

Instantiate vector db with cached backed embedder

Create a vectorstore

In [51]:
db = lancedb.connect("/tmp/lancedb")
table = db.create_table(
    "my_table",
    data=[
        {
            "vector": cache_backed_embedder.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)
table = db.open_table('my_table')
db = LanceDB(table, embedding=OpenAIEmbeddings())

Let's create some test data to use

In [52]:
sample_data_dir = "./sample_data/"

In [53]:
!echo "hello said the little kitty" > ./sample_data/text1.txt
!echo "byebye said the big doggy" > ./sample_data/text2.txt

We'll be loading these files and splitting them using a text splitter and then indexing.

In [56]:
loader = GenericLoader.from_filesystem(
    sample_data_dir, glob="*.txt", parser=TextParser()
)

In [57]:
data_loader = DocumentPipeline(loader, transformers=[transformer1, transformer2])

NameError: name 'DocumentPipeline' is not defined

In [36]:
%%time
index(data_loader, record_manager, chroma, delete_old=True)

Created a chunk of size 20, which is longer than the specified 4
Created a chunk of size 21, which is longer than the specified 4


CPU times: user 32.4 ms, sys: 993 µs, total: 33.4 ms
Wall time: 66.9 ms


{'num_added': 0, 'num_updated': 0, 'num_skipped': 8, 'num_deleted': 0}

## Run an update!

Change the loader to only pick up 1 of the files!

In [17]:
loader = GenericLoader.from_filesystem(
    sample_data_dir, glob="file2.txt", parser=TextParser()
)

data_loader = DocumentPipeline(loader, transformers=[transformer1, transformer2])

list(data_loader.lazy_load())

Created a chunk of size 21, which is longer than the specified 4


[Document(lc_kwargs={'page_content': 'This is th', 'metadata': {'source': 'sample_data/file2.txt'}}, page_content='This is th', metadata={'source': 'sample_data/file2.txt'}),
 Document(lc_kwargs={'page_content': 'smallest', 'metadata': {'source': 'sample_data/file2.txt'}}, page_content='smallest', metadata={'source': 'sample_data/file2.txt'}),
 Document(lc_kwargs={'page_content': 'ile. Fil', 'metadata': {'source': 'sample_data/file2.txt'}}, page_content='ile. Fil', metadata={'source': 'sample_data/file2.txt'}),
 Document(lc_kwargs={'page_content': '#2.', 'metadata': {'source': 'sample_data/file2.txt'}}, page_content='#2.', metadata={'source': 'sample_data/file2.txt'})]

In [18]:
%%time
index(data_loader, timestamped_set, chroma, delete_old=True)

Created a chunk of size 21, which is longer than the specified 4


CPU times: user 20.7 ms, sys: 265 µs, total: 21 ms
Wall time: 55.3 ms


{'num_added': 0, 'num_updated': 0, 'num_skipped': 4, 'num_deleted': 4}

## Run another update (note that no more deletions are necessary)!

In [19]:
%%time
index(data_loader, timestamped_set, chroma, delete_old=True)

Created a chunk of size 21, which is longer than the specified 4


CPU times: user 9.24 ms, sys: 9.44 ms, total: 18.7 ms
Wall time: 45.3 ms


{'num_added': 0, 'num_updated': 0, 'num_skipped': 4, 'num_deleted': 0}