# Parent vectorstore

When splitting documents for retrieval, there are often conflicting desires:

1. You may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
2. You want to have long enough documents that the context of each chunk is retained.

The `ParentDocumentVectorStore` strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

The challenge is to manage the life cycle of the three levels of documents correctly:
- original documents
- chunks extracted from the original documents
- transformations of chunks in order to have more vectors with which to retrieve them

The `ParentDocumentVectorStore`, in combinaison with others components, is here for that.

In [None]:
#!pip install 'langchain-parent' openai tiktokena
!poetry install -q

For the sample, we use the set of documents from wikipedia

In [None]:
top_k_results = 1

In [None]:
query = "names the major mathematical disciplines"


In [None]:
!pip install --quiet --upgrade pip langchain wikipedia > /dev/null
from langchain.retrievers import WikipediaRetriever

documents = WikipediaRetriever(top_k_results=top_k_results).get_relevant_documents("mathematic")
len(documents), query

# Select provider
## Select the LLM
Before starting, you need to:
- Set the environment variables
- choose an LLM, get the context size, and set the max_tokens to generate.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv(override=True)
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = "XXXXX"
if "COHERE_API_KEY" not in os.environ:
    os.environ["COHERE_API_KEY"] = "XXXX"

In [None]:
!pip install --quiet openai tiktoken
from langchain.llms import OpenAI

context_size = 512  # For the demonstration use a smal context_size.
max_tokens = int(context_size * (10 / 100))  # 10% for the response
max_input_tokens = context_size - max_tokens
llm = OpenAI(
    max_tokens=1000,
)
context_size, max_tokens, max_input_tokens

## Select the embedding implementation

In [None]:
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

In [None]:
# Add a cache
import tempfile

CACHE_EMBEDDING_PATH = tempfile._gettempdir() + "/cache_embedding"
from langchain.storage import LocalFileStore

fs = LocalFileStore(CACHE_EMBEDDING_PATH)

from langchain.embeddings import CacheBackedEmbeddings

embeddings = CacheBackedEmbeddings.from_bytes_store(
    embeddings, fs, namespace=embeddings.model if hasattr(embeddings, "model") else "unknown"
)

# Transform documents
The idea is to transform a document into several versions, and calculate the vector for each one.

In [None]:
from langchain.text_splitter import *
from langchain_parent.document_transformers import *

The first step is to split the document to be compatible with the `max_input_tokens`.

In [None]:
parent_transformer = TokenTextSplitter(
    chunk_size=max_input_tokens, chunk_overlap=0
)

Apply the transformation

In [None]:
child_documents = parent_transformer.transform_documents(documents)
f"before:{len(documents)} (big documents), after:{len(child_documents)} (chunk of documents)"

In [None]:
child_transformer = DocumentTransformerPipeline(
    transformers=[
        CopyDocumentTransformer(),
        GenerateQuestions.from_llm(llm),
        SummarizeTransformer.from_llm(llm),
    ]
)

Note, we need all the transformation for each chunk, and we also want the original chunk. It's why we add the `CopyDocumentTransformer()`

In [None]:
variations_of_chunks = child_transformer.transform_documents(child_documents)
variations_of_chunks[0].page_content

# Save all variations in a vector store
Now we want to save the chunks and their variations in a vector store. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

A specialized vectorstore is here for that: `ParentVectorStore`.
It's not a real vectorstore, but a wrapper to another vectorstore. When you add a document, the document is transform with the parent_transformer, and each chunk is enriched with different versions, via the `child_transformer`.

We must first, create some persistent component:
- A classical vectorstore
- A doc store to save each original chunk returned by the retriever

In [None]:
from langchain.vectorstores import Chroma

chroma_vectorstore = Chroma(
    collection_name="all_variations_of_chunks",
    embedding_function=embeddings
)

In [None]:
DOCSTORE_PATH = tempfile._gettempdir() + "/chunks"
from langchain.storage import EncoderBackedStore
from langchain.storage import LocalFileStore
import pickle

docstore = EncoderBackedStore[str, Document](
    store=LocalFileStore(root_path=DOCSTORE_PATH),
    key_encoder=lambda x: x,
    value_serializer=pickle.dumps,
    value_deserializer=pickle.loads
)

Then, you  must select a metadata to identify each fragment

In [None]:
id_key = 'id'

In [None]:
from langchain_parent.vectorstores import ParentVectorStore

vectorstore = ParentVectorStore(
    vectorstore=chroma_vectorstore,
    docstore=docstore,
    parent_id_key="source",
    id_key=id_key,
    parent_transformer=parent_transformer,
    child_transformer=child_transformer,
)

Now, it's time to add documents to this vectorstore. 
- If the `parent_transformer` is set, the document is transformed to a new list of chunks documents (generally it's a split phase).
- Then, each chunks documents is transformed with the `child_transformer`.
- Each transformation of all chunks is added in the destination vector store (chroma is this sample)
- All chunks are saved in the doc store with the list of all associated transformations

In [None]:
ids = vectorstore.add_documents(documents)
ids

The ids is the list of id for each chunks. It's possible to use it to delete some chunks. All variations are also deleted.

In [None]:
vectorstore.delete(ids[:1])

When you look at the API, you wonder where to save the document IDs from the vector store.

# Index vectorstore
To manage the live cycle of the documents in the vectorstore, you can use an `index()`.
It's a *record manager* to detect when the data are inserted, updated or if a record must be deleted. 

In [None]:
from langchain.indexes import index, SQLRecordManager

record_manager = SQLRecordManager(
    namespace="record_manager_cache",
    db_url=f"sqlite:///{tempfile._gettempdir()}/record_manager_cache.db"
)
record_manager.create_schema()

In [None]:
# Import doduments via all the pipeline:
# - record manager
# - docstore
# - vectorstore
index(
    docs_source=documents,  # PPR: on peut y placer un loader
    record_manager=record_manager,
    vector_store=vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

It's important to know that there are three ways of saving part of the data:
- In the *vectorstore*, the bucket, the metadata and the associated embeddings vector
- In the *docstore*, the orinal bucket, before the *child_transformations*
- In the *SQLRecordManager*, the references of the chunks (FIXME ou doc d'origine ?)
Each source does not manage transactions. If a problem occurs while adding a document, it is highly likely that the sources will be inconsistent.

# Use a retriever
Like with the standard vector store, it's possible to convert to a `VectorStoreRetriever`.

In [None]:
retriever = vectorstore.as_retriever()

In [None]:
selected_chunks = retriever.get_relevant_documents(query)

In [None]:
len(selected_chunks)

### Specialize retrievers
It's possible to refine the retrievers.

In [None]:
from langchain.retrievers.multi_query import MultiQueryRetriever
if True:
    retriever= MultiQueryRetriever.from_llm(
        llm=llm,
        retriever=retriever, 
    )

In [None]:
if False:
    # TODO: SelfQuery
    pass

# Use a compressor
It's possible to use a *compressor*, to filter the selection.

In [None]:
!pip install --quiet cohere
from langchain.retrievers.document_compressors import CohereRerank
cohere_rerank=CohereRerank(
    top_n=top_k_results
)


In [None]:
from langchain.retrievers.document_compressors import *
from langchain.retrievers import ContextualCompressionRetriever

similarity_filter = EmbeddingsFilter(
        embeddings=embeddings,
        similarity_threshold=0.8  # Threshold for determining when two documents are redundant.
    )

# Filter that drops documents that aren't relevant to the query.
drop_filter = LLMChainFilter.from_llm(llm)


compressor = DocumentCompressorPipeline(
    transformers=[
        similarity_filter,
        cohere_rerank,
    ]
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

compressed_documents = compression_retriever.get_relevant_documents(query)
len(selected_chunks),len(compressed_documents)

# Ask a question
Now, it's possible to use this architecture to ask the question.

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_llm(
    llm,
    retriever=retriever)
qa_chain(query)

# Short version
Now it's time to simplify the code.

## Prepare the import

In [39]:
from langchain.vectorstores import Chroma
from langchain.storage import EncoderBackedStore
from langchain.storage import LocalFileStore
from langchain_parent.vectorstores import ParentVectorStore
from langchain.indexes import index, SQLRecordManager
from langchain.text_splitter import *
from langchain_parent.document_transformers import *

import pickle
import tempfile

id_key = 'id'
top_k_results = 4

DOCSTORE_PATH = tempfile._gettempdir() + "/chunks"

docstore = EncoderBackedStore[str, Document](
    store=LocalFileStore(root_path=DOCSTORE_PATH),
    key_encoder=lambda x: x,
    value_serializer=pickle.dumps,
    value_deserializer=pickle.loads
)

record_manager = SQLRecordManager(
    namespace="record_manager_cache",
    db_url=f"sqlite:///{tempfile._gettempdir()}/record_manager_cache.db"
)
record_manager.create_schema()

parent_transformer = TokenTextSplitter(
    chunk_size=max_input_tokens, chunk_overlap=0
)
child_transformer = DocumentTransformerPipeline(
    transformers=[
        CopyDocumentTransformer(),
        GenerateQuestions.from_llm(llm),
        SummarizeTransformer.from_llm(llm),
    ]
)
vectorstore = ParentVectorStore(
    vectorstore=Chroma(
        collection_name="all_variations_of_chunks",
        embedding_function=embeddings
        ),
    docstore=docstore,
    parent_id_key="source",
    id_key=id_key,
    parent_transformer=parent_transformer,
    child_transformer=child_transformer,
)

## Import documents

In [40]:
from langchain.retrievers import WikipediaRetriever

documents = WikipediaRetriever(top_k_results=top_k_results).get_relevant_documents("mathematic")


In [41]:
index(
    docs_source=documents,
    record_manager=record_manager,
    vector_store=vectorstore,
    cleanup="incremental",
    source_id_key="source",
)



{'num_added': 4, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

## Use the vectorstore

In [42]:
from langchain.retrievers import *
from langchain.retrievers.document_compressors import *

retriever = vectorstore.as_retriever()
#TODO: retriever = SelfQueryRetriever. ...
retriever= MultiQueryRetriever.from_llm(
    llm=llm,
    retriever=retriever, 
)


In [43]:
similarity_filter = EmbeddingsFilter(
        embeddings=embeddings,
        similarity_threshold=0.8  # Threshold for determining when two documents are redundant.
    )
cohere_rerank=CohereRerank(
    top_n=top_k_results
)
# Filter that drops documents that aren't relevant to the query.
drop_filter = LLMChainFilter.from_llm(llm)


retriever = ContextualCompressionRetriever(
    base_compressor=DocumentCompressorPipeline(
    transformers=[
            similarity_filter,
            cohere_rerank,
            drop_filter,
        ]
    ),
    base_retriever=retriever
)

In [44]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_llm(
    llm,
    retriever=retriever)
qa_chain("names the major mathematical disciplines")



{'query': 'names the major mathematical disciplines',
 'result': '\n\nThe major mathematical disciplines include algebra, geometry, calculus, statistics, and trigonometry.'}