# How to use Late Chunk in RAG

Based on the [Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models](https://arxiv.org/abs/2409.04701) paper.

This notebooks explains how apply `Late chunking Embedding` support by `LangChain`.

**Notes:**
- The key idea behind Late Chunking is to first embed the entire text, then split it into chunks later. To implement Late Chunking in Langchain, we use `LateChunkQdrant` vectorstore that applies the late chunking technique.

- Can combine with any `text splitting` used in LangChain or you can custom with the [Chunk](https://github.com/jina-ai/late-chunking/blob/main/chunked_pooling/chunking.py) used in the paper. We'll give the example with handle the same method of authors.

## Setup

In [None]:
%pip install -qU langchain langchain-community qdrant-client beautifulsoup4 transformers

## Credentials
To access Jina embedding models you'll need to go https://jina.ai/embeddings/ get an API key.

In [1]:
import os
import getpass

if not os.getenv("JINA_API_KEY"):
    os.environ["JINA_API_KEY"] = getpass.getpass("Enter your key: ") # "jina_*"

## Instantiation

import EmbeddingTabs from "@theme/EmbeddingTabs";

In [2]:
from langchain_community.embeddings import JinaLateChunkEmbeddings

text_embeddings = JinaLateChunkEmbeddings(
    jina_api_key=os.environ.get("JINA_API_KEY"),
    model_name="jina-embeddings-v3"
)

For our purpose, we need to ensure the input text fits within the model’s context length. Therefore, we will use the tokenizer from Hugging Face check input tokenized length.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)
text_splitter.tokenizer = tokenizer

## LateChunkQdrant

Config several parameters use in database

In [None]:
class Config:
    ROOT = "demo-qdrant"
    CLT_NAME = "demo"
    TOPK = 10

Here we create LateChunkQdrant database. We set the return documents with 10 docs

In [None]:
import os

from qdrant_client import QdrantClient
from langchain_community.vectorstores import LateChunkQdrant
from langchain_community.docstore.document import Document 


client = QdrantClient()
vectorstore = LateChunkQdrant(
    client, 
    collection_name=Config.CLT_NAME,
    embeddings=text_embeddings, 
    text_splitter=text_splitter
)

if os.path.isdir(os.path.join(Config.ROOT, "collection", Config.CLT_NAME)):
    print(f"===== Load exits collection: {Config.CLT_NAME} ======")
    vectorstore = vectorstore.from_existing_collection(
        embedding=text_embeddings, 
        path=Config.ROOT,
        collection_name=Config.COLLECTION_NAME, 
        text_splitter=text_splitter
    )
    
else:
    print(f"===== Create new collection: {Config.CLT_NAME} ======")
    with open("state_of_the_union.txt") as f:
        state_of_the_union = f.read()

    documents  = [
        Document(
            page_content=state_of_the_union, 
            metadata={"source": "state_of_the_union.txt"}
        ),
    ]
    
    vectorstore = vectorstore.from_documents(
        documents=documents, 
        embedding=text_embeddings, 
        text_splitter=text_splitter,
        path=Config.ROOT, 
        collection_name=Config.CLT_NAME
    )

# Set the vectorstore as retriever
vectorstore = vectorstore.as_retriever(search_kwargs={"k": Config.TOPK})



Token indices sequence length is longer than the specified maximum sequence length for this model (8872 > 8194). Running this sequence through the model will result in indexing errors


Once your vector store has been created and the relevant documents have been added, you will most likely wish to query it during the running of your chain or agent.

In [7]:
query = "what did the president say about ketanji brown jackson?"

docs = vectorstore.invoke(query)
for idx, doc in enumerate(docs):
    print(f"Doc {idx}: ", doc.page_content, "\n")

Doc 0:  One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

Doc 1:  And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

Doc 2:  As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. 

Doc 3:  Revise our laws so businesses have the workers they need and families don’t wait decades to reunite. 

It’s not only the right thing to do—it’s the economically smart thing to do. 

Doc 4:  We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. 

Doc 5:  If you’re suffering from addiction, know you are not alone. I believe in recovery, and I celebrate the 23 million American