When working with files, like PDFs, you're likely to encounter text that exceeds your language model's context window. To process this text, consider these strategies:

1. **Change LLM** Choose a different LLM that supports a larger context window.
2. **Brute Force** Chunk the document, and extract content from each chunk.
3. **RAG** Chunk the document, index the chunks, and only extract content from a subset of chunks that look "relevant".

Keep in mind that these strategies have different trade off and the best strategy likely depends on the application that you're designing!

## Set up

We need some example data! Let's download an article about [cars from wikipedia](https://en.wikipedia.org/wiki/Car) and load it as a LangChain `Document`.

In [2]:
import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content).replace(
    "\xa0", " "
)

In [3]:
print(len(document.page_content))

79251


## Define the schema

Here, we'll define schema to extract key developments from the text.

In [4]:
from typing import List, Optional

from langchain_community.chat_models.gigachat import GigaChat
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.pydantic_v1 import BaseModel, Field

class KeyDevelopment(BaseModel):
    """Важная историческая дата."""

    # ^ Док-строка выше, подкладывается в описании функции
    # и может помочь в улучшении результатов от LLM

    # Заметьте:
    # 1. Каждое поле имеет поле description — это подкладывается в описание аргументов функции
    # и может помочь в улучшении результатов
    year: Optional[int] = Field(
        ..., description="Год исторического события. Не может быть null."
    )
    description: str = Field(
        ..., description="Описание. Что произошло в этом году? Каково было развитие?"
    )
    evidence: str = Field(
        ...,
        description="Повтори дословно предложения из текста, из которых были извлечены год и описание.",
    )


class ExtractionData(BaseModel):
    """Извлеченая информация о ключевых событиях в истории."""

    key_developments: List[KeyDevelopment]


# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
#    about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Ты эксперт в извлечении важных исторических дат из текста. "
            "Извлекай только важные исторические события с годами."
            "Если ты не можешь извлечь год, не записывай это в историческое событие",
        ),
        MessagesPlaceholder(
            "examples"
        ),  # Keep on reading through this use case to see how to use examples to improve performance
        ("human", "{text}"),
    ]
)

llm = GigaChat(
    verify_ssl_certs=False,
    timeout=6000,
    model="GigaChat-Pro",
    temperature=0.01,
)

extractor = prompt | llm.with_structured_output(
    schema=ExtractionData,
)

  warn_beta(


## Brute force approach

Split the documents into chunks such that each chunk fits into the context window of the LLMs.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Controls the size of each chunk
    chunk_size=2000,
    # Controls overlap between chunks
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

In [6]:
from typing import List, TypedDict
import itertools

from langchain_core.messages import (
    AIMessage,
    BaseMessage,
    HumanMessage,
    FunctionMessage,
)
from langchain_core.pydantic_v1 import BaseModel

#
# В этом блоке мы добавляем примеры работы функций для улучшения качества
# подробнее о примерах работы читайте в example.ipynb
#

class Example(TypedDict):
    """Пример работы функций."""

    input: str  # Пример вызова
    function_calls: List[BaseModel]  # Pydantic модель, с примером извлечения
    function_outputs: List[str]


def tool_example_to_messages(example: Example) -> List[BaseMessage]:
    """Превращаем примеры вызовов функций в историю сообщений"""
    messages: List[BaseMessage] = [HumanMessage(content=example["input"])]
    for function_call, function_output in itertools.zip_longest(
        example["function_calls"], example.get("function_outputs", [])
    ):
        messages.append(
            AIMessage(
                content="",
                additional_kwargs={
                    "function_call": {
                        # Названием функции в текущий момент соответствует The name of the function right now corresponds
                        # to the name of the pydantic model
                        # This is implicit in the API right now,
                        # and will be improved over time.
                        "name": function_call.__class__.__name__,
                        "arguments": function_call.dict(),
                    },
                },
            )
        )
        output = "You have correctly called this tool."
        if function_output:
            output = function_output
        messages.append(
            FunctionMessage(name=function_call.__class__.__name__, content=output)
        )
    return messages


examples = [
    (
        "Техногенная авария «Размыв» Ленинградского-Петербургского метрополитена "
        "является крупнейшей в мировой практике метростроения[34]; была экранизирована "
        "в фильме «Прорыв» и послужила вдохновением для фильма «Метро»[35].",
        ExtractionData(
            key_developments=[
                KeyDevelopment(
                    year=None,
                    description="Техногенная авария 'Размыв' в "
                    "Ленинградском-Петербургском метрополитене является "
                    "крупнейшей в мировой практике метростроения",
                    evidence="была экранизирована в фильме 'Прорыв' и послужила "
                    "вдохновением для фильма 'Метро'",
                )
            ]
        ),
        """pydantic.v1.error_wrappers.ValidationError: 1 validation error for KeyDevelopment
year
  none is not an allowed value (type=type_error.none.not_allowed)""",
    ),
    (
        "In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed "
        "the longest trip by a petrol-driven vehicle when their self-designed and "
        "built Daimler powered Peugeot Type 3 completed 2,100 kilometres (1,300 mi) "
        "from Valentigney to Paris and Brest and back again. They were attached to "
        "the first Paris–Brest–Paris bicycle race, but finished six days "
        "after the winning cyclist, Charles Terront.",
        ExtractionData(
            key_developments=[
                KeyDevelopment(
                    year=1891,
                    description="Август Дорио и его коллега Луи Риголу "
                    "завершают самую длинную поездку на бензиновом автомобиле",
                    evidence="In 1891, Auguste Doriot and his Peugeot colleague Louis Rigoulot completed the longest trip by a petrol-driven vehicle",
                )
            ]
        ),
        "You have correctly called this tool.",
    ),
    (
        "I love cats and dogs.",
        ExtractionData(key_developments=[]),
        "You have correctly called this tool.",
    ),
]


messages = []

for text, tool_call, function_output in examples:
    messages.extend(
        tool_example_to_messages(
            {
                "input": text,
                "function_calls": [tool_call],
                "function_outputs": [function_output],
            }
        )
    )

Use `.batch` functionality to run the extraction in **parallel** across each chunk! 

:::{.callout-tip}
You can often use .batch() to parallelize the extractions! `batch` uses a threadpool under the hood to help you parallelize workloads.

If your model is exposed via an API, this will likley speed up your extraction flow!
:::

In [None]:
# Limit just to the first 3 chunks
# so the code can be re-run quickly
first_few = texts[:10]

extractions = extractor.batch(
    [{"text": text, "examples": messages} for text in first_few],
    {"max_concurrency": 5},  # limit the concurrency by passing max concurrency!
)

### Merge results

After extracting data from across the chunks, we'll want to merge the extractions together.

In [11]:
key_developments = []

for extraction in extractions:
    key_developments.extend(extraction.key_developments)

key_developments[:20]

[KeyDevelopment(year=None, description='Car, or an automobile, is a motor vehicle with wheels.', evidence='A car, or an automobile, is a motor vehicle with wheels.'),
 KeyDevelopment(year=1769, description='Французский изобретатель Николя-Жозеф Кюньо создал первый паровой автомобиль в 1769 году', evidence='French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769'),
 KeyDevelopment(year=1886, description='Немецкий изобретатель Карл Бенц запатентовал свой Benz Patent-Motorwagen в 1886 году', evidence='The modern car—a practical, marketable automobile for everyday use—was invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 KeyDevelopment(year=1908, description='Модель Т, американский автомобиль, произведенный компанией Ford Motor Company, стал доступным для масс в 1908 году', evidence='One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company.'),
 KeyDeve

## RAG based approach

Another simple idea is to chunk up the text, but instead of extracting information from every chunk, just focus on the the most relevant chunks.

:::{.callout-caution}
It can be difficult to identify which chunks are relevant.

For example, in the `car` article we're using here, most of the article contains key development information. So by using
**RAG**, we'll likely be throwing out a lot of relevant information.

We suggest experimenting with your use case and determining whether this approach works or not.
:::

Here's a simple example that relies on the `FAISS` vectorstore.

In [40]:
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda, RunnableParallel
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 3}
)  # Only extract from first document

In this case the RAG extractor is only looking at the top document.

In [41]:
def combine_docs(docs):
    return "\n\n".join([doc.page_content for doc in docs])

rag_extractor = {
    "text": retriever | combine_docs,  # fetch content of top doc
    "examples": lambda x : messages
} | extractor

In [42]:
results = rag_extractor.invoke("Key developments")

Giga generation stopped with reason: function_call


In [43]:
for key_development in results.key_developments:
    print(key_development)

year=2018 description='Рост популярности автомобилей и поездок привел к заторам на дорогах.' evidence='Так, Москва, Стамбул, Богота, Мехико и Сан-Паулу были самыми загруженными городами в 2018 году, согласно данным компании INRIX, специализирующейся на анализе данных.'
year=1924 description='В Европе происходило то же самое.' evidence='Morris начал производство на конвейере в Ковли в 1924 году и вскоре стал продавать больше автомобилей, чем Ford, а также начал следовать практике вертикальной интеграции Ford, покупая двигатели, коробки передач и радиаторы у других компаний.'
year=None description='В Японии производство автомобилей было ограничено до Второй мировой войны.' evidence='Только несколько компаний производили автомобили в ограниченном количестве, и эти автомобили были небольшими, трехколесными для коммерческих целей или были результатом партнерства с европейскими компаниями.'
year=None description='Большинство автомобилей, используемых в начале 2020-х годов, работают на бензин

## Common issues

Different methods have their own pros and cons related to cost, speed, and accuracy.

Watch out for these issues:

* Chunking content means that the LLM can fail to extract information if the information is spread across multiple chunks.
* Large chunk overlap may cause the same information to be extracted twice, so be prepared to de-duplicate!
* LLMs can make up data. If looking for a single fact across a large text and using a brute force approach, you may end up getting more made up data.