# LongContext Extraction
超長いコンテキストを少しずつLLMに渡しながら全体を要約する。
- 題材
  - [How to handle long text when doing extraction](https://python.langchain.com/docs/how_to/extraction_long_text)

## パッケージインストール

In [29]:
%pip install -qU langchain-community lxml faiss-cpu langchain-openai
%pip install -qU unstructured

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 環境変数読み込み

In [30]:
from dotenv import load_dotenv
import os

load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
os.environ["LANGSMITH_PROJECT"] = "langchain-study"

## Contextの元となる情報の取得
インターネット上のWebページを取得してLangChainのDocumentインスタンスに変換する。

In [44]:
import re

import requests
from langchain_community.document_loaders import BSHTMLLoader

# インターネットからWebページを取得
response = requests.get("https://en.wikipedia.org/wiki/Car")

# ローカルファイルに出力
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)

# ローカルファイルを読み込んでDocumentインスタンスを作成
loader = BSHTMLLoader("car.html")
document = loader.load()[0]

# 2行以上の空行を1行にまとめる
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))

# print(document.page_content[:100])

79186


### 参考  
ファイルとして保存せずURLからWebページを直接取得するパターン

https://python.langchain.com/docs/integrations/document_loaders/url/

In [45]:
from langchain_community.document_loaders import UnstructuredURLLoader

urls = [
    "https://en.wikipedia.org/wiki/Car",
]
loader = UnstructuredURLLoader(urls=urls)
document = loader.load()[0]

# 2行以上の空行を1行にまとめる
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))
# print(document.page_content[3000:4000])

75209


### chunkに分割

In [32]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    # Controls the size of each chunk
    chunk_size=2000,
    # Controls overlap between chunks
    chunk_overlap=20,
)

texts = text_splitter.split_text(document.page_content)

In [33]:
print(type(texts))
print(len(texts))
print(type(texts[0]))
print(len(texts[0]))

<class 'list'>
10
<class 'str'>
3508


## Schemaの定義
Chunk毎にどのような形の要約を作成するか定義する

In [34]:
from typing import List, Optional
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field

# 要約の1要素のschema
class ExtractionSchema(BaseModel):
    """Information about a development in the history of cars."""

    year: int = Field(
        ..., description="The year when there was an important historic development."
    )
    description: str = Field(
        ..., description="What happened in this year? What was the development?"
    )
    evidence: str = Field(
        ...,
        description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
    )


# 要約全体のschema
# 要約はExtractionSchemaのlistとして作成する
class ExtractionData(BaseModel):
    """Extracted information about key developments in the history of cars."""

    extraction_result: List[ExtractionSchema]

## extractorの定義

In [35]:
from langchain_openai import ChatOpenAI


model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert at identifying key historic development in text. "
            "Only extract important historic developments. Extract nothing if no important information can be found in the text.",
        ),
        ("human", "{text}"),
    ]
)



extractor = prompt | model.with_structured_output(
    schema=ExtractionData,
    include_raw=False,
)

## extractorの実行
chunkを渡してextractorを実行することで、各chunkの内容を`ExtractionData`のshcemaに従って要約する。

In [36]:
# ここでは検証として、最初の3つのテキストチャンクのみを処理
first_few = texts[:3]

results = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},  # limit the concurrency by passing max concurrency!
)

実行結果をlistにまとめる。

In [37]:
extraction_result = []

for extraction in results:
    extraction_result.extend(extraction.extraction_result)

extraction_result[:10]

[ExtractionSchema(year=1769, description='The first steam-powered road vehicle was built by the French inventor Nicolas-Joseph Cugnot.', evidence='The French inventor Nicolas-Joseph Cugnot built the first steam-powered road vehicle in 1769.'),
 ExtractionSchema(year=1808, description='The first internal combustion-powered automobile was designed and constructed by the Swiss inventor François Isaac de Rivaz.', evidence='the Swiss inventor François Isaac de Rivaz designed and constructed the first internal combustion-powered automobile in 1808.'),
 ExtractionSchema(year=1886, description='The modern car was invented when the German inventor Carl Benz patented his Benz Patent-Motorwagen.', evidence='the modern car—a practical, marketable automobile for everyday use—was invented in 1886, when the German inventor Carl Benz patented his Benz Patent-Motorwagen.'),
 ExtractionSchema(year=1901, description='The Oldsmobile Curved Dash was widely considered the first mass-produced car.', evidence