# Kor

- Kor는 구조화된 데이터를 LLMs(대규모 언어 모델)을 사용하여 추출하는 것을 돕는 LLMs 위의 얇은 래퍼입니다.
- Kor를 사용하려면 추출해야 할 데이터의 스키마를 지정하고 일부 추출 예제를 제공해야합니다.
- LLM도 출력이 완벽하지 않을 때가 있기 때문에 추출은 완벽하지 않은 경우가 있습니다.

In [1]:
!pip install -U --quiet kor==1.0.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
open-webui 0.5.16 requires chromadb==0.6.2, but you have chromadb 0.6.3 which is incompatible.
open-webui 0.5.16 requires duckduckgo-search~=7.3.2, but you have duckduckgo-search 8.0.0 which is incompatible.
open-webui 0.5.16 requires fastapi==0.115.7, but you have fastapi 0.115.9 which is incompatible.
open-webui 0.5.16 requires google-generativeai==0.7.2, but you have google-generativeai 0.8.4 which is incompatible.
open-webui 0.5.16 requires langchain==0.3.7, but you have langchain 0.3.23 which is incompatible.
open-webui 0.5.16 requires langchain-community==0.3.7, but you have langchain-community 0.3.21 which is incompatible.
open-webui 0.5.16 requires pandas==2.2.3, but you have pandas 1.5.3 which is incompatible.
open-webui 0.5.16 requires psycopg2-binary==2.9.9, but you have psycopg2-binary 2.9.10 whic

## Schema

Kor는 원하는 내용을 구문 분석하려면 분석하고자 하는 데이터의 스키마를 지정해야 합니다.ㅇ

In [2]:
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number
from langchain.chat_models import ChatOpenAI

In [3]:
schema = Object(
    id="person",
    description="개인 정보",
    examples=[
        ("김철수와 홍길동은 친구입니다", [{"first_name": "철수"}, {"first_name": "길동"}])
    ],
    attributes=[
        Text(
            id="first_name",
            description="사람의 이름.",
        )
    ],
    many=True,
)

- 위의 스키마는 'first_name'이라는 단일 텍스트 속성을 포함하는 하나의 객체 노드로 구성됩니다.
- 이 객체는 여러번 반복될 수 있으므로, 텍스트에 여러 개의 이름이 포함된 경우 여러 객체가 추출될 것입니다.
- 스키마의 일부로, 우리는 추출하고자 하는 내용의 설명과 2개의 예제를 명시했습니다.
- 설명과 예제를 모두 포함하는 것은 성능 향상에 도움이 됩니다.

## Langchain

In [4]:
from langchain_openai.chat_models import ChatOpenAI

In [5]:
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    max_tokens=2000
)

In [6]:
chain = create_extraction_chain(llm, schema)

  return LLMChain(


## 응답 파싱

In [7]:
llm.invoke("랜덤한 한국인 이름 두개 만들어줘")

AIMessage(content='1. 김지영\n2. 이승우', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 25, 'total_tokens': 41, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-BMBF716KCfpb07uaxbsZhALaczbuB', 'finish_reason': 'stop', 'logprobs': None}, id='run-a8f54475-e41c-4d41-b0e0-73eecd72f4c3-0', usage_metadata={'input_tokens': 25, 'output_tokens': 16, 'total_tokens': 41, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [8]:
chain.run("학생들의 이름은 김지영, 이민호이다")["data"]

  chain.run("학생들의 이름은 김지영, 이민호이다")["data"]


{'person': [{'first_name': '지영'}, {'first_name': '민호'}]}

In [8]:
chain.run("학생들의 이름은 김지영, 이민호이다")

  warn_deprecated(
  warn_deprecated(


{'data': {'person': [{'first_name': '지영'}, {'first_name': '민호'}]},
 'raw': 'first_name\n지영\n민호',
 'errors': [],
 'validated_data': {}}

## 원리 - Prompt Engineering

In [9]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

person: Array<{ // 개인 정보
 first_name: string // 사람의 이름.
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: 김철수와 홍길동은 친구입니다
Output: first_name
철수
길동

Input: [user input]
Output:


## With pydantic

In [10]:
from kor import from_pydantic
from pydantic import BaseModel, Field

In [11]:
class Person(BaseModel):
    first_name: str = Field(description="사람 이름")

In [12]:
schema, validator = from_pydantic(
    Person,
    description="개인정보",  # <-- 설명
    examples=[  # <-- 예제
        ("김철수와 홍길동은 친구입니다", [{"first_name": "철수"}, {"first_name": "길동"}])
    ],
    many=True,  # <-- 여러개 일 수 있다는 정보
)

chain = create_extraction_chain(llm, schema, encoder_or_encoder_class="json", validator=validator)

In [13]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

person: Array<{ // 개인정보
 first_name: string // 사람 이름
}>
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: 김철수와 홍길동은 친구입니다
Output: <json>{"person": [{"first_name": "철수"}, {"first_name": "길동"}]}</json>
Input: [user input]
Output:


In [14]:
chain.run("학생들의 이름은 김지영, 이민호이다")

{'data': {'person': [{'first_name': '지영'}, {'first_name': '민호'}]},
 'raw': '<json>{"person": [{"first_name": "지영"}, {"first_name": "민호"}]}</json>',
 'errors': [],
 'validated_data': [Person(first_name='지영'), Person(first_name='민호')]}

In [16]:
chain = create_extraction_chain(llm, schema, encoder_or_encoder_class="csv", validator=validator)

In [17]:
print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

person: Array<{ // 개인정보
 first_name: string // 사람 이름
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: 김철수와 홍길동은 친구입니다
Output: first_name
철수
길동

Input: [user input]
Output:


In [18]:
chain.run("학생들의 이름은 김지영, 이민호이다")

{'data': {'person': [{'first_name': '지영'}, {'first_name': '민호'}]},
 'raw': 'first_name\n지영\n민호',
 'errors': [],
 'validated_data': [Person(first_name='지영'), Person(first_name='민호')]}