## 三. 信息抽取(Extraction)

- Extraction是从一段文本中解析数据的过程
- 通常与Extraction parser一起使用，以构建数据

1. 从句子中提取结构化行以插入数据库
2. 从长文档中提取多行以插入数据库
3. 从用户查询中提取参数以进行 API 调用
4. 最近最火的 Extraction 库是 KOR

### 1. 手动格式转换

In [None]:
import sys
sys.path.append("../")
from models import azure_chat_model


In [None]:

from langchain.schema import HumanMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate


# 解析输出并获取结构化的数据
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

In [None]:
# Vanilla Extraction
instructions = """
You will be given a sentence with fruit names, extract those fruit names and assign an emoji to them
Return the fruit name and emojis in a python dictionary
"""

fruit_names = """
Apple, Pear, this is an kiwi
"""

In [5]:
# Make your prompt which combines the instructions w/ the fruit names
prompt = (instructions + fruit_names)

# Call the LLM
output = azure_chat_model([HumanMessage(content=prompt)])

print (output.content)
print (type(output.content))

{'Apple': '🍎', 'Pear': '🍐', 'kiwi': '🥝'}
<class 'str'>


In [6]:
output_dict = eval(output.content)

print (output_dict)
print (type(output_dict))

{'Apple': '🍎', 'Pear': '🍐', 'kiwi': '🥝'}
<class 'dict'>


### 2. 自动格式转换

自动生成一个带有格式说明的提示

这样就不需要担心提示工程的问题了，将这部分完全交给 Lang Chain 来执行

将LLM的输出转化为 python 对象

In [7]:
response_schemas = [
    ResponseSchema(name="artist", description="The name of the musical artist"),
    ResponseSchema(name="song", description="The name of the song that the artist plays")
]

# 解析器将会把LLM的输出使用我定义的schema进行解析并返回期待的结构数据给我
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)


In [8]:
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```


In [9]:
# 这个 Prompt 与之前我们构建 Chat Model 时 Prompt 不同
# 这个 Prompt 是一个 ChatPromptTemplate，它会自动将我们的输出转化为 python 对象
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("Given a command from the user, extract the artist and song names \n \
                                                    {format_instructions}\n{user_prompt}")  
    ],
    input_variables=["user_prompt"],
    partial_variables={"format_instructions": format_instructions}
)

In [10]:
fruit_query = prompt.format_prompt(user_prompt="I really like So Young by Portugal. The Man")

print (fruit_query.messages[0].content)

Given a command from the user, extract the artist and song names 
                                                     The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```
I really like So Young by Portugal. The Man


In [11]:
fruit_output = azure_chat_model(fruit_query.to_messages())
output = output_parser.parse(fruit_output.content)

print (output)
print (type(output))
# 这里要注意的是，因为我们使用的 turbo 模型，生成的结果并不一定是每次都一致的
# 替换成gpt4模型可能是更好的选择

{'artist': 'Portugal. The Man', 'song': 'So Young'}
<class 'dict'>
