# 信息抽取 Extraction
- Extraction是从一段文本中解析数据的过程
- 通常与Extraction parser一起使用，以构建数据

1. 从句子中提取结构化行以插入数据库
2. 从长文档中提取多行以插入数据库
3. 从用户查询中提取参数以进行API调用
4. 最近最火的Extraction库是KOR

In [1]:


# here put the import lib
from typing import Any, List, Mapping, Optional, Dict
from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from zhipuai import ZhipuAI

import os

# 继承自 langchain.llms.base.LLM
class ZhipuAILLM(LLM):
    # 默认选用 glm-3-turbo
    model: str = "glm-3-turbo"
    # 温度系数
    temperature: float = 0.1
    # API_Key
    api_key: str = "acf4f9247da5e232fbe056b14b35fd9b.uWW0WvWqwWUYjhzQ"
    
    def _call(self, prompt : str, stop: Optional[List[str]] = None,
                run_manager: Optional[CallbackManagerForLLMRun] = None,
                **kwargs: Any):
        client = ZhipuAI(
            api_key = self.api_key
        )

        def gen_glm_params(prompt):
            '''
            构造 GLM 模型请求参数 messages

            请求参数：
                prompt: 对应的用户提示词
            '''
            messages = [{"role": "user", "content": prompt}]
            return messages
        
        messages = gen_glm_params(prompt)
        response = client.chat.completions.create(
            model = self.model,
            messages = messages,
            temperature = self.temperature
        )

        if len(response.choices) > 0:
            return response.choices[0].message.content
        return "generate answer error"


    # 首先定义一个返回默认参数的方法
    @property
    def _default_params(self) -> Dict[str, Any]:
        """获取调用API的默认参数。"""
        normal_params = {
            "temperature": self.temperature,
            }
        # print(type(self.model_kwargs))
        return {**normal_params}

    @property
    def _llm_type(self) -> str:
        return "Zhipu"

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        """Get the identifying parameters."""
        return {**{"model": self.model}, **self._default_params}

In [2]:
llm = ZhipuAILLM()

## 1. 手动格式转换

In [3]:
from langchain.schema import HumanMessage, AIMessage
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate

# 解析输出并获取结构化的数据
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

In [4]:
# Vanilla Extraction
instructions = """
You will be given a sentence with fruit names, extract those fruit names and assign an emoji to them
Return the fruit name and emojis in a python dictionary
"""

fruit_names = """
Apple, Pear, this is an kiwi
"""

In [18]:
prompt = instructions + fruit_names

# Call the LLM
output = llm(prompt)

print(output)

Here's a Python dictionary with the fruit names and their corresponding emojis:

```python
fruit_dict = {
    'Apple': '🍎',
    'Pear': '🍐',
    'Kiwi': '🥝'
}
```

Note: I've included "Kiwi" as it was mentioned in your sentence. If you meant "kiwi" to be a part of a larger word (like "this is an kiwi"), please clarify, and I'll adjust the entry accordingly.


## 自动格式转换
自动生成一个带有格式说明的提示

这样就不需要担心提示工程的问题了，将这部分完全交给 Lang Chain 来执行

将LLM的输出转化为 python 对象

In [19]:
response_schemas = [
    ResponseSchema(name="artist", description="The name of the musical artist"),
    ResponseSchema(name="song", description="The name of the song that the artist plays")
]

# 解析器将会把LLM的输出使用我定义的schema进行解析并返回期待的结构数据给我
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

In [20]:
format_instructions = output_parser.get_format_instructions()
print(format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```


In [21]:
# 这个 Prompt 与之前我们构建 Chat Model 时 Prompt 不同
# 这个 Prompt 是一个 ChatPromptTemplate，它会自动将我们的输出转化为 python 对象
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("Given a command from the user, extract the artist and song names \n \
                                                    {format_instructions}\n{user_prompt}")  
    ],
    input_variables=["user_prompt"],
    partial_variables={"format_instructions": format_instructions}
)

In [22]:

fruit_query = prompt.format_prompt(user_prompt="I really like So Young by Portugal. The Man")

print (fruit_query.messages[0].content)

Given a command from the user, extract the artist and song names 
                                                     The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"artist": string  // The name of the musical artist
	"song": string  // The name of the song that the artist plays
}
```
I really like So Young by Portugal. The Man


In [31]:
# fruit_output = llm(fruit_query.to_messages())

# 自定义的chatglm类与openai在使用上有差别
fruit_output = llm.invoke(fruit_query)
print(fruit_output)

output = output_parser.parse(fruit_output)

print (output)
print (type(output))

```json
{
	"artist": "Portugal. The Man",
	"song": "So Young"
}
```
{'artist': 'Portugal. The Man', 'song': 'So Young'}
<class 'dict'>
