# Tagging and Extraction

- [I. Set OpenAI API Key](#I. Set OpenAI-API-Key)
- [II. Tagging](#II. Tagging)
- [2.1 Create Tagging Function](#2.1-Create Tagging Function)
- [2.2 Implement Tagging through LangChain](#2.2-Implement Tagging through LangChain)
- [2.3 Structured Parsing Tagging Results](#2.3-Structural Parsing Tagging Results)
- [III. Extraction](#III. Extraction)
- [3.1 Create Extraction Function](#3.1-Create Extraction Function)
- [3.2 Implement Extraction Function through LangChain](#3.2-Implement Extraction Function through LangChain)
- [3.3 Structured Parsing Extraction Results](#3.3-Structural Parsing Extraction Results)
- [IV. Application Cases](#IV. Application Cases)
- [4.1 Loading data](#4.1-Loading data)- [4.2 Extract article overview](#4.2-Extract article overview)
- [4.3 Extract article information](#4.3-Extract article information)
- [4.4 Block text extraction](#4.4-Block text extraction)

# 1. Set OpenAI-API-Key

For details, see `Set OpenAI_API_KEY.ipynb` file

# 2. Tagging

What is `Tagging`:
- LLM given a function description, generates a structured output by selecting arguments from the input text, forming a function call
- More generally, LLM can evaluate input text and generate **structured output**

## 2.1 Create Tagging function

We define a `Tagging`, which inherits from Pydantic's BaseModel class, so the `Tagging` class also has strict data type verification. The `Tagging` class contains two member variables: `sentiment` and `language`:
- `sentiment`: used to determine the sentiment of user information, including pos (positive), neg (negative), neutral (neutral).
- `language`: used to determine which country the user uses, and must comply with the ISO 639-1 encoding standard.

In [None]:
# Import modules
from typing import List  
from pydantic import BaseModel, Field  
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [None]:
# Create Tagging class
# This table is based on the input text to mark the text sentiment as `pos` (positive), `neg` (negative) or `neutral` (neutral)
class Tagging(BaseModel):
"""Mark this text with specific information."""
# The sentiment label of the text, optional values ​​are `pos` (positive), `neg` (negative) or `neutral` (neutral)
    sentiment: str = Field(description="文本的情绪，请从“正面”、“负面”或“中立”中选择")
# The language tag of the text should be the ISO 639-1 standard code
    language: str = Field(description="文本语言(应采用ISO 639-1代码)")

In [None]:
# Convert Tagging data model to OpenAI function
convert_pydantic_to_openai_function(Tagging)

{'name': 'Tagging',
 'description': '用特定信息标记这段文本。',
 'parameters': {'title': 'Tagging',
  'description': '用特定信息标记这段文本。',
  'type': 'object',
  'properties': {'sentiment': {'title': 'Sentiment',
    'description': '文本的情绪，请从“正面”、“负面”或“中立”中选择',
    'type': 'string'},
   'language': {'title': 'Language',
    'description': '文本语言(应采用ISO 639-1代码)',
    'type': 'string'}},
  'required': ['sentiment', 'language']}}

## 2.2 Tagging through LangChain

Next we need to convert the `Tagging` class into a function description object that OpenAI can recognize

In [None]:
# Import modules
from langchain.prompts import ChatPromptTemplate 
from langchain.chat_models import ChatOpenAI

In [None]:
# Create a ChatOpenAI model instance with a temperature of 0
model = ChatOpenAI(temperature=0)  

In [None]:
# Apply Tagging
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]

With the function description variable, we use the `LCEL` syntax to create a chain. Before that, we need to create prompt, model, bind the function description variable, and finally create the chain.

In [None]:
# Use the from_messages method of ChatPromptTemplate to create a chat prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "仔细思考，然后按指示标记文本"),
    ("user", "{input}")
])

In [None]:
# Bind the model to the function and specify the name of the function call
model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"}
)

In [None]:
# Create a label chain, combining the prompt template and the model
tagging_chain = prompt | model_with_functions

In [None]:
# Call the tag chain and pass in the input text
tagging_chain.invoke({"input": "我爱langchain"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{\n  "sentiment": "正面",\n  "language": "zh"\n}'}}, example=False)

In [None]:
# Call the label chain again and pass in another input text
tagging_chain.invoke({"input": "我想要问的不是这些问题"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{\n  "sentiment": "中立",\n  "language": "zh"\n}'}}, example=False)

## 2.3 Structured parsing of Tagging results

The above output is the result of AIMessage format given by LLM. We can use the `LCEL` syntax to add a json output parser when creating a chain to solve this problem.

In [None]:
# Import JsonOutputFunctionsParser from the langchain.output_parsers.openai_functions module
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser  

In [None]:
# Create a new tag chain, combining the prompt template, model and JsonOutputFunctionsParser parser
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()

In [None]:
# Call the tag chain and pass in the input text
tagging_chain.invoke({"input": "我爱langchain"})

{'sentiment': '正面', 'language': 'zh'}

# 3. Extraction

What is Extraction:
- Extraction is similar to Tagging, but is used to extract multiple pieces of information.
- When given an input Json pattern, LLM has been fine-tuned to find and fill in the parameters of that pattern.
- This feature is not limited to function mode and can be used for general purpose extraction.

## 3.1 Create Extraction function

In [None]:
# Import modules
from typing import Optional  
from pydantic import BaseModel, Field  

Two classes, `Person` and `Information`, are defined:
- The `person` class contains two members, name and age, where age is optional.
- The `Information` class contains a people member, which is a collection (List) of persons.

In [None]:
# Create the Person class
class Person(BaseModel):
"""personal information"""
    name: str = Field(description="人的名字")  # 人的名字
    age: Optional[int] = Field(description="人的年龄")  # 人的年龄，可选字段

In [None]:
# Create the Information category
class Information(BaseModel):
"""Information to extract"""
    people: List[Person] = Field(description="关于人的信息列表")  # 关于人的信息列表

In [None]:
# Convert the Information data model to an OpenAI function
convert_pydantic_to_openai_function(Information)

{'name': 'Information',
 'description': '要提取的信息',
 'parameters': {'title': 'Information',
  'description': '要提取的信息',
  'type': 'object',
  'properties': {'people': {'title': 'People',
    'description': '关于人的信息列表',
    'type': 'array',
    'items': {'title': 'Person',
     'description': '个人信息',
     'type': 'object',
     'properties': {'name': {'title': 'Name',
       'description': '人的名字',
       'type': 'string'},
      'age': {'title': 'Age', 'description': '人的年龄', 'type': 'integer'}},
     'required': ['name']}}},
  'required': ['people']}}

In [None]:
# Create a list of extraction features and bind the extraction features to the model
extraction_functions = [convert_pydantic_to_openai_function(Information)]  
extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})  

In [None]:
#Call the extraction model and pass in text information
extraction_model.invoke("乔30岁，他妈妈叫玛莎")

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{\n  "people": [\n    {\n      "name": "乔",\n      "age": 30\n    },\n    {\n      "name": "玛莎",\n      "age": 0\n    }\n  ]\n}'}}, example=False)

## 3.2 Creating an Extraction Function through LangChain

In [None]:
# Use ChatPromptTemplate to create a prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "提取相关信息，如果没有明确提供不要猜测。可以提取部分信息"), 
    ("human", "{input}")  
])

In [None]:
# Create an extraction chain, combining the prompt template and the extraction model
extraction_chain = prompt | extraction_model

In [None]:
# Call the extraction chain and pass in the input text
extraction_chain.invoke({"input": "乔30岁，他妈妈叫玛莎"})

AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{\n  "people": [\n    {\n      "name": "乔",\n      "age": 30\n    },\n    {\n      "name": "玛莎"\n    }\n  ]\n}'}}, example=False)

In [None]:
# Create a new extraction chain and add JsonOutputFunctionsParser to parse the output
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

In [None]:
# Call the extraction chain again
extraction_chain.invoke({"input": "乔30岁，他妈妈叫玛莎"})

{'people': [{'name': '乔', 'age': 30}, {'name': '玛莎'}]}

## 3.3 Structured Parsing Extraction Results

In [None]:
# Import modules
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser 

In [None]:
# Create an extraction chain and specify the keyword "name" to parse the output
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")

In [None]:
# Call the extraction chain and pass in the input text
extraction_chain.invoke({"input": "乔30岁，他妈妈叫玛莎"})

[{'name': '乔', 'age': 30}, {'name': '玛莎'}]

# 4. Application Cases

We can apply tagging to a larger body of text. For example, load a blog post and extract tagging information from a subset of the text.

## 4.1 Loading data

In [None]:
# Loading documents using WebBaseLoader
from langchain.document_loaders import WebBaseLoader  
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/") 
documents = loader.load() 

In [None]:
# Get the first document
doc = documents[0]  

In [None]:
# Get the first 10,000 characters of the page content
page_content = doc.page_content[:10000]  

## 4.2 Extract article overview

In [None]:
# Import BaseModel and Field from pydantic to create data models
from pydantic import BaseModel, Field  

Define a Pydantic class `Overview`
- `summary`: represents the summary of the article content
- `language`: represents the language used in the article
- `keyword`: represents the keywords in the article

In [None]:
# Create Overview category
class Overview(BaseModel):
"""An overview of a text"""
    summary: str = Field(description="提供内容的简明总结。")  # 内容摘要
    language: str = Field(description="提供编写内容所用的语言。")  # 内容语言
    keywords: str = Field(description="提供与内容相关的关键字。")  # 关键词

In [None]:
# Convert the Overview data model to OpenAI function
overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]
tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}  # 绑定函数调用
)
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()  # 创建标注链并加入解析器

In [None]:
# Calling the annotation chain
tagging_chain.invoke({"input": page_content})

{'summary': 'LLM Powered Autonomous Agents is a concept of building agents with LLM (large language model) as its core controller. It involves several key components such as planning, memory, and tool use. The agent breaks down tasks into smaller subgoals, utilizes short-term and long-term memory, and learns to call external APIs for additional information. Self-reflection is also an important aspect for agents to improve iteratively. There are various techniques and frameworks, such as Chain of Thought, ReAct, Reflexion, and Chain of Hindsight, that enable agents to plan, reflect, and improve their performance.',
 'language': 'English',
 'keywords': 'LLM, autonomous agents, planning, memory, tool use, self-reflection, Chain of Thought, ReAct, Reflexion, Chain of Hindsight'}

## 4.3 Extracting article information

In [None]:
# Create a Paper class for title and author
class Paper(BaseModel):
"""Information about the mentioned paper."""
    title: str  # 论文标题
    author: Optional[str]  # 作者，可选字段

# Create Info, user extracts paper information list
class Info(BaseModel):
"""Information to extract"""
    papers: List[Paper] 

In [None]:
# Convert the Info data model to OpenAI function
paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]
extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}  # 绑定函数调用
)

In [None]:
# Create an extraction chain and add a parser
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers") 

In [None]:
# Call the extraction chain and find that the name of the paper itself is extracted. Therefore, we can improve it in combination with prompt
extraction_chain.invoke({"input": page_content})  

[{'title': 'LLM Powered Autonomous Agents', 'author': 'Lilian Weng'}]

In [None]:
template = """
A article will be passed to you. Extract from it all papers that are mentioned by this article. 
Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.
Do not make up or guess ANY extra information. Only extract what exactly is in the text.
"""

template_chinese = """
一篇文章将转交给你。把这篇文章中提到的所有论文都摘录出来。
不要提取文章本身的名称。如果没有提到论文，那很好——你不需要提取任何论文!只返回一个空列表。
不要编造或猜测任何额外的信息。只提取文本中的内容。
"""

In [None]:
# Create a chat prompt using a custom prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", template_chinese),
    ("human", "{input}")
])

In [None]:
# Recreate the extraction chain
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")  

In [None]:
# Call the extraction chain again
extraction_chain.invoke({"input": page_content})  

[{'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': ''},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': ''},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': ''},
 {'title': 'ReAct (Yao et al. 2023)', 'author': ''},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': ''},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)', 'author': ''},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)', 'author': ''}]

In [None]:
# Calling the extraction chain with irrelevant input will not return valid information
extraction_chain.invoke({"input": "hi"})  

[]

## 4.4 Chunk text extraction

In [None]:
# Import modules
from langchain.text_splitter import RecursiveCharacterTextSplitter 

# Instantiate the text segmenter
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)  

In [None]:
# Split document content, text_splitter can split long text into multiple short texts
splits = text_splitter.split_text(doc.page_content)  

# Get the number of segmented paragraphs
len(splits)  

14

In [None]:
# Define a function to flatten a list
def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list  

In [None]:
# Example calling the flatten function
flatten([[1, 2], [3, 4]])  

[1, 2, 3, 4]

In [None]:
# Print the last thousand characters of the first split text block
print(splits[0][-1000:])  

lemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory

Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.


Tool use

The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.

In [None]:
# Import modules
from langchain.schema.runnable import RunnableLambda  

In [None]:
# Create a Lambda function to preprocess text
prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]  
)

In [None]:
# Test prep
print(prep.invoke("hi"))
print(len(prep.invoke("hi")))

# Put a long text in and it will be split into multiple short texts
print(len(prep.invoke(doc.page_content)))

[{'input': 'hi'}]
1
14


In [None]:
# Create chain calls, including preprocessing, mapping extraction
# Use extraction_chain to extract multiple short texts separately, and flatten the resulting list together using the flatten function
chain = prep | extraction_chain.map() | flatten  

In [None]:
chain.invoke(doc.page_content)

[{'title': 'AutoGPT', 'author': ''},
 {'title': 'GPT-Engineer', 'author': ''},
 {'title': 'BabyAGI', 'author': ''},
 {'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': ''},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': ''},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': ''},
 {'title': 'ReAct (Yao et al. 2023)', 'author': ''},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': ''},
 {'title': 'Reflexion: A Framework for Self-Reflection in Reinforcement Learning',
  'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight: Improving Reinforcement Learning with Sequential Feedback',
  'author': 'Liu et al.'},
 {'title': 'Algorithm Distillation: Learning Process of Reinforcement Learning',
  'author': 'Laskin et al.'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al. 2023'},
 {'title': 'ED (expert distillation)', 'author': ''},
 {'title': 'RL^2', 'author': 'Duan et al. 2017'},
 {'title': 'Maximum Inner Product Search (MIPS)', 'author': ''},
 

# 5. English version template

**2.1 Create Tagging Function**

In [None]:
class Tagging(BaseModel):
"""Tag the piece of text with particular info."""
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    language: str = Field(description="language of text (should be ISO 639-1 code)")

**2.2 Tagging through LangChain**

In [None]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])

**3.1 Create Extraction Function**

In [None]:
class Person(BaseModel):
"""Information about a person."""
    name: str = Field(description="person's name")  
    age: Optional[int] = Field(description="person's age")  

In [None]:
class Information(BaseModel):
"""Information to extract."""
    people: List[Person] = Field(description="List of info about people")

**3.2 Creating Extraction Functions through LangChain**

In [None]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"), 
    ("human", "{input}")  
])

**4.2 Extracting Article Overview**

In [None]:
class Overview(BaseModel):
"""Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.") 
    language: str = Field(description="Provide the language that the content is written in.") 
    keywords: str = Field(description="Provide keywords related to the content.") 

**4.3 Extracting article information**

In [None]:
class Paper(BaseModel):
"""Information about papers mentioned."""
    title: str  
    author: Optional[str]  

class Info(BaseModel):
"""Information to extract"""
    papers: List[Paper] 

prompt using `template`

In [None]:
template = """
A article will be passed to you. Extract from it all papers that are mentioned by this article. 
Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.
Do not make up or guess ANY extra information. Only extract what exactly is in the text.
"""