# Tagging and Extraction Using OpenAI functions üè∑Ô∏è

## Introduction 
This notebook demonstrates `how to perform tagging and extraction using OpenAI functions in combination with Langchain`. The main focus is on tagging text with specific attributes such as sentiment and language, and extracting structured information from text.

`Sentiment analysis` is a natural language processing (NLP) technique used to determine the sentiment or emotional tone behind a piece of text. This can be categorized as positive, negative, or neutral. Sentiment analysis is widely used in various applications such as social media monitoring, customer feedback analysis, and market research.

### Setup the Environment, OpenAI API Key  and Imports
First, we need to import the necessary libraries and set up the Environemnt and the OpenAI API key:

In [1]:
# Import necessary libraries
import os
import openai
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

# Print OpenAI API key (masked)
print(f"OPENAI_API_KEY: {os.getenv('OPENAI_API_KEY')[:5]}*****")

**Note** Ensure you have the required packages installed:
```py
%pip install pydantic==1.10.8
%pip install rich
```

In [3]:
# Import necessary modules from rich library that helps to improve the readability of nested dictionary outputs
from rich import print
from rich.pretty import Pretty

In [None]:
# Import necessary modules from Pydantic 
from typing import List
from pydantic import BaseModel, Field

# Import necessary modules from  Langchain
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

### Example Using OpenAI and Langchain
Here is a simplified example of how you can perform sentiment analysis using OpenAI and Langchain:

- Define a Tagging model using Pydantic to specify the sentiment and language fields.

In [5]:
# Define a Pydantic model for Tagging
class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    language: str = Field(description="language of text (should be ISO 639-1 code)")

- Use the convert_pydantic_to_openai_function utility to convert this model into an OpenAI function.

In [None]:
# Convert Pydantic model to OpenAI function
convert_pydantic_to_openai_function(Tagging)

# Display the converted function in a pretty format
print(Pretty(convert_pydantic_to_openai_function(Tagging)))

  convert_pydantic_to_openai_function(Tagging)


{'name': 'Tagging',
 'description': 'Tag the piece of text with particular info.',
 'parameters': {'properties': {'sentiment': {'description': 'sentiment of text, should be `pos`, `neg`, or `neutral`',
    'type': 'string'},
   'language': {'description': 'language of text (should be ISO 639-1 code)',
    'type': 'string'}},
  'required': ['sentiment', 'language'],
  'type': 'object'}}

In [None]:
#print(Pretty(convert_pydantic_to_openai_function(Tagging)))

In [8]:
# Import necessary modules for Langchain prompts and chat models
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

- Create a ChatOpenAI model and bind it with the tagging functions.

In [9]:
# Initialize the ChatOpenAI model
model = ChatOpenAI(temperature=0)

  model = ChatOpenAI(temperature=0)


In [10]:
# Define the tagging functions
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]

- Finally, let's create a prompt template and a tagging chain to analyze the sentiment of the input

In [11]:
# Define the prompt template for tagging
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])

In [12]:
# Bind the model with functions
model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"}
)

In [13]:
# Create a tagging chain
tagging_chain = prompt | model_with_functions

-  Test the tagging chain

In [None]:
# Test the tagging chain
tagging_chain.invoke({"input": "I love langchain"})

# Display the tagging chain in a pretty format
print(Pretty(tagging_chain.invoke({"input": "I love langchain"})))

In [None]:
#print(Pretty(tagging_chain.invoke({"input": "I love langchain"})))

**Explanation output** The above output is an instance of an AIMessage, which encapsulates the response from an AI model with a given input (I love langchain). In summary, the AI model was called to tag a piece of text, identifying it as `having a positive sentiment and being in English`

In [None]:
# Test the tagging chain with a different input
tagging_chain.invoke({"input": "Ik houd niet van het Nederlandse weer"})

# Display the tagging chain in a pretty format
print(Pretty(tagging_chain.invoke({"input": "Ik houd niet van het Nederlandse weer"})))

In [None]:
#print(Pretty(tagging_chain.invoke({"input": "Ik houd niet van het Nederlandse weer"})))

**Explanation output** Using a different input, the above output has identified the input as `having a negative sentiment and being in Dutch`. 

In [17]:
# Import the JsonOutputFunctionsParser
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

In [18]:
# Create a tagging chain with JSON output parser
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()

In [16]:
# Test the tagging chain with JSON output parser
tagging_chain.invoke({"input": "Ik houd niet van het Nederlandse weer"})

{'sentiment': 'neg', 'language': 'nl'}

## Extraction

Extraction is similar to tagging, but used for extracting multiple pieces of information.

In [19]:
# Define a Pydantic model for Person
from typing import Optional
class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description="person's name")
    age: Optional[int] = Field(description="person's age")

In [20]:
# Define a Pydantic model for Information
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description="List of info about people")

In [None]:
# Convert Pydantic model to OpenAI function
convert_pydantic_to_openai_function(Information)

# Print the function on a pretty way 
print(Pretty(convert_pydantic_to_openai_function(Information)))

In [None]:
#print(Pretty(convert_pydantic_to_openai_function(Information)))

In [22]:
# Define the extraction functions
extraction_functions = [convert_pydantic_to_openai_function(Information)]
extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})

In [None]:
# Test the extraction model
extraction_model.invoke("Pinco is 30, his mom is Jane")

# Print the extraction model in a pretty way 
print(Pretty(extraction_model.invoke("Pinco is 30, his mom is Jane")))


In [None]:
#print(Pretty(extraction_model.invoke("Pinco is 30, his mom is Jane")))

In [25]:
# Define the prompt template for extraction
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"),
    ("human", "{input}")
])

In [26]:
# Create an extraction chain
extraction_chain = prompt | extraction_model

In [None]:
# Test the extraction chain
extraction_chain.invoke({"input": "Pinco is 30, his mom is Jane"})

# Print the extracht chain in a pretty way 
print(Pretty(extraction_chain.invoke({"input": "Pinco is 30, his mom is Jane"})))

In [None]:
#print(Pretty(extraction_chain.invoke({"input": "Pinco is 30, his mom is Jane"})))

In [29]:
# Create an extraction chain with JSON output parser
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

In [30]:
# Test the extraction chain with JSON output parser
extraction_chain.invoke({"input": "Pinco is 30, his mom is Jane"})

{'people': [{'name': 'Pinco', 'age': 30}, {'name': 'Palla'}]}

In [None]:
#print(Pretty(extraction_chain.invoke({"input": "Pinco is 30, his mom is Jane"})))

In [32]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

In [None]:
# Create an extraction chain where the input prompt is processed through the 
# extraction_model, and the resulting output is parsed to extract the value 
# associated with the key "people" using JsonKeyOutputFunctionsParser.
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")

In [34]:
extraction_chain.invoke({"input": "Pinco is 30, his mom is Jane"})

[{'name': 'Pinco', 'age': 30}, {'name': 'Palla', 'age': None}]

In [None]:
#print(Pretty(extraction_chain.invoke({"input": "Pinco is 30, his mom is Jane"})))

## Doing it for real

We can apply tagging to a larger body of text.

For example, let's load this blog post and extract tag information from a sub-set of the text.

In [36]:
# Load a document from the web using WebBaseLoader
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
documents = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


**Note** The above output shows up a warninig of setting the USER_AGENT environment variable, you ensure that your requests are properly identified, which can improve compatibility and compliance with web servers' policies. However, in this case, the warning about the USER_AGENT environment variable not being set does not affect the functionality of the analysis in the provided file. It is simply a recommendation to provide better identification for your requests.

In [37]:
# Retrieve the first document from the loaded documents
doc = documents[0]

In [38]:
# Extract the first 10,000 characters of the document's content
page_content = doc.page_content[:10000]

In [39]:
# Print the first 1,000 characters of the page content for a quick preview
print(page_content[:1000])

In [40]:
# Define a Pydantic model for Overview
class Overview(BaseModel):
    """Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.")
    language: str = Field(description="Provide the language that the content is written in.")
    keywords: str = Field(description="Provide keywords related to the content.")

In [41]:
# Convert the Pydantic model to an OpenAI function
overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]

# Bind the model with the tagging functions and specify the function call
tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}
)

# Create a tagging chain using the prompt, model, and JSON output parser
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()

In [None]:
# Test the tagging chain with the page content
tagging_chain.invoke({"input": page_content})

# Print the function in a pretty way 
print(Pretty(tagging_chain.invoke({"input": page_content})))

In [None]:
#print(Pretty(tagging_chain.invoke({"input": page_content})))

In [44]:
class Paper(BaseModel):
    """Information about papers mentioned."""
    title: str
    author: Optional[str]


class Info(BaseModel):
    """Information to extract"""
    papers: List[Paper]

In [45]:
paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]
extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}
)
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")

In [46]:
extraction_chain.invoke({"input": page_content})

[{'title': 'LLM Powered Autonomous Agents', 'author': 'Lilian Weng'}]

In [47]:
template = """A article will be passed to you. Extract from it all papers that are mentioned by this article follow by its author. 

Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.

Do not make up or guess ANY extra information. Only extract what exactly is in the text."""

prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", "{input}")
])

In [None]:
# Create an extraction chain where the input prompt is processed through the extraction_model,
# and the resulting output is parsed to extract the value associated with the key "papers"
# using JsonKeyOutputFunctionsParser.
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")

In [None]:
# Invoke the extraction chain with the given input (page_content) to process and extract the desired information.
extraction_chain.invoke({"input": page_content})

# Print the desired information in a pretty way 
print(Pretty(extraction_chain.invoke({"input": page_content})))

In [None]:
#print(Pretty(extraction_chain.invoke({"input": page_content})))

In [51]:
extraction_chain.invoke({"input": "hi"})

[{'title': 'Paper A', 'author': 'Author A'},
 {'title': 'Paper B', 'author': 'Author B'}]

In [52]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)

In [None]:
# Split the document's page content into smaller chunks using the text_splitter.
splits = text_splitter.split_text(doc.page_content)

**Note** "chunks" refer to smaller segments or pieces of a larger text. These chunks can be sentences, paragraphs, or other sub-divisions of the text, depending on the specific application and the method used for splitting the text.

In [None]:
# Calculate the number of chunks created by the text_splitter from the document's page content.
len(splits)

15

In [None]:
# Define a function to flatten a 2D matrix into a 1D list by concatenating all rows.
def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list

In [56]:
flatten([[1, 2], [3, 4]])

[1, 2, 3, 4]

In [None]:
# Print the first chunk from the list of text chunks created by the text_splitter.
print(splits[0])

In [58]:
from langchain.schema.runnable import RunnableLambda

In [None]:
# Define a RunnableLambda that prepares the input by splitting the text into chunks using text_splitter,
# and then formats each chunk as a dictionary with the key "input".
prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]
)

In [60]:
prep.invoke("hi")

[{'input': 'hi'}]

In [None]:
# Create a processing chain where the input is first prepared by splitting the text into chunks,
# then each chunk is processed through the extraction_chain in parallel using the map function,
# and finally, the results are flattened into a single list.
chain = prep | extraction_chain.map() | flatten

In [None]:
# Invoke the model 
chain.invoke(doc.page_content)

# Print in a pretty format 
print(Pretty(chain.invoke(doc.page_content)))

[{'title': 'AutoGPT', 'author': None},
 {'title': 'GPT-Engineer', 'author': None},
 {'title': 'BabyAGI', 'author': None},
 {'title': 'Chain of thought', 'author': 'Wei et al. 2022'},
 {'title': 'Tree of Thoughts', 'author': 'Yao et al. 2023'},
 {'title': 'LLM+P', 'author': 'Liu et al. 2023'},
 {'title': 'ReAct', 'author': 'Yao et al. 2023'},
 {'title': 'Reflexion', 'author': 'Shinn & Labash 2023'},
 {'title': 'Chain of Hindsight (CoH)', 'author': 'Liu et al. 2023'},
 {'title': 'Algorithm Distillation (AD)', 'author': 'Laskin et al. 2023'},
 {'title': 'Miller 1956', 'author': None},
 {'title': 'Duan et al. 2017', 'author': None},
 {'title': 'LSH: Locality-Sensitive Hashing', 'author': None},
 {'title': 'ANNOY: Approximate Nearest Neighbors Oh Yeah', 'author': None},
 {'title': 'HNSW: Hierarchical Navigable Small World', 'author': None},
 {'title': 'FAISS: Facebook AI Similarity Search', 'author': None},
 {'title': 'ScaNN: Scalable Nearest Neighbors', 'author': None},
 {'title': 'MRKL (K

In [None]:
#print(Pretty(chain.invoke(doc.page_content)))

## Conclusion 
This notebook demonstrated how to `use OpenAI functions in combination with Langchain to perform tagging and extraction on text`. By defining Pydantic models and converting them to OpenAI functions, we were able to tag text with specific attributes and extract structured information. This approach can be extended to more complex use cases and integrated into larger applications. Besides, sentiment analysis is a powerful tool for understanding the emotional tone of text. By using OpenAI and Langchain, you can easily `implement sentiment analysis in your applications to gain insights from textual data`.