Skip to content

Commit

Permalink
Added support for a Pandas DataFrame OutputParser (#13257)
Browse files Browse the repository at this point in the history
**Description:**

Added support for a Pandas DataFrame OutputParser with format
instructions, along with unit tests and a demo notebook. Namely, we've
added the ability to request data from a DataFrame, have the LLM parse
the request, and then use that request to retrieve a well-formatted
response.

Within LangChain, it seamlessly integrates with language models like
OpenAI's `text-davinci-003`, facilitating streamlined interaction using
the format instructions (just like the other output parsers).

This parser structures its requests as
`<operation/column/row>[<optional_array_params>]`. The instructions
detail permissible operations, valid columns, and array formats,
ensuring clarity and adherence to the required format.

For example:

- When the LLM receives the input: "Retrieve the mean of `num_legs` from
rows 1 to 3."
- The provided format instructions guide the LLM to structure the
request as: "mean:num_legs[1..3]".

The parser processes this formatted request, leveraging the LLM's
understanding to extract the mean of `num_legs` from rows 1 to 3 within
the Pandas DataFrame.

This integration allows users to communicate requests naturally, with
the LLM transforming these instructions into structured commands
understood by the `PandasDataFrameOutputParser`. The format instructions
act as a bridge between natural language queries and precise DataFrame
operations, optimizing communication and data retrieval.

**Issue:**

- #11532

**Dependencies:**

No additional dependencies :)

**Tag maintainer:**

@baskaryan 

**Twitter handle:**

No need. :)

---------

Co-authored-by: Wasee Alam <waseealam@protonmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
  • Loading branch information
3 people committed Nov 30, 2023
1 parent 235bdb9 commit 41a4c06
Show file tree
Hide file tree
Showing 6 changed files with 521 additions and 0 deletions.
229 changes: 229 additions & 0 deletions docs/docs/modules/model_io/output_parsers/pandas_dataframe.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas DataFrame Parser\n",
"\n",
"A Pandas DataFrame is a popular data structure in the Python programming language, commonly used for data manipulation and analysis. It provides a comprehensive set of tools for working with structured data, making it a versatile option for tasks such as data cleaning, transformation, and analysis.\n",
"\n",
"This output parser allows users to specify an arbitrary Pandas DataFrame and query LLMs for data in the form of a formatted dictionary that extracts data from the corresponding DataFrame. Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate a well-formed query as per the defined format instructions.\n",
"\n",
"Use Pandas' DataFrame object to declare the DataFrame you wish to perform queries on."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pprint\n",
"from typing import Any, Dict\n",
"\n",
"import pandas as pd\n",
"from langchain.llms import OpenAI\n",
"from langchain.output_parsers import PandasDataFrameOutputParser\n",
"from langchain.prompts import PromptTemplate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_name = \"text-davinci-003\"\n",
"temperature = 0.5\n",
"model = OpenAI(model_name=model_name, temperature=temperature)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Solely for documentation purposes.\n",
"def format_parser_output(parser_output: Dict[str, Any]) -> None:\n",
" for key in parser_output.keys():\n",
" parser_output[key] = parser_output[key].to_dict()\n",
" return pprint.PrettyPrinter(width=4, compact=True).pprint(parser_output)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Define your desired Pandas DataFrame.\n",
"df = pd.DataFrame(\n",
" {\n",
" \"num_legs\": [2, 4, 8, 0],\n",
" \"num_wings\": [2, 0, 0, 0],\n",
" \"num_specimen_seen\": [10, 2, 1, 8],\n",
" }\n",
")\n",
"\n",
"# Set up a parser + inject instructions into the prompt template.\n",
"parser = PandasDataFrameOutputParser(dataframe=df)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LLM Output: column:num_wings\n",
"{'num_wings': {0: 2,\n",
" 1: 0,\n",
" 2: 0,\n",
" 3: 0}}\n"
]
}
],
"source": [
"# Here's an example of a column operation being performed.\n",
"df_query = \"Retrieve the num_wings column.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string())\n",
"print(\"LLM Output:\", output)\n",
"parser_output = parser.parse(output)\n",
"\n",
"format_parser_output(parser_output)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LLM Output: row:1\n",
"{'1': {'num_legs': 4,\n",
" 'num_specimen_seen': 2,\n",
" 'num_wings': 0}}\n"
]
}
],
"source": [
"# Here's an example of a row operation being performed.\n",
"df_query = \"Retrieve the first row.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string())\n",
"print(\"LLM Output:\", output)\n",
"parser_output = parser.parse(output)\n",
"\n",
"format_parser_output(parser_output)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LLM Output: mean:num_legs[1..3]\n"
]
},
{
"data": {
"text/plain": [
"{'mean': 4.0}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here's an example of a random Pandas DataFrame operation limiting the number of rows\n",
"df_query = \"Retrieve the average of the num_legs column from rows 1 to 3.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string())\n",
"print(\"LLM Output:\", output)\n",
"parser.parse(output)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Here's an example of a poorly formatted query\n",
"df_query = \"Retrieve the mean of the num_fingers column.\"\n",
"\n",
"# Set up the prompt.\n",
"prompt = PromptTemplate(\n",
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n",
" input_variables=[\"query\"],\n",
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n",
")\n",
"\n",
"_input = prompt.format_prompt(query=df_query)\n",
"output = model(_input.to_string()) # Expected Output: \"Invalid column: num_fingers\".\n",
"print(\"LLM Output:\", output)\n",
"parser.parse(output) # Expected Output: Will raise an OutputParserException."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 2 additions & 0 deletions libs/langchain/langchain/output_parsers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
JsonOutputToolsParser,
PydanticToolsParser,
)
from langchain.output_parsers.pandas_dataframe import PandasDataFrameOutputParser
from langchain.output_parsers.pydantic import PydanticOutputParser
from langchain.output_parsers.rail_parser import GuardrailsOutputParser
from langchain.output_parsers.regex import RegexParser
Expand All @@ -47,6 +48,7 @@
"MarkdownListOutputParser",
"NumberedListOutputParser",
"OutputFixingParser",
"PandasDataFrameOutputParser",
"PydanticOutputParser",
"RegexDictParser",
"RegexParser",
Expand Down
22 changes: 22 additions & 0 deletions libs/langchain/langchain/output_parsers/format_instructions.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,25 @@
```
{tags}
```"""


PANDAS_DATAFRAME_FORMAT_INSTRUCTIONS = """The output should be formatted as a string as the operation, followed by a colon, followed by the column or row to be queried on, followed by optional array parameters.
1. The column names are limited to the possible columns below.
2. Arrays must either be a comma-seperated list of numbers formatted as [1,3,5], or it must be in range of numbers formatted as [0..4].
3. Remember that arrays are optional and not necessarily required.
4. If the column is not in the possible columns or the operation is not a valid Pandas DataFrame operation, return why it is invalid as a sentence starting with either "Invalid column" or "Invalid operation".
As an example, for the formats:
1. String "column:num_legs" is a well-formatted instance which gets the column num_legs, where num_legs is a possible column.
2. String "row:1" is a well-formatted instance which gets row 1.
3. String "column:num_legs[1,2]" is a well-formatted instance which gets the column num_legs for rows 1 and 2, where num_legs is a possible column.
4. String "row:1[num_legs]" is a well-formatted instance which gets row 1, but for just column num_legs, where num_legs is a possible column.
5. String "mean:num_legs[1..3]" is a well-formatted instance which takes the mean of num_legs from rows 1 to 3, where num_legs is a possible column and mean is a valid Pandas DataFrame operation.
6. String "do_something:num_legs" is a badly-formatted instance, where do_something is not a valid Pandas DataFrame operation.
7. String "mean:invalid_col" is a badly-formatted instance, where invalid_col is not a possible column.
Here are the possible columns:
```
{columns}
```
"""

0 comments on commit 41a4c06

Please sign in to comment.