-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added support for a Pandas DataFrame OutputParser (#13257)
**Description:** Added support for a Pandas DataFrame OutputParser with format instructions, along with unit tests and a demo notebook. Namely, we've added the ability to request data from a DataFrame, have the LLM parse the request, and then use that request to retrieve a well-formatted response. Within LangChain, it seamlessly integrates with language models like OpenAI's `text-davinci-003`, facilitating streamlined interaction using the format instructions (just like the other output parsers). This parser structures its requests as `<operation/column/row>[<optional_array_params>]`. The instructions detail permissible operations, valid columns, and array formats, ensuring clarity and adherence to the required format. For example: - When the LLM receives the input: "Retrieve the mean of `num_legs` from rows 1 to 3." - The provided format instructions guide the LLM to structure the request as: "mean:num_legs[1..3]". The parser processes this formatted request, leveraging the LLM's understanding to extract the mean of `num_legs` from rows 1 to 3 within the Pandas DataFrame. This integration allows users to communicate requests naturally, with the LLM transforming these instructions into structured commands understood by the `PandasDataFrameOutputParser`. The format instructions act as a bridge between natural language queries and precise DataFrame operations, optimizing communication and data retrieval. **Issue:** - #11532 **Dependencies:** No additional dependencies :) **Tag maintainer:** @baskaryan **Twitter handle:** No need. :) --------- Co-authored-by: Wasee Alam <waseealam@protonmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- Loading branch information
1 parent
235bdb9
commit 41a4c06
Showing
6 changed files
with
521 additions
and
0 deletions.
There are no files selected for viewing
229 changes: 229 additions & 0 deletions
229
docs/docs/modules/model_io/output_parsers/pandas_dataframe.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,229 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Pandas DataFrame Parser\n", | ||
"\n", | ||
"A Pandas DataFrame is a popular data structure in the Python programming language, commonly used for data manipulation and analysis. It provides a comprehensive set of tools for working with structured data, making it a versatile option for tasks such as data cleaning, transformation, and analysis.\n", | ||
"\n", | ||
"This output parser allows users to specify an arbitrary Pandas DataFrame and query LLMs for data in the form of a formatted dictionary that extracts data from the corresponding DataFrame. Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate a well-formed query as per the defined format instructions.\n", | ||
"\n", | ||
"Use Pandas' DataFrame object to declare the DataFrame you wish to perform queries on." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pprint\n", | ||
"from typing import Any, Dict\n", | ||
"\n", | ||
"import pandas as pd\n", | ||
"from langchain.llms import OpenAI\n", | ||
"from langchain.output_parsers import PandasDataFrameOutputParser\n", | ||
"from langchain.prompts import PromptTemplate" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"model_name = \"text-davinci-003\"\n", | ||
"temperature = 0.5\n", | ||
"model = OpenAI(model_name=model_name, temperature=temperature)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Solely for documentation purposes.\n", | ||
"def format_parser_output(parser_output: Dict[str, Any]) -> None:\n", | ||
" for key in parser_output.keys():\n", | ||
" parser_output[key] = parser_output[key].to_dict()\n", | ||
" return pprint.PrettyPrinter(width=4, compact=True).pprint(parser_output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Define your desired Pandas DataFrame.\n", | ||
"df = pd.DataFrame(\n", | ||
" {\n", | ||
" \"num_legs\": [2, 4, 8, 0],\n", | ||
" \"num_wings\": [2, 0, 0, 0],\n", | ||
" \"num_specimen_seen\": [10, 2, 1, 8],\n", | ||
" }\n", | ||
")\n", | ||
"\n", | ||
"# Set up a parser + inject instructions into the prompt template.\n", | ||
"parser = PandasDataFrameOutputParser(dataframe=df)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"LLM Output: column:num_wings\n", | ||
"{'num_wings': {0: 2,\n", | ||
" 1: 0,\n", | ||
" 2: 0,\n", | ||
" 3: 0}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Here's an example of a column operation being performed.\n", | ||
"df_query = \"Retrieve the num_wings column.\"\n", | ||
"\n", | ||
"# Set up the prompt.\n", | ||
"prompt = PromptTemplate(\n", | ||
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n", | ||
" input_variables=[\"query\"],\n", | ||
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n", | ||
")\n", | ||
"\n", | ||
"_input = prompt.format_prompt(query=df_query)\n", | ||
"output = model(_input.to_string())\n", | ||
"print(\"LLM Output:\", output)\n", | ||
"parser_output = parser.parse(output)\n", | ||
"\n", | ||
"format_parser_output(parser_output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"LLM Output: row:1\n", | ||
"{'1': {'num_legs': 4,\n", | ||
" 'num_specimen_seen': 2,\n", | ||
" 'num_wings': 0}}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Here's an example of a row operation being performed.\n", | ||
"df_query = \"Retrieve the first row.\"\n", | ||
"\n", | ||
"# Set up the prompt.\n", | ||
"prompt = PromptTemplate(\n", | ||
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n", | ||
" input_variables=[\"query\"],\n", | ||
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n", | ||
")\n", | ||
"\n", | ||
"_input = prompt.format_prompt(query=df_query)\n", | ||
"output = model(_input.to_string())\n", | ||
"print(\"LLM Output:\", output)\n", | ||
"parser_output = parser.parse(output)\n", | ||
"\n", | ||
"format_parser_output(parser_output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"LLM Output: mean:num_legs[1..3]\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"{'mean': 4.0}" | ||
] | ||
}, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# Here's an example of a random Pandas DataFrame operation limiting the number of rows\n", | ||
"df_query = \"Retrieve the average of the num_legs column from rows 1 to 3.\"\n", | ||
"\n", | ||
"# Set up the prompt.\n", | ||
"prompt = PromptTemplate(\n", | ||
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n", | ||
" input_variables=[\"query\"],\n", | ||
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n", | ||
")\n", | ||
"\n", | ||
"_input = prompt.format_prompt(query=df_query)\n", | ||
"output = model(_input.to_string())\n", | ||
"print(\"LLM Output:\", output)\n", | ||
"parser.parse(output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Here's an example of a poorly formatted query\n", | ||
"df_query = \"Retrieve the mean of the num_fingers column.\"\n", | ||
"\n", | ||
"# Set up the prompt.\n", | ||
"prompt = PromptTemplate(\n", | ||
" template=\"Answer the user query.\\n{format_instructions}\\n{query}\\n\",\n", | ||
" input_variables=[\"query\"],\n", | ||
" partial_variables={\"format_instructions\": parser.get_format_instructions()},\n", | ||
")\n", | ||
"\n", | ||
"_input = prompt.format_prompt(query=df_query)\n", | ||
"output = model(_input.to_string()) # Expected Output: \"Invalid column: num_fingers\".\n", | ||
"print(\"LLM Output:\", output)\n", | ||
"parser.parse(output) # Expected Output: Will raise an OutputParserException." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.