# Testing Langchain's Capabilities with Pandas

In [84]:
from langchain.chat_models import ChatOpenAI
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent, create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
from langchain.schema.output_parser import StrOutputParser
from langchain.output_parsers import PandasDataFrameOutputParser
from langchain.prompts import ChatPromptTemplate, PromptTemplate
import pandas as pd
import numpy as np
from langchain.llms import OpenAI
import os

### Load Sample DF

In [2]:
df = pd.read_csv("CSVs/titanic.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Initialize GPT

In [4]:
os.environ["OPENAI_API_KEY"] = ""

In [12]:
model_name = "gpt-3.5-turbo-16k"

llm_model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, verbose=True, streaming=True) 

### Call the DF with Langchain Pandas Parser (LCEL)

In [42]:
df_parser = PandasDataFrameOutputParser(dataframe=df)

The DF Parser provides the data structure instructions. **It lists the column names, specific to the DF used**

These are specific instructions, which can be **limited**

In [53]:
print(df_parser.get_format_instructions())

The output should be formatted as a string as the operation, followed by a colon, followed by the column or row to be queried on, followed by optional array parameters.
1. The column names are limited to the possible columns below.
2. Arrays must either be a comma-separated list of numbers formatted as [1,3,5], or it must be in range of numbers formatted as [0..4].
3. Remember that arrays are optional and not necessarily required.
4. If the column is not in the possible columns or the operation is not a valid Pandas DataFrame operation, return why it is invalid as a sentence starting with either "Invalid column" or "Invalid operation".

As an example, for the formats:
1. String "column:num_legs" is a well-formatted instance which gets the column num_legs, where num_legs is a possible column.
2. String "row:1" is a well-formatted instance which gets row 1.
3. String "column:num_legs[1,2]" is a well-formatted instance which gets the column num_legs for rows 1 and 2, where num_legs is a p

Notice how the **promptemplate takes the format instructions as *partial_variables***

*Apparently ChatPromptTemplate does not work here...*

In [80]:
# template = """
# Answer the following {query}.

# Use the provided {format_instructions}.

# """

# prompt = ChatPromptTemplate.from_template(template)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": df_parser.get_format_instructions()},
)


In [81]:
chain = prompt | llm_model | df_parser

In [82]:
chain.invoke({
    "query":"Retrieve the first row."
})

{'0': PassengerId                          1
 Survived                             0
 Pclass                               3
 Name           Braund, Mr. Owen Harris
 Sex                               male
 Age                               22.0
 SibSp                                1
 Parch                                0
 Ticket                       A/5 21171
 Fare                              7.25
 Cabin                              NaN
 Embarked                             S
 Name: 0, dtype: object}

It does a pretty decent job. Let's try a more complex operation

In [83]:
chain.invoke({
    "query":"What is the mean of the Age column?"
})

{'mean': 29.69911764705882}

Let's check...

In [85]:
np.mean(df.Age)

29.69911764705882

It is correct. Something a bit more complicated now:

In [86]:
chain.invoke({
    "query":"How big is the dataframe?"
})

OutputParserException: Invalid operation: "How big is the dataframe?" is not a valid Pandas DataFrame operation.. Please check the format instructions.

In [87]:
chain.invoke({
    "query":"What is the median age of the male passengers?"
})

OutputParserException: Invalid array format in 'Age[Sex=='male']'.                     Please check the format instructions.

##### Seems to be **limited** to specific pandas operations. We'll need to use **agents (Below)**

### Call the DF with Langchain Agent

Try this by varying the **"return_intermediate_steps"** argument and excluding the **"agent_type"** option. Returning the intermediate steps will allow you to have access to a semiparsable answer. Including **agent_type** will let you run things much smoother

In [30]:
pandas_df_agent = create_pandas_dataframe_agent(llm=llm_model, df=df, verbose=True, 
                                                handle_parsing_errors=True,
                                                agent_type=AgentType.OPENAI_FUNCTIONS,
                                                return_intermediate_steps=True)

##### Let's try the complex questions we couldn't solve above with the DF parser:

In [31]:
pandas_result = pandas_df_agent.invoke("How big is this dataframe?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': 'df.shape'}`


[0m[36;1m[1;3m(891, 12)[0m[32;1m[1;3mThe dataframe has 891 rows and 12 columns.[0m

[1m> Finished chain.[0m


In [32]:
pandas_result

{'input': 'How big is this dataframe?',
 'output': 'The dataframe has 891 rows and 12 columns.',
 'intermediate_steps': [(AgentActionMessageLog(tool='python_repl_ast', tool_input={'query': 'df.shape'}, log="\nInvoking: `python_repl_ast` with `{'query': 'df.shape'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "query": "df.shape"\n}', 'name': 'python_repl_ast'}})]),
   (891, 12))]}

In [35]:
pandas_result["intermediate_steps"][0]

(AgentActionMessageLog(tool='python_repl_ast', tool_input={'query': 'df.shape'}, log="\nInvoking: `python_repl_ast` with `{'query': 'df.shape'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "query": "df.shape"\n}', 'name': 'python_repl_ast'}})]),
 (891, 12))

In [89]:
pandas_result["intermediate_steps"][0][1]

(891, 12)

In [90]:
pandas_result = pandas_df_agent.invoke("What is the median age of the male passengers?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[df['Sex'] == 'male']['Age'].median()"}`


[0m[36;1m[1;3m29.0[0m[32;1m[1;3mThe median age of the male passengers is 29.0.[0m

[1m> Finished chain.[0m


In [91]:
pandas_result["intermediate_steps"][0][1]

29.0

Let's confirm this:

In [100]:
(df
 .query("Sex=='male'")
 .Age
 .median()
 )

29.0

In [101]:
(df
 .Age
 .median()
 )

28.0

##### This is correct, much better than with DF parser