### In this notebook, we are going to implement the PandasAgent constructed by LangChain as one of the methods to answer some questions regarding the Chronic  Disease Dataset.

Firstly, we are going to import the required libraries from langchain https://python.langchain.com/docs/get_started/introduction

In [1]:
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI
from langchain.chat_models.azure_openai import AzureChatOpenAI
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
import pandas as pd

Then, we read our dataframe resulted from preprocessing data files. We are going to work with the unpivotted table format.

In [2]:
df = pd.read_csv("Loading_PreprocessingData/filtered_data.csv", low_memory = False)

In [3]:
df.head(5)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,DataValueUnit,DataValueType,DataValue,DataValueFootnoteSymbol,DatavalueFootnote,LowConfidenceLimit,HighConfidenceLimit,StratificationCategory1,Stratification1,GeoLocation,ProcessedDataValue,ProcessedDataValue_z_score
0,2014,2014,AR,Arkansas,SEDD; SID,Asthma,Hospitalizations for asthma,,Number,916,,,,,Gender,Male,POINT (-92.27449074299966 34.74865012400045),,
1,2018,2018,CO,Colorado,SEDD; SID,Asthma,Hospitalizations for asthma,,Number,2227,,,,,Overall,Overall,POINT (-106.13361092099967 38.843840757000464),,
2,2018,2018,DC,District of Columbia,SEDD; SID,Asthma,Hospitalizations for asthma,,Number,708,,,,,Overall,Overall,POINT (-77.036871 38.907192),,
3,2017,2017,GA,Georgia,SEDD; SID,Asthma,Hospitalizations for asthma,,Number,3520,,,,,Gender,Female,POINT (-83.62758034599966 32.83968109300048),,
4,2010,2010,MI,Michigan,SEDD; SID,Asthma,Hospitalizations for asthma,,Number,123,,,,,Race/Ethnicity,Hispanic,POINT (-84.71439026999968 44.6613195430005),,


Now, we are calling the Azure API to create our LLM using GPT-4 as our main model.
Remember you might need to change AzureChatOpenAI to ChatOpenAI. We are mainly using 'Chat' since we are creating a chatbot.

In [4]:
llm =  AzureChatOpenAI(
    openai_api_type = "azure",
    openai_api_base = "https://efgopenaiswedencenteral.openai.azure.com/",
    openai_api_version = "2023-07-01-preview",
    deployment_name='GPT-4',
    openai_api_key = "a47cfd8d31bc42928874167d262696b3")

Let's use our agent. The agent should be able to get answers from the dataframe by writing python code, and finally, chat with the user.

We will use the agent's default parameters.

In [5]:
agent = create_pandas_dataframe_agent(llm, df, verbose=True)

Here are the agent default parameters

In [6]:
print(agent)

verbose=True agent=ZeroShotAgent(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['agent_scratchpad', 'input'], partial_variables={'df_head': '|    |   YearStart |   YearEnd | LocationAbbr   | LocationDesc         | DataSource   | Topic   | Question                    |   DataValueUnit | DataValueType   |   DataValue |   DataValueFootnoteSymbol |   DatavalueFootnote |   LowConfidenceLimit |   HighConfidenceLimit | StratificationCategory1   | Stratification1   | GeoLocation                                    |   ProcessedDataValue |   ProcessedDataValue_z_score |\n|---:|------------:|----------:|:---------------|:---------------------|:-------------|:--------|:----------------------------|----------------:|:----------------|------------:|--------------------------:|--------------------:|---------------------:|----------------------:|:--------------------------|:------------------|:-----------------------------------------------|---------------------:|---------------------------

Now, let's create out chatbot using Gradio library in python!

In [7]:
from langchain.schema import AIMessage, HumanMessage
import time
import gradio as gr

def predict(message, history):
    history_langchain_format = []
    for human, ai in history:
        history_langchain_format.append(HumanMessage(content=human))
        history_langchain_format.append(AIMessage(content=ai))
    history_langchain_format.append(HumanMessage(content=message))
    
    gpt_response = agent.run(message)
    
    
    response_message = ""
    for character in gpt_response:
        response_message += character
        time.sleep(0.01)
        yield response_message
    
# Create a Gradio interface
chat_interface = gr.ChatInterface(predict)

# Enable the queue
chat_interface.queue()

# Launch the Gradio interface
chat_interface.launch()

## this line if you want to share the chatbot with others and make an authenticator
#chat_interface.launch(share = True, auth=("admin", "pass1234"))

Running on local URL:  http://127.0.0.1:7877

To create a public link, set `share=True` in `launch()`.




Interesting! You can check the debug placed after the chatbot to see how the agent works.

Here are some sample evaluation questions we use to assess the performance of the various language model methods we're implementing:

In [8]:
# List Cancer related indicators available in the dataset.

answer_1 = df[df.Topic == 'Cancer']['Question'].unique()
answer_1

array(['Cancer of the oral cavity and pharynx, mortality',
       'Cancer of the prostate, mortality',
       'Invasive cancer (all sites combined), mortality',
       'Cancer of the female breast, mortality',
       'Cancer of the female cervix, mortality',
       'Cancer of the colon and rectum (colorectal), mortality',
       'Cancer of the lung and bronchus, mortality',
       'Invasive melanoma, incidence', 'Melanoma, mortality',
       'Invasive cancer of the oral cavity or pharynx, incidence',
       'Invasive cancer of the prostate, incidence',
       'Invasive cancer (all sites combined), incidence',
       'Invasive cancer of the female breast, incidence',
       'Invasive cancer of the cervix, incidence',
       'Cancer of the colon and rectum (colorectal), incidence',
       'Cancer of the lung and bronchus, incidence',
       'Mammography use among women aged 50-74 years',
       'Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50-75 years',
      

In [9]:
# List Cancer related Indicators for which there are data for the year the last year of any available data.
answer_2 =  len(df[(df.Topic == 'Cancer') & (df.YearStart == df.YearStart.max())])
answer_2

0

In [10]:
# How many diferent datat value types are in the dataset?
answer_3 = len(df['DataValueType'].unique())
answer_3

12

In [11]:
# What is the mean value for indicator 'Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50-75 years' in the last year of available data? Provide answer for Overall stratification group.

max_year = df[df.Question == 'Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50-75 years']['YearStart'].max()

answer_4 = df[(df.Question == 'Fecal occult blood test, sigmoidoscopy, or colonoscopy among adults aged 50-75 years') & (df.YearStart == max_year) & (df.Stratification1 == "Overall")]['DataValue'].astype(float).mean()
answer_4

73.12222222222222

In [12]:
# What is the value for 'Cancer of the female breast, mortality' in Wisconsin. Provide answer for Overall stratification group.

answer_5 = df[(df.Question == 'Cancer of the female breast, mortality') & (df.Stratification1 == 'Overall') & (df.LocationAbbr == 'WI')][['YearStart', 'YearEnd', 'DataValue']].sort_values(by='YearStart', ascending=False)
answer_5

Unnamed: 0,YearStart,YearEnd,DataValue
18180,2015,2019,720
17772,2014,2018,729
18992,2013,2017,726
18543,2012,2016,743
18720,2011,2015,753
17331,2010,2014,763
19238,2009,2013,767
18357,2008,2012,758




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: This is not a question, but a greeting. Therefore, no specific action is required.
Final Answer: Hello! How can I assist you today?[0m

[1m> Finished chain.[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the cancer related indicators, I first need to check the unique values available in the "Topic" column. If "Cancer" is one of the topics, I then need to find the unique values in the "Question" column where the "Topic" is "Cancer". These unique questions will represent the cancer related indicators available in the dataset.
Action: python_repl_ast
Action Input: df['Topic'].unique()[0m
Observation: [36;1m[1;3m['Asthma' 'Cancer' 'Chronic Kidney Disease'
 'Chronic Obstructive Pulmonary Disease' 'Cardiovascular Disease'
 'Diabetes' 'Disability' 'Reproductive Health' 'Tobacco' 'Alcohol'
 'Arthritis' 'Nutrition, Physical Activity, and Weight Status'
 'Mental Health' 'Older Adults' 'O



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to filter the dataframe based on the 'Question', 'LocationDesc' and 'Stratification1' columns.
Action: python_repl_ast
Action Input: df[(df['Question'] == 'Cancer of the female breast, mortality') & (df['LocationDesc'] == 'Wisconsin') & (df['Stratification1'] == 'Overall')]['DataValue'][0m
Observation: [36;1m[1;3m17331    763
17772    729
18180    720
18357    758
18543    743
18720    753
18992    726
19238    767
Name: DataValue, dtype: object[0m
Thought:[32;1m[1;3mThere are multiple values for 'Cancer of the female breast, mortality' in Wisconsin for the Overall stratification group. I should take the most recent value.
Action: python_repl_ast
Action Input: df[(df['Question'] == 'Cancer of the female breast, mortality') & (df['LocationDesc'] == 'Wisconsin') & (df['Stratification1'] == 'Overall')].sort_values('YearStart', ascending=False)['DataValue'].iloc[0][0m
Observation: [36;1m[1;3m720[0m
Thoug

Thought:[32;1m[1;3mThe last year in the 'YearStart' column is 2021. I will now filter the DataFrame to exclude this year.
Action: python_repl_ast
Action Input: df[df['YearStart'] != 2021][0m
Observation: [36;1m[1;3m        YearStart  YearEnd LocationAbbr          LocationDesc DataSource  \
0            2014     2014           AR              Arkansas  SEDD; SID   
1            2018     2018           CO              Colorado  SEDD; SID   
2            2018     2018           DC  District of Columbia  SEDD; SID   
3            2017     2017           GA               Georgia  SEDD; SID   
4            2010     2010           MI              Michigan  SEDD; SID   
...           ...      ...          ...                   ...        ...   
567424       2020     2020           WY               Wyoming      BRFSS   
567425       2020     2020           WY               Wyoming      BRFSS   
567426       2020     2020           WY               Wyoming      BRFSS   
567427       2020   

And now, the moment of truth! The agent aced 4 out of 5 questions (one of them with an incomplete answer). Great job! 🌟

That's it! Give it a go on your own.

That concludes the PandasAgent developed by LangChain. To enhance its performance, consider fine-tuning its parameters and experimenting with different approaches. For instance, a useful suggestion is to provide the agent with the unique values of categorical columns to prevent receiving an answer like "The data is not provided" (sometimes it's provided, but the agent cannot capture the value in the dataframe).

Additionally, it's beneficial to simplify the column names to enhance the agent's understanding.
Another potential enhancement is providing the agent with a clear problem objective and description.

There are several aspects to consider for enhancements. In this project, we are just exploring the power of LLMs.