<a href="https://colab.research.google.com/github/nkkchem/SCA24_Hackathon/blob/main/Agent_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Language Model Agents in LangChain





This is a brief introduction for how to use `LangChain` for the building of Large Language Model agents.

LLM-Agents are essentially LLMs that are given tools specific to their desired task. This means that if you want an LLM to be able to configure a database, or access the internet, Agents are how you go about this task.

This tutorial will step you through a potential application of LLMs - a pandas dataframe manager. We will see how to load in a local LLM through `vllm` as well as use an API-based model such as gpt-4, gemini-pro, or claude.

This tutorial is adapted from a tutorial build by PineCone-IO at https://github.com/pinecone-io/examples

---

To start, let's install the packages we will be using for this example.

In [1]:
!pip install -qU torch==2.3.0 vllm==0.4.2 langchain langchain_community langchain_core langchain_experimental openai google-search-results pandas langchain_openai datasets

## Pandas Agent - Setup

Pretend we have an pandas dataframe (or multiple) that we would like to set up an LLM to interact with. This type of integration can provide a natural language way of gaining insights from your data. This can assist non-experts in utilizing your groups data to form insights.

We will first need to create an pandas dataframe, then set up an LLM to interface with the data.

⚠️ In this tutorial we will try and use local Google Colab GPU resources, but the free tier of colab prohibits larger models and may decrease the reliability of the LLM-Agents. For this situation, we will provide an additional loader for the OpenAI API models, but you will need an API-key to utilize this resoruce. ⚠️

----

## The Large Language Models

In the following cells, we will create an LLM while pulling an example pandas dataframe from github. We will first provide instructions for spinning up a `vllm`-based LLM which pulls in models from `HuggingFace`. Then we will show the equivalent steps for OpenAI and Google.

IMPORTANT: The rest of this notebook should only use one of the two approaches. Choose one of the 3 below cells to run.

In [None]:
# RUN THIS CELL FOR VLLM MODELS

import pandas as pd
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_community.llms import VLLM

llm = VLLM(
    model="microsoft/Phi-3-mini-4k-instruct", # Set some HF model name, we prefer smaller models since they have to fit in 15 GB of vram
    trust_remote_code=True,  # mandatory for hf models
    temperature=0,
    dtype="float16", # This is only required when using the T4 GPU on Colab
)

df = pd.read_csv(
    "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)

In [2]:
# RUN THIS CELL FOR OPENAI MODELS

import os
import pandas as pd
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_openai import ChatOpenAI
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0) # You need to have you OPENAI_API Key set. If not add OPENAI_API_KEY to the secrets tab on the left side of the screen (click the key)

df = pd.read_csv(
    "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)

In [None]:
# RUN THIS CELL FOR GOOGLE MODELS

from langchain_google_genai import ChatGoogleGenerativeAI

os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")

llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0)

df = pd.read_csv(
    "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)

----

## The Agent

Now that we have our `llm` loaded, we can add that chain to an `AgentExecutor` type which will provide the background thought process that the agent should be able to handle.

There are a few different pre-made Agent types that LangChain provides:

1. Zero Shot React
2. Conversational React
3. React Docstore

or variations of these, which can be found in the documentation [here](https://api.python.langchain.com/en/latest/agents/langchain.agents.agent_types.AgentType.html)


---

We are going to use the Zero Shot React agent type. This type of agent is used to perform 'zero shot' tasks on the input. That means that we will not have several, interdependent interactions but only one. In other words, this agent will have no memory.

In [3]:
from langchain.agents import create_sql_agent
from langchain.agents.agent_types import AgentType

agent_executor = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    verbose=True,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    handle_parsing_errors=True,
)



### Single DataFrame

Now we can run some example questions to see how well the agent handles the example dataframe.

In [4]:
agent_executor.invoke(
    "how many rows are there?"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find out how many rows are in the dataframe, I can use the `shape` attribute of the dataframe.
Action: python_repl_ast
Action Input: df.shape[0][0m[36;1m[1;3m891[0m[32;1m[1;3mI now know the final answer
Final Answer: There are 891 rows in the dataframe.[0m

[1m> Finished chain.[0m


{'input': 'how many rows are there?',
 'output': 'There are 891 rows in the dataframe.'}

In [5]:
agent_executor.run("how many people have more than 3 siblings")

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: We need to filter the dataframe based on the condition of having more than 3 siblings and then count the number of rows that meet this condition.

Action: python_repl_ast
Action Input: df[df['SibSp'] > 3]['PassengerId'].count()[0m[36;1m[1;3m30[0m[32;1m[1;3mI now know the final answer

Final Answer: 30[0m

[1m> Finished chain.[0m


'30'

In [6]:
agent_executor.invoke("whats the square root of the average age?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the square root of the average age, we need to calculate the average age first.
Action: python_repl_ast
Action Input: df['Age'].mean()[0m[36;1m[1;3m29.69911764705882[0m[32;1m[1;3mNow that we have the average age, let's calculate the square root.
Action: python_repl_ast
Action Input: import math; math.sqrt(df['Age'].mean())[0m[36;1m[1;3m5.449689683556195[0m[32;1m[1;3mI now know the final answer
Final Answer: The square root of the average age is approximately 5.45[0m

[1m> Finished chain.[0m


{'input': 'whats the square root of the average age?',
 'output': 'The square root of the average age is approximately 5.45'}

### Multiple DataFrames

Now that we show that we can run the LLM Agent over a single dataframe, we can add in a slightly altered version of the dataframe and show how we can pass multiple dataframes to the agent.

In [7]:
df1 = df.copy()
df1["Age"] = df1["Age"].fillna(df1["Age"].mean())

In [8]:
agent_executor = create_pandas_dataframe_agent(
    llm=llm,
    df=[df,df1],
    verbose=True,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    handle_parsing_errors=True,
)



In [15]:
agent_executor.invoke("Count how many rows in the age column are different?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction: [python_repl_ast]
Action Input: df1['Age'].nunique()[0m[python_repl_ast] is not a valid tool, try one of [python_repl_ast].[32;1m[1;3mI should use the python_repl_ast tool to count the unique values in the 'Age' column for both dataframes.
Action: [python_repl_ast]
Action Input: df1['Age'].nunique()[0m[python_repl_ast] is not a valid tool, try one of [python_repl_ast].[32;1m[1;3mI should use the python_repl_ast tool to count the unique values in the 'Age' column for both dataframes.
Action: [python_repl_ast]
Action Input: df1['Age'].nunique()[0m[python_repl_ast] is not a valid tool, try one of [python_repl_ast].[32;1m[1;3mI should use the python_repl_ast tool to count the unique values in the 'Age' column for both dataframes.
Action: [python_repl_ast]
Action Input: df1['Age'].nunique()[0m[python_repl_ast] is not a valid tool, try one of [python_repl_ast].[32;1m[1;3mI should use the python_repl_ast tool to 

{'input': 'Count how many rows in the age column are different? python_repl_ast',
 'output': 'Agent stopped due to iteration limit or time limit.'}

## Scientific Example - Dataframe of Cells

Let's look at a scientific example. We will load in a dataframe that contains information on cells. These cells are located in various tissue types and organisms. We can use this to illustrate how LLM-agents might be integrated into scientific databases.


In [17]:
df = pd.read_parquet("hf://datasets/pankajrajdeo/PubChem_cell/PubChem_cell_text.parquet")

In [18]:
agent_executor = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    verbose=True,
    agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    handle_parsing_errors=True,
)



In [26]:
agent_executor.invoke("Count how many tissue types there are.")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: We need to find the unique values in the 'tissue' column to count how many tissue types are there.
Action: python_repl_ast
Action Input: df['tissue'].nunique()[0m[36;1m[1;3m57[0m[32;1m[1;3mThere are 57 unique tissue types in the dataframe.
Final Answer: 57[0m

[1m> Finished chain.[0m


{'input': 'Count how many tissue types are there?', 'output': '57'}

In [27]:
agent_executor.invoke("What is the most common organism?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the most common organism, we need to count the occurrences of each organism in the dataframe.
Action: Use value_counts() on the 'organism' column.
Action Input: df['organism'].value_counts()[0mUse value_counts() on the 'organism' column. is not a valid tool, try one of [python_repl_ast].[32;1m[1;3mI will use python_repl_ast to get the value counts of the 'organism' column.
Action: python_repl_ast
Action Input: df['organism'].value_counts()[0m[36;1m[1;3morganism
Homo sapiens (human)                                 2403
Cricetulus griseus (Chinese hamster)                  214
Mus musculus (house mouse)                            207
Rattus norvegicus (Norway rat)                         72
Bos taurus (cattle)                                    48
                                                     ... 
Pelodiscus sinensis (Chinese soft-shelled turtle)       1
Gallus gallus (chicken)                     

{'input': 'What is the most common organism?',
 'output': 'The most common organism in the dataframe is Homo sapiens (human) with 2403 occurrences.'}