<a href="https://colab.research.google.com/github/jaredmullane/LLM_Class/blob/main/Claude_3_Opus_structured_query_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install all our deps.

In [None]:
!pip install llama-index-core llama-index-embeddings-huggingface llama-index-llms-anthropic llama-index-readers-file

Collecting llama-index-core
  Downloading llama_index_core-0.10.16-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.1.4-py3-none-any.whl (7.7 kB)
Collecting llama-index-llms-anthropic
  Downloading llama_index_llms_anthropic-0.1.5-py3-none-any.whl (4.4 kB)
Collecting llama-index-readers-file
  Downloading llama_index_readers_file-0.1.7-py3-none-any.whl (34 kB)
Collecting dataclasses-json (from llama-index-core)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index-core)
  Downloading httpx-0.27.0-py3-none-any.whl (75 k

Import everything we need.

In [None]:
import pandas as pd
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.query_engine import PandasQueryEngine
from llama_index.core.embeddings import resolve_embed_model
from llama_index.llms.anthropic import Anthropic
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.agent import ReActAgent
from llama_index.core.agent.react.formatter import ReActChatFormatter
from google.colab import userdata

Initialize Claude 3 as the LLM and use local embeddings.

In [None]:
Settings.llm = Anthropic(
    model="claude-3-opus-20240229",
    api_key=userdata.get('anthropic-key')
)
Settings.embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Get our test data, a CSV file of food truck licenses in San Francisco and a PDF containing a table that explains what the fields mean.

In [None]:
!wget "https://www.dropbox.com/scl/fi/qfbsbwlxz5gyy5b3xlf1x/DPW_DataDictionary_Mobile-Food-Facility-Permit.pdf?rlkey=utazqw34fimzawsxq1ely11f0&dl=0" -q -O ./data_dictionary.pdf
!wget "https://www.dropbox.com/scl/fi/r3litz78wt08cvp1y5aui/Mobile_Food_Facility_Permit_20240111.csv?rlkey=kg4hpkfdqregnccpmlict7ds5&dl=0" -q -O ./food_truck_permits.csv

First ask it to just count the total number of rows in the table.

In [None]:
df = pd.read_csv("./food_truck_permits.csv")

pandas_query_engine = PandasQueryEngine(df=df, verbose=True)
response = pandas_query_engine.query("How many rows are in the dataframe?")
print(response)

> Pandas Instructions:
```
len(df)
```
> Pandas Output: 481
481


Then use the PDF to find out which of the columns in the table determines what type of vehicle is being permitted (it can be a food truck or a push cart).

In [None]:
data_dictionary_docs = SimpleDirectoryReader(input_files=["./data_dictionary.pdf"]).load_data()
index = VectorStoreIndex.from_documents(data_dictionary_docs)
data_dictionary_engine = index.as_query_engine()

response = data_dictionary_engine.query("What field describes what type of vehicle the mobile food facility is?")
print(response)

The FacilityType field describes the type of vehicle the mobile food facility is, such as a truck or push cart.


Now create both of those indexes as tools to an agent,

In [None]:
query_engine_tools = [
    QueryEngineTool(
        query_engine=data_dictionary_engine,
        metadata=ToolMetadata(
            name="datadictionary",
            description=(
                "Accepts natural-language questions about the column names used "
                "to describe data about mobile food facilities in San Francisco. "
                "These can be food trucks or carts."
            ),
        ),
    ),
    QueryEngineTool(
        query_engine=pandas_query_engine,
        metadata=ToolMetadata(
            name="permitdata",
            description=(
                "Accepts natural-language questions about mobile food facilities "
                "in San Francisco. You should be explicit about what column names "
                "to use by getting them from the data dictionary."
            ),
        ),
    ),
]

agent = ReActAgent.from_tools(query_engine_tools, verbose=True)
task = agent.create_task("How many food trucks are there in San Francisco?")

In [None]:
step_output = agent.run_step(task.task_id)
print(step_output)

[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: datadictionary
Action Input: {'input': 'What column name would tell me the number of food trucks?'}
[0m[1;3;34mObservation: The "FacilityType" column in the data dictionary indicates whether each mobile food facility is a truck or push cart. To determine the total number of food trucks, you would need to count the number of rows where the "FacilityType" value is "truck".
[0mObservation: The "FacilityType" column in the data dictionary indicates whether each mobile food facility is a truck or push cart. To determine the total number of food trucks, you would need to count the number of rows where the "FacilityType" value is "truck".


In [None]:
step_output = agent.run_step(task.task_id)
print(step_output)

[1;3;38;5;200mThought: Now that I know the column name to use, I can query the permit data to get the number of food trucks.
Action: permitdata
Action Input: {'input': "How many rows have a FacilityType value of 'Truck'?"}
[0m> Pandas Instructions:
```
len(df[df['FacilityType'] == 'Truck'])
```
> Pandas Output: 419
[1;3;34mObservation: 419
[0mObservation: 419
