function-calling-test-suite

function-calling-test-suite (FCTS) is a pragmatic test framework for assessing the function calling capabilities of large language models (LLMs).

Test spec overview

Test specs files contain YAML streams that define the metadata, input, and expected output for a set of test cases.

e.g.

---
categories:
  - basic
description: >-
  Asserts that the model can make a function call with a given argument and
  conveys the result to the user
prompt: Call funcA with 1 and respond with the result of the call
available_functions:
  - name: funcA
    description: Performs funcA
    parameters:
      type: object
      properties:
        param1:
          type: integer
          description: Param 1
expected_function_calls:
  - name: funcA
    arguments:
      param1: 1
    result: This is the output of funcA(1)
final_answer_should: >-
  The answer should indicate that the result of calling funcA with 1 is "This is
  the output of funcA(1)"
---
# ...

Spec anatomy

Every test spec has three primary components:

1. Test metadata

Each test spec must include a description and categories. The description outlines the test's goal, while categories tag the capabilities being tested. Categorizing test cases helps identify the model's strengths and weaknesses.

2. Functions definitions and expected calls

When executed, the framework uses the prompt and available_functions to generate an initial request. It compares the model’s response to expected_function_calls. If they match, the framework continues making requests with the result field until all expected calls are completed or a call fails.

3. Answer criteria

Even if a model completes all expected function calls, the final response still needs to be verified. To this end, specs can optionally include a final_answer_should field to describe valid answers using natural language.

Default test suite

The default suite of spec files can be found in the specs directory.

Basic usage

Initialize test environment

poetry shell
poetry install

Configure FCTS to judge model responses with gpt-4-turbo

export OPENAI_API_KEY='<openai-api-key>'

Target a model to test

export FCTS_API_KEY='<model-provider-api-key>'
export FCTS_BASE_URL='<model-provider-api-base-url>'
export FCTS_MODEL='<model-name>'

Run the default test suite with verbose output enabled:

poetry run pytest -vvv

Run options

$ poetry run pytest -h
usage: pytest [options] [file_or_dir] [file_or_dir] [...]
...
Custom options:
  --spec-run-count=SPEC_RUN_COUNT                   Number of times each test spec should be run
  --spec-filter=SPEC_FILTER                         Filter which test specs are run by their generated test IDs
  --spec-dir=SPEC_DIR                               Directory containing JSON test spec files
  --stream=STREAM                                   Enables streaming for all chat completion requests
  --use-system-prompt=USE_SYSTEM_PROMPT             Add a default system prompt to all chat completion requests
  --aggregate-summary-file=AGGREGATE_SUMMARY_FILE   Add results for the model to an aggregate CSV file
  --request-delay=REQUEST_DELAY                     Delay in seconds between chat completion requests
...

Testing models without chat completion API support

GPTScript's alternative model provider shims can be used to test models that don't support OpenAI's chat completion API.

Using the test script (Automated)

The test.sh script automates the configuration and deployment of provider shims for a few popular models:

claude-3.5-sonnet
gemini-1.5-pro
mistral-large-latest

Before using test.sh, ensure the OPENAI_API_KEY environment variable is set and the following CLIs are installed on your system:

To run the test suite, just pass the name of the model to the script, followed by the desired pytest arguments.

e.g.

./test.sh gemini-1.5-pro --spec-run-count=10 --spec-filter='*basic.yaml-0*'

Note: The script will prompt for auth tokens if necessary

You can also run the provider shims and configure the test suite manually. See the sections below for some examples.

claude-3.5-sonnet (Manual)

Set an Anthopic key:

export ANTHROPIC_API_KEY='<anthropic-key>'

Clone the claude3-anthropic-provider:

git clone https://github.com/gptscript-ai/claude3-anthropic-provider

Follow the Development instructions in the repo's README.md:

cd claude3-anthropic-provider
export GPTSCRIPT_DEBUG=true
python -m venv .venv
source ./.venv/bin/activate
pip install -r requirements.txt

Run the shim:

./run.sh

In another terminal, target the provider shim:

export FCTS_MODEL='claude-3-5-sonnet-20240620'
export FCTS_BASE_URL='http://127.0.0.1:8000/v1'
export FCTS_API_KEY='foo'

Note: The API key can be set to any arbitrary value, but must be set

Run the tests:

poetry shell
poetry install
poetry run pytest --stream=true

Note: Streaming must be enabled because the claude3-anthropic-provider doesn't support non-streaming responses

gemini-1.5-pro (Manual)

Ensure the following requirements are met:

gcloud CLI
VertexAI access

Configure gcloud CLI to use your VertexAI project and account:

gcloud config set project <project-name> 
gcloud config set billing/quota_project <project-name> 
gcloud config set account <account-email> 
gcloud components update

Afterwords, your configuration should look something like this:

gcloud config list
[billing]
quota_project = acorn-io
[core]
account = nick@acorn.io
disable_usage_reporting = False
project = acorn-io

Your active configuration is: [default]

Authenticate with the gcloud CLI:

gcloud auth application-default login

Clone the gemini-vertexai-provider repo:

git clone https://github.com/gptscript-ai/gemini-vertexai-provider

Follow the Development instructions in the repo's README.md:

cd gemini-vertexai-provider
export GPTSCRIPT_DEBUG=true
python -m venv .venv
source ./.venv/bin/activate
pip install -r requirements.txt

Run the shim:

./run.sh

In another terminal, target the provider shim:

export FCTS_MODEL='gemini-1.5-pro-preview-0409'
export FCTS_BASE_URL='http://127.0.0.1:8081/v1'
export FCTS_API_KEY='foo'

Note: The API key can be set to any arbitrary value, but must be set

Run the tests:

poetry shell
poetry install
poetry run pytest --stream=true

Note: Streaming must be enabled because the gemini-1.5-pro-preview-0409 doesn't support non-streaming responses

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
function_calling_test_suite		function_calling_test_suite
reports		reports
specs		specs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

function-calling-test-suite

Test spec overview

Spec anatomy

1. Test metadata

2. Functions definitions and expected calls

3. Answer criteria

Default test suite

Basic usage

Run options

Testing models without chat completion API support

Using the test script (Automated)

claude-3.5-sonnet (Manual)

gemini-1.5-pro (Manual)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

gptscript-ai/function-calling-test-suite

Folders and files

Latest commit

History

Repository files navigation

function-calling-test-suite

Test spec overview

Spec anatomy

1. Test metadata

2. Functions definitions and expected calls

3. Answer criteria

Default test suite

Basic usage

Run options

Testing models without chat completion API support

Using the test script (Automated)

claude-3.5-sonnet (Manual)

gemini-1.5-pro (Manual)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages