Blog post

You can find my blog post/tutorial for this repo here

Tool Selection Tester for GenAI Agents

A framework for testing and evaluating LLM's capabilities to select tools and respond to user queries across different scenarios. Compatible with Gemini and OpenAI models.

Repository Structure

genai-agent-tool-selection-testing/
├── main.py                # Main entry point and test orchestration
├── models.py              # Model implementations (OpenAI, Gemini)
├── evaluator.py           # Evaluation logic and metrics
├── model_tester.py        # Test execution engine
├── utils.py               # Utility functions for processing responses
├── tools/
│   ├── functions.py      # Function definitions for tool calling
│   └── function_registry.py       # Function registry and model-specific formatting
├── datasets/
│   ├── test_dataset.json  # Combined test dataset
├── prompts/
│   ├── semantic_judge_tool_selection.txt
│   ├── semantic_judge_error.txt
│   └── semantic_judge_clarifying.txt
│   ├── semantic_judge_no_tool.txt
│   └── semantic_judge_not_supported.txt
├── results/              # Test run outputs
├── requirements.txt
└── README.md

Architecture

Test Datasets

The framework includes several specialized test datasets that are combined into a unified test dataset:

Tool Selection Tests
- Tests model's ability to select appropriate functions
- Includes ground truth function calls with arguments
- Covers various domains (weather, navigation, translation, etc.)
Clarification Tests
- Tests model's ability to request missing information
- Validates appropriate clarifying questions
- Ensures models don't make assumptions with incomplete data
Error Handling Tests
- Tests model's response to invalid inputs
- Includes cases like invalid dates, non-existent locations
- Validates appropriate error messages
Direct Response Tests
- Tests model's knowledge-based responses
- No function calling required
- Factual questions with clear ground truth
Not Supported Tests
- Tests model's ability to gracefully handle unsupported actions
- Includes requests for device control, media playback, real-time data
- Validates clear communication of limitations
- Ensures helpful alternative suggestions when possible

Tool Repository

The framework includes a model-agnostic tool repository that defines available functions and automatically converts them to the appropriate format for each LLM at runtime.

Tool Definition Structure

Tools are defined using a simple, model-independent format:

Function(
    name="get_weather",
    description="Get the weather in a given location",
    parameters=[
        FunctionParameter(
            name="location",
            type="string",
            description="The city name of the location."
        )
    ]
)

Available Tools

The framework includes common tools for testing:

Weather information retrieval
Time zone conversions
Text translation
Route calculation
Hotel booking
Stock price lookup
Email sending
News headlines retrieval
Unit conversion
Reminder setting
Restaurant search
Calendar event retrieval
Math problem solving
Word definition lookup
Traffic condition checking

Model-Specific Conversion

The tool registry automatically converts function definitions to model-specific formats:

OpenAI Format
- Converts to OpenAI's function calling format
- Includes required parameters specification
- Maintains JSON Schema compatibility
Gemini Format
- Converts to Gemini's FunctionDeclaration format
- Preserves parameter descriptions and types
- Packages functions into a Tool object

Adding New Tools

To add new tools to the framework:

Define the function in tools/functions.py:

registry.register(Function(
    name="your_function",
    description="Description of what your function does",
    parameters=[
        FunctionParameter(
            name="param_name",
            type="string",
            description="Parameter description",
            required=True  # or False for optional parameters
        )
    ]
))

The function will automatically be available to both Gemini and OpenAI models with the appropriate format conversion handled by the registry.

End-to-End Process

The testing framework operates in three main stages:

Model Testing
- Loads test cases from dataset
- Initializes model (Gemini or OpenAI) with appropriate configuration
- Executes each test case, capturing raw model responses
- Handles both function calls and text responses based on expected_response_type
- Manages API interactions and error handling
Response Processing
- Converts model-specific response formats into a standardized structure
- Extracts relevant information (function calls, arguments, text responses)
- Handles different response structures between Gemini and OpenAI
- Prepares responses for evaluation
Evaluation
- Performs multi-level response validation: a. Response Type Check
  - Verifies if model provided expected response type (function call or text) b. Exact Match Check
  - For function calls: Matches function name and arguments exactly
  - For text responses: Direct string comparison c. Semantic Evaluation
  - Uses LLM as semantic judge with type-specific prompts
  - Evaluates response meaning and intent
  - Considers variations in phrasing and format d. Mismatch Analysis
  - Categorizes types of mismatches
  - Identifies specific differences in function calls
  - Analyzes parameter variations

Using Custom Model Responses

You can use the test datasets with your own model and bring the responses back for evaluation:

Load and Use Test Dataset
- Import test cases from the dataset
- Use cases to test your own model implementation
- Generate responses in the standardized format

Format Responses Your responses need to be in a JSON file with the following structure:

{
    "test_results": [
        {
            "id": "A001",
            "type": "tool_selection",
            "user_query": "What's the weather like in New York?",
            "ground_truth": {
                "function_call": {
                    "name": "get_weather",
                    "arguments": {
                        "location": "New York"
                    }
                },
                "text": null,
                "expected_response_type": "function_call"
            },
            "model_function_call": {
                "name": "get_weather",
                "arguments": {
                    "location": "New York"
                }
            },
            "model_text": null
        },
        {
            "id": "B001",
            "type": "text_response",
            "user_query": "What is the capital of France?",
            "ground_truth": {
                "function_call": null,
                "text": "The capital of France is Paris.",
                "expected_response_type": "text"
            },
            "model_function_call": null,
            "model_text": "Paris is the capital of France."
        }
    ]
}

Run Evaluation Use the evaluation-only mode to assess your model's responses:

python main.py \
  --eval-only \
  --processed-responses path/to/your/responses.json \
  --semantic-judge-model gemini-1.5-flash-002

Running Tests

# Test function calling with Gemini
python main.py --model-type gemini --mode function_call --dataset datasets/test_tool_selection.json

# Test direct responses with OpenAI
python main.py --model-type openai --mode no_function --dataset datasets/test_no_tool.json --openai-api-key YOUR_KEY

Command Line Arguments

The application supports various command-line arguments for customization:

Required Arguments

--model-type: Type of model to test (gemini or openai)
--dataset: Path to test dataset file
- Example: --dataset datasets/test_tool_selection.json
- Not required in eval-only mode

Required Arguments (for eval-only mode)

--eval-only: Run in evaluation-only mode
- Flag only, no value needed
- Example: --eval-only
--processed-responses: Path to your processed responses file
- Example: --processed-responses results/my_responses.json

Optional Arguments

--mode: Testing mode (default: 'function_call')
- Choices: function_call, no_function
- Example: --mode no_function
--openai-model-name: OpenAI model to use (default: 'gpt-4-1106-preview')
- Example: --openai-model-name gpt-4
--gemini-model-id: Gemini model ID (default: 'gemini-1.5-pro-002')
- Example: --gemini-model-id gemini-1.0-pro
--openai-api-key: OpenAI API key (required for OpenAI models)
- Example: --openai-api-key sk-...
--semantic-judge-model: Model for semantic evaluation
- Example: --semantic-judge-model gpt-4
--semantic-judge-prompt: Path to custom semantic judge prompt file
- Example: --semantic-judge-prompt prompts/custom_prompt.txt
--skip-evaluation: Skip the evaluation phase after running tests
- Flag only, no value needed
- Example: --skip-evaluation

Example Commands

Basic Gemini test with function calling:

python main.py --model-type gemini --dataset datasets/test_tool_selection.json

OpenAI test with custom model and evaluation settings:

python main.py \
  --model-type openai \
  --mode no_function \
  --dataset datasets/test_no_tool.json \
  --openai-api-key YOUR_KEY \
  --openai-model-name gpt-4 \
  --semantic-judge-model gpt-4 \
  --semantic-judge-prompt prompts/custom_prompt.txt

Gemini test with evaluation skipped:

python main.py \
  --model-type gemini \
  --dataset datasets/test_clarifying.json \
  --skip-evaluation

Evaluate your own model responses:

python main.py \
  --eval-only \
  --processed-responses results/my_model_responses.json \
  --mode no_function \
  --semantic-judge-model gemini-1.5-pro-002 \
  --semantic-judge-prompt prompts/semantic_judge_not_supported.txt

Requirements

Python 3.7 or higher
OpenAI API key (for OpenAI models)
Google Cloud credentials (for Gemini model)
Required packages in requirements.txt

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blog post

Tool Selection Tester for GenAI Agents

Repository Structure

Architecture

Test Datasets

Tool Repository

Tool Definition Structure

Available Tools

Model-Specific Conversion

Adding New Tools

End-to-End Process

Using Custom Model Responses

Running Tests

Command Line Arguments

Required Arguments

Required Arguments (for eval-only mode)

Optional Arguments

Example Commands

Requirements

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets		assets
datasets		datasets
prompts		prompts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluator.py		evaluator.py
main.py		main.py
model_tester.py		model_tester.py
models.py		models.py
requirements.txt		requirements.txt
unify_data.py		unify_data.py
utils.py		utils.py

License

heiko-hotz/genai-agent-tool-selection-testing

Folders and files

Latest commit

History

Repository files navigation

Blog post

Tool Selection Tester for GenAI Agents

Repository Structure

Architecture

Test Datasets

Tool Repository

Tool Definition Structure

Available Tools

Model-Specific Conversion

Adding New Tools

End-to-End Process

Using Custom Model Responses

Running Tests

Command Line Arguments

Required Arguments

Required Arguments (for eval-only mode)

Optional Arguments

Example Commands

Requirements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages