Use pragmatics
to query a language model based on a template and then evaluate the response. This is useful for testing the capabilities of language models in a variety of domains, including programming, security, and general knowledge. pragmatics
implements various custom checks, which allow you to evaluate the quality of the response, including code compilation, code execution, and more.
Check out the pragmatics-templates repository containing templates for testing language models.
Please note, that this project is not intended for rigorous evaluation of model capabilities or for determining which model is more capable, knowledgeable, factual, biased, harmful, aligned, or helpful.
This project is highly inspired by Nicholas Carlinis Github repository Yet Another Applied LLM Benchmark as well as nuclei. Most of the inital templates were adapted from his repository. More to come soon.
π§ Disclaimer π§ |
---|
This project is in active development. Expect breaking changes with releases. |
$ python3 pragmatics.py -t code/python/ -m gpt-4-0125-preview claude-3-opus-20240229 gemini-1.0-pro mistral-large-latest -r 5 -o ./output/result --workers 4
__ _
____ _________ _____ _____ ___ ____ _/ /_(_)_________
/ __ \/ ___/ __ `/ __ `/ __ `__ \/ __ `/ __/ / ___/ ___/
/ /_/ / / / /_/ / /_/ / / / / / / /_/ / /_/ / /__(__ )
/ .___/_/ \__,_/\__, /_/ /_/ /_/\__,_/\__/_/\___/____/ v0.0.0
/_/ /____/
phibenz.github.io
[+] Selected models: 4
[+] pragmatics-templates version 0.0.0 up-to-date
[+] Selected templates: 5
β [c-convert-sqrt-from-python][gemini-1.0-pro] (1/5)
β [c-weird-expression][gemini-1.0-pro] (1/5)
β
[c-crc32][claude-3-opus-20240229] (1/5)
[...]
β [c-rref][mistral-large-latest] (5/5)
β [c-weird-expression][gpt-4-0125-preview] (5/5)
β
[c-convert-dp-from-python][claude-3-opus-20240229] (5/5)
+-------------------------------------------------------+
| Pragmatics Leaderboard |
+-------------------------------------------------------+
| Rank | Model | Acc (All) | Acc (Any) |
+------+------------------------+-----------+-----------+
| 1 | claude-3-opus-20240229 | 56.00% | 60.00% |
| 2 | mistral-large-latest | 44.00% | 60.00% |
| 3 | gpt-4-0125-preview | 44.00% | 60.00% |
| 4 | gemini-1.0-pro | 8.00% | 40.00% |
+------+------------------------+-----------+-----------+
Set your API keys in ./config/api-keys.yaml
:
openai: "sk-xxx"
anthropic: "sk-ant-xxx"
mistralai: "xxx"
googleai: "AIxxx"
You can use ./config/api-keys.yaml.example as a template.
cp ./config/api-keys.yaml.example ./config/api-keys.yaml
USAGE AT YOUR OWN RISK. It is HIGHLY recommended to use docker, since pragmatics will blindly execute code provided by the language model. This includes (over)writing files, executing code, and more. If the model decides to provide malicious code, such as rm -rf / it could be executed on your machine. |
Make sure that docker is installed on your machine. You can run the pragmatics docker image with the following command:
docker run -it --rm \
-v $(pwd)/config:/pragmatics/config \
-v $(pwd)/output:/pragmatics/output \
phibenz/pragmatics:latest \
-t code/python/python-hello-world.yaml -m gpt-3.5-turbo-0125 --output ./output
Build the docker image with:
docker build -t pragmatics .
If you are bold/insane enough, you can run pragmatics
outside of docker. NOT RECOMMENDED! Make sure you have python 3.8 or higher installed on your machine. You can install the required dependencies with:
pip install -r requirements.txt
Pragmatics: A benchmarking tool for language models
options:
-h, --help show this help message and exit
-t, --templates List of template or template directory to run
-m, --models Models to evaluate (specific model or vendor)
-r, --repetitions Number of repetitions
--workers Number of workers
--timeout Timeout for each job
--silent Do not print the banner
--log-level Log level (debug,info,warning,error,critical)
--list-models List available models
--skip-clean-up Skip clean up
--skip-leaderboard Skip leaderboard
-o, --output Path to the output folder to save the results
--download-templates Download the latest templates
--pragmatics-templates-path Overwrite path to the pragmatics-templates folder
(default: ~/pragmatics-templates)
To see a list of available models use the --list-models
flag, or have a look in the config/models folder:
python3 pragmatics.py --list-models --silent
Available models:
- open-mistral-7b
- open-mixtral-8x7b
- mistral-small-latest
- mistral-medium-latest
- mistral-large-latest
- gpt-4-0125-preview
- gpt-4-turbo-preview
- gpt-4-1106-preview
- gpt-4-0613
- gpt-4-32k-0613
- gpt-3.5-turbo-0125
- gpt-3.5-turbo-1106
- claude-3-haiku-20240307
- claude-3-sonnet-20240229
- claude-3-opus-20240229
- gemini-1.0-pro
With the following command, you can run the template python-hello-world.yaml
with the model gpt-3.5-turbo-0125
.
python3 pragmatics.py -t code/python/python-hello-world.yaml -m gpt-3.5-turbo-0125
Run all templates in the code/c/
directory with the anthropic models, and save the results in the ./output/antrhropic-c
directory.
python3 pragmatics.py -t code/c/ -m anthropic -o ./output/anthropic-c
Use multiple workers to run the templates in parallel.
python3 pragmatics.py -t code/c/ -m gemini-1.0-pro -o ./output/multi-workers --workers 4
For each job, an emoji indicates the status of the job.
β
- The job was successfully evaluated
β - The job was not successfully evaluated
π - The job failed, likely a code or model issue. You should investigate.
β³ - The job timed out (see --timeout flag)
Afterwards the template-id, the model and the repetition number are indicated.
β³ [c-crc32][gemini-1.0-pro] (2/5)
β [c-rref][gpt-4-0125-preview] (3/5)
β
[c-crc32][claude-3-opus-20240229] (3/5)
If you specify the --output
flag, the results are saved in the specified directory. Please note, if you specify an exisiting folder, the results are overwritten and updated. Several files are created:
score.json
: The results are saved in the following format:
{
"model": {
"template-id": [
true, # These are the results of the evaluations x repetitions
false,
true,
false,
false
],
[...]
}
[...]
}
accuracy.json
: The accuracy of the models are saved in the following format:
{
"model": {
"all": 0.6, # The accuracy of the model over all templates & repetitions
"any": 0.6666666666666666 # The accuracy of the model, where at least one repetition per template is correct
},
[...]
}
leaderboard.json
: The leaderboard, as shown at end of the run is stored in this file.
Additionally, for each model, template, and repetition a json file is created, storing information about the job, the response and the evaluation.
For example here is the log for a model, that prints "hello world" and evaluates the output:
{"type": "language_model", "name": "AnthropicModel", "input": [{"role": "user", "content": "Write a python program that prints the string \"hello world\". Print only the code."}], "output": "print(\"hello world\")", "success": true, "error": null, "args": {"model": "claude-3-opus-20240229", "api_key": "api-key-redacted"}}
{"type": "processor", "name": "ExtractCodeProcessor", "input": "print(\"hello world\")", "output": "print(\"hello world\")", "success": true, "error": null, "args": {}}
{"type": "processor", "name": "PythonRunProcessor", "input": "print(\"hello world\")", "output": "hello world", "success": true, "error": null, "args": {}}
{"type": "evaluator", "name": "WordEvaluator", "input": "hello world", "output": "True", "success": true, "error": null, "args": {"patterns": ["hello world"], "condition": "and"}}
{"type": "template_job:final_eval", "name": "TemplateJob", "input": [true], "output": true, "success": true, "error": null, "args": {"query_evaluators_condition": "and"}}
{"type": "template_job:clean_up", "name": "TemplateJob", "input": "/tmp/pragmatics/python-hello-world", "output": null, "success": true, "error": null, "args": {}}
Have a look at the collection of already exisiting templates in the pragmatics-templates repository. The easiest way to get started is to copy one of the exising templates and modify it.
For example, here is a template, that queries the model to write a simple python function and then evaluates the output of the model, by running the function with input arguments and comparing the output to the expected output:
id: python-string-length
info:
name: Python String Length Function
author: phibenz
difficulty: easy
description: Write a python function that returns the input string and appends its length.
tags:
- code
- python
query:
- name: question
text: |-
Write a python function that returns the input string and appends its length.
Call the function "print_length".
For example if you input "hello world", the function returns "hello world 11".
Print only the code.
processors:
- type: extract-code
- type: python-compile
fn-name: print_length
evaluators:
- type: python-fn
fn-name: print_length
args:
- "hello world"
output: "hello world 11"
- type: python-fn
fn-name: print_length
args:
- "test"
output: "test 4"
This project is far from finished. Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.