We have prepared some example tasks to run auto evaluation. You can run them by following the steps below:
- Complete the
evaluator_config.json
(referring to the schema inevaluator_config_template.json
) under theauto_eval
folder and thetaskweaver_config.json
under thetaskweaver
folder. - Go to the
auto_eval
folder. - Run the command below to start the auto evaluation for single case.
python taskweaver_eval.py -m single -p cases/echo
Or run the command below to start the auto evaluation for multiple cases.
python taskweaver_eval.py -m batch -p ./cases
- -m/--mode: specifies the evaluation mode, which can be
single
orbatch
. - -p/--path: specifies the path to the test case file or directory containing test case files.
- -r/--result: specifies the path to the result file for batch evaluation mode. This parameter is only valid in batch mode. The default value is
sample_case_results.csv
. - -f/--fresh: specifies whether to flush the result file. This parameter is only valid in batch mode. The default value is
False
, which means that the evaluated cases will not be loaded again. If you want to re-evaluate the cases, you can set this parameter toTrue
.
A sample task can be configured in the yaml file that contains the following fields:
- config_var (optional): set the configuration values for TaskWeaver if needed.
- app_dir: the path to the project directory for TaskWeaver.
- dependencies: the list of Python package dependencies that are required to run the task. If current environment is not compatible with the dependencies, it will report an error.
- data_files: the list of data files that are required to run the task.
- task_description: the description of the task.
- scoring_points:
- score_point: describes the criteria of the agent's response
- weight: the value that determines how important that criterion is
- eval_code (optional): evaluation code that will be run to determine if the criterion is met. In this case, this scoring point will not be evaluated using LLM.
💡 for the
eval_code
field, you can use the variablechat_history
in your evaluation code snippet. It is a list of Messages objects of Langchain that contain the chat history between the virtual user and the agent. Theeval_code
should use theassert
statement to check the criterion.
Our evaluation framework is designed to be generic and can be used to evaluate other agents besides TaskWeaver. If you want to evaluate other agents, you should follow the steps below.
- Create a new python file under
auto_eval
, create a new class and inherit theVirtualUser
inevaluator.py
. Just like below:
from evaluator import VirtualUser
class YourVirtualUser(VirtualUser):
def __init__(self, task_description: str, app_dir: str, config_var: Optional[dict] = None):
super().__init__(task_description)
"""
Initialize the VirtualUser.
"""
...
def get_reply_from_agent(self, message: str) -> str:
# Custom code to get the reply from your agent.
- You can get the config values from the
config_var
parameter in the__init__
method to set the config values for your agent if needed. - Implement the
get_reply_from_agent
method to get the reply from your agent. - Develop the evaluation logic.
- Use the
Evaluator
class to evaluate the agent's response.- load the task case
- check the package version
- create the VirtualUser and Evaluator
- assign the task to the agent via the virtual user
- evaluate the agent's response
from evaluator import Evaluator, ScoringPoint, VirtualUser
from utils import check_package_version, load_task_case
def auto_evaluate_for_your_agent(
eval_case_dir: str,
) -> Tuple[float, float]:
# Load the task case
eval_meta_data = load_task_case(eval_case_dir)
app_dir = eval_meta_data["app_dir"]
config_var = eval_meta_data.get("config_var", None)
task_description = eval_meta_data["task_description"]
dependencies = eval_meta_data.get("dependencies", [])
data_files = eval_meta_data.get("data_files", [])
# Check the package version
for dependency in dependencies:
check_package_version(dependency)
# Create the VirtualUser and Evaluator
vuser = YourVirtualUser(task_description, app_dir, config_var)
evaluator = Evaluator()
# assign the task to the agent via the virtual user
chat_history = vuser.talk_with_agent()
# Evaluate the agent's response
score_points = eval_meta_data["scoring_points"]
score_points = [ScoringPoint(**score_point) for score_point in score_points]
score, normalized_score = evaluator.evaluate(task_description, chat_history, score_points)
return score, normalized_score