<a href="https://colab.research.google.com/github/microsoft/autogen/blob/main/notebook/agenteval_cq_math.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demonstrating the `AgentEval` framework using the task of solving GAIA bechmark

This notebook aims to demonstrate how to `AgentEval` implemented through [AutoGen](https://github.com/microsoft/autogen) works, where we use a math problem-solving task as an example. 
`AgentEval` consists of two key components:

- `CriticAgent`: This is an LLM-based agent that generates a list criteria $(c_1, \dots, c_n)$ to help to evaluate a utility given task.

- `QuantifierAgent`: This agent quantifies the performance of any sample task based on the criteria designed by the `CriticAgent` in the following way: $(c_1=a_1, \dots, c_n=a_n)$

![AgentEval](../website/blog/2023-11-11-AgentEval/img/agenteval-CQ.png)

For more detailed explanations, please refer to the accompanying [blog post](https://https://microsoft.github.io/autogen/blog/2023/11/11/AgentEval)

## Requirements

AutoGen requires `Python>=3.8`. To run this notebook example, please install pyautogen, Docker, and OpenAI:


In [8]:
%pip install "pyautogen>=0.2.3"
%pip install scipy
%pip install matplotlib

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Set your API Endpoint

* The [`config_list_openai_aoai`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_openai_aoai) function tries to create a list of configurations using Azure OpenAI endpoints and OpenAI endpoints. It assumes the api keys and api bases are stored in the corresponding environment variables or local txt files:
  - OpenAI API key: os.environ["OPENAI_API_KEY"] or `openai_api_key_file="key_openai.txt"`.
  - Azure OpenAI API key: os.environ["AZURE_OPENAI_API_KEY"] or `aoai_api_key_file="key_aoai.txt"`. Multiple keys can be stored, one per line.
  - Azure OpenAI API base: os.environ["AZURE_OPENAI_API_BASE"] or `aoai_api_base_file="base_aoai.txt"`. Multiple bases can be stored, one per line.
* The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file. It first looks for an environment variable with a specified name. The value of the environment variable needs to be a valid json string. If that variable is not found, it looks for a json file with the same name. It filters the configs by filter_dict.

You can set the value of config_list in any way you prefer. Please refer to this [notebook](https://github.com/microsoft/autogen/blob/main/notebook/oai_openai_utils.ipynb) for full code examples of the different methods.


In [11]:
import autogen
import json


print(autogen.__version__)

#config_list = json.loads(secrect-string)

config_list = autogen.config_list_from_json(
    env_or_file="../OAI_CONFIG_LIST",
    file_location=".",
    filter_dict={
        "model": ["gpt-4"],
    },
)

print(config_list[0]["base_url"])
print(config_list[0]["api_key"])



0.2.9
https://gcraoai8sw2.openai.azure.com/
40696eb0167747f4905356965bd123a7


In [12]:
def read_gaia_logs(file_name, correctness):
    """
    Read the mathproblem logs line by line - extract specific fields.

    Args:
    - file_name (str): The single log file that wants to get evaluated.

    Returns:
    - list: A list of tuples, each containing the test case and correctness.
    """
    results = []

    with open(file_name, "r") as f:
        for line in f:
            #print("Line: "+ line)
            try:
                data = json.loads(line)
                task_id = data.get('task_id','')
                question = data.get('Question', '')
                reasoning_trace = json.dumps(data.get('reasoning_trace', ''))
                test_case = "Question is "+ question + " The agent reasoning is " + reasoning_trace
                results.append((task_id, test_case[:128000], correctness))
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON: {e}")

    return results


def read_gaia_logs_without_label(file_name):
    """
    Read the mathproblem logs line by line - extract specific fields.

    Args:
    - file_name (str): The single log file that wants to get evaluated.

    Returns:
    - list: A list of tuples, each containing the test case and correctness.
    """
    results = []

    with open(file_name, "r") as f:
        for line in f:
            #print("Line: "+ line)
            try:
                data = json.loads(line)
                task_id = data.get('task_id','')
                question = data.get('Question', '')
                reasoning_trace = json.dumps(data.get('reasoning_trace', ''))
                test_case = "Question is "+ question + " The agent reasoning is " + reasoning_trace
                results.append((task_id, test_case[:128000]))
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON: {e}")

    return results



## Construct `CriticAgent`

We construct the planning agent named `critic` and a user proxy agent for the critic named `critic_user`. We specify `human_input_mode` as "NEVER" in the user proxy agent, ensuring that it will never ask for human feedback. 

Here critic is going over of all the GAIA samples and produces a number of criteria.


In [14]:
path_to_gaia_logs = "../logs/gaia_validation_level_1__Orchestrator.oai.jsonl"

results = read_gaia_logs_without_label(path_to_gaia_logs)
begin = 32
end = 50

for i in range(32,50):

    critic = autogen.AssistantAgent(
        name = "critic",
        llm_config = {"config_list": config_list, "max_retries": 10, "cache_seed": i},
        system_message = """You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be dinstinguishable, quantifieable and not redundant.
        Convert the evaluation criteria into a dictionary where the keys are the criteria.
        The value of each key is a dictionary as follows {"description": criteria description , "accepted_values": possible accepted inputs for this key}
        Make sure the keys are criteria for assessing the given task.  "accepted_values" include the acceptable inputs for each key that are fine-grained and preferrably mlti-graded levels. "description" includes the criterion description.
        Return the dictionary in the json format with no extra text."""
    )

    critic_user = autogen.UserProxyAgent(
        name = "critic_user",
        max_consecutive_auto_reply = 0,  # terminate without auto-reply
        human_input_mode = "NEVER",
        code_execution_config={"use_docker": False},
    )

    for result in results:
        task = {"name": "GAIA",
        "description": "The task is answer the Real-world and challenging question provided in the field Question. The proposed solution is generated by multi-agent system. The archive the goal the agent needs to do number of steps. Answering the questions requires successful completion of some number of steps, which cannot easily be brute forced due to their diversity. The possibility to check the reasoning trace, the accuracy required in the answers, their absence in plain text from the internet prevent a possible data contamination. ",
        "solution" : result}

        sys_msg = f"""Task: {task["name"]}.
        Task description: {task["description"]}
        Solution: {task["solution"]}"""

        gen_criteria = critic_user.initiate_chat(critic, message=sys_msg)
        criteria = critic_user.last_message()
        task_id = result[0]  # Accessing the first element of the tuple
        print("Task ID:", task_id)
        cr_file = open(f"../logs/solution-based/gaia-{i}-{task_id}.json","w")
        cr_file.write(criteria["content"])
        cr_file.close()  


[33mcritic_user[0m (to critic):

Task: GAIA.
        Task description: The task is answer the Real-world and challenging question provided in the field Question. The proposed solution is generated by multi-agent system. The archive the goal the agent needs to do number of steps. Answering the questions requires successful completion of some number of steps, which cannot easily be brute forced due to their diversity. The possibility to check the reasoning trace, the accuracy required in the answers, their absence in plain text from the internet prevent a possible data contamination. 
        Solution: ('9318445f-fe6a-4e1b-acbf-c68228c9906a', 'Question is  The agent reasoning is ["Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\\nrequested package: openai None\\nfound package openai 1.12.0\\nEnvironment variable OPENAI_API_KEY is set\\nrequested package: easyocr None\\nfound package easyocr 1.7.1\\ncomputer_terminal (to orchestrator)

KeyboardInterrupt: 

# Run the Critic

To run the solution based critic, you need to go over all the responces:


# The Criteria
Now, we print the designed criteria. 

In [None]:

gen_criteria = critic_user.initiate_chat(critic, message=sys_msg[:128000])
criteria = critic_user.last_message()
cr_file = open(f"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json","w")
cr_file.write(criteria["content"])
cr_file.close()

*Note :* You can also define and use your own criteria by editing `criteria.txt`

# The `QuantifierAgent`

Once we have the criteria, we need to quantify a new sample based on the designed criteria and its accepted values. This will be done through `QuantifierAgent` agent as follows. 
We note that can skip the designed creteria by the agent and use your own defined criteria in `criteria_file`. Check the file before going to the next step making sure it's contains only dict, it may produce something extra text sometimes.

In [10]:
import json

criteria_file = f"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json"

quantifier = autogen.AssistantAgent(
    name = "quantifier",
    llm_config = {"config_list": config_list, "max_retries": 10},
    system_message = """You are a helpful assistant. You quantify the output of different tasks based on the given criteria.
    The criterion is given in a dictionary format where each key is a dintinct criteria.
    The value of each key is a dictionary as follows {"description": criteria description , "accepted_values": possible accepted inputs for this key}
    You are going to quantify each of the crieria for a given task based on the task decription.
    Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.
    Return only the dictionary."""
)

quantifier_user = autogen.UserProxyAgent(
    name = "quantifier_user",
    max_consecutive_auto_reply = 0,  # terminate without auto-reply
    human_input_mode = "NEVER",
    code_execution_config={"use_docker": False},
)

dictionary_for_eval = open(criteria_file,"r").read()


## Running the quantifier on a GAIA Logs

In [None]:
import os
import json




def get_quantifier(success_file, failed_file, criteria_file):
    """
    Running quantifier agent on individual log.

    Args:
    - success_file (str): The log path for successful cases.
    - failed_file (str): The log path for failed cases.
    - criteria_file (str): The criteria JSON file path.
    
    Returns:
    - dict: A dictionary including the actual success of each problem as well as estimated performance by the agent eval.
      {"results": [{"actual_success": actual_label, "estimated_performance": quantified_results["content"]}, ...]}
    """
    dictionary_for_eval = open(criteria_file, "r").read()
    results = []

    # Append results for successful and failed cases
    results.extend(read_gaia_logs(success_file, "correct"))
    results.extend(read_gaia_logs(failed_file, "incorrect"))
  
    
    output_file_path = "../test/test_files/agenteval-in-out/gaia_evaluated_problems.json"
    output_file_path_line_by_line = "../test/test_files/agenteval-in-out/per_line_gaia_evaluated_problems.jsonl"
 
    all_data = {}

    # Iterate through the loop where you have access to task_id, test_case, and actual_label
    for task_id, test_case, actual_label in results:
        quantifier_user.initiate_chat(quantifier, message=sys_msg + \
                                                "Evaluation dictionary: " + str(dictionary_for_eval) + \
                                                "Actual test case to evaluate: " + test_case)
        quantified_results = quantifier_user.last_message()

        # Construct the nested dictionary for each file
        nested_dict = {
            "task_id": task_id,
            "actual_success": actual_label,
            "estimated_performance": quantified_results["content"]
        }

        # Add the nested dictionary to the all_data dictionary
        all_data[task_id] = nested_dict

        # Write each line of the nested dictionary to the JSONL file
        with open(output_file_path_line_by_line, "a") as file:
            json.dump(nested_dict, file)
            file.write('\n')  # Add a newline after each JSON object


    # Write the entire dictionary to the JSON file
    with open(output_file_path, "w") as file:
        json.dump(all_data, file, indent=2)


get_quantifier("../test/test_files/agenteval-in-out/sample-gaia_1output_equal.jsonl", "../test/test_files/agenteval-in-out/sample-gaia_1output_not_equal.jsonl", criteria_file)  

