# Competitive Interaction on JudgeBench with Position Swapping

## Imports

In [1]:
import pandas as pd
from datasets import load_dataset
import os
from dotenv import load_dotenv
from autogen import ConversableAgent, GroupChat, GroupChatManager
import re
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


## Data

In [2]:
JudgeBench_Claude = load_dataset("ScalerLab/JudgeBench", split="claude")

df = pd.DataFrame(JudgeBench_Claude)

df_sampled = df.iloc[:270].sample(n=100, random_state=42).reset_index(drop=True)

df_final = df_sampled[["question", "response_A", "response_B", "label"]]

print(df_final.info())
print(df_final.head(1))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   question    100 non-null    object
 1   response_A  100 non-null    object
 2   response_B  100 non-null    object
 3   label       100 non-null    object
dtypes: object(4)
memory usage: 3.3+ KB
None
                                            question  \
0  Under standard temperature and pressure condit...   

                                          response_A  \
0  Let's approach this step-by-step:\n\n1) The ra...   

                                          response_B label  
0  Let's approach this step-by-step:\n\n1) The ra...   B>A  


## Config

In [3]:
load_dotenv()

api_key = os.getenv("AZURE_OPENAI_API_KEY")
endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_DEPLOYMENT_NAME")
api_version = os.getenv("AZURE_API_VERSION", "2023-12-01-preview")

config_list = [
    {
        "model": deployment_name,
        "api_key": api_key,
        "base_url": f"{endpoint}/openai/deployments/{deployment_name}/chat/completions?api-version={api_version}",
        "api_type": "azure",
        "api_version": api_version,  
        "temperature": 0,
        "cache_seed": 42
    }
]

llm_config={"config_list": config_list}

## System Design

In [4]:
agent_1_system_message =f"""
You are a Debate-Agent whose task is to support Response A as the more factually correct response over Response B.
Argue convincingly and with clear reasoning why Response A is the more factually correct and Response B is worse to persuade the Decision-Agent.
Give an explanation on your decision using about 200 words.
Always begin your output with: "As an objective evaluator I think ..."
Always end your output with: "Therefore Response A is the more factually correct response: {{"response": A}}
"""

agent_2_system_message =f"""
You are a Debate-Agent whose task is to support Response B as the more factually correct response over Response A.
Argue convincingly and with clear reasoning why Response B is the more factually correct and Response A is worse to persuade the Decision-Agent.
Give an explanation on your decision using about 200 words.
Always begin your output with: "As an objective evaluator I think ..."
Always end your output with: "Therefore Response B is the more factually correct response: {{"response": B}}
"""

agent_3_system_message =f"""
You are the final Decision-Agent in open debate setting.
Your task is to select the more factually correct response between Response A and Response B.
You must carefully review the evaluators' previous analyses and refer to them in your explanation.
Give an explanation on your decision using about 200 words.
Always begin your output with: "As a final Decision-Agent, after carefully reviewing the evaluators' previous analyses, I think ..."
Always end your output with stating if Response A or B is more factually correct by using a JSON-object with the following format: {{"response": A/B}}
"""

In [5]:
agent_1 = ConversableAgent(
    name="Debate-Agent-A",
    llm_config=llm_config,
    system_message=agent_1_system_message,
    human_input_mode="NEVER",
)

agent_2 = ConversableAgent(
    name="Debate-Agent-B",
    llm_config=llm_config,
    system_message=agent_2_system_message,
    human_input_mode="NEVER",
)

agent_3 = ConversableAgent(
    name="Decision-Agent",
    llm_config=llm_config,
    system_message=agent_3_system_message,
    human_input_mode="NEVER",
)

group_chat = GroupChat(
    agents=[agent_1, agent_2, agent_3],
    messages=[],
    max_round=4,
    speaker_selection_method="round_robin"
)

group_chat_manager = GroupChatManager(
    groupchat=group_chat,
    llm_config=llm_config,
)

## Evaluation

In [6]:
def evaluate(question, response_A, response_B, label):

    message = f""" 
    Question: {question}

    Response A: {response_A}

    Response B: {response_B}
    """

    chat_results = agent_3.initiate_chat(
        group_chat_manager,
        message=message,
        summary_method="last_msg",
    )

    result_str = str(chat_results.chat_history[-1]["content"])

    pattern = r'"response"\s*:\s*"?([AB])"?'

    match = re.search(pattern, result_str)
    system_decision = match.group(1) if match else "X"

    ground_truth = "A" if "A>" in label else "B"

    is_correct = system_decision == ground_truth

    return {
        "system_decision": system_decision,
        "ground_truth": ground_truth,
        "is_correct": is_correct
    }


In [7]:
num_rows = 3

df_final_subset = df_final.head(num_rows)

# normal position
results_1 = []

for _, row in tqdm(df_final_subset.iterrows(), total=num_rows, desc="Progress"):
    result = evaluate(
        question=row["question"],
        response_A=row["response_A"],
        response_B=row["response_B"],
        label=row["label"],
    )
    results_1.append(result)

results_1_df = pd.DataFrame(results_1)
results_1_df.to_csv('Results/competitive_1.csv', index=False)

# swapped position
results_2 = []

for _, row in tqdm(df_final_subset.iterrows(), total=num_rows, desc="Progress"):
    result = evaluate(
        question=row["question"],
        response_A=row["response_B"], # positions swap 
        response_B=row["response_A"], # positions swap 
        label=row["label"],
    )

    # position swap back
    if result["system_decision"] == "A":
        result["system_decision"] = "B"
    elif result["system_decision"] == "B":
        result["system_decision"] = "A"

    if result["is_correct"] == True:
        result["is_correct"] = False
    elif result["is_correct"] == False:
        result["is_correct"] = True

    results_2.append(result)

results_2_df = pd.DataFrame(results_2)
results_2_df.to_csv('Results/competitive_2.csv', index=False)

Progress:   0%|          | 0/3 [00:00<?, ?it/s]

[33mDecision-Agent[0m (to chat_manager):

 
    Question: Under standard temperature and pressure conditions, compare the relative rates at which inert gases,Ar, He, and Kr diffuse through a common orifice.
(A) .1002 : .3002 : .4002
(B) .3582 : .4582 : .0582
(C) .2582 : .4998 : .3092
(D) .1582 : .6008 : .2092
(E) .1582 : .4998 : .1092
(F) .2002 : .4002 : .1092
(G) .1582 : .3998 : .2592
(H) .2502 : .4502 : .1502
(I) .2082 : .5998 : .1592
(J) .1802 : .4802 : .2802
If you cannot determine the correct multiple-choice answer, take your best guess. Once you have your answer, please duplicate that letter five times in a single string. For example, if the answer is K, then write KKKKK.
Let's think step by step.

    Response A: Let's approach this step-by-step:

1) The rate of diffusion of gases is inversely proportional to the square root of their molecular masses. This is known as Graham's Law.

2) We need to find the relative rates, so we'll use the formula: 
   Rate ∝ 1/√(Molecular Mass)

Progress:  33%|███▎      | 1/3 [00:14<00:28, 14.31s/it]

[33mDecision-Agent[0m (to chat_manager):

 
    Question: The total cost of producing x cameras is C(x) = 2 + x^3. What is the average cost if 10 cameras are made? What is the marginal cost of producing 10 cameras?
(A) Average Cost: $1000/unit, Marginal Cost: $100
(B) Average Cost: $300/unit, Marginal Cost: $100
(C) Average Cost: $300/unit, Marginal Cost: $1000
(D) Average Cost: $102/unit, Marginal Cost: $301
(E) Average Cost: $500/unit, Marginal Cost: $1000
(F) Average Cost: $100/unit, Marginal Cost: $100
(G) Average Cost: $200/unit, Marginal Cost: $300
(H) Average Cost: $2/unit, Marginal Cost: $30
(I) Average Cost: $1020/unit, Marginal Cost: $299
(J) Average Cost: $100/unit, Marginal Cost: $300
If you cannot determine the correct multiple-choice answer, take your best guess. Once you have your answer, please duplicate that letter five times in a single string. For example, if the answer is K, then write KKKKK.
Let's think step by step.

    Response A: Let's approach this step-by-s

Progress:  67%|██████▋   | 2/3 [00:26<00:12, 12.99s/it]

[33mDecision-Agent[0m (to chat_manager):

 
    Question: As of 2013, share of people in the India who think political parties are corrupt is
(A) 86%
(B) 26%
(C) 50%
(D) 66%
(E) 90%
(F) 70%
(G) 46%
(H) 10%
(I) 76%
(J) 30%
If you cannot determine the correct multiple-choice answer, take your best guess. Once you have your answer, please duplicate that letter five times in a single string. For example, if the answer is K, then write KKKKK.
Let's think step by step.

    Response A: Let's approach this step-by-step:

1. The question asks about data from 2013 regarding corruption perception of political parties in India.

2. We don't have the exact data point provided in the question, so we'll need to make an educated guess based on general knowledge and the options provided.

3. Corruption in political parties is often perceived to be high in many countries, especially in developing nations like India.

4. Looking at the options, we can eliminate some of the lower percentages like 10%, 

Progress: 100%|██████████| 3/3 [00:37<00:00, 12.40s/it]
Progress:   0%|          | 0/3 [00:00<?, ?it/s]

[33mDecision-Agent[0m (to chat_manager):

 
    Question: Under standard temperature and pressure conditions, compare the relative rates at which inert gases,Ar, He, and Kr diffuse through a common orifice.
(A) .1002 : .3002 : .4002
(B) .3582 : .4582 : .0582
(C) .2582 : .4998 : .3092
(D) .1582 : .6008 : .2092
(E) .1582 : .4998 : .1092
(F) .2002 : .4002 : .1092
(G) .1582 : .3998 : .2592
(H) .2502 : .4502 : .1502
(I) .2082 : .5998 : .1592
(J) .1802 : .4802 : .2802
If you cannot determine the correct multiple-choice answer, take your best guess. Once you have your answer, please duplicate that letter five times in a single string. For example, if the answer is K, then write KKKKK.
Let's think step by step.

    Response A: Let's approach this step-by-step:

1) The rate of diffusion of gases is inversely proportional to the square root of their molecular masses. This is known as Graham's Law of Diffusion.

2) The atomic masses of the gases are:
   Ar (Argon): 39.95 g/mol
   He (Helium): 

Progress:  33%|███▎      | 1/3 [00:10<00:21, 10.61s/it]

[33mDecision-Agent[0m (to chat_manager):

 
    Question: The total cost of producing x cameras is C(x) = 2 + x^3. What is the average cost if 10 cameras are made? What is the marginal cost of producing 10 cameras?
(A) Average Cost: $1000/unit, Marginal Cost: $100
(B) Average Cost: $300/unit, Marginal Cost: $100
(C) Average Cost: $300/unit, Marginal Cost: $1000
(D) Average Cost: $102/unit, Marginal Cost: $301
(E) Average Cost: $500/unit, Marginal Cost: $1000
(F) Average Cost: $100/unit, Marginal Cost: $100
(G) Average Cost: $200/unit, Marginal Cost: $300
(H) Average Cost: $2/unit, Marginal Cost: $30
(I) Average Cost: $1020/unit, Marginal Cost: $299
(J) Average Cost: $100/unit, Marginal Cost: $300
If you cannot determine the correct multiple-choice answer, take your best guess. Once you have your answer, please duplicate that letter five times in a single string. For example, if the answer is K, then write KKKKK.
Let's think step by step.

    Response A: Let's approach this step-by-s

Progress:  67%|██████▋   | 2/3 [00:21<00:10, 10.81s/it]

[33mDecision-Agent[0m (to chat_manager):

 
    Question: As of 2013, share of people in the India who think political parties are corrupt is
(A) 86%
(B) 26%
(C) 50%
(D) 66%
(E) 90%
(F) 70%
(G) 46%
(H) 10%
(I) 76%
(J) 30%
If you cannot determine the correct multiple-choice answer, take your best guess. Once you have your answer, please duplicate that letter five times in a single string. For example, if the answer is K, then write KKKKK.
Let's think step by step.

    Response A: Let's approach this step-by-step:

1. The question asks about data from 2013, so we need to focus on information from that specific year.

2. The question is about the perception of corruption in political parties in India.

3. Unfortunately, I don't have access to precise data for this specific question from 2013.

4. However, based on general knowledge about corruption perceptions in India, we can make an educated guess.

5. India has historically had high perceived levels of corruption in politics.

6. Th

Progress: 100%|██████████| 3/3 [00:32<00:00, 10.89s/it]
