# Cooperative Interaction on SummEval

## Imports

In [3]:
import pandas as pd
from datasets import load_dataset
import os
from dotenv import load_dotenv
from autogen import ConversableAgent, GroupChat, GroupChatManager
import re
from tqdm import tqdm
import numpy as np

## Data

In [4]:
SummEval = load_dataset("mteb/summeval")

df = pd.DataFrame(SummEval["test"])[["text", "machine_summaries", "relevance", "coherence", "fluency", "consistency"]]

problematic_indices = [5, 7, 8, 9, 10, 11, 18, 20, 26, 27, 33, 34, 39, 46, 61, 64, 68, 73, 75, 79, 85, 86, 88, 92, 96, 99]
df_filtered = df.drop(index=problematic_indices).reset_index(drop=True)

df_exploded = df_filtered.explode(["machine_summaries", "relevance", "coherence", "fluency", "consistency"]).reset_index(drop=True)

df_sampled = df_exploded.iloc[:1200].sample(n=100, random_state=42).reset_index(drop=True)

columns_to_round = ["relevance", "coherence", "fluency", "consistency"]

df_sampled[columns_to_round] = df_sampled[columns_to_round].map(lambda x: int(np.ceil(x) if x % 1 > 0.5 else np.floor(x)))

df_final = df_sampled

print(df_final.info())
print(df_final.head(1))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   text               100 non-null    object
 1   machine_summaries  100 non-null    object
 2   relevance          100 non-null    int64 
 3   coherence          100 non-null    int64 
 4   fluency            100 non-null    int64 
 5   consistency        100 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 4.8+ KB
None
                                                text  \
0  Boss Nigel Pearson has urged Leicester to keep...   

                                   machine_summaries  relevance  coherence  \
0  jamie vardy scored an injury-time winner again...          2          2   

   fluency  consistency  
0        3            2  


## Config

In [5]:
load_dotenv()

api_key = os.getenv("AZURE_OPENAI_API_KEY")
endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_DEPLOYMENT_NAME")
api_version = os.getenv("AZURE_API_VERSION", "2023-12-01-preview")

config_list = [
    {
        "model": deployment_name,
        "api_key": api_key,
        "base_url": f"{endpoint}/openai/deployments/{deployment_name}/chat/completions?api-version={api_version}",
        "api_type": "azure",
        "api_version": api_version,  
        "temperature": 0,
        "cache_seed": 42
    }
]

llm_config={"config_list": config_list}

## System Design

In [6]:
agent_1_system_message =f"""
You are an Analyst-Agent in a cooperative reasoning chain.
In this task you will evaluate the quality of a summary written for a news article.
To correctly solve this task, follow these steps:

    1. Carefully read the news article, be aware of the information it contains.
    2. Read the proposed summary.
    3. Rate the summary with integer values on a scale from 1 (worst) to 5 (best) by its relevance, consistency, fluency, and coherence.

Definitions:
    Relevance:
        - The rating measures how well the summary captures the key points of the article.
        - Consider whether all and only the important aspects are contained in the summary.
    Consistency:
        - The rating measures whether the facts in the summary are consistent with the facts in the original article.
        - Consider whether the summary does reproduce all facts accurately and does not make up untrue information.
    Fluency:
        - This rating measures the quality of individual sentences, are they well-written and grammatically correct.
        - Consider the quality of individual sentences.
    Coherence:
        - The rating measures the quality of all sentences collectively, to the fit together and sound naturally.
        - Consider the quality of the summary as a whole.

Use a structured format and about 200 words to present your arguments.
Always begin your output with: "As an Analyst-Agent in an initial assessment, I think ..."
Always end your output with a JSON object with the following format:{{"relevance": score, "coherence": score, "fluency": score, "consistency": score}} 
"""


agent_2_system_message =f"""
You are a Critical-Enhancer-Agent in a cooperative reasoning chain.
Your task is to carefully review the Analyst-Agent's initial assessment and improve upon it.
You may add missing facts, point out inaccuracies or refine the reasoning in a constructive way.

In this task you will evaluate the quality of a summary written for a news article.
To correctly solve this task, follow these steps:

    1. Carefully read the news article, be aware of the information it contains.
    2. Read the proposed summary.
    3. Rate the summary with integer values on a scale from 1 (worst) to 5 (best) by its relevance, consistency, fluency, and coherence.

Definitions:
    Relevance:
        - The rating measures how well the summary captures the key points of the article.
        - Consider whether all and only the important aspects are contained in the summary.
    Consistency:
        - The rating measures whether the facts in the summary are consistent with the facts in the original article.
        - Consider whether the summary does reproduce all facts accurately and does not make up untrue information.
    Fluency:
        - This rating measures the quality of individual sentences, are they well-written and grammatically correct.
        - Consider the quality of individual sentences.
    Coherence:
        - The rating measures the quality of all sentences collectively, to the fit together and sound naturally.
        - Consider the quality of the summary as a whole

Use a structured format and about 200 words to present your arguments.
Always begin your output with: "Building on the Analyst-Agent's initial assessment, I ..."
Always end your output with a JSON object with the following format:{{"relevance": score, "coherence": score, "fluency": score, "consistency": score}} 
"""

agent_3_system_message ="""
You are the Decision-Agent in a cooperative reasoning chain.
Your task is to integrate the insights from the Analyst-Agent and the Critical-Enhancer-Agent into a final judgement.

In this task you will evaluate the quality of a summary written for a news article.
To correctly solve this task, follow these steps:

    1. Carefully read the news article, be aware of the information it contains.
    2. Read the proposed summary.
    3. Rate the summary with integer values on a scale from 1 (worst) to 5 (best) by its relevance, consistency, fluency, and coherence.

Definitions:
    Relevance:
        - The rating measures how well the summary captures the key points of the article.
        - Consider whether all and only the important aspects are contained in the summary.
    Consistency:
        - The rating measures whether the facts in the summary are consistent with the facts in the original article.
        - Consider whether the summary does reproduce all facts accurately and does not make up untrue information.
    Fluency:
        - This rating measures the quality of individual sentences, are they well-written and grammatically correct.
        - Consider the quality of individual sentences.
    Coherence:
        - The rating measures the quality of all sentences collectively, to the fit together and sound naturally.
        - Consider the quality of the summary as a whole

Give an explanation for your final conclusion using about 200 words.
Always begin your output with: "As an final Decision-Agent, after carefully reviewing the Analyst-Agents' and Critical-Enhancer-Agents' previous analyses, I think ..."
Always end your output with a JSON object with the following format:{{"relevance": score, "coherence": score, "fluency": score, "consistency": score}} 
"""

In [7]:
agent_1 = ConversableAgent(
    name="Analyst-Agent",
    llm_config=llm_config,
    system_message=agent_1_system_message,
    human_input_mode="NEVER",
)

agent_2 = ConversableAgent(
    name="Critical-Enhancer-Agent",
    llm_config=llm_config,
    system_message=agent_2_system_message,
    human_input_mode="NEVER",
)

agent_3 = ConversableAgent(
    name="Decision-Agent",
    llm_config=llm_config,
    system_message=agent_3_system_message,
    human_input_mode="NEVER",
)

group_chat = GroupChat(
    agents=[agent_1, agent_2, agent_3],
    messages=[],
    max_round=4,
    speaker_selection_method="round_robin"
)

group_chat_manager = GroupChatManager(
    groupchat=group_chat,
    llm_config=llm_config,
)

## Evaluation

In [8]:
def evaluate(text, machine_summaries, relevance, coherence, fluency, consistency):
    message = f""" 
    Article: {text}

    Summary: {machine_summaries}
    """

    chat_results = agent_3.initiate_chat(
        group_chat_manager,
        message=message,
        summary_method="last_msg",
    )

    result_str = str(chat_results.chat_history[-1]["content"])
    print(result_str)

    pattern = r'"relevance"\s*:\s*(\d+)'
    relevance_score = int(re.search(pattern, result_str).group(1))

    pattern = r'"coherence"\s*:\s*(\d+)'
    coherence_score = int(re.search(pattern, result_str).group(1))

    pattern = r'"fluency"\s*:\s*(\d+)'
    fluency_score = int(re.search(pattern, result_str).group(1))

    pattern = r'"consistency"\s*:\s*(\d+)'
    consistency_score = int(re.search(pattern, result_str).group(1))

    relevance_deviation = abs(relevance - relevance_score) 
    coherence_deviation = abs(coherence - coherence_score) 
    fluency_deviation = abs(fluency - fluency_score) 
    consistency_deviation = abs(consistency - consistency_score) 

    return {
        "relevance": {
            "ground_truth": relevance,
            "system_decision": relevance_score,
            "deviation": relevance_deviation
        },
        "coherence": {
            "ground_truth": coherence,
            "system_decision": coherence_score,
            "deviation": coherence_deviation
        },
        "fluency": {
            "ground_truth": fluency,
            "system_decision": fluency_score,
            "deviation": fluency_deviation
        },
        "consistency": {
            "ground_truth": consistency,
            "system_decision": consistency_score,
            "deviation": consistency_deviation
        }
    }

In [9]:
num_rows = 3

results = []

df_subset = df_final.head(num_rows)

for _, row in tqdm(df_subset.iterrows(), total=num_rows, desc="Progress"):
    result = evaluate(
        text=row["text"],
        machine_summaries=row["machine_summaries"],
        relevance=row["relevance"],
        coherence=row["coherence"],
        fluency=row["fluency"],
        consistency=row["consistency"]
    )
    results.append(result)

results_df = pd.DataFrame(results)
results_df.to_csv('Results/cooperative.csv', index=False)

Progress:   0%|          | 0/3 [00:00<?, ?it/s]

[33mDecision-Agent[0m (to chat_manager):

 
    Article: Boss Nigel Pearson has urged Leicester to keep their cool and ignore their relegation rivals. The Foxes host Swansea on Saturday just three points from safety in the Barclays Premier League after back-to-back wins. Last week's 3-2 win at West Brom handed them a survival lifeline, although they remain bottom of the table. Jamie Vardy scored an injury-time winner against West Bromwich Albion on Saturday to improve his side's slim chance of Premier League survival Vardy celebrates in front of the travelling away fans after hitting the winner against West Brom But after their mini-revival, Pearson wants his side to remain focused on their own jobs. 'I'm very wary of people flipping the emphasis,' he said. 'Our future is in our own hands and if we go into the last game with that we have given ourselves a realistic chance. 'We need to make sure our own run-in is what we want it to be. Leicester manager Nigel Pearson has urged his pla

Progress:  33%|███▎      | 1/3 [00:11<00:23, 11.64s/it]

As a final Decision-Agent, after carefully reviewing the Analyst-Agent's and Critical-Enhancer-Agent's previous analyses, I think the proposed summary falls short in several key areas. 

In terms of relevance, while it mentions Jamie Vardy's injury-time winner and Pearson's focus on the team's performance, it fails to capture the broader context of Leicester's precarious position in the league and the significance of their recent back-to-back wins. The summary inaccurately states that Swansea is hosting Leicester, which is a critical error that misrepresents the situation, leading to a significant consistency issue. 

Fluency is somewhat acceptable, as the sentences are clear, but they lack complexity and depth, which could enhance readability. The summary could benefit from varied sentence structures and more descriptive language to better convey the article's tone and urgency. 

Coherence is another area where the summary falls short. The points presented do not connect logically, ma

Progress:  67%|██████▋   | 2/3 [00:23<00:11, 11.75s/it]

As a final Decision-Agent, after carefully reviewing the Analyst-Agent's and Critical-Enhancer-Agent's previous analyses, I think the summary does a fair job of capturing the main points of the article but ultimately falls short in several areas. 

In terms of relevance, the summary mentions Tevez's desire to leave Juventus and return to Boca Juniors, which are indeed key points. However, it neglects to include important details such as Tevez's status as the league's top scorer and the fact that his contract runs until 2016. These omissions detract from the overall understanding of the situation and the implications of his potential move.

The summary is consistent with the original article, accurately reflecting the facts without introducing inaccuracies. Fluency is satisfactory, with grammatically correct sentences, but the writing could be more engaging with varied sentence structures.

Coherence is where the summary struggles the most. The ideas presented feel somewhat disjointed, 

Progress: 100%|██████████| 3/3 [00:35<00:00, 11.75s/it]

As a final Decision-Agent, after carefully reviewing the Analyst-Agent's and Critical-Enhancer-Agent's previous analyses, I think the summary provided does capture some key points from the article, but it ultimately falls short in several critical areas. 

In terms of **relevance**, while it mentions the auction and the sale price of the dress, it omits significant details such as the dress's original color, its condition, and the background of the collector, James Tumblin. This lack of context diminishes the overall relevance of the summary, leading to a score of 2. 

Regarding **consistency**, the summary accurately reflects the facts presented in the article, with no inaccuracies noted, which justifies a score of 5. 

For **fluency**, the summary suffers from grammatical issues, including inconsistent capitalization and punctuation, which hinder readability. The sentences are fragmented and lack variety, resulting in a score of 2. 

Lastly, the **coherence** of the summary is weak, 


