# Parallel Interaction on SummEval

## Imports

In [1]:
import pandas as pd
from datasets import load_dataset
import os
from dotenv import load_dotenv
from autogen import ConversableAgent
import re
from tqdm import tqdm
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## Data

In [2]:
SummEval = load_dataset("mteb/summeval")

df = pd.DataFrame(SummEval["test"])[["text", "machine_summaries", "relevance", "coherence", "fluency", "consistency"]]

problematic_indices = [5, 7, 8, 9, 10, 11, 18, 20, 26, 27, 33, 34, 39, 46, 61, 64, 68, 73, 75, 79, 85, 86, 88, 92, 96, 99]
df_filtered = df.drop(index=problematic_indices).reset_index(drop=True)

df_exploded = df_filtered.explode(["machine_summaries", "relevance", "coherence", "fluency", "consistency"]).reset_index(drop=True)

df_sampled = df_exploded.iloc[:1200].sample(n=100, random_state=42).reset_index(drop=True)

columns_to_round = ["relevance", "coherence", "fluency", "consistency"]

df_sampled[columns_to_round] = df_sampled[columns_to_round].map(lambda x: int(np.ceil(x) if x % 1 > 0.5 else np.floor(x)))

df_final = df_sampled

print(df_final.info())
print(df_final.head(1))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   text               100 non-null    object
 1   machine_summaries  100 non-null    object
 2   relevance          100 non-null    int64 
 3   coherence          100 non-null    int64 
 4   fluency            100 non-null    int64 
 5   consistency        100 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 4.8+ KB
None
                                                text  \
0  Boss Nigel Pearson has urged Leicester to keep...   

                                   machine_summaries  relevance  coherence  \
0  jamie vardy scored an injury-time winner again...          2          2   

   fluency  consistency  
0        3            2  


## Config

In [3]:
load_dotenv()

api_key = os.getenv("AZURE_OPENAI_API_KEY")
endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_DEPLOYMENT_NAME")
api_version = os.getenv("AZURE_API_VERSION", "2023-12-01-preview")

config_list = [
    {
        "model": deployment_name,
        "api_key": api_key,
        "base_url": f"{endpoint}/openai/deployments/{deployment_name}/chat/completions?api-version={api_version}",
        "api_type": "azure",
        "api_version": api_version,  
        "temperature": 0,
        "cache_seed": 42,
    }
]

## System Design

In [4]:
agent_1_system_message =f"""
You are a Factual-Journalist-Agent.
You are a professional journalist focused on verifying factual accuracy, capturing the core message of articles, and ensuring summaries remain true to the source. 
You value clear, complete, and correct information above all.

In this task you will evaluate the quality of a summary written for a news article.
To correctly solve this task, follow these steps:

    1. Carefully read the news article, be aware of the information it contains.
    2. Read the proposed summary.
    3. Rate the summary with integer values on a scale from 1 (worst) to 5 (best) by its relevance, consistency, fluency, and coherence.

Definitions:
    Relevance:
        - The rating measures how well the summary captures the key points of the article.
        - Consider whether all and only the important aspects are contained in the summary.
    Consistency:
        - The rating measures whether the facts in the summary are consistent with the facts in the original article.
        - Consider whether the summary does reproduce all facts accurately and does not make up untrue information.
    Fluency:
        - This rating measures the quality of individual sentences, are they well-written and grammatically correct.
        - Consider the quality of individual sentences.
    Coherence:
        - The rating measures the quality of all sentences collectively, to the fit together and sound naturally.
        - Consider the quality of the summary as a whole.

Give an explanation on your evaluation using about 200 words.
Always begin your output with: "As a Factual-Journalist-Agent I think ..."
Always end your output with a JSON object with the following format:{{"relevance": score, "coherence": score, "fluency": score, "consistency": score}} 
"""

agent_2_system_message =f"""
You are a Language-Stylist-Agent.
You are an expert in writing, grammar, and expression, evaluating how summaries communicate effectively through style, clarity, and sentence construction. 
You care about linguistic precision, elegance, and flow while maintaining content accuracy.

In this task you will evaluate the quality of a summary written for a news article.
To correctly solve this task, follow these steps:

    1. Carefully read the news article, be aware of the information it contains.
    2. Read the proposed summary.
    3. Rate the summary with integer values on a scale from 1 (worst) to 5 (best) by its relevance, consistency, fluency, and coherence.

Definitions:
    Relevance:
        - The rating measures how well the summary captures the key points of the article.
        - Consider whether all and only the important aspects are contained in the summary.
    Consistency:
        - The rating measures whether the facts in the summary are consistent with the facts in the original article.
        - Consider whether the summary does reproduce all facts accurately and does not make up untrue information.
    Fluency:
        - This rating measures the quality of individual sentences, are they well-written and grammatically correct.
        - Consider the quality of individual sentences.
    Coherence:
        - The rating measures the quality of all sentences collectively, to the fit together and sound naturally.
        - Consider the quality of the summary as a whole.

Give an explanation on your evaluation using about 200 words.
Always begin your output with: "As a Language-Stylist-Agent I think ..."
Always end your output with a JSON object with the following format:{{"relevance": score, "coherence": score, "fluency": score, "consistency": score}} 
"""

agent_3_system_message =f"""
You are a Cognitive-Agent.
You specialize in how readers process, understand, and retain information. 
You assess summaries based on logical structure, mental clarity, and how effectively the content can be followed and integrated by readers.

In this task you will evaluate the quality of a summary written for a news article.
To correctly solve this task, follow these steps:

    1. Carefully read the news article, be aware of the information it contains.
    2. Read the proposed summary.
    3. Rate the summary with integer values on a scale from 1 (worst) to 5 (best) by its relevance, consistency, fluency, and coherence.

Definitions:
    Relevance:
        - The rating measures how well the summary captures the key points of the article.
        - Consider whether all and only the important aspects are contained in the summary.
    Consistency:
        - The rating measures whether the facts in the summary are consistent with the facts in the original article.
        - Consider whether the summary does reproduce all facts accurately and does not make up untrue information.
    Fluency:
        - This rating measures the quality of individual sentences, are they well-written and grammatically correct.
        - Consider the quality of individual sentences.
    Coherence:
        - The rating measures the quality of all sentences collectively, to the fit together and sound naturally.
        - Consider the quality of the summary as a whole.

Give an explanation on your evaluation using about 200 words.
Always begin your output with: "As a Cognitive-Agent I think ..."
Always end your output with a JSON object with the following format:{{"relevance": score, "coherence": score, "fluency": score, "consistency": score}} 
"""

In [5]:
initializer = ConversableAgent(
    "initializer", 
    llm_config={"config_list": config_list},
    human_input_mode="NEVER",
    )

agent_1 = ConversableAgent(
    "Factual-Journalist-Agent",
    llm_config={"config_list": config_list},
    system_message=agent_1_system_message,
    human_input_mode="NEVER",
)
agent_2 = ConversableAgent(
    "Language-Stylist-Agent",
    llm_config={"config_list": config_list},
    system_message=agent_2_system_message,
    human_input_mode="NEVER",
)
agent_3 = ConversableAgent(
    "Cognitive-Agent",
    llm_config={"config_list": config_list},
    system_message=agent_3_system_message,
    human_input_mode="NEVER",
)

## Evaluation

In [6]:
def evaluate(text, machine_summaries, relevance, coherence, fluency, consistency):
    message = f""" 
    Article: {text}

    Summary: {machine_summaries}
    """

    agents = [agent_1, agent_2, agent_3]

    scores = {
        "relevance": [],
        "coherence": [],
        "fluency": [],
        "consistency": []
    }

    for agent in agents:
        result = initializer.initiate_chat(agent, message=message, max_turns=1)
        result_str = str(result)

        relevance_match = re.search(r'"relevance"\s*:\s*(\d+)', result_str)
        coherence_match = re.search(r'"coherence"\s*:\s*(\d+)', result_str)
        fluency_match = re.search(r'"fluency"\s*:\s*(\d+)', result_str)
        consistency_match = re.search(r'"consistency"\s*:\s*(\d+)', result_str)

        scores["relevance"].append(int(relevance_match.group(1)))
        scores["coherence"].append(int(coherence_match.group(1)))
        scores["fluency"].append(int(fluency_match.group(1)))
        scores["consistency"].append(int(consistency_match.group(1)))


    final_scores = {key: round(sum(vals) / len(vals)) for key, vals in scores.items()}

    return {
        "relevance": {
            "ground_truth": relevance,
            "system_decision": final_scores["relevance"],
            "deviation": abs(relevance - final_scores["relevance"])
        },
        "coherence": {
            "ground_truth": coherence,
            "system_decision": final_scores["coherence"],
            "deviation": abs(coherence - final_scores["coherence"])
        },
        "fluency": {
            "ground_truth": fluency,
            "system_decision": final_scores["fluency"],
            "deviation": abs(fluency - final_scores["fluency"])
        },
        "consistency": {
            "ground_truth": consistency,
            "system_decision": final_scores["consistency"],
            "deviation": abs(consistency - final_scores["consistency"])
        }
    }

In [7]:
num_rows = 3

results = []

df_subset = df_final.head(num_rows)

for _, row in tqdm(df_subset.iterrows(), total=num_rows, desc="Progress"):
    result = evaluate(
        text=row["text"],
        machine_summaries=row["machine_summaries"],
        relevance=row["relevance"],
        coherence=row["coherence"],
        fluency=row["fluency"],
        consistency=row["consistency"]
    )
    results.append(result)

results_df = pd.DataFrame(results)
results_df.to_csv('Results/parallel.csv', index=False)

Progress:   0%|          | 0/3 [00:00<?, ?it/s]

[33minitializer[0m (to Factual-Journalist-Agent):

 
    Article: Boss Nigel Pearson has urged Leicester to keep their cool and ignore their relegation rivals. The Foxes host Swansea on Saturday just three points from safety in the Barclays Premier League after back-to-back wins. Last week's 3-2 win at West Brom handed them a survival lifeline, although they remain bottom of the table. Jamie Vardy scored an injury-time winner against West Bromwich Albion on Saturday to improve his side's slim chance of Premier League survival Vardy celebrates in front of the travelling away fans after hitting the winner against West Brom But after their mini-revival, Pearson wants his side to remain focused on their own jobs. 'I'm very wary of people flipping the emphasis,' he said. 'Our future is in our own hands and if we go into the last game with that we have given ourselves a realistic chance. 'We need to make sure our own run-in is what we want it to be. Leicester manager Nigel Pearson has urge

Progress:  33%|███▎      | 1/3 [00:12<00:25, 12.50s/it]

[33minitializer[0m (to Factual-Journalist-Agent):

 
    Article: Carlos Tevez has been told to terminate his contract with Juventus to complete a return to his former club Boca Juniors in Argentina. The former Manchester City striker's deal with the Serie A champions does not expire until the end of next season but he has reportedly told the club he wishes to leave this summer. Boca have confirmed they are close to completing a deal for the 31-year-old, but club president Daniel Angelici has stressed that Tevez must terminate his contract with the Italians first. Carlos Tevez has shocked Juventus by suggesting he wants to leave the club this summer Tevez is on course to win a second Serie A title with the Old Lady and still has a shot at European glory 'We must be careful', Angelici told TYC Sports. 'We know that he wants to return to Argentina with Boca Juniors but he must first terminate his contract with Juventus, which runs until 2016. 'We are close to sealing his return and it 

Progress:  67%|██████▋   | 2/3 [00:23<00:11, 11.67s/it]

[33minitializer[0m (to Factual-Journalist-Agent):

 
    Article: A dress worn by Vivien Leigh when she played Scarlett O'Hara in the classic 1939 film Gone With the Wind has fetched $137,000 at auction. Heritage Auctions offered the gray jacket and skirt, featuring a black zigzag applique, plus more than 150 other items from the Academy Award-winning film at auction on Saturday in Beverly Hills, California. The dress - a jacket and full skirt ensemble - was worn in several key scenes in the 1939 movie, including when Scarlett O'Hara encounters Rhett Butler, played by Clark Gable, and when she gets attacked in the shanty town. Scroll down for video An outfit worn in several scenes of the 1939 film Gone With The Wind by Vivien Leigh as she played Scarlett O'Hara sold for $137,000 at auction on Saturday The dress - a jacket and full skirt ensemble - was worn in several key scenes in the 1939 movie but has suffered a little with age and has faded to light gray from original slate blue-g

Progress: 100%|██████████| 3/3 [00:46<00:00, 15.50s/it]
