README

Title: Pairwise Evaluation

Overview:
This script evaluates and compares climate disclosure responses from two companies using OpenAI's GPT-4o-mini model. We tell the prompt to judges which company's response is stronger based on action taken, specificity, and clarity, while ignoring length, order, and missing data.

Main Features:
- Uses a pairwise system prompt to guide GPT-4o-mini in evaluating two company responses.
- Gets token-level log probability analysis to assess model confidence in answer.
- Reads in a list of selected company responses and compares them with a set of 24 selected company responses.
- Outputs a file with explanations (if applicable), decision (which company had a better responses), and its respectivelog probs
- we then use those log probs and calculate a final 'score', which gets outputted into a new file

How to Run:
1. Install dependencies:
    openai
    pandas
    numpy

2. Set up your OpenAI API key in `api_keys.py`.
    ex. # api_keys.py
        OPEN_AI_API_KEYS = ["your-api-key-here"]

3. Adjust the input and output file paths as needed in the script.


Note:
Make sure the company responses are aligned with the model's input requirements.

In [None]:
import openai

MODEL = 'gpt-4o-mini'
from api_keys import OPEN_AI_API_KEYS
key = OPEN_AI_API_KEYS[0]
client = openai.OpenAI(api_key=key)
openai.api_key = key

import numpy as np
import pandas as pd
import csv

Converts a list of prompt strings into a list of message dictionaries formatted for OpenAI's chat API. Sends the user prompts to the OpenAI model using the given client and retrieves the model's full text response.

In [148]:
def prepare_messages(prompts):
	messages = []
	for prompt in prompts:
		line = {"role": "user", "content": prompt}
		messages.append(line)
	return messages

def get_model_output_initial_sentences(prompts, client): 
    # Cycle through clients to avoid rate limiting
	completion = client.chat.completions.create(
		model = MODEL,
		messages = prepare_messages(prompts),
		temperature = 0
	)
	output = completion.choices[0].message.content
	return output

Calls OpenAI chat API with full control over parameters like temperature, max tokens, stop sequences, tools, and log probabilities.

In [149]:
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

This is the prompt that we used. If you do not want an explanation, comment out the code as described.

In [161]:
prompts = [
    "Please act as an impartial judge and evaluate the quality of the following responses provided by two ",
    "companies to the following questions. The companies may answer both questions or just one.",
    "Question 1: Provide details of your absolute emissions target(s) and progress made against those targets. - Please explain target coverage and identify any exclusions ",
    "Question 2: Provide details of your emissions intensity target(s) and progress made against those target(s). - Please explain target coverage and identify any exclusions ",
    "Your evaluation should be primarily based on the tangible action that has already been taken, ",
    "and can also consider planned actions for the future. Your evaluation should also consider factors ",
    "such as the specificity, clarity, completeness, and depth of their responses.",
    "Do not allow the length of the responses to influence your evaluation. ",
    "Do not allow a missing response to influence your decision, but simply ignore it and focus on the other response. ",
    "Do not fall for greenwashing tactics.",

    # comment out this line if you dont want an explanation
    "Please strictly keep your explanation under 40 words.", 

    "Do not allow the order of the responses to influence your decision. ",
    "Please decide which company gave a better response, based on the considerations above. ",

    # comment out this line if you dont want an explanation
    "Before giving your answer, please provide a short explanation of 40 or fewer words discussing the factors that contributed to your decision. ",
   
    # comment out this line if you dont want an explanation
    "After giving your explanation, ",
    
    # comment in this line if you want an explanation
    # "Please do not provide an explanation. ",

    # "Output your final answer by strictly following this format: A if company A’s response is better, and B if company B’s response is better. "
    "Output your final answer by strictly following this format: Better Response: A if company A’s response is better, and Better Response: B if company B’s response is better. "

    "[The Start of Company A’s Question 1 Response] {answer1a} [The End of Company A’s Response]",
    "[The Start of Company A’s Question 2 Response] {answer1b} [The End of Company A’s Response]",
    "[The Start of Company B’s Question 1 Response] {answer2a} [The End of Company B’s Response]",
    "[The Start of Company B’s Question 2 Response] {answer2b} [The End of Company B’s Response]",
    
]

This function:
- Fills in a templated prompt using four answer strings (answer1a, answer1b, answer2a, answer2b)
- Sends that prompt to OpenAI's model
- Retrieves the top 2 token log probabilities for the final token

Returns:
- The model's generated response
- A list of the top 2 likely next-token guesses (with their probabilities)



In [156]:
def get_log_prob(answer1a, answer1b, answer2a, answer2b, company_1, company_2):
    try:
        formatted_prompt = "".join(prompts).format(
            answer1a=answer1a, 
            answer1b=answer1b, 
            answer2a=answer2a, 
            answer2b=answer2b
        )
   
        API_RESPONSE = get_completion(
            [{"role": "user", "content": formatted_prompt.format(answer1a=answer1a, answer1b=answer1b, answer2a=answer2a, answer2b=answer2b)}],
            model="gpt-4o-mini",
            logprobs=True,
            top_logprobs=2,
        )

        generated_answer = API_RESPONSE.choices[0].message.content
        # print(f"Generated answer: {generated_answer}")
        top_two_logprobs = API_RESPONSE.choices[0].logprobs.content[-1].top_logprobs


        token_probabilities = []
        for i, logprob in enumerate(top_two_logprobs, start=1):

            token = logprob.token
            probability = np.round(np.exp(logprob.logprob)*100,2)
            token_probabilities.append((token, probability))
            
        return generated_answer, token_probabilities
    except KeyError as e:
        print(f"KeyError: {e}")
        print("Variables: answer1a:", answer1a, "answer1b:", answer1b, "answer2a:", answer2a, "answer2b:", answer2b)
        return [],[]
    except Exception as e:
        print(f"An error occurred: {e}")
        return [],[]

In [157]:
# variables for each years' questions
a_2019 = "C4.1a_C12_Provide details of your absolute emissions target(s) and progress made against those targets. - Please explain"
b_2019 = "C4.1b_C13_Provide details of your emissions intensity target(s) and progress made against those target(s). - Please explain"
a_2020 = "C4.1a_C15_Provide details of your absolute emissions target(s) and progress made against those targets. - Please explain (including target coverage)"
b_2020 = "C4.1b_C18_Provide details of your emissions intensity target(s) and progress made against those target(s). - Please explain (including target coverage)"
a_2021 = "C4.1a_C16_Provide details of your absolute emissions target(s) and progress made against those targets. - Please explain (including target coverage)"
b_2021 = "C4.1b_C19_Provide details of your emissions intensity target(s) and progress made against those target(s). - Please explain (including target coverage)"
a_2022 = "C4.1a_C27_Provide details of your absolute emissions target(s) and progress made against those targets. - Please explain target coverage and identify any exclusions"
b_2022 = "C4.1b_C30_Provide details of your emissions intensity target(s) and progress made against those target(s). - Please explain target coverage and identify any exclusions"

In [162]:
import pandas as pd
import csv

# input files:
#   input_file - 2022 merged dataset with all valid European company responses
#   companies_to_look_at - company responses you want to score
#       we did A list company responses and a stratified set of 147 companies

input_file = 'merged_files/2022_merged_dataset.csv'
companies_to_look_at = "a-list_data_files/2022_a_list.csv"

# output_file format: Company A, Company B, Industry, Decision, Token 1, LogProb 1, Token 2, LogProb 2
output_file = "pairwise_testing_new_prompt/testing_with_explanation/A_List_Comparisons_12.csv"

# 24 comparison companies from a stratified set that each company gets compared to
comparison_set = [
    # insert companies you want to compare against
    "Pandora A/S", "thyssenkrupp AG", "J Sainsbury Plc", "AMG Advanced Metallurgical Group NV", 
    "Sonaca", "Havas", "Norwegian Property ASA", "Phoenix Group Holdings", 
    "SGS Société Générale de Surveillance SA", "TDC NET A/S", "DUCRON CONSTRUCTION", 
    "Vistry Group plc", "LAMBERT SMITH HAMPTON", 
    "Nouvelle Societe Mineral Products Internation", "Recordati SpA", "SKF", 
    "Amadeus IT Group, S.A.", "Vitesco Technologies", "SSAB", 
    "Coloplast A/S", "Severn Trent", "Ernst & Young Global Ltd", 
    "Softcat", "WHISTL UK LTD"
]

# Load datasets
df = pd.read_csv(input_file)
a_df = pd.read_csv(companies_to_look_at)

# Extract columns
company_name_list = df.loc[:, "Organization"]
industry_list = df.loc[:, 'Primary industry']
a_response_list = df.loc[:, a_2022] 
b_response_list = df.loc[:, b_2022]  

# Extract the companies to compare (from the list of 25 companies)
companies_to_look_at_list = a_df["Company Name"].tolist()

fields = ["Company A", "Company B", "Industry", "Decision", "Token 1", "LogProb 1", "Token 2", "LogProb 2"]

# Write output to CSV
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fields, quoting=csv.QUOTE_ALL)
    writer.writeheader()

    # Loop through the 147 companies
    for index, companyA in enumerate(companies_to_look_at_list):
        
        # Check if the current company exists in the dataset
        if companyA in company_name_list.values:
            row = df[df["Organization"] == companyA].iloc[0]  # Get the row for the current company
            industry = row["Primary industry"]
            response1a = row[a_2022]
            response1b = row[b_2022]
            
            if pd.notna(industry) and industry != "" and isinstance(industry, str):
                # Compare against the fixed list of 24 companies
                filtered_df = df[df["Organization"].isin(comparison_set)]
                a_random_response_list = filtered_df.loc[:, a_2022]
                b_random_response_list = filtered_df.loc[:, b_2022]
                random_company_name_list = filtered_df.loc[:, "Organization"]

                # Loop through the 24 fixed companies
                for random_list_index in range(len(filtered_df)):
                    response2a = a_random_response_list.iloc[random_list_index]
                    response2b = b_random_response_list.iloc[random_list_index]
                    companyB = random_company_name_list.iloc[random_list_index]

                    print(f"Comparing {companyA} with {companyB}")

                    # Call the function to get log probabilities and tokens
                    answer1, list_log_probs = get_log_prob(response1a, response1b, response2a, response2b, companyA, companyB)
                    if list_log_probs:
                        # Write data to CSV
                        line = {
                            "Company A": companyA, 
                            "Company B": companyB, 
                            "Industry": industry, 
                            "Decision": answer1, 
                            "Token 1": list_log_probs[0][0], 
                            "LogProb 1": list_log_probs[0][1], 
                            "Token 2": list_log_probs[1][0],
                            "LogProb 2": list_log_probs[1][1]
                        }
                        writer.writerow(line)
                    else:
                        print(f"Insufficient log probabilities for comparison between {companyA} and {companyB}.")

print(f"LogProbs have been written to {output_file}")

Comparing Symrise AG with AMG Advanced Metallurgical Group NV
Comparing Symrise AG with Amadeus IT Group, S.A.
Comparing Symrise AG with Coloplast A/S
Comparing Symrise AG with DUCRON CONSTRUCTION
Comparing Symrise AG with Ernst & Young Global Ltd
Comparing Symrise AG with Havas
Comparing Symrise AG with J Sainsbury Plc
Comparing Symrise AG with LAMBERT SMITH HAMPTON
Comparing Symrise AG with Norwegian Property ASA
Comparing Symrise AG with Nouvelle Societe Mineral Products Internation
Comparing Symrise AG with Pandora A/S
Comparing Symrise AG with Phoenix Group Holdings
Comparing Symrise AG with Recordati SpA
Comparing Symrise AG with SGS Société Générale de Surveillance SA
Comparing Symrise AG with SKF
Comparing Symrise AG with SSAB
Comparing Symrise AG with Severn Trent
Comparing Symrise AG with Softcat
Comparing Symrise AG with Sonaca
Comparing Symrise AG with TDC NET A/S
Comparing Symrise AG with Vistry Group plc
Comparing Symrise AG with Vitesco Technologies
Comparing Symrise AG 

This script processes the csv file from above and computes the average log prob for each company.

Outputs a file with Company and Average (pairwise score)

In [166]:
import pandas as pd

# File paths
input_file = "pairwise_testing_new_prompt/with_explanation/147_Non_A_List_Comparisons.csv"
output_file = "pairwise_testing_new_prompt/with_explanation/Averages_147_Non_A_List.csv"

# Read input CSV
df = pd.read_csv(input_file)

# Initialize tracking variables
log_prob_data = {}
current_company = None
total_score = 0.0
total_count = 0

# Iterate over each row
for index, row in df.iterrows():
    company_a = row['Company A']
    token_1 = row['Token 1'].strip()
    logprob_1 = row['LogProb 1']
    logprob_2 = row['LogProb 2']

    # Check for a new company
    if company_a != current_company:
        # Finalize the previous company's calculations
        if current_company is not None and total_count > 0:
            avg_log_prob = total_score / total_count
            log_prob_data[current_company] = avg_log_prob

        # Reset counters for the new company
        current_company = company_a
        total_score = 0.0
        total_count = 0

    # Add probabilities for both scenarios
    if token_1 == 'A':
        total_score += logprob_1
    else:  # Token 1 is 'B'
        total_score += logprob_2
    total_count += 1

# Finalize the last company's calculations
if current_company is not None and total_count > 0:
    avg_log_prob = total_score / total_count
    log_prob_data[current_company] = avg_log_prob

# Create a DataFrame from the results
log_prob_df = pd.DataFrame([
    {'Company': company, 'Average': avg_log_prob}
    for company, avg_log_prob in log_prob_data.items()
])

# Save to CSV
log_prob_df.to_csv(output_file, index=False)

print("Average log probabilities have been calculated and saved.")

Average log probabilities have been calculated and saved.
