Title: Rating Evaluation

Overview:
This script evaluates climate disclosure responses of companies using OpenAI's GPT-4o-mini model. We tell the prompt to judge company's responses based on action taken, specificity, and clarity, while ignoring length, order, and missing data.

Main Features:
- Uses a rating prompt to guide GPT-4o-mini in evaluating each company's response.
- Gets token-level log probability analysis to assess model confidence in answer.
- Outputs a file with explanations (if applicable), decision (numerical rating on a scale of 1 to 5), and its respective log probs
- We then use those log probs and calculate a final 'score', which gets outputted into a new file

How to Run:
1. Install dependencies:
    openai
    pandas
    numpy

2. Set up your OpenAI API key in `api_keys.py`.
    ex. # api_keys.py
        OPEN_AI_API_KEYS = ["your-api-key-here"]

3. Adjust the input and output file paths as needed in the script.


Note:
Make sure the company responses are aligned with the model's input requirements.

In [None]:
import openai
from api_keys import OPEN_AI_API_KEYS
MODEL = 'gpt-4o-mini'
key = OPEN_AI_API_KEYS[0]
client = openai.OpenAI(api_key=key)

In [None]:
import pandas as pd
import numpy as np
import re
import csv
import os

Calls OpenAI chat API with full control over parameters like temperature, max tokens, stop sequences, tools, and log probabilities.

In [None]:
def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

Setup of files to read in and result files

In [None]:
indices = data_file.index

# Create results csv file
result_file = result_file_name
fields = ["Company", "Industry", "Output", "Result 1-1", "Percent 1-1", "Result 1-2", "Percent 1-2", "Result 1-3", "Percent 1-3", "Result 1-4", "Percent 1-4", "Result 1-5", "Percent 1-5"]

# prepare other parts of prompt
question_4a = data_file.columns[5]
question_4b = data_file.columns[6]

response_list_4a = data_file[question_4a]
response_list_4b = data_file[question_4b]
company_list = data_file["Organization"]
industry_list = data_file["Primary industry"]

All prompt parts

In [None]:
final_prompt_start = """
Please act as an impartial judge and evaluate the quality of the following response provided by a company to the following questions. The company may answer both questions or just one.
"""

final_prompt_general = """
Your evaluation should be primarily based on the tangible action that has already been taken, and can also consider planned actions for the future. Your evaluation should also consider factors such as the specificity, clarity, completeness, and depth of their responses.
Do not allow the length of the responses to influence your evaluation. Do not allow a missing response to influence your decision, but simply ignore it and focus on the other response. Do not fall for greenwashing tactics.
Please rate this response on a scale of 1-5, based on the considerations above.
"""

final_prompt_reference_3 = """
Following is an example of a response that should receive a score of 3. Use this response as a reference for your decision. 
"""
example_response_a_3 = "To grow our business sustainably, we are cutting carbon and maximising energy efficiency. The Science Based Targets initiative (SBTi) has approved our Science Based Targets for Scopes 1, 2 in line with a 1.5-degree pathway and Scope 3 in line with a well below 2-degree pathway.For Scopes 1 and 2, this includes the reduction of greenhouse gas (GHG) emissions from our own operations, aligned with a temperature pathway to limit global warming to 1.5°C. We have committed to achieve Net Zero by 2035. For Scope 3 we are reducing our GHG emissions in the value chain by 30 per cent by 2030.To grow our business sustainably, we are cutting carbon and maximising energy efficiency. The Science Based Targets initiative (SBTi) has approved our Science Based Targets for Scopes 1, 2 in line with a 1.5-degree pathway and Scope 3 in line with a well below 2-degree pathway.For Scopes 1 and 2, these include the reduction of greenhouse gas (GHG) emissions from our own operations, aligned with a temperature pathway to limit global warming to 1.5°C and we have committed to achieve Net Zero by 2035. For Scope 3 we are reducing our GHG emissions in the value chain by 30 per cent by 2030 and this includes reducing emissions from purchased goods, upstream transport and distribution, services sold and our customers'  direct use and consumption of the products we sell."
example_response_b_3 = ""

final_prompt_reference_5 = """
Following is an example of a response that should receive a score of 5. Use this response as a reference for your decision. 
"""

example_response_a_5 = "In 2019 we achieved our original 2030 SBT 11 years early, reducing our carbon intensity by 48% since 2014. In line with our aim to lead our sector, in 2019 we became the first UK REIT to increase the ambition level of our science-based carbon reduction target, aligning it to a 1.5-degree scenario (1.5DS). This commitment is the foundation of our transition to net zero.Our updated science-based target, in line with the 1.5DS, is to reduce our absolute carbon emissions (tCO2e) by 70% by 2030 compared to a 2013/14 baseline, for property under our management for at least two years, excluding those properties which are acquired, sold or included in the development pipeline at any time within the last two years. We understand that this two-year period reflects the amount of time needed to undertake sustainability assessments and start implementing changes to the assets; once properties complete the minimum required time under our operational control, they will be included into the commitment portfolio at the start of the following reporting year. This target includes Scope 1 and 2 emissions, and Scope 3 emissions associated with downstream leased assets (gas and electricity procured by us and used by our occupiers) but excludes Scope 1 emissions associated with refrigerant gas. To develop this target, the Absolute Contraction Approach was adopted, which applies the annual emission reduction pathway aligned to a 1.5DS to the baseline emissions of the company, and the pathway is defined by a 4.2% annual linear reduction, which has been derived by the Science Based Targets initiative (SBTi). We worked with the Carbon Trust in order to calculate the emissions pathway for our SBT; the annual reduction aligned to the 1.5DS was applied to our baseline footprint, resulting in the absolute emissions pathway and reduction targets."
example_response_b_5 = ""

final_prompt_indicative = """
Here is the scale you should use to build your answer:
1. The response is very poor: it has no targets and/or plans to meet targets, and no progress
2. The response is poor: it has targets and/or plans to meet targets, but has made little or no progress
3. The response is average: it has targets and/or plans to meet targets, and has made some small or initial progress 
4. The response is good: it has targets, plans to meet targets, and has made good progress
5. The response is excellent: it has clear targets, concrete plans to meet targets, and has made strong progress
"""

final_prompt_end_explanation = """
Before giving your answer, please provide a short explanation of 40 or fewer words discussing the factors that contributed to your decision.
After giving your explanation, output your final verdict by strictly following this format of a single number 1, 2, 3, 4, or 5. It should be a single character without quotations or spaces. 
"""

final_prompt_end = """
Please do not provide an explanation. 
Output your final verdict by strictly following this format of a single number 1, 2, 3, 4, or 5. It should be a single character without quotations or spaces. 
"""

CDP questionairre questions to use in prompt

In [None]:
question_4a_22 = "Question 1: Provide details of your absolute emissions target(s) and progress made against those targets. - Please explain target coverage and identify any exclusions"
question_4b_22 = "Question 2: Provide details of your emissions intensity target(s) and progress made against those target(s). - Please explain target coverage and identify any exclusions"

In [None]:
# if it crashes partway through, can rerun and this segment will keep track of companies already seen
# if company i is already in csv, move to next index
processed_companies = set()
try:
    with open(result_file, 'r', newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            processed_companies.add(row["Company"])
except FileNotFoundError:
    # If the file doesn't exist, no indices are processed yet
    pass


with open(result_file, 'a', newline='') as csvfile:

    # creating a csv dict writer object
    writer = csv.DictWriter(csvfile, fieldnames=fields, quoting=csv.QUOTE_ALL)

    if csvfile.tell() == 0:
        writer.writeheader()

    # get responses one at a time and get decision
    for i in indices:
        
        if company_list[i] in processed_companies:
            continue
                    
        answer_a = response_list_4a[i]
        answer_b = response_list_4b[i]

        # comment out whichever sections needed depending on prompt configuration
        full_prompt = [
                    final_prompt_start,

                    question_4a_22,
                    question_4b_22, 

                    final_prompt_general,

                    # if using explanations
                    # "Please strictly keep your explanation under 40 words.",

                    # final_prompt_reference_3,
                    # f"[The Start of Question 1 Answer] {example_response_a_3} [The End of Question 1 Answer]",
                    # f"[The Start of Question 2 Answer] {example_response_b_3} [The End of Question 2 Answer]",

                    final_prompt_reference_5,
                    f"[The Start of Question 1 Answer] {example_response_a_5} [The End of Question 1 Answer]",
                    f"[The Start of Question 2 Answer] {example_response_b_5} [The End of Question 2 Answer]",

                    # final_prompt_indicative,
                    
                    # use either end or end_explanation below
                    # final_prompt_end_explanation,
                    final_prompt_end, 

                    f"[The Start of Question 1 Answer] {answer_a} [The End of Question 1 Answer]",
                    f"[The Start of Question 2 Answer] {answer_b} [The End of Question 2 Answer]"

                    ]

        full_prompt_string = ' '.join(full_prompt)

        decision = []
        percent = []
        
        output = get_completion(
            [{"role": "user", "content": full_prompt_string}],
            model="gpt-4o-mini",
            logprobs=True,
            top_logprobs=5,
        )

        # Extract the actual output text
        generated_text = output.choices[0].message.content

        print(len(generated_text.split()))
        # extract logprobs

        # without explanation
        # top_two_logprobs = output.choices[0].logprobs.content[0].top_logprobs
        
        # with explanation
        top_two_logprobs = output.choices[0].logprobs.content[-1].top_logprobs


        for j, logprob in enumerate(top_two_logprobs, start=1):
            decision.append(logprob.token)
            percent.append(np.round(np.exp(logprob.logprob)*100,2))

        # write new line to csv file
        line = {"Company":company_list[i], "Industry": industry_list[i], "Output":generated_text, "Result 1-1": decision[0], "Percent 1-1":percent[0], "Result 1-2": decision[1], "Percent 1-2": percent[1], "Result 1-3": decision[2], "Percent 1-3":percent[2], "Result 1-4": decision[3], "Percent 1-4": percent[3], "Result 1-5": decision[4], "Percent 1-5":percent[4]}
        writer.writerow(line)

    csvfile.close()


Convert log probs to weighted average between 1 and 5

In [None]:
# helper function to cast to int
def safe_cast_to_int(s):
    try:
        return int(s)
    except ValueError:
        return None 

In [None]:

input = "sample_set_log_probs.csv"
output = "sample_set_averages.csv"

# get data
df_non_a = pd.read_csv(input)
companies_average = {}

for company_number in range(len(df_non_a)):
    average = 0
    for rating_number in range(1,6):
        rating_column = "Result 1-" + str(rating_number)
        percent_column = "Percent 1-" + str(rating_number)

        rating = df_non_a[rating_column].iloc[company_number]
        if isinstance(rating, np.int64):
            rating = int(rating)
        if isinstance(rating, str):
            rating = safe_cast_to_int(rating)
        
        if isinstance(rating, int):
            average += rating * (df_non_a[percent_column].iloc[company_number]/100)
        companies_average[df_non_a['Company'].iloc[company_number]] = average

df_average_companies = pd.DataFrame(list(companies_average.items()), columns=['Company', 'Average'])

df_average_companies.to_csv(output, index = False)
