# Climate-related classification tasks on ChatGPT - using techniques for better prompt engineering
<hr>
<h3>In this Notebook we have used various techniques for better prompt engineering in order to get better results, including chaining of prompts, using different patterns for prompts like <i>"The Persona Pattern"</i> where which enables
the model to take a certain point of view or role, in our
case, a climate, sustainability, and environmental expert; <i>“The Fact Check List Pattern”</i>, which instructs the model
to output the most important points of a text and then use
those points as the input in a follow-up prompt and the
<i>“Reflection Pattern”</i> in which the model is asked to
explain the reasoning behind its response.</h3>
<h3>For that purpose, in this script we have created a function that connects to the OpenAI API using the API key and sends the data and the prompts provided to ChatGPT in batch, in chosen batch size. Additionally, a function that chains prompts was also created such that it receives the first prompt and forwards it as input in the second prompt.</h3>
<h3>In order to be able to run the scripts and the tasks, first the OpenAI key needs to be set.</h3>
<h3>Because this key is a secret and gives access to your OpenAI account, it should be hidden and not available in plain text to the public. It is advised to store these keys in files on your computer on on some cloud, like Google Drive where other people cannot access them and then open them in the Notebook and set the keys via variables, that way they can be protected from the public.</h3>
<h3>In our approach, we used text files on Google Drive to store the key and we open them in the Notebook, set the appropriate variable and then use the variable to set the key.</h3>
<hr>
<h3>To use this script, you need to set your OpenAI key, to do that, if you use the same approach as us, first you need to store your key in a file and store them on Google Drive and after that only the path to the file in which the key is stored needs to be changed and the script will work.</h3>
<h3>Alternative approaches include uploading your locally stored files to the Colab Notebook, using a GitHub repository or using alternative storage solutions.</h3>
<h3>On the following link you can find ways to deal with your files on various storage providers: <a href="https://neptune.ai/blog/google-colab-dealing-with-files">https://neptune.ai/blog/google-colab-dealing-with-files</a></h3>
<hr>
<h3>Each task is structured in its own Colab Notebook and in order to get the results for a task, first the appropriate keys must be set in the Notebook and after that the whole Notebook can just be run and the results will be displayed at the end of the section, either by collapsing the section and running the cells from the whole section at once or running each cell one by one. Some steps are optional, for example saving the results in a .csv file and may be skipped.</h3>

In [None]:
#With the following commands, your Google Drive gets mounted to the Notebook at /content/drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install openai
!pip install tiktoken
import openai
import tiktoken
encoding = tiktoken.get_encoding("gpt2")
with open('Here put the path to your OpenAI API key', 'r') as file:
    key = file.readline()

#Alternatively, you can just insert your keys as plain text in the appropriate places, but this is not advised since your keys would be visible to anyone who has access to your Notebook
#For using other approaches, please visit the link provided in the description above that instructs use and import of files from other storage solutions

openai.api_key = key

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Using cached openai-0.27.8-py3-none-any.whl (73 kB)
Collecting aiohttp (from openai)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
  Downloading yarl-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import json
import time
def batch_gpt(prompt,target_texts, batch_size):
    l = len(target_texts)
    size = int(l / batch_size) + 1
    size = batch_size
    text_list = np.array_split(target_texts, 3)
    print(f"Total records {l}, number of chunks = {size}")
    rez_keys = []
    rez_vals = []
    rez = []
    for i, texts in enumerate(text_list):
        text = "\n".join(texts)
        p = prompt + text
        print("prompt", p)
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": p}]
        )
        r = response["choices"][0]["message"]["content"]
        print(result, r)
        rez.append(r)
        try:
            dictData= json.loads(r)
            keys = list(dictData.keys())
            values = list(dictData.values())
            rez_keys += keys
            rez_vals += values
        except:
            print("error parsing"+r)
        print(f"i={i + 1}: shape:{len(rez_keys)}")
    return rez_keys, rez_vals, rez
def batch_gpt_len(prompt,target_texts, batch_size):
    l = len(target_texts)
    rez_keys = []
    rez_vals = []
    rez = []
    i =0
    while i < l:
        text = ""
        for j in range(batch_size):
            if i<l:
                if batch_size > 1:
                    text += f"{i}. "+target_texts[i]+"\n\n"
                else:
                    text += target_texts[i]
            i += 1
        p = prompt + text
        #print(f"Prompt ({len(p)}):", p)
        # print(i, len(p))

        # try the API call and if it fails, wait 10 seconds and retry again (max 3 times)
        for j in range(6):
            try:
                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[{"role": "user", "content": p}]
                )
                break
            except:
                print("error calling API, retrying...")
                time.sleep(10)

        r = response["choices"][0]["message"]["content"]
        rez.append(r)
        # convert r to int
        try:
            b = int(r[:1])
        except:
            b = -1
            print("error parsing"+r)
        print(i, len(p), "-", r, b, end=": ")
        # if i % 10 print
        if i % 10 == 0:
            print()
        rez_keys.append(b)
    return rez,rez_keys


def batch_gpt_chained(prompt,chained_prompt,target_texts, batch_size):
    l = len(target_texts)
    rez_keys = []
    rez_vals = []
    rez = []
    i =0
    while i < l:
        text = ""
        for j in range(batch_size):
            if i<l:
                if batch_size > 1:
                    text += f"{i}. "+target_texts[i]+"\n\n"
                else:
                    text += target_texts[i]
            i += 1
        p = prompt + text
        #print(f"Prompt ({len(p)}):", p)
        # print(i, len(p))

        # try the API call and if it fails, wait 10 seconds and retry again (max 3 times)
        for j in range(6):
            try:
                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[{"role": "user", "content": p}]
                )
                break
            except:
                print("error calling API, retrying...")
                time.sleep(10)

        r1 = response["choices"][0]["message"]["content"]
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": chained_prompt + text}]
            )
        except:
            print("error calling API, retrying...")
            time.sleep(10)

        r = response["choices"][0]["message"]["content"]
        rez.append(r)
        # convert r to int
        try:
            b = int(r[:1])
        except:
            b = -1
            print("error parsing"+r)
        print(i, len(p), "-", r, b, end=": ")
        # if i % 10 print
        if i % 10 == 0:
            print()
        rez_keys.append(b)
    return rez,rez_keys



def batch_gpt_resp_as_part_prompt(prompt,chained_prompt,target_texts, batch_size):
    l = len(target_texts)
    rez_keys = []
    rez_vals = []
    rez = []
    summarized_points = []
    i =0
    while i < l:
        text = ""
        for j in range(batch_size):
            if i<l:
                if batch_size > 1:
                    text += f"{i}. "+target_texts[i]+"\n\n"
                else:
                    text += target_texts[i]
            i += 1
        p = prompt + text
        #print(f"Prompt ({len(p)}):", p)
        # print(i, len(p))

        # try the API call and if it fails, wait 10 seconds and retry again (max 3 times)
        for j in range(6):
            try:
                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[{"role": "user", "content": p}]
                )
                break
            except:
                print("error calling API, retrying...")
                time.sleep(10)

        r1 = response["choices"][0]["message"]["content"]
        summarized_points.append(r1)
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": chained_prompt + r1}]
            )
        except:
            print("error calling API, retrying...")
            time.sleep(10)

        r = response["choices"][0]["message"]["content"]
        rez.append(r)
        # convert r to int
        try:
            b = int(r[:1])
        except:
            b = -1
            print("error parsing"+r)
        print(i, len(p), "-", r, b, end=": ")
        # if i % 10 print
        if i % 10 == 0:
            print()
        rez_keys.append(b)
    return rez,rez_keys, summarized_points




def batch_gpt_verify_cot_prompts(start_prompt,end_prompt,target_texts, batch_size):
    l = len(target_texts)
    rez_keys = []
    rez_vals = []
    rez = []
    i =0
    while i < l:
        text = ""
        for j in range(batch_size):
            if i<l:
                if batch_size > 1:
                    text += f"{i}. "+target_texts[i]+"\n\n"
                else:
                    text += target_texts[i]
            i += 1
        p = start_prompt + text + end_prompt
        #print(f"Prompt ({len(p)}):", p)
        # print(i, len(p))

        # try the API call and if it fails, wait 10 seconds and retry again (max 3 times)
        for j in range(6):
            try:
                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[{"role": "user", "content": p}]
                )
                break
            except:
                print("error calling API, retrying...")
                time.sleep(10)

        r = response["choices"][0]["message"]["content"]
        rez.append(r)
        # convert r to int
        try:
            parts = r.split("\"")
            i = len(parts)-2
            b = parts[i]
            b = b.lower()
            if b.startswith("yes"):
              b = 1
            else:
              b = 0
        except:
            b = -1
            print("error parsing"+r)
        # print(i, len(p), "-", r, b, end=": ")
        print(f"{i}: \| response: {r} \| result: {b}\n\n")
        # if i % 10 print
        if i % 10 == 0:
            print()
        rez_keys.append(b)
    return rez,rez_keys

<h1>Climate Classification on commitments and actions</h1>
<h4>In the Climate Commitments and actions task, paragraphs are being classified whether they talk about commitments and actions or not.</h4>
<hr>
<h4>Classification classes:</h4>
<h4>0 - paragraph is not about climate commitments and actions</h4>
<h4>1 - paragraph is about climate commitments and actions</h4>
<hr>
<h4>First, the required library - datasets is loaded in order to be able to work with the dataset and the corresponding dataset is downloaded from HuggingFace and loaded into the dataset variable.</h4>


In [None]:
!pip install datasets
from datasets import load_dataset

dataset = load_dataset("climatebert/climate_commitments_actions")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Downloading readme:   0%|          | 0.00/4.52k [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/climatebert___parquet/climatebert--climate_commitments_actions-c39067b6628a5441/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/273k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/101k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/320 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/climatebert___parquet/climatebert--climate_commitments_actions-c39067b6628a5441/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 320
    })
})

In [None]:
dataset['test']

Dataset({
    features: ['text', 'label'],
    num_rows: 320
})

<h4>After that, the paragraphs and the labels are extracted from the dataset and are loaded into a Pandas DataFrame that allows easier manipulation with the data and better visualization of the data, with tables.</h4>

In [None]:
data = []
for i in range(0,len(dataset['test']['text'])):
  data.append([dataset['test']['text'][i],dataset['test']['label'][i]])

print(data)

[['Sustainable strategy ‘red lines’ For our sustainable strategy range, we incorporate a series of proprietary ‘red lines’ in order to ensure the poorest- performing companies from an ESG perspective are not eligible for investment.', 1], ['Verizon’s environmental, health and safety management system provides a framework for identifying, controlling, and reducing the risks associated with the environments in which we operate. Besides regular management system assessments, internal and third-party compliance audits and inspections are performed annually at hundreds of facilities worldwide. The goal of these assessments is to identify and correct site-specific issues, and to educate and empower facility managers and supervisors to implement corrective actions. Verizon’s environment, health and safety efforts are directed and supported by experienced experts around the world that support our operations and facilities.', 0], ['In 2019, the Company closed a series of transactions related to

In [None]:
import pandas as pd
df = pd.DataFrame(data=data,columns=["text","label"])

In [None]:
df

Unnamed: 0,text,label
0,Sustainable strategy ‘red lines’ For our susta...,1
1,"Verizon’s environmental, health and safety man...",0
2,"In 2019, the Company closed a series of transa...",1
3,"In December 2020, the AUC approved the Electri...",0
4,"Finally, there is a reputational risk linked t...",0
...,...,...
315,Indirect emissions result from operational act...,0
316,"All data in this TCFD report is as of, or for ...",0
317,Outcome: The bank explained that it would be w...,1
318,"In 2020, Banco do Brasil Foundation celebrated...",1


<h4>The initial prompt that was sent to ChatGPT was the following: "You are the sustainability, environment, and climate change expert. Read the following paragraph and extract the most important points from the text and return only the points and their explanations:"</h4>
<h4>After the response was received, the response was then used as the input for the second prompt in which ChatGPT is instructed to perform the classification. The chained prompt was the following: "Read the following points and answer only with one number that is the overall class of all points summarized without any explanations. Answer only with 0 if the points are not about climate commitments and actions, and answer only with 1 if the points are about climate commitments and actions:"</h4>
<hr>
<h4>The results were received both in a numerical - categorical and textual representation</h4>

In [None]:
prompt = 'You are the sustainability, environment, and climate change expert. Read the following paragraph and extract the most important points from the text and return only the points and their explanations: \n\n'
chain_prompt = 'Read the following points and answer only with one number that is the overall class of all points summarized without any explanations. Answer only with 0 if the points are not about climate commitments and actions, and answer only with 1 if the points are about climate commitments and actions: \n\n'
texts = df["text"].to_list()
rez, rez_keys, summarized_points = batch_gpt_resp_as_part_prompt(prompt,chain_prompt, texts, 1)

1 431 - 1 1: 2 889 - 0 0: 3 818 - 0 0: 4 1131 - 0 0: 5 621 - 1 1: 6 590 - 1 1: 7 587 - 0 0: 8 731 - 1 1: 9 506 - 1 1: 10 462 - 1 1: 
11 474 - 1 1: 12 428 - 1 1: 13 487 - 0 0: 14 648 - 1 1: 15 571 - 1 1: 16 402 - 1 1: 17 878 - 0 0: 18 637 - 0 0: 19 913 - 1 1: 20 858 - 1 1: 
21 813 - 0 0: 22 564 - 1 1: 23 513 - 1 1: 24 453 - 1 1: 25 684 - 1 1: 26 816 - 1 1: 27 814 - 1 1: 28 411 - 1 1: 29 510 - 1 1: 30 705 - 1 1: 
31 869 - 1 1: 32 1039 - 1 1: 33 794 - 1. 1: 34 536 - 1 1: 35 709 - 1 1: 36 526 - 1 1: 37 626 - 1 1: 38 504 - 1 1: 39 781 - 1 1: 40 660 - 1 1: 
41 452 - 1 1: 42 499 - 1 1: 43 759 - 1 1: 44 756 - 1 1: 45 470 - 0 0: 46 431 - 1 1: 47 420 - 1 1: 48 657 - 1 1: 49 526 - 1 1: error calling API, retrying...
error parsing- Sustainability issues in the tire industry include sustainable natural rubber and microplastics.
- The sustainability of the natural rubber supply chain is associated with human rights, environmental protection, transparent management, productivity, quality and quality 

<h4>The predictions that are received, are stored in a variable, then they are stored both in numerical and textual representations in a Pandas DataFrame to further be compared and evaluated.</h4>

In [None]:
rez

['1',
 '0',
 '0',
 '0',
 '1',
 '1',
 '0',
 '1',
 '1',
 '1',
 '1',
 '1',
 '0',
 '1',
 '1',
 '1',
 '0',
 '0',
 '1',
 '1',
 '0',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1.',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '1',
 '0',
 '1',
 '1',
 '1',
 '1',
 "- Sustainability issues in the tire industry include sustainable natural rubber and microplastics.\n- The sustainability of the natural rubber supply chain is associated with human rights, environmental protection, transparent management, productivity, quality and quality of life.\n- Hankook Tire & Technology is addressing these issues through the Global Platform for Sustainable Natural Rubber GPSNR.\n- Hankook Tire & Technology is also researching microplastics for tire wear particles.\n- LG Chem is a major supplier of Hankook Tire & Technology and its sustainability efforts will impact the company's responsible sourcing policy and products.\n- Collaboration among companies or industries is ne

In [None]:
rez_keys

[1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 -1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 -1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1

In [None]:
df['gpt-label'] = rez_keys

In [None]:
df['gpt-explanations'] = rez

In [None]:
df['gpt-summarized'] = summarized_points

In [None]:
df

Unnamed: 0,text,label,gpt-label,gpt-explanations,gpt-summarized
0,Sustainable strategy ‘red lines’ For our susta...,1,1,1,Points: \n1. Sustainable strategy range \n2. P...
1,"Verizon’s environmental, health and safety man...",0,0,0,"- Verizon has an environmental, health, and sa..."
2,"In 2019, the Company closed a series of transa...",1,0,0,Points:\n1. The Company sold its Canadian foss...
3,"In December 2020, the AUC approved the Electri...",0,0,0,Points:\n1. AUC approved deferral of compulsor...
4,"Finally, there is a reputational risk linked t...",0,1,1,Points: \n1. Reputational risk for oil compani...
...,...,...,...,...,...
315,Indirect emissions result from operational act...,0,1,1,1. Indirect emissions come from activities not...
316,"All data in this TCFD report is as of, or for ...",0,0,0,"1. TCFD report data is up until December 31, 2..."
317,Outcome: The bank explained that it would be w...,1,1,1,Points: \n1. The bank is winding down its foss...
318,"In 2020, Banco do Brasil Foundation celebrated...",1,0,0,Points: \n1. Banco do Brasil Foundation celebr...


<h4>The labels that couldn't be automatically mapped were mapped manually by observing their textual counterparts and seeing where they belong.</h4>

In [None]:
df['gpt-label'].value_counts()

 1    243
 0     73
-1      4
Name: gpt-label, dtype: int64

In [None]:
df[df['gpt-label']==-1]

Unnamed: 0,text,label,gpt-label,gpt-explanations,gpt-summarized
49,What are the latest sustainability issues in y...,1,-1,- Sustainability issues in the tire industry i...,- Sustainability issues in the tire industry i...
159,Innovation and Digital Two enablers will set u...,0,-1,- Two enablers for success: deep and broad inn...,- Two enablers for success: deep and broad inn...
292,In addition to capital investments in the Regu...,1,-1,- Canadian Utilities plans to invest in:\n\n1....,- Canadian Utilities plans to invest in:\n\n1....
294,We also anticipate that the potential effects ...,0,-1,- Climate change effects will impact operation...,- Climate change effects will impact operation...


In [None]:
df['gpt-label'] = df['gpt-label'].replace(-1,1)

In [None]:
df['gpt-label'].value_counts()

1    247
0     73
Name: gpt-label, dtype: int64

<h4>The DataFrame is also stored on Google Drive, for later viewing and analysis. This step can be skipped.</h4>

In [None]:
df.to_csv("/content/drive/MyDrive/DS-Environment-Project/ChatGPT Results/chatgpt_climate_commitments_and_actions_chained_prompts.csv",index=False)

<h4>In the following section, the predicted labels are compared to the actual labels and the results are displayed.</h4>
<hr>
<h4>In the first row of the output, three metrics are displayed in the following order: <h6>(precision, recall, fscore, support - optional, may be none)</h6></h4>
<h4>In the second row, only the F1 Score is displayed, for better clarity.</h4>
<h4>In the third row the confusion matrix is displayed.</h4>
<h4>In the fourth row the whole classification report is displayed, with the metrics per class: precision, recall, f1 score and support; the accuracy, per class and overall and the macro and micro averages of each metric.</h4>

In [None]:
# calculate the precision and f1 score for df columns label and prediction
from sklearn.metrics import precision_recall_fscore_support,f1_score
sent_col = 'gpt-label'
print(precision_recall_fscore_support(df['label'], df[sent_col], average='macro'))

# f1 score only
print(f1_score(df['label'], df[sent_col], average='macro'))

# confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(df['label'], df[sent_col]))

# performnce report

from sklearn.metrics import classification_report
print(classification_report(df['label'], df[sent_col]))

(0.53865564860518, 0.5320371391799963, 0.41832473593711617, None)
0.41832473593711617
[[ 55 167]
 [ 18  80]]
              precision    recall  f1-score   support

           0       0.75      0.25      0.37       222
           1       0.32      0.82      0.46        98

    accuracy                           0.42       320
   macro avg       0.54      0.53      0.42       320
weighted avg       0.62      0.42      0.40       320

