#Classification of paragraphs on climate-related tasks on ChatGPT using Scikit LLM
## Classification of paragraphs into one of the recommended categories by the Task Force on Climate-Related Financial Disclosures (TCFD)
<hr>
<h3>In order to be able to use Scikit LLM Classificators to send data to ChatGPT, first the Scikit LLM module must be installed and the OpenAI API keys must be set. In the following section, first the Scikit LLM library is installed and after that both the API key and the corresponding organization key must be set.</h3>
<h3>Because these keys are secret and give access to your OpenAI account, they should be hidden and not available in plain text to the public. It is advised to store these keys in files on your computer on some cloud, like Google Drive where other people cannot access them and then open them in the Notebook and set the keys via variables, that way they can be protected from the public.</h3>
<h3>In our approach, we used text files on Google Drive to store the keys and we open them in the Notebook, set the appropriate variables and then use the variables to set the keys.</h3>
<hr>
<h3>To use this script, you need to set your OpenAI keys, to do that, if you use the same approach as us, first you need to store your keys in files and store them on Google Drive and after that only the path to the files in which the keys are stored needs to be changed and the script will work.</h3>
<h3>Alternative approaches include uploading your locally stored files to the Colab Notebook, using a GitHub repository or using alternative storage solutions.</h3>
<h3>On the following link you can find ways to deal with your files on various storage providers: <a href="https://neptune.ai/blog/google-colab-dealing-with-files">https://neptune.ai/blog/google-colab-dealing-with-files</a></h3>
<hr>
<h3>Each task is structured in its own Colab Notebook and in order to get the results for a task, first the appropriate keys must be set in the Notebook and after that the whole Notebook can just be run and the results will be displayed at the end of the section, either by collapsing the section and running the cells from the whole section at once or running each cell one by one. Some steps are optional, for example saving the results in a .csv file and may be skipped.</h3>

In [None]:
#This code is for mounting your Google Drive to the Notebook. The path where you can access your whole Google Drive is /content/drive
#Alternatively, the Google Drive may be mounted by clicking the folder icon on the left side menu and then clicking the third icon from the
#left, the dark icon with a folder and the logo of Google Drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install scikit-llm

#If you use the same approach as us, with Google Drive, you need to change the paths to your relevant files where the keys are stored on Google Drive

with open('Here put the path to your OpenAI API key', 'r') as file1:
    key = file1.readline()

with open('Here put the path to your OpenAI organization key', 'r') as file2:
    org_key = file2.readline()



from skllm.config import SKLLMConfig

#Alternatively, you can just insert your keys as plain text in the appropriate places, but this is not advised since your keys would be visible to anyone who has access to your Notebook
#For using other approaches, please visit the link provided in the description above that instructs use and import of files from other storage solutions

SKLLMConfig.set_openai_key(key)
SKLLMConfig.set_openai_org(org_key)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-llm
  Downloading scikit_llm-0.2.0-py3-none-any.whl (29 kB)
Collecting openai>=0.27.0 (from scikit-llm)
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting annoy>=1.17.2 (from scikit-llm)
  Downloading annoy-1.17.2.tar.gz (647 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.4/647.4 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting aiohttp (from openai>=0.27.0->scikit-llm)
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
Collecting multidict<7.0,>=4.5 (from aiohttp->openai>=0.27.0->scikit-llm

<h1>Climate Multi-label classification using TCFD Recommendations</h1>
<hr>
<h4>In this task, paragraphs are classified into one or more classes from the TCFD Recommended categories for climate change-related texts and then of those multiple classes, one is chosen so that the classification report can be performed. The multi-label classification was done mostly to evaluate the performance of this kind of classifier from the library.</h4>
<h4>First, the required library - datasets is loaded in order to be able to work with the dataset and the corresponding dataset is downloaded from HuggingFace and loaded into the dataset variable.</h4>

In [None]:
!pip install datasets
from datasets import load_dataset
dataset = load_dataset("climatebert/tcfd_recommendations")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Downloading readme:   0%|          | 0.00/4.64k [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/climatebert___parquet/climatebert--tcfd_recommendations-8f7123f770abbd61/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/360k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/132k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1300 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/climatebert___parquet/climatebert--tcfd_recommendations-8f7123f770abbd61/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

<h4>After that, the paragraphs and the labels are extracted from the dataset and are loaded into a Pandas DataFrame that allows easier manipulation with the data and better visualization of the data, with tables.</h4>

In [None]:
data = []
for i in range(0,len(dataset['test']['text'])):
  data.append([dataset['test']['text'][i],dataset['test']['label'][i]])

print(data)

[['Sustainable strategy ‘red lines’ For our sustainable strategy range, we incorporate a series of proprietary ‘red lines’ in order to ensure the poorest- performing companies from an ESG perspective are not eligible for investment.', 2], ['Verizon’s environmental, health and safety management system provides a framework for identifying, controlling, and reducing the risks associated with the environments in which we operate. Besides regular management system assessments, internal and third-party compliance audits and inspections are performed annually at hundreds of facilities worldwide. The goal of these assessments is to identify and correct site-specific issues, and to educate and empower facility managers and supervisors to implement corrective actions. Verizon’s environment, health and safety efforts are directed and supported by experienced experts around the world that support our operations and facilities.', 3], ['In 2019, the Company closed a series of transactions related to

In [None]:
import pandas as pd
df = pd.DataFrame(data=data,columns=["text","label"])

In [None]:
df

Unnamed: 0,text,label
0,Sustainable strategy ‘red lines’ For our susta...,2
1,"Verizon’s environmental, health and safety man...",3
2,"In 2019, the Company closed a series of transa...",2
3,"In December 2020, the AUC approved the Electri...",2
4,"Finally, there is a reputational risk linked t...",2
...,...,...
395,"In 2020, Banco do Brasil Foundation celebrated...",2
396,Climate change is producing changes in weather...,2
397,A sound and certain regulatory and fiscal envi...,0
398,"Across our global workforce, 20% of Gold Field...",0


<h4>In the following step, the Multi label Zero Shot classifier is imported, the paragraphs are stored into variable X and the correct labels in variable Y. The labels are provided to the classifier and the classification process begins with the paragraphs being sent to the model.</h4>

In [None]:
from skllm import MultiLabelZeroShotGPTClassifier

X = df['text']
Y = df['label']

candidate_labels = [
    "the provided text is not about sustainability, environment and climate change",
    "the provided text is about metrics for sustainability, environment and climate change",
    "the provided text is about strategy for sustainability, environment and climate change",
    "the provided text is about risk for sustainability, environment and climate change",
    "the provided text is about governance for sustainability, environment and climate change"
]


clf = MultiLabelZeroShotGPTClassifier(max_labels=5)
clf.fit(None, [candidate_labels])
preds = clf.predict(X)

100%|██████████| 400/400 [18:46<00:00,  2.82s/it]


<h4>The predictions that are received, are stored in a variable, then they are stored both in numerical and textual representations in a Pandas DataFrame to further be compared and evaluated.</h4>

In [None]:
preds

[['the provided text is about strategy for sustainability, environment and climate change'],
 ['the provided text is about risk for sustainability, environment and climate change',
  'the provided text is about governance for sustainability, environment and climate change'],
 ['the provided text is not about sustainability, environment and climate change'],
 ['the provided text is not about sustainability, environment and climate change'],
 ['the provided text is about risk for sustainability, environment and climate change'],
 ['the provided text is about metrics for sustainability, environment and climate change',
  'the provided text is about strategy for sustainability, environment and climate change'],
 ['the provided text is about risk for sustainability, environment and climate change'],
 ['the provided text is not about sustainability, environment and climate change'],
 ['the provided text is about governance for sustainability, environment and climate change',
  'the provided 

In [None]:
preds

[['the provided text is about strategy for sustainability, environment and climate change'],
 ['the provided text is about risk for sustainability, environment and climate change',
  'the provided text is about governance for sustainability, environment and climate change'],
 ['the provided text is not about sustainability, environment and climate change'],
 ['the provided text is not about sustainability, environment and climate change'],
 ['the provided text is about risk for sustainability, environment and climate change'],
 ['the provided text is about metrics for sustainability, environment and climate change',
  'the provided text is about strategy for sustainability, environment and climate change'],
 ['the provided text is about risk for sustainability, environment and climate change'],
 ['the provided text is not about sustainability, environment and climate change'],
 ['the provided text is about governance for sustainability, environment and climate change',
  'the provided 

In [None]:
df['gpt-explanations'] = preds

In [None]:
df[df['gpt-explanations'] == "the provided text is not about sustainability, environment and climate change"]

Unnamed: 0,text,label,gpt-explanations


In [None]:
labels = []

for label in preds:
  label = label[0]
  if label == "the provided text is not about sustainability, environment and climate change":
    labels.append(0)
  elif label == "the provided text is about metrics for sustainability, environment and climate change":
    labels.append(1)
  elif label == "the provided text is about strategy for sustainability, environment and climate change":
    labels.append(2)
  elif label == "the provided text is about risk for sustainability, environment and climate change":
    labels.append(3)
  elif label == "the provided text is about governance for sustainability, environment and climate change":
    labels.append(4)


labels

[2,
 3,
 0,
 0,
 3,
 1,
 3,
 0,
 4,
 3,
 0,
 3,
 1,
 3,
 1,
 0,
 0,
 1,
 3,
 3,
 3,
 0,
 3,
 1,
 3,
 3,
 2,
 1,
 2,
 3,
 3,
 3,
 3,
 3,
 1,
 1,
 1,
 2,
 3,
 2,
 3,
 1,
 2,
 1,
 2,
 1,
 4,
 3,
 2,
 0,
 3,
 1,
 2,
 3,
 1,
 1,
 3,
 1,
 0,
 1,
 1,
 2,
 3,
 3,
 1,
 1,
 0,
 3,
 1,
 3,
 3,
 0,
 3,
 1,
 3,
 3,
 1,
 2,
 3,
 0,
 3,
 1,
 0,
 1,
 4,
 2,
 2,
 2,
 3,
 2,
 0,
 3,
 1,
 3,
 3,
 3,
 1,
 1,
 1,
 1,
 1,
 2,
 0,
 0,
 1,
 2,
 3,
 2,
 1,
 3,
 3,
 0,
 3,
 1,
 1,
 3,
 3,
 3,
 0,
 0,
 2,
 1,
 0,
 1,
 1,
 3,
 3,
 0,
 3,
 3,
 0,
 2,
 1,
 3,
 1,
 2,
 1,
 3,
 4,
 2,
 1,
 3,
 3,
 0,
 3,
 3,
 3,
 4,
 4,
 1,
 3,
 3,
 3,
 3,
 1,
 1,
 1,
 1,
 1,
 4,
 1,
 2,
 3,
 1,
 3,
 2,
 1,
 3,
 3,
 1,
 1,
 0,
 2,
 1,
 1,
 3,
 1,
 3,
 3,
 3,
 4,
 0,
 2,
 1,
 1,
 2,
 2,
 1,
 2,
 0,
 3,
 1,
 0,
 1,
 2,
 1,
 3,
 4,
 1,
 0,
 2,
 0,
 3,
 3,
 1,
 3,
 3,
 3,
 1,
 1,
 4,
 1,
 1,
 3,
 3,
 2,
 1,
 0,
 3,
 3,
 1,
 0,
 0,
 3,
 1,
 1,
 2,
 0,
 2,
 3,
 1,
 4,
 3,
 3,
 3,
 2,
 0,
 3,
 4,
 3,
 1,
 2,
 2,
 3,
 2,
 0,
 2,
 3,
 2,
 2,


In [None]:
df['gpt-label'] = labels

In [None]:
df

Unnamed: 0,text,label,gpt-explanations,gpt-label
0,Sustainable strategy ‘red lines’ For our susta...,2,[the provided text is about strategy for susta...,2
1,"Verizon’s environmental, health and safety man...",3,[the provided text is about risk for sustainab...,3
2,"In 2019, the Company closed a series of transa...",2,[the provided text is not about sustainability...,0
3,"In December 2020, the AUC approved the Electri...",2,[the provided text is not about sustainability...,0
4,"Finally, there is a reputational risk linked t...",2,[the provided text is about risk for sustainab...,3
...,...,...,...,...
395,"In 2020, Banco do Brasil Foundation celebrated...",2,[the provided text is about strategy for susta...,2
396,Climate change is producing changes in weather...,2,[the provided text is about risk for sustainab...,3
397,A sound and certain regulatory and fiscal envi...,0,[the provided text is about risk for sustainab...,3
398,"Across our global workforce, 20% of Gold Field...",0,[the provided text is about metrics for sustai...,1


In [None]:
df['gpt-label'].value_counts()

3    138
1    119
0     66
2     62
4     15
Name: gpt-label, dtype: int64

<h4>The DataFrame is also stored on Google Drive, for later viewing and analysis. This step can be skipped.</h4>

In [None]:
df.to_csv("/content/drive/MyDrive/DS-Environment-Project/ChatGPT Results/scikit-llm/chatgpt_climate_tcfd_recommendations.csv",index=False)

<h4>In the following section, the predicted labels are compared to the actual labels and the results are displayed.</h4>
<hr>
<h4>In the first row of the output, three metrics are displayed in the following order: <h6>(precision, recall, fscore, support - optional, may be none)</h6></h4>
<h4>In the second row, only the F1 Score is displayed, for better clarity.</h4>
<h4>In the third row the confusion matrix is displayed.</h4>
<h4>In the fourth row the whole classification report is displayed, with the metrics per class: precision, recall, f1 score and support; the accuracy, per class and overall and the macro and micro averages of each metric.</h4>

In [None]:
# calculate the precision and f1 score for df columns label and prediction
from sklearn.metrics import precision_recall_fscore_support,f1_score
sent_col = 'gpt-label'
print(precision_recall_fscore_support(df['label'], df[sent_col], average='macro'))

# f1 score only
print(f1_score(df['label'], df[sent_col], average='macro'))

# confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(df['label'], df[sent_col]))

# performnce report

from sklearn.metrics import classification_report
print(classification_report(df['label'], df[sent_col]))

(0.469885351565081, 0.4836182454239017, 0.4063071635396307, None)
0.4063071635396307
[[42  5  4 24  5]
 [ 3 38  6  2  0]
 [20 59 44 72  2]
 [ 1 13  3 30  1]
 [ 0  4  5 10  7]]
              precision    recall  f1-score   support

           0       0.64      0.53      0.58        80
           1       0.32      0.78      0.45        49
           2       0.71      0.22      0.34       197
           3       0.22      0.62      0.32        48
           4       0.47      0.27      0.34        26

    accuracy                           0.40       400
   macro avg       0.47      0.48      0.41       400
weighted avg       0.57      0.40      0.40       400

