# Climate-related classification tasks on ClimateBert - the current State-of-the-art Model for climate-related NLP tasks
<hr>
<h3>This is a script that performs the classification tasks on the current State-of-the-art Model for climate-related NLP tasks - ClimateBert</h3>
<h3>Each task is structured in its own Colab Notebook and in order to get the results for a task, the whole Notebook can just be run and the results will be displayed at the end of the section, either by collapsing the section and running the cells from the whole section at once or running each cell one by one.</h3>


<h1>Running the Climate Detection Task on ChatGPT</h1>
<h4>In the Climate Detection task, paragraphs are being classified whether they are climate-related or not.</h4>
<hr>
<h4>Classification classes:</h4>
<h4>0 - is not climate change-related</h4>
<h4>1 - is climate change-related</h4>
<hr>
<h4>First, the required library - datasets is loaded in order to be able to work with the dataset and the corresponding dataset is downloaded from HuggingFace and loaded into the dataset variable.</h4>

In [None]:
!pip install transformers==4.28.0
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m101.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.15.1 tokenizers-0.13.3 transfor

In [None]:
from datasets import load_dataset
climate_detection_dataset = load_dataset("climatebert/climate_detection")

Downloading readme:   0%|          | 0.00/4.35k [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/climatebert___parquet/climatebert--climate_detection-eefa7af3c8031d26/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/360k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/132k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1300 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/400 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/climatebert___parquet/climatebert--climate_detection-eefa7af3c8031d26/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

<h4>After that, the appropriate fine-tuned ClimateBert model is loaded for the task, along with its Tokenizer</h4>

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

climate_change_tokenizer = AutoTokenizer.from_pretrained("climatebert/distilroberta-base-climate-detector")

climate_change_model = AutoModelForSequenceClassification.from_pretrained("climatebert/distilroberta-base-climate-detector")

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/4.48k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/887 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

In [None]:
climate_detection_encoded_dataset = climate_detection_dataset.map(lambda t: climate_change_tokenizer(t['text'],  truncation=True), batched=True,load_from_cache_file=False)

Map:   0%|          | 0/1300 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

<h4>In the following steps, the fine-tuned model is imported and a Trainer object is configured so that we can perform the predictions. The encoded dataset is then sent to the model.</h4>

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
climate_change_arg = TrainingArguments(
    "label",
    learning_rate=5e-5,
    num_train_epochs=4,
    per_device_eval_batch_size=32,
    per_device_train_batch_size=32,
    seed=19
)

In [None]:
climate_change_trainer = Trainer(
    model=climate_change_model,
    args=climate_change_arg,
    tokenizer=climate_change_tokenizer,
    eval_dataset=climate_detection_encoded_dataset['test']
)

In [None]:
y_pred_climate_change = climate_change_trainer.predict(climate_detection_encoded_dataset['test'])

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


<h4>The predictions are then received and stored in a categorical format in variables.</h4>

In [None]:
y_pred_climate_change = y_pred_climate_change.predictions

In [None]:
import numpy as np
y_pred_climate_change = [np.argmax(y_pred_climate_change[i]) for i in range(0,len(y_pred_climate_change))]

In [None]:
y_pred_climate_change

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,


<h4>In the following section, the predicted labels are compared to the actual labels and the results are displayed.</h4>
<hr>
<h4>In the first row of the output, the F1 Score is displayed</h6></h4>
<h4>In the second row the whole classification report is displayed, with the metrics per class: precision, recall, f1 score and support; the accuracy, per class and overall and the macro and micro averages of each metric.</h4>
<h4>In the third row the confusion matrix is displayed.</h4>

In [None]:
from sklearn.metrics import classification_report,f1_score,confusion_matrix
print(f1_score(climate_detection_encoded_dataset['test']['label'],y_pred_climate_change,average='macro'))

0.9572313105687265


In [None]:
print(classification_report(climate_detection_encoded_dataset['test']['label'],y_pred_climate_change))

              precision    recall  f1-score   support

           0       0.93      0.94      0.93        80
           1       0.98      0.98      0.98       320

    accuracy                           0.97       400
   macro avg       0.96      0.96      0.96       400
weighted avg       0.97      0.97      0.97       400



In [None]:
print(confusion_matrix(climate_detection_encoded_dataset['test']['label'], y_pred_climate_change))

[[ 75   5]
 [  6 314]]
