# Fine-tuning a BERT-like model with huggingface

[Huggingface](https://huggingface.co/) is a platform hosting thousands of **pretrained** model, as well as libraries and resources that make it easy for us to **fine-tune them**.

:::{attention}
The code below runs much faster with GPU availability and may stretch some machines resources. To run a minimal version, take smaller samples from the data, or use smaller models (e.g. [BERT tiny](https://huggingface.co/prajjwal1/bert-tiny))
:::

In [1]:
import os
os.chdir('../../../')

In the background, huggingface's `transformers` uses either [Pytorch](https://pytorch.org/) or [Tensorflow](https://www.tensorflow.org/). At least one of these has to be installed. In this example, we will use the pytorch backend (see requirements.txt).  

## Datasets

The first step is to get our data (shown below with a very small sample) the huggingface [datasets](https://huggingface.co/docs/datasets/index) format.

In [2]:
from datasets import Dataset
import pandas as pd
df = pd.read_feather('data/labels.feather').sample(256, random_state=2023).reset_index(drop=True)
print(df.head().title.values)
print(df.head().INCLUDE.values)
dataset = Dataset.from_dict({"text": df['abstract'], "label": df['INCLUDE']})
dataset

['The pitfalls and promises of climate adaptation planning'
 'How to support growth with less energy'
 'Impact of CO2 Emissions on Low Volume Road Maintenance Policy: Case Study of Serbia'
 'Research on Emission Reduction in Supply Chain under the Energy Performance Contracting Mode'
 'Linking energy efficiency to economic productivity: recommendations for improving the robustness of the U.S. economy']
[0. 0. 1. 1. 0.]


  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['text', 'label'],
    num_rows: 256
})

## Tokenization

The next step is to **tokenize** our texts. Tokenizers are model specific. In this tutorial we will use [DistilRoberta](https://huggingface.co/distilroberta-base) ([Ro](https://arxiv.org/abs/1907.11692) indicates improvements to the BERT training procedure, [Distil](https://arxiv.org/abs/1910.01108) indicates a smaller, pruned or *distilled* version of the model).

In [3]:
from transformers import AutoTokenizer
model_name = 'distilroberta-base'
#model_name = 'climatebert/distilroberta-base-climate-f'
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|                                                                                  | 0/256 [00:00<?, ? examples/s]

Map: 100%|██████████████████████████████████████████████████████████████████████| 256/256 [00:00<00:00, 7117.52 examples/s]




Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 256
})

We put this into a function: {mod}`mlmap.hf_tokenize_data`, so that it's simple to create a dataset in the right format. Before using the function, we need to make sure the dataset has a `text` column, and a `labels` column. Usually, we would use the abstract, or the title and the abstract

In [4]:
from mlmap import hf_tokenize_data
df['text'] = df['title'] + ' ' + df['abstract']
df['labels'] = df['INCLUDE'].dropna().astype(int)
dataset = hf_tokenize_data(df, model_name)
dataset

Map:   0%|                                                                                  | 0/256 [00:00<?, ? examples/s]

Map: 100%|██████████████████████████████████████████████████████████████████████| 256/256 [00:00<00:00, 7192.47 examples/s]




Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 256
})

## Training our model

In [5]:
from transformers import AutoModelForSequenceClassification, Trainer
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
trainer = Trainer(model=model, train_dataset=dataset)
# Once this has been instantiated we can apply the train() method
trainer.train()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


TrainOutput(global_step=96, training_loss=0.4565093517303467, metrics={'train_runtime': 23.319, 'train_samples_per_second': 32.935, 'train_steps_per_second': 4.117, 'total_flos': 101734962167808.0, 'train_loss': 0.4565093517303467, 'epoch': 3.0})

Now we have fine-tuned a model!

## Making predictions with our model

In [6]:
texts = [
  'Designing effective and efficient CO2 mitigation policies in line with Paris Agreement targets',
  'Climate model derived anthropogenic forcing contributions to hurricane intensity '
]
new_df = pd.DataFrame({'text': texts})
dataset = hf_tokenize_data(new_df, model_name)
pred = trainer.predict(dataset)
pred

Map:   0%|                                                                                    | 0/2 [00:00<?, ? examples/s]

Map: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 826.55 examples/s]




PredictionOutput(predictions=array([[-0.8902483,  1.1045303],
       [ 2.5690665, -1.9498619]], dtype=float32), label_ids=None, metrics={'test_runtime': 0.021, 'test_samples_per_second': 95.428, 'test_steps_per_second': 47.714})

At the moment, these are [logits](). To convert them into probabilities, which are more useful (though these will not be well calibrated), we need an activation function. The [Softmax]() function ensures that probabilities for each class add up to 1 for each document (good for binary classification, when this is represented as a negative and positive class). The [Sigmoid]() function is useful when we have multiple labels that can be true at the same time.

In [7]:
from torch import tensor
from torch.nn import Sigmoid, Softmax
activation = (Softmax())
activation(tensor(pred.predictions))

  return self._call_impl(*args, **kwargs)


tensor([[0.1198, 0.8802],
        [0.9892, 0.0108]])

In our codebase, we subclass the `Trainer` class to give it a [predict_proba]() method. This will automatically output probabilities when we make predictions.

## Multilabel predictions

For the instrument type, and the sector, we want to generate a model that predicts what, if any, sectors or instrument types (out of a set of possible values) a document mentions.

To do this, we need to feed a matrix of labels for each instrument type to our model.

Only included documents have instrument types and sectors, so lets get a small set of included documents and their sectors.

In [8]:
import re
df = pd.read_feather('data/labels.feather').query('INCLUDE==1').sample(512, random_state=2023).reset_index(drop=True)
y_prefix = '8 -'
targets = [x for x in df.columns if re.match(f'^{y_prefix}',x)]
print(len(targets))
df['labels'] = list(df[targets].values.astype(int))
df['text'] = df['title'] + ' ' + df['abstract']
dataset = hf_tokenize_data(df, model_name)
df[['text','labels']]

  df['labels'] = list(df[targets].values.astype(int))


7


Map:   0%|                                                                                  | 0/512 [00:00<?, ? examples/s]

Map: 100%|██████████████████████████████████████████████████████████████████████| 512/512 [00:00<00:00, 6962.67 examples/s]




Unnamed: 0,text,labels
0,Innovation and Climate Change Policy This pape...,"[0, 0, 0, 0, 0, 0, 1]"
1,Global carbon budgets and the viability of new...,"[0, 0, 0, 1, 0, 0, 0]"
2,RENEWABLE ENERGY RESOURCES IN AGRICULTURE: POT...,"[1, 0, 0, 1, 0, 0, 0]"
3,Ways of Seeing in Environmental Law: How Defor...,"[1, 0, 0, 0, 0, 0, 1]"
4,Price transmission mechanism and socio-economi...,"[0, 0, 1, 1, 1, 0, 1]"
...,...,...
507,Regulating Automakers for Climate Change: US R...,"[0, 0, 0, 0, 1, 0, 0]"
508,Environmental and economic benefits of carbon ...,"[1, 0, 0, 0, 0, 0, 0]"
509,Strategy of Developing Innovative Technology f...,"[0, 0, 0, 0, 0, 0, 1]"
510,CHINA'S REGIONAL CARBON TRADING EXPERIMENTS AN...,"[0, 0, 0, 0, 0, 0, 1]"


We'll need to use a different loss function to the default. We can do this by subclassing Trainer {meth}`mlmap.CustomTrainer.compute_loss`, and adding in our own

In [9]:
from mlmap import CustomTrainer

model = AutoModelForSequenceClassification.from_pretrained(
  model_name, num_labels=len(targets)
)

trainer = CustomTrainer(model=model, train_dataset=dataset)
trainer.train()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


TrainOutput(global_step=192, training_loss=-1.2397517016522206e+17, metrics={'train_runtime': 45.5734, 'train_samples_per_second': 33.704, 'train_steps_per_second': 4.213, 'total_flos': 203488067321856.0, 'train_loss': -1.2397517016522206e+17, 'epoch': 3.0})

In [10]:
texts = [
  'Optimal CO2 pricing of light vehicles, trucks, and flights. This paper calculates the optimal CO2 price to reduce emissions from the transport sector. This works out to a tax of €0.20 per liter in 2025 of petrol, rising to €0.50 a liter in 2050. The policy would have large health benefits, through reducing PM2.5 emissions.',
  'The Paris Agreement and its implications for land use, forestry and agriculture. REDD'
]
new_df = pd.DataFrame({'text': texts})
dataset = hf_tokenize_data(new_df, model_name)


pred = trainer.predict_proba(dataset, binary=False)
pred_df = pd.DataFrame(pred)
pred_df.columns=targets
pred_df.style.format(precision=2)

Map:   0%|                                                                                    | 0/2 [00:00<?, ? examples/s]

Map: 100%|███████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 656.13 examples/s]




Unnamed: 0,8 - 01. AFOLU,8 - 02. Buildings,8 - 03. Industry,8 - 04. Energy,8 - 05. Transport,8 - 06. Waste,8 - 15. Cross-sectoral
0,0.08,0.11,0.14,0.24,0.56,0.07,0.15
1,0.71,0.1,0.12,0.18,0.1,0.11,0.25
