# Fine-tuning a BERT-like model with huggingface

[Huggingface](https://huggingface.co/) is a platform hosting thousands of **pretrained** model, as well as libraries and resources that make it easy for us to **fine-tune them**.

In [1]:
import os
os.chdir('../../../')

In the background, huggingface's `transformers` uses either [Pytorch](https://pytorch.org/) or [Tensorflow](https://www.tensorflow.org/). At least one of these has to be installed. In this example, we will use the pytorch backend (see requirements.txt).  

## Datasets

The first step is to get our data (shown below with a very small sample) the huggingface [datasets](https://huggingface.co/docs/datasets/index) format.

In [2]:
from datasets import Dataset
import pandas as pd
df = pd.read_feather('data/labels.feather').sample(32, random_state=2023).reset_index(drop=True)
print(df.title.values)
print(df.INCLUDE.values)
dataset = Dataset.from_dict({"text": df['abstract'], "label": df['INCLUDE']})
dataset

['Utilizing GIS to Examine the Relationship Between State Renewable Portfolio Standards and the Adoption of Renewable Energy Technologies'
 'The Way Forward after the Durban Climate Change Conference: A Strategic Analysis'
 'A grassland strategy for farming systems in Europe to mitigate GHG emissions-An integrated spatially differentiated modelling approach'
 'A Lagrangian Relaxation-Based Solution Method for a Green Vehicle Routing Problem to Minimize Greenhouse Gas Emissions'
 'The environment, international standards, asset health management and condition monitoring: An integrated strategy'
 'The effects of electricity pricing on PHEV competitiveness'
 'Efficiency Analysis of Carbon Emission Quotas'
 'Optimal timing of CO2 mitigation policies for a cost-effectiveness model'
 'Green supply chain network design considering chain-to-chain competition on price and carbon emission'
 'Assessing the strength of the monsoon during the late Pleistocene in southwestern United States'
 'Biogas

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['text', 'label'],
    num_rows: 32
})

## Tokenization

The next step is to **tokenize** our texts. Tokenizers are model specific. In this tutorial we will use [DistilRoberta](https://huggingface.co/distilroberta-base) ([Ro](https://arxiv.org/abs/1907.11692) indicates improvements to the BERT training procedure, [Distil](https://arxiv.org/abs/1910.01108) indicates a smaller, pruned or *distilled* version of the model).

In [3]:
from transformers import AutoTokenizer
model_name = 'distilroberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

Map:   0%|                                                                                            | 0/32 [00:00<?, ? examples/s]

Map: 100%|█████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 4110.17 examples/s]




Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 32
})

We put this into a [function](reference:api): `hf_tokenize_data` function, so that it's simple to create a dataset in the right format. Before using the function, we need to make sure the dataset has a `text` column, and a `labels` column. Usually, we would use the abstract, or the title and the abstract

In [4]:
from mlmap import hf_tokenize_data
df['text'] = df['title'] #+ ' ' + df['abstract']
df['labels'] = df['INCLUDE'].dropna().astype(int)
dataset = hf_tokenize_data(df, model_name)
dataset

Map:   0%|                                                                                            | 0/32 [00:00<?, ? examples/s]

Map: 100%|█████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 5551.00 examples/s]




Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 32
})

## Training our model

In [5]:
from transformers import AutoModelForSequenceClassification, Trainer
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
trainer = Trainer(model=model, train_dataset=dataset)
# Once this has been instantiated we can apply the train() method
trainer.train()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


TrainOutput(global_step=12, training_loss=0.6786517302195231, metrics={'train_runtime': 3.3983, 'train_samples_per_second': 28.249, 'train_steps_per_second': 3.531, 'total_flos': 12716870270976.0, 'train_loss': 0.6786517302195231, 'epoch': 3.0})

Now we have fine-tuned a model!

## Making predictions with our model

In [6]:
texts = [
  'Designing effective and efficient CO2 mitigation policies in line with Paris Agreement targets',
  'Climate model derived anthropogenic forcing contributions to hurricane intensity '
]
new_df = pd.DataFrame({'text': texts})
dataset = hf_tokenize_data(new_df, model_name)
pred = trainer.predict(dataset)
pred

Map:   0%|                                                                                             | 0/2 [00:00<?, ? examples/s]

Map: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 930.21 examples/s]




PredictionOutput(predictions=array([[ 0.21125597, -0.03911743],
       [ 0.22655225, -0.05965398]], dtype=float32), label_ids=None, metrics={'test_runtime': 0.0212, 'test_samples_per_second': 94.502, 'test_steps_per_second': 47.251})

At the moment, these are [logits](). To convert them into probabilities, which are more useful (though these will not be well calibrated), we need an activation function. The [Softmax]() function ensures that probabilities for each class add up to 1 for each document (good for binary classification, when this is represented as a negative and positive class). The [Sigmoid]() function is useful when we have multiple labels that can be true at the same time.

In [7]:
from torch import tensor
from torch.nn import Sigmoid, Softmax
activation = (Softmax())
activation(tensor(pred.predictions))

  return self._call_impl(*args, **kwargs)


tensor([[0.5623, 0.4377],
        [0.5711, 0.4289]])

In our codebase, we subclass the `Trainer` class to give it a [predict_proba]() method. This will automatically output probabilities when we make predictions.

## Multilabel predictions

For the instrument type, and the sector, we want to generate a model that predicts what, if any, sectors or instrument types (out of a set of possible values) a document mentions.

To do this, we need to feed a matrix of labels for each instrument type to our model.

Only included documents have instrument types, so lets get a small set of included documents and their instrument types.

In [8]:
import re
df = pd.read_feather('data/labels.feather').query('INCLUDE==1').sample(32, random_state=2023).reset_index(drop=True)
y_prefix = '4 -'
targets = [x for x in df.columns if re.match(f'^y_prefix',x)]
df['labels'] = df[targets].values.astype(int)