# Train an SDG Classifier using the Hugging Face (Hub) Training Notebook and annotated Data from OSGD (https://osdg.ai/)

In this notebook we'll take a look at fine-tuning a multilingual Transformer model called [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) for text classification. By the end of this notebook you should know how to:

* Load and process a dataset from the Hugging Face Hub
* Create a baseline with the zero-shot classification pipeline
* Fine-tune and evaluate pretrained model on your data
* Push a model to the Hugging Face Hub

Let's get started!

In [1]:
%%capture
! pip install datasets transformers sentencepiece huggingface_hub
! apt install git-lfs

To be able to share your model with the community there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your username and password:

In [2]:
!git config --global credential.helper store

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
!git config --global credential.helper store

## The dataset

In [None]:
#In this notebook we'll be using the 🤗 Datasets to load and preprocess our data. If you're new to this library, check out the video below to get some additional context:

In [5]:
import pandas as pd
df_osdg = pd.read_csv('https://zenodo.org/record/5550238/files/osdg-community-dataset-v21-09-30.csv',sep='\t')

In [6]:
print(df_osdg.shape)
df_osdg[:3]

(32120, 7)


Unnamed: 0,doi,text_id,text,sdg,labels_negative,labels_positive,agreement
0,10.6027/9789289342698-7-en,00021941702cd84171ff33962197ca1f,"From a gender perspective, Paulgaard points ou...",5,1,7,0.75
1,10.18356/eca72908-en,00028349a7f9b2485ff344ae44ccfd6b,Labour legislation regulates maximum working h...,11,2,1,0.333333
2,10.1787/9789264289062-4-en,0004eb64f96e1620cd852603d9cbe4d4,The average figure also masks large difference...,3,1,6,0.714286


In [7]:
for i in range(1001,1005):
  print(df_osdg.iloc[i]['text'])
  print(df_osdg.iloc[i]['sdg'])

In particular, those that show the impact of policies in cities at advanced stages of the ageing process will become useful information for cities in the early stages of ageing. One such example is the EU Ageing Report (European Commission, 2012a), which includes a set of indicators that illustrate expenditure projections for a large older population, covering pensions, healthcare, long-term care, education and unemployment. Internationally applicable indicators that help assess the social sustainability of cities and their urban form are particularly important.
11
The share of children under the age of 3 enrolled in ECEC is on the rise in most countries and has increased on average from 25% to 31% between 2010 and 2016 (Table B2.1b). This is particularly marked in many European countries, as a result of further stimulus by the 2010 objectives set by the European Union (EU) at its Barcelona meeting (to supply subsidised full-day places for one-third of children under the age of 3 by 20

In [8]:
df_osdg.iloc[3]['sdg']

3

In [9]:
df_osdg['sdg'].value_counts()

5     4327
4     3740
6     2826
7     2808
1     2740
3     2698
2     2463
11    2286
13    2085
8     1517
9     1093
14    1087
10    1027
15     960
12     463
Name: sdg, dtype: int64

In [10]:
df_osdg['lang'] = [te[-2:] for te in list(df_osdg['doi'])]
df_osdg['lang'].value_counts()

en    32039
r3       33
jb       23
d8        8
lt        8
9x        6
bn        2
q6        1
Name: lang, dtype: int64

In [11]:
len_lst = []
for t in range(0,2000):
  text = df_osdg.iloc[t]['text']
  len_lst.append(len(text.split()))

In [12]:
import numpy as np
np.quantile(len_lst,.9)

133.0

In [13]:
#df_osdg['label'] = df_osdg['sdg']
#df_osdg['sdg'] = [str(a) for a in list(df_osdg['label'])]

df_osdg['sdg'] = df_osdg['sdg'].apply(lambda x: 'sdg_'+str(x))

In [14]:
df_osdg.describe()

Unnamed: 0,labels_negative,labels_positive,agreement
count,32120.0,32120.0,32120.0
mean,1.332098,4.046762,0.687803
std,6.938317,12.216903,0.326972
min,0.0,0.0,0.0
25%,0.0,2.0,0.333333
50%,1.0,3.0,0.75
75%,2.0,5.0,1.0
max,777.0,864.0,1.0


In [15]:
len(df_osdg['sdg'].unique())

15

In [16]:
from datasets import Dataset
dataset =  Dataset.from_pandas(df_osdg)

In [17]:
dataset

Dataset({
    features: ['doi', 'text_id', 'text', 'sdg', 'labels_negative', 'labels_positive', 'agreement', 'lang'],
    num_rows: 32120
})

In [18]:
dataset[100]

{'agreement': 0.6,
 'doi': '10.1787/9789264245914-5-en',
 'labels_negative': 1,
 'labels_positive': 4,
 'lang': 'en',
 'sdg': 'sdg_4',
 'text': 'A Research and Innovation Framework was developed by late 2011 and innovation became part of the Department’s strategic plan. The focus has been on equity, excellence and sustainability, identifying and up-scaling innovation, as well as establishing system-wide directions for innovation. The focus was primarily on improvement in literacy and numeracy initially, subsequently extended to other areas of the curriculum.',
 'text_id': '00e6c8802fe3757149fe6c3aebfd1f65'}

## Mapping Labels

During training, 🤗 Transformers expects the labels to be ordered, starting from 0 to N. But we've seen that our star ratings range from 1-5, so let's fix that. While we're at it, we'll create a mapping between the label IDs and names, which will be handy later on when we want to run inference with our model. First we'll define the label mapping from ID to name:

In [19]:
labels = df_osdg['sdg'].unique()
labels.sort()
labels

array(['sdg_1', 'sdg_10', 'sdg_11', 'sdg_12', 'sdg_13', 'sdg_14',
       'sdg_15', 'sdg_2', 'sdg_3', 'sdg_4', 'sdg_5', 'sdg_6', 'sdg_7',
       'sdg_8', 'sdg_9'], dtype=object)

In [20]:
label_names = labels #["terrible", "poor", "ok", "good", "great"]
id2label = {idx:label for idx, label in enumerate(label_names)}
label2id = {label:idx for idx, label in enumerate(label_names)}
label2id

{'sdg_1': 0,
 'sdg_10': 1,
 'sdg_11': 2,
 'sdg_12': 3,
 'sdg_13': 4,
 'sdg_14': 5,
 'sdg_15': 6,
 'sdg_2': 7,
 'sdg_3': 8,
 'sdg_4': 9,
 'sdg_5': 10,
 'sdg_6': 11,
 'sdg_7': 12,
 'sdg_8': 13,
 'sdg_9': 14}

In [21]:
#id2label = {int(label): 'sdg_'+ str(label) for  label in labels}
id2label

{0: 'sdg_1',
 1: 'sdg_10',
 2: 'sdg_11',
 3: 'sdg_12',
 4: 'sdg_13',
 5: 'sdg_14',
 6: 'sdg_15',
 7: 'sdg_2',
 8: 'sdg_3',
 9: 'sdg_4',
 10: 'sdg_5',
 11: 'sdg_6',
 12: 'sdg_7',
 13: 'sdg_8',
 14: 'sdg_9'}

We can then apply this mapping to our whole dataset by using the `Dataset.map()` method. Similar to the `Dataset.filter()` method, this one expects a function which receives examples as input, but returns a Python dictionary as output. The keys of the dictionary correspond to the columns, while the values correspond to the column entries. The following function creates two new columns:

* A `labels` column which is the star rating shifted down by one
* A `label_name` column which provides a nice string for each rating

In [22]:
def map_labels(example):
    # Shift labels to start from 0
    label_id = label2id[example["sdg"]]
    return {"labels": label_id, "label_name": id2label[label_id]}

In [23]:
dataset = dataset.map(map_labels)
# Peek at the first example
dataset[0]

  0%|          | 0/32120 [00:00<?, ?ex/s]

{'agreement': 0.75,
 'doi': '10.6027/9789289342698-7-en',
 'label_name': 'sdg_5',
 'labels': 10,
 'labels_negative': 1,
 'labels_positive': 7,
 'lang': 'en',
 'sdg': 'sdg_5',
 'text': 'From a gender perspective, Paulgaard points out that the labour markets of the fishing villages have been highly gender-segregated in terms of the existence of "male jobs" and "female jobs"; however, the new business opportunities have led to the male population of the peripheral areas now working in the service industry in former "female jobs": "That boys and girls are doing the same jobs indicates change, because traditional boundaries between women and men\'s work are being crossed. But the fact that young people are still working represents continuity with the past" (Paulgaard 2002: 102). When Paulgaard refers to continuity with traditions, she refers to the expectations of young adults to participate in adult culture, thus these fishing villages traditionally have no actual youth culture. As describ

In [24]:
dataset = dataset.shuffle(seed=42)
dataset = dataset.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['doi', 'text_id', 'text', 'sdg', 'labels_negative', 'labels_positive', 'agreement', 'lang', 'labels', 'label_name'],
        num_rows: 28908
    })
    test: Dataset({
        features: ['doi', 'text_id', 'text', 'sdg', 'labels_negative', 'labels_positive', 'agreement', 'lang', 'labels', 'label_name'],
        num_rows: 3212
    })
})

## From text to tokens

Like other machine learning models, Transformers expect their inputs in the form of numbers (not strings) and so some form of preprocessing is required. For NLP, this preprocessing step is called _tokenization_. Tokenization converts strings into atomic chunks called tokens, and these tokens are subsequently encoded as numerical vectors. 


In [25]:
from transformers import AutoTokenizer

#model_checkpoint = "bert-base-cased" # "xlm-roberta-base"
model_checkpoint = "distilbert-base-uncased" # "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
#tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print(tokenizer.vocab_size)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True,max_length=135)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

30522


  0%|          | 0/29 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [26]:
tokenizer.special_tokens_map

{'cls_token': '[CLS]',
 'mask_token': '[MASK]',
 'pad_token': '[PAD]',
 'sep_token': '[SEP]',
 'unk_token': '[UNK]'}

In [27]:
label2id = {v:k for k,v in id2label.items()}

When you feed strings to the tokenizer, you'll get at least two fields (some models have more, depending on how they're trained):

* `input_ids`: These correspond to the numerical encodings that map each token to an integer
* `attention_mask`: This indicates to the model which tokens should be ignored when computing self-attention

Let's see how this works with a simple example. First we encode the string:

In [28]:
encoded_str = tokenizer("Today I'm giving an NLP workshop at MLT")
encoded_str

{'input_ids': [101, 2651, 1045, 1005, 1049, 3228, 2019, 17953, 2361, 8395, 2012, 19875, 2102, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Loading a pretrained model

To load a pretrained model from the Hub is quite simple: just select the appropriate `AutoModelForXxx` class and use the `from_pretrained()` function with the model checkpoint. In our case, we're dealing with 5 classes (one for each star) so to initialise the model we'll provide this information along with the label mappings:

In [29]:
from transformers import AutoModelForSequenceClassification

num_labels = len(labels)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels, label2id=label2id, id2label=id2label)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

These warnings are perfectly normal - they are telling us that the weights in the head of the network are randomly initialised and so we should fine-tune the model on a downstream task.

Now that we have a model, the next step is to initialise a `Trainer` that will take care of the training loop for us. Let's do that next.

## Creating a Trainer

To create a `Trainer`, we usually need a few basic ingredients:

* A `TrainingArguments` class to define all the hyperparameters
* A `compute_metrics` function to compute metrics during evaluation
* Datasets to train and evaluate on

In [1]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]
batch_size = 32
num_train_epochs = 7
logging_steps = len(tokenized_datasets["train"]) // (batch_size * num_train_epochs)

args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-osdg",
  evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=6e-7,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    #weight_decay=0.005,
    logging_steps=logging_steps,
    push_to_hub=True,
    #push_to_hub_organization="Deutsche Gesellschaft für internationale Zusammenarbeit"
)

ModuleNotFoundError: ignored

In [32]:
model_name

'distilbert-base-uncased'

Here we've defined `output_dir` to save our checkpoints and tweaked some of the default hyperparameters like the learning rate and weight decay. The `push_to_hub` argument will push each checkpoint to the Hub automatically for us, so we can reuse the model at any point in the future!

Now that we've defined the hyperparameters, the next step is to define the metrics. In the MARC paper, the authors point out that one should use the mean absolute error (MAE) for star ratings because:

> star ratings for each review are ordinal, and a 2-star prediction for a 5-star review should be penalized more heavily than a 4-star prediction for a 5-star review.

We'll take the same approach here and we can get the metric easily from Scikit-learn as follows:

In [33]:
import numpy as np
from sklearn.metrics import mean_absolute_error, accuracy_score

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"Acc": accuracy_score(labels, predictions)}

In [34]:
tokenized_datasets['train']

Dataset({
    features: ['doi', 'text_id', 'text', 'sdg', 'labels_negative', 'labels_positive', 'agreement', 'lang', 'labels', 'label_name', 'input_ids', 'attention_mask'],
    num_rows: 28908
})

In [35]:
from transformers import Trainer 

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/peter2000/distilbert-base-uncased-finetuned-osdg into local empty directory.


In [36]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: labels_negative, text, sdg, agreement, labels_positive, text_id, lang, doi, label_name. If labels_negative, text, sdg, agreement, labels_positive, text_id, lang, doi, label_name are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 28908
  Num Epochs = 7
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 6328


Epoch,Training Loss,Validation Loss,Acc
1,1.072,1.02738,0.724159
2,0.8997,0.916817,0.748132
3,0.8137,0.868669,0.75467
4,0.7966,0.840048,0.761831
5,0.7209,0.828021,0.767746
6,0.6666,0.83,0.767123
7,0.6655,0.8231,0.769614


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: labels_negative, text, sdg, agreement, labels_positive, text_id, lang, doi, label_name. If labels_negative, text, sdg, agreement, labels_positive, text_id, lang, doi, label_name are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3212
  Batch size = 32
Saving model checkpoint to distilbert-base-uncased-finetuned-osdg/checkpoint-904
Configuration saved in distilbert-base-uncased-finetuned-osdg/checkpoint-904/config.json
Model weights saved in distilbert-base-uncased-finetuned-osdg/checkpoint-904/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-osdg/checkpoint-904/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-osdg/checkpoint-904/special_tokens_map.json
tokenizer confi

TrainOutput(global_step=6328, training_loss=0.8676624083338134, metrics={'train_runtime': 1476.0144, 'train_samples_per_second': 137.096, 'train_steps_per_second': 4.287, 'total_flos': 7069514264782200.0, 'train_loss': 0.8676624083338134, 'epoch': 7.0})

In [None]:
#trainer.save_model()

Saving model checkpoint to roberta-base-finetuned-sdg
Configuration saved in roberta-base-finetuned-sdg/config.json
Model weights saved in roberta-base-finetuned-sdg/pytorch_model.bin
tokenizer config file saved in roberta-base-finetuned-sdg/tokenizer_config.json
Special tokens file saved in roberta-base-finetuned-sdg/special_tokens_map.json


In [None]:
#fine_tuned_model = AutoModel.from_pretrained("roberta-base-finetuned-sdg")

In [37]:
trainer.push_to_hub()

Saving model checkpoint to distilbert-base-uncased-finetuned-osdg
Configuration saved in distilbert-base-uncased-finetuned-osdg/config.json
Model weights saved in distilbert-base-uncased-finetuned-osdg/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-osdg/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-osdg/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/255M [00:00<?, ?B/s]

Upload file runs/May11_09-04-03_b6f937350586/events.out.tfevents.1652259872.b6f937350586.74.0:  24%|##4       …

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/peter2000/distilbert-base-uncased-finetuned-osdg
   b2d2638..f57b75d  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Text Classification', 'type': 'text-classification'}}
remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/peter2000/distilbert-base-uncased-finetuned-osdg
   f57b75d..bd31a4b  main -> main



'https://huggingface.co/peter2000/distilbert-base-uncased-finetuned-osdg/commit/f57b75da2e220aa49ed5a1767d4e6c619615af58'

In [None]:
trainer.push_to_hub('sdg_classifier', organization="Deutsche Gesellschaft für internationale Zusammenarbeit")

In [None]:
trainer.train()