<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/27_default_huggingface_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using HuggingFace Trainer for Classification

Inspired by
 * HuggingFace Tutorial - [Fine Tuning a pretrained model](https://huggingface.co/course/chapter3)
 * https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb

## Steps

1. Download 10kGNAD dataset
2. Create `DatasetDict` for *train* and *test* data
3. Load Tokenizer
4. Tokenize dataset
5. Define Training Parameters
6. Create Model
7. Train Model

## Prerequisites

In [1]:
checkpoint = "distilbert-base-german-cased"

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Thu Jul  1 22:14:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
# install transformers
!pip install -q --upgrade tqdm==4.47.0 >/dev/null
!pip install -q --upgrade transformers datasets wandb >/dev/null

# check installed version
!pip freeze | grep transformers
# transformers==4.8.2

transformers==4.8.2


In [4]:
import pandas as pd
import numpy as np
from pathlib import Path
import os

from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
from transformers import logging
import wandb

# hide progress bar when downloading tokenizers - a workaround!
logging.get_verbosity = lambda : logging.NOTSET

# disable transformer warnings like "Some weights of the model checkpoint ..."
logging.set_verbosity_error()

# disable logging of wandb
os.environ["WANDB_SILENT"] = "true"

## Load Dataset from HuggingFace

In [5]:
from datasets import load_dataset

dataset_name = "gnad10"
raw_datasets = load_dataset(dataset_name)

print(raw_datasets)
raw_datasets['train'].to_pandas().head()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1502.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=987.0, style=ProgressStyle(description_…

Using custom data configuration default



Downloading and preparing dataset gnad10/default (download: 25.90 MiB, generated: 25.92 MiB, post-processed: Unknown size, total: 51.82 MiB) to /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9674114.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1092281.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset gnad10 downloaded and prepared to /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881. Subsequent calls will reuse this data.
DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 9245
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 1028
    })
})


Unnamed: 0,label,text
0,4,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,8,"Erfundene Bilder zu Filmen, die als verloren g..."
2,0,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,3,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,5,Estland sieht den künftigen österreichischen P...


## Load Dataset from Web

Get the 10k German News Articles Dataset

In [6]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data
2021-07-01 22:15:32 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/train.csv [24405789/24405789] -> "data/train.csv" [1]
2021-07-01 22:15:33 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/test.csv [2755020/2755020] -> "data/test.csv" [1]

2.7M Jul  1 22:15 test.csv
 24M Jul  1 22:15 train.csv


## Create DatasetDict

In [7]:
from datasets import DatasetDict

data_dir = Path(os.getenv("DIR"))

csv_files = {
    "train": str(data_dir / 'train.csv'),
    "test": str(data_dir / 'test.csv'),
}

columns = ["labels", "text"]

gnad10k_ds = DatasetDict.from_csv(csv_files, sep=";", quotechar="'", names=columns)

print(gnad10k_ds)
gnad10k_ds['train'].to_pandas().head()

Using custom data configuration default-576082dbcf589892


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-576082dbcf589892/0.0.0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-576082dbcf589892/0.0.0. Subsequent calls will reuse this data.
DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 9245
    })
    test: Dataset({
        features: ['labels', 'text'],
        num_rows: 1028
    })
})


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


## Load Tokenizer and Model

In [8]:
%%time
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

CPU times: user 7.94 s, sys: 1.63 s, total: 9.57 s
Wall time: 14.2 s


### Encode Labels

In [9]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(gnad10k_ds['train']['labels'])

encoded_ds = gnad10k_ds.map(lambda ds: {'labels': le.transform(ds['labels'])}, batched=True)

encoded_ds['train'].to_pandas().head()

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




Unnamed: 0,labels,text
0,5,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,3,"Erfundene Bilder zu Filmen, die als verloren g..."
2,6,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,7,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,1,Estland sieht den künftigen österreichischen P...


### Tokenize Data

In [10]:
%%time
tokenized_ds = encoded_ds.map(lambda ds: tokenizer(ds['text'], truncation=True, padding=True), batched=True)

print(tokenized_ds)
display(tokenized_ds['train'].to_pandas().head())

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'text'],
        num_rows: 9245
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'text'],
        num_rows: 1028
    })
})


Unnamed: 0,attention_mask,input_ids,labels,text
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[102, 1735, 232, 19231, 693, 5844, 2134, 378, ...",5,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[102, 11806, 646, 30881, 4195, 205, 13165, 818...",3,"Erfundene Bilder zu Filmen, die als verloren g..."
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[102, 351, 13236, 124, 7847, 123, 26074, 12309...",6,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[102, 16679, 853, 224, 12205, 818, 377, 268, 5...",7,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[102, 18600, 2671, 190, 13458, 13239, 30882, 5...",1,Estland sieht den künftigen österreichischen P...


CPU times: user 22.7 s, sys: 454 ms, total: 23.2 s
Wall time: 13.2 s


In [11]:
# look at result of tokenization
print(tokenizer.convert_ids_to_tokens(tokenized_ds['train']['input_ids'][0]))

['[CLS]', '21', '-', 'Jähr', '##iger', 'fällt', 'wohl', 'bis', 'Saisonende', 'aus', '.', 'Wien', '–', 'Rapid', 'muss', 'wohl', 'bis', 'Saisonende', 'auf', 'Offensiv', '##spieler', 'Thomas', 'Mur', '##g', 'verzichten', '.', 'Der', 'im', 'Winter', 'aus', 'Ried', 'gekommen', '##e', '21', '-', 'Jähr', '##ige', 'erlitt', 'beim', '0', ':', '4', '-', 'Heim', '##deb', '##akel', 'gegen', 'Ad', '##mir', '##a', 'Wa', '##cker', 'Mö', '##dl', '##ing', 'am', 'Samstag', 'einen', 'Teil', '##riss', 'des', 'Innen', '##band', '##es', 'im', 'linken', 'Knie', ',', 'wie', 'eine', 'Magnet', '##res', '##onanz', '-', 'Untersuchung', 'am', 'Donnerstag', 'ergab', '.', 'Mur', '##g', 'erhielt', 'eine', 'Schiene', ',', 'muss', 'aber', 'nicht', 'oper', '##iert', 'werden', '.', 'Dennoch', 'steht', 'ihm', 'eine', 'mehr', '##wöch', '##ige', 'Pause', 'bevor', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD

In [12]:
# tokenized_ds.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

In [13]:
# check size of tokenized data
samples = tokenized_ds["train"][:8]
samples = {
    k: v for k, v in samples.items() if k not in ["text"]
}
[len(x) for x in samples["input_ids"]]

[512, 512, 512, 512, 512, 512, 512, 512]

In [14]:
samples.keys()

dict_keys(['attention_mask', 'input_ids', 'labels'])

### Apply Padding per batch (only for GPU)

TPU does not like dynamic padding

In [15]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [16]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 512]),
 'input_ids': torch.Size([8, 512]),
 'labels': torch.Size([8])}

## Training

In [17]:
# Initialize a new wandb run
wandb.init(project="vanilla_huggingface");

<IPython.core.display.Javascript object>

In [18]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="10kgnad_distilbert_default",
                                  num_train_epochs=2,
                                #   eval_steps=500,
                                #   evaluation_strategy="steps",
                                  evaluation_strategy="epoch",
                                  per_device_train_batch_size=32,
                                  disable_tqdm=False,
                                #   fp16=True,
                                  )
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.EPOCH,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=10kgnad_distilbert_default/runs/Jul01_22-16-22_ce3d58451c13,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-

## Create Model

In [None]:
num_labels = len(set(gnad10k_ds["train"]["labels"]))
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

In [19]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    preds = np.argmax(logits, axis=-1)
    return {
        "acc": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average='macro'),
        "precision": precision_score(labels, preds, average='macro'),
        "recall": recall_score(labels, preds, average='macro'),
        }

In [20]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_ds["train"].remove_columns('text'),
    eval_dataset=tokenized_ds["test"].remove_columns('text'),
    # data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

In [21]:
%%time
trainer.train()

***** Running training *****
  Num examples = 9245
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 578
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Acc,F1,Precision,Recall
1,No log,0.394606,0.86965,0.869866,0.867372,0.873857
2,0.523900,0.356992,0.879377,0.878296,0.878315,0.87898


***** Running Evaluation *****
  Num examples = 1028
  Batch size = 8
Saving model checkpoint to 10kgnad_distilbert_default/checkpoint-500
Configuration saved in 10kgnad_distilbert_default/checkpoint-500/config.json
Model weights saved in 10kgnad_distilbert_default/checkpoint-500/pytorch_model.bin
tokenizer config file saved in 10kgnad_distilbert_default/checkpoint-500/tokenizer_config.json
Special tokens file saved in 10kgnad_distilbert_default/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1028
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 8min 39s, sys: 3.3 s, total: 8min 42s
Wall time: 8min 41s


TrainOutput(global_step=578, training_loss=0.49043702914228077, metrics={'train_runtime': 521.1258, 'train_samples_per_second': 35.481, 'train_steps_per_second': 1.109, 'total_flos': 3828737593866240.0, 'train_loss': 0.49043702914228077, 'epoch': 2.0})

In [22]:
%%time
trainer.evaluate(eval_dataset=tokenized_ds["test"])

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 1028
  Batch size = 8


CPU times: user 10.4 s, sys: 128 ms, total: 10.5 s
Wall time: 10.5 s


{'epoch': 2.0,
 'eval_acc': 0.8793774319066148,
 'eval_f1': 0.8782962921477768,
 'eval_loss': 0.3569917678833008,
 'eval_precision': 0.8783149254885126,
 'eval_recall': 0.8789796109806409,
 'eval_runtime': 10.4422,
 'eval_samples_per_second': 98.446,
 'eval_steps_per_second': 12.354}

In [23]:
wandb.finish()