# Training German BERT for German text classification

In this notebook I will fine-tune a pretrained german model for text classification on the 10KGNAD dataset. As some german pretrained models and the dataset are available on Huggingface Hub I will use it to load the data and train the classifier. I personnaly trained it using Colab GPUs.

In [2]:
# For Colab the following libraries need to be installed :
!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 36.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.1-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.2 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.7 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 45.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |███

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          TrainingArguments, Trainer)
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from datasets import (load_dataset_builder, load_dataset, 
                      get_dataset_split_names, DatasetDict)

## Loading using 🤗 Datasets

As the dataset is available on the [Huggingface Hub](https://huggingface.co/datasets/gnad10) I will directly load it from here for easier use in the training steps. The dataset is already splitted (90% train and 10% test) using Shuffled Stratified Split (because unbalanced classes).

In [4]:
# First I'll print some informtion about the dataset and store number of labels
dataset_builder = load_dataset_builder('gnad10')
print(dataset_builder.cache_dir)
print(dataset_builder.info.features)
print(dataset_builder.info.splits)
n_labels = dataset_builder.info.features['label'].num_classes

Downloading:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/987 [00:00<?, ?B/s]

Using custom data configuration default


/root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881
{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=9, names=['Web', 'Panorama', 'International', 'Wirtschaft', 'Sport', 'Inland', 'Etat', 'Wissenschaft', 'Kultur'], names_file=None, id=None)}
{'train': SplitInfo(name='train', num_bytes=24418224, num_examples=9245, dataset_name='gnad10'), 'test': SplitInfo(name='test', num_bytes=2756405, num_examples=1028, dataset_name='gnad10')}


In [5]:
dataset = load_dataset('gnad10')

Using custom data configuration default


Downloading and preparing dataset gnad10/default (download: 25.90 MiB, generated: 25.92 MiB, post-processed: Unknown size, total: 51.82 MiB) to /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881...


Downloading:   0%|          | 0.00/9.67M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.09M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset gnad10 downloaded and prepared to /root/.cache/huggingface/datasets/gnad10/default/1.1.0/3a8445be65795ad88270af4d797034c3d99f70f8352ca658c586faf1cf960881. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

The dataset only have train and test split so it will be convenient to split the train initial split into train and validation for feedback during training. Unfortunately the datasets library does not provide a stratify option so the split will be random.

In [6]:
train_dataset, val_dataset = (item[1] 
                              for item in dataset['train'].train_test_split(shuffle = True, seed = 42, test_size=0.1).items())
# update variable with newly splitted datasets
dataset = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': dataset['test']})

## Training with 🤗 Transformers

For training, I will use [German BERT model by Deepset.ai](https://huggingface.co/bert-base-german-cased) which is [available on HuggingFace](https://huggingface.co/bert-base-german-cased). That model was evaluated on the current dataset and achieved ~90% accuracy and was slightly better than multilingual BERT, so I'll try to get similar results fine tuning their model on the current set.

In [7]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-german-cased')

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/474k [00:00<?, ?B/s]

In [8]:
# tokenizing the whole dataset
encoded_dataset = dataset.map(lambda examples: tokenizer(examples['text'], truncation=True, padding='max_length'), batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
encoded_dataset.to('cuda')

The model will take the first 512 tokens (same parameter as model was pretrained with) of each article if the article has more than this number.

In [9]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-german-cased', num_labels=n_labels)

Downloading:   0%|          | 0.00/419M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

To train the model, I will use pytorch and with the Trainer API (optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like logging, gradient accumulation, and mixed precision.)

In [11]:
training_args = TrainingArguments("/content/drive/MyDrive/colab/",
                                  save_strategy="steps",
                                  evaluation_strategy="steps",
                                  num_train_epochs =2,
                                  load_best_model_at_end=True,
                                  logging_steps=100,
                                  save_steps=100,
                                  report_to='all')
trainer = Trainer(
    model=model, args=training_args, train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'], tokenizer=tokenizer
)

We can finally fine-tune our model :

In [12]:
trainer.train()
trainer.save_model('./drive/MyDrive/colab/')

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text.
***** Running training *****
  Num examples = 8320
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2080


Step,Training Loss,Validation Loss
100,1.033,0.68742
200,0.6703,0.593864
300,0.6579,0.495211
400,0.5068,0.541656
500,0.5439,0.500148
600,0.5429,0.472044
700,0.4817,0.597054
800,0.5023,0.442909
900,0.4756,0.492309
1000,0.4772,0.451961


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 925
  Batch size = 8
Saving model checkpoint to /content/drive/MyDrive/colab/checkpoint-100
Configuration saved in /content/drive/MyDrive/colab/checkpoint-100/config.json
Model weights saved in /content/drive/MyDrive/colab/checkpoint-100/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/colab/checkpoint-100/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/colab/checkpoint-100/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 925
  Batch size = 8
Saving model checkpoint to /content/drive/MyDrive/colab/checkpoint-200
Configuration saved in /content/drive/MyDrive/colab/chec

After training, I pushed the model to [the hugging face hub](https://huggingface.co/Mathking/bert-base-german-cased-gnad10) so that it will be easier to use it afterwards.