<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/20_transformer_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of Tokenizers used by German Transformer Models

The basis for most NLP algorithms is the tokenization of the input text, i.e.
splitting the text (document, paragraph or sentences) into tokens that represent words, subwords, and punctuation.
Since tokenization of German texts is different from English text tokenization, e.g. because of compound words and umlauts, we can observe different tokenization variants.

## Objective

* Investigate how tokenization is done in different pretrained German Transformer models.

## Learnings

* German Transformer models have different vocab sizes.
* Some tokenizers convert all tokens to lower case.
* Some tokenizers strip accents from the tokens.

**IMPORTANT:** When using the classification model of the `SimpleTransformers` library based on a pretrained **uncased** German Transformer Model it does not automatically use the right lower case setting!

## Prerequisites

In [1]:
# install transformers
!pip install -q -U tqdm==4.47.0 transformers simpletransformers >/dev/null

# check installed version
!pip freeze | grep transformers
# simpletransformers==0.61.6
# transformers==4.6.1

[31mERROR: google-colab 1.0.0 has requirement ipykernel~=4.10, but you'll have ipykernel 5.5.5 which is incompatible.[0m
simpletransformers==0.61.6
transformers==4.6.1


In [2]:
import pandas as pd
from pathlib import Path

from transformers import AutoTokenizer, AutoModel
from simpletransformers.classification import ClassificationModel

# hide progress bar when downloading tokenizers - needs workaround!
from transformers import logging
logging.get_verbosity = lambda : logging.NOTSET

# PROBLEM: this is programmatically not possible because the relevant code in
# https://github.com/huggingface/transformers/blob/master/src/transformers/file_utils.py
# 'disable=bool(logging.get_verbosity() == logging.NOTSET)' can never be True,
# i.e. 'get_verbosity()' will always return effective log level!

## Download Data

Using [10k German News Articles Dataset](https://tblock.github.io/10kGNAD/)

In [3]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data
2021-06-12 22:06:00 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/train.csv [24405789/24405789] -> "data/train.csv" [1]
2021-06-12 22:06:02 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/test.csv [2755020/2755020] -> "data/test.csv" [1]

2.7M Jun 12 22:06 test.csv
 24M Jun 12 22:06 train.csv


## Import Data

In [4]:
data_dir = Path("data/")

train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'

def read_csv_10kGNAD(filepath: Path, columns=["labels", "text"]) -> pd.DataFrame:
    """Load 10kGNAD csv file, handling its specific file format."""
    f = pd.read_csv(filepath, sep=";", quotechar="'", names=columns)
    return f

In [5]:
train_df = read_csv_10kGNAD(data_dir / 'train.csv')
print(f"{train_df.shape[0]:,} articles")
display(train_df.head())

9,245 articles


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


## Create Tokenizers

Use tokenizers from different pretrained German Transformer models.

In [6]:
models = [
          "bert-base-german-cased",
          "distilbert-base-german-cased",
          "dbmdz/bert-base-german-cased",
          "dbmdz/bert-base-german-uncased",
          "dbmdz/bert-base-german-europeana-cased",
          "dbmdz/bert-base-german-europeana-uncased",
          "dbmdz/distilbert-base-german-europeana-cased",
          "deepset/gbert-base",
          "deepset/gbert-large",
          "deepset/gelectra-base",
          "deepset/gelectra-large",
          "german-nlp-group/electra-base-german-uncased",
          "bert-base-multilingual-cased",
          "distilbert-base-multilingual-cased",
]

In [7]:
%%time
toks = {}

for model_name in models:
    print(f"loading tokenizer for {model_name}")
    toks[model_name] = AutoTokenizer.from_pretrained(model_name)

loading tokenizer for bert-base-german-cased
loading tokenizer for distilbert-base-german-cased
loading tokenizer for dbmdz/bert-base-german-cased
loading tokenizer for dbmdz/bert-base-german-uncased
loading tokenizer for dbmdz/bert-base-german-europeana-cased
loading tokenizer for dbmdz/bert-base-german-europeana-uncased
loading tokenizer for dbmdz/distilbert-base-german-europeana-cased
loading tokenizer for deepset/gbert-base
loading tokenizer for deepset/gbert-large
loading tokenizer for deepset/gelectra-base
loading tokenizer for deepset/gelectra-large
loading tokenizer for german-nlp-group/electra-base-german-uncased
loading tokenizer for bert-base-multilingual-cased
loading tokenizer for distilbert-base-multilingual-cased
CPU times: user 2.65 s, sys: 274 ms, total: 2.92 s
Wall time: 1min 43s


### Analyse Tokenizer Settings

In [8]:
def getTokenizerParams(toks: dict) -> pd.DataFrame:
    """Extract Tokenizer parameters."""

    props = ["name_or_path", "do_lower_case", "vocab_size",
             "model_max_length", "padding_side", "is_fast",
             "unk_token", "sep_token", "pad_token", "cls_token", "mask_token" ]

    # put tokenizer parameters in DataFrame
    conf = [{p:getattr(v, p) for p in props} for (k,v) in toks.items()]
    conf_df = pd.DataFrame(conf).set_index("name_or_path")

    # get 'strip_accent' which is not part of parameters
    strip = {k:v._tokenizer.normalizer.strip_accents for (k,v) in toks.items()}
    strip_s = pd.Series(strip, name="strip_accents")

    # combine both
    return pd.concat([strip_s, conf_df], axis=1)

params_df = getTokenizerParams(toks)
display(params_df.reset_index())

Unnamed: 0,name_or_path,strip_accents,do_lower_case,vocab_size,model_max_length,padding_side,is_fast,unk_token,sep_token,pad_token,cls_token,mask_token
0,bert-base-german-cased,,False,30000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
1,distilbert-base-german-cased,,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
2,dbmdz/bert-base-german-cased,,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
3,dbmdz/bert-base-german-uncased,,True,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
4,dbmdz/bert-base-german-europeana-cased,,False,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
5,dbmdz/bert-base-german-europeana-uncased,,True,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
6,dbmdz/distilbert-base-german-europeana-cased,False,False,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
7,deepset/gbert-base,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
8,deepset/gbert-large,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
9,deepset/gelectra-base,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]


## Tokenize Example Text

In [9]:
text = train_df.text[0]
text

'21-Jähriger fällt wohl bis Saisonende aus. Wien – Rapid muss wohl bis Saisonende auf Offensivspieler Thomas Murg verzichten. Der im Winter aus Ried gekommene 21-Jährige erlitt beim 0:4-Heimdebakel gegen Admira Wacker Mödling am Samstag einen Teilriss des Innenbandes im linken Knie, wie eine Magnetresonanz-Untersuchung am Donnerstag ergab. Murg erhielt eine Schiene, muss aber nicht operiert werden. Dennoch steht ihm eine mehrwöchige Pause bevor.'

In [10]:
def applyTokenizer(toks: dict, text: str) -> pd.DataFrame:
    """Apply tokenizers on given text."""
    tokens_s = pd.Series({k:t.tokenize(text) for (k,t) in toks.items()})

    return pd.concat({"length": tokens_s.map(len),
                      "tokens": tokens_s.map(lambda s: "|".join(s)),
                      }, axis=1)

tokens_df = applyTokenizer(toks, text)
display(params_df[["strip_accents", "do_lower_case", "vocab_size"]].join(tokens_df).reset_index())

Unnamed: 0,name_or_path,strip_accents,do_lower_case,vocab_size,length,tokens
0,bert-base-german-cased,,False,30000,106,21|-|J|##ähr|##iger|fällt|wohl|bis|Saisonende|...
1,distilbert-base-german-cased,,False,31102,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
2,dbmdz/bert-base-german-cased,,False,31102,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
3,dbmdz/bert-base-german-uncased,,True,31102,98,21|-|jahr|##iger|fall|##t|wohl|bis|saisonende|...
4,dbmdz/bert-base-german-europeana-cased,,False,32000,108,21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
5,dbmdz/bert-base-german-europeana-uncased,,True,32000,105,21|-|jahr|##iger|fallt|wohl|bis|saison|##ende|...
6,dbmdz/distilbert-base-german-europeana-cased,False,False,32000,108,21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
7,deepset/gbert-base,False,False,31102,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
8,deepset/gbert-large,False,False,31102,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
9,deepset/gelectra-base,False,False,31102,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...


## Problem with Tokenizers in SimpleTransformers Classifiers

The automatically loaded pretrained model with tokenizer does not correctly set the `do_lower_case` setting for uncased models.

In [11]:
import logging

# disable warning about uninitialized weights
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

### Get Tokenizers of Classification Models

In [12]:
%%time
toks_cm = {}

for model_name in models:
    print(f"loading classification model for {model_name}")

    model_type = "electra" if "electra" in model_name else "distilbert" if "distilbert" in model_name else "bert"
    model = ClassificationModel(model_type, model_name, use_cuda=False)

    toks_cm[model_name] = model.tokenizer

loading classification model for bert-base-german-cased
loading classification model for distilbert-base-german-cased
loading classification model for dbmdz/bert-base-german-cased
loading classification model for dbmdz/bert-base-german-uncased
loading classification model for dbmdz/bert-base-german-europeana-cased
loading classification model for dbmdz/bert-base-german-europeana-uncased
loading classification model for dbmdz/distilbert-base-german-europeana-cased
loading classification model for deepset/gbert-base
loading classification model for deepset/gbert-large
loading classification model for deepset/gelectra-base
loading classification model for deepset/gelectra-large
loading classification model for german-nlp-group/electra-base-german-uncased
loading classification model for bert-base-multilingual-cased
loading classification model for distilbert-base-multilingual-cased
CPU times: user 2min 35s, sys: 40.8 s, total: 3min 16s
Wall time: 10min 21s


In [13]:
params_cm_df = getTokenizerParams(toks_cm)
display(params_cm_df.reset_index())

Unnamed: 0,name_or_path,strip_accents,do_lower_case,vocab_size,model_max_length,padding_side,is_fast,unk_token,sep_token,pad_token,cls_token,mask_token
0,bert-base-german-cased,,False,30000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
1,distilbert-base-german-cased,,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
2,dbmdz/bert-base-german-cased,,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
3,dbmdz/bert-base-german-uncased,,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
4,dbmdz/bert-base-german-europeana-cased,,False,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
5,dbmdz/bert-base-german-europeana-uncased,,False,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
6,dbmdz/distilbert-base-german-europeana-cased,False,False,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
7,deepset/gbert-base,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
8,deepset/gbert-large,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
9,deepset/gelectra-base,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]


### Compare Tokenizer Parameters

Shows difference for lower case settings.

In [14]:
params_df.compare(params_cm_df)

Unnamed: 0_level_0,do_lower_case,do_lower_case
Unnamed: 0_level_1,self,other
name_or_path,Unnamed: 1_level_2,Unnamed: 2_level_2
dbmdz/bert-base-german-uncased,1.0,0.0
dbmdz/bert-base-german-europeana-uncased,1.0,0.0
german-nlp-group/electra-base-german-uncased,1.0,0.0


### Problems in Tokenized Texts

Not settings `do_lower_case` correctly leads to UNK tokens.

In [15]:
tokens_s = pd.Series({k:t.tokenize(text) for (k,t) in toks_cm.items()})

tokens_df = pd.concat({"length": tokens_s.map(len),
                       "tokens": tokens_s.map(lambda s: "|".join(s)),
                       }, axis=1)

display(params_cm_df[["strip_accents", "do_lower_case"]].join(tokens_df).reset_index())

Unnamed: 0,name_or_path,strip_accents,do_lower_case,length,tokens
0,bert-base-german-cased,,False,106,21|-|J|##ähr|##iger|fällt|wohl|bis|Saisonende|...
1,distilbert-base-german-cased,,False,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
2,dbmdz/bert-base-german-cased,,False,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
3,dbmdz/bert-base-german-uncased,,False,83,21|-|[UNK]|fällt|wohl|bis|[UNK]|aus|.|[UNK]|–|...
4,dbmdz/bert-base-german-europeana-cased,,False,108,21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
5,dbmdz/bert-base-german-europeana-uncased,,False,85,21|-|[UNK]|fällt|wohl|bis|[UNK]|aus|.|[UNK]|[U...
6,dbmdz/distilbert-base-german-europeana-cased,False,False,108,21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
7,deepset/gbert-base,False,False,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
8,deepset/gbert-large,False,False,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
9,deepset/gelectra-base,False,False,101,21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...


## Fixed Classification Model of SimpleTransformers

The initialization of a classification model can be fixed by loading the Tokenizer first and setting `do_lower_case` explicitly.

In [16]:
%%time
toks_fixed = {}

for model_name in models:
    print(f"loading classification model for {model_name}")

    # get tokenizer first to determine lower case setting
    tok = AutoTokenizer.from_pretrained(model_name)
    args = {"do_lower_case": tok.do_lower_case}

    model_type = "electra" if "electra" in model_name else "distilbert" if "distilbert" in model_name else "bert"
    model = ClassificationModel(model_type, model_name, args=args, use_cuda=False)

    toks_fixed[model_name] = model.tokenizer

loading classification model for bert-base-german-cased
loading classification model for distilbert-base-german-cased
loading classification model for dbmdz/bert-base-german-cased
loading classification model for dbmdz/bert-base-german-uncased
loading classification model for dbmdz/bert-base-german-europeana-cased
loading classification model for dbmdz/bert-base-german-europeana-uncased
loading classification model for dbmdz/distilbert-base-german-europeana-cased
loading classification model for deepset/gbert-base
loading classification model for deepset/gbert-large
loading classification model for deepset/gelectra-base
loading classification model for deepset/gelectra-large
loading classification model for german-nlp-group/electra-base-german-uncased
loading classification model for bert-base-multilingual-cased
loading classification model for distilbert-base-multilingual-cased
CPU times: user 29.3 s, sys: 7.75 s, total: 37.1 s
Wall time: 3min 53s


In [17]:
params_fixed_df = getTokenizerParams(toks_fixed)
display(params_fixed_df.reset_index())

Unnamed: 0,name_or_path,strip_accents,do_lower_case,vocab_size,model_max_length,padding_side,is_fast,unk_token,sep_token,pad_token,cls_token,mask_token
0,bert-base-german-cased,,False,30000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
1,distilbert-base-german-cased,,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
2,dbmdz/bert-base-german-cased,,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
3,dbmdz/bert-base-german-uncased,,True,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
4,dbmdz/bert-base-german-europeana-cased,,False,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
5,dbmdz/bert-base-german-europeana-uncased,,True,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
6,dbmdz/distilbert-base-german-europeana-cased,False,False,32000,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
7,deepset/gbert-base,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
8,deepset/gbert-large,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]
9,deepset/gelectra-base,False,False,31102,512,right,True,[UNK],[SEP],[PAD],[CLS],[MASK]


In [18]:
params_df.compare(params_fixed_df)

name_or_path
