<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/TransformerTokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Tokenization used by different Transformer Models

In [None]:
!pip install -q transformers simpletransformers > /dev/null

# check installed version
!pip freeze | grep transformers
# simpletransformers==0.61.4
# transformers==4.6.0

simpletransformers==0.61.4
transformers==4.6.0


## Download Data

Using [10k German News Articles Dataset](notebooks/10kGNAD/README.md)

In [None]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data

2.7M May 15 12:18 test.csv
 24M May 15 12:18 train.csv


## Import Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

In [None]:
data_dir = Path("data/")

train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'

In [None]:
def load_file(filepath: Path) -> pd.DataFrame:
    f = pd.read_csv(filepath, sep=";", quotechar="'", names=['labels', 'text'])
    return f

In [None]:
train_df = load_file(data_dir / 'train.csv')
print(train_df.shape[0], 'articles')
display(train_df.head())

9245 articles


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


## Create Model

In [None]:
from transformers import AutoTokenizer, AutoModel

import warnings
warnings.simplefilter('ignore')

In [None]:
%%time
models = [
          "bert-base-german-cased",
          "distilbert-base-german-cased",
          "dbmdz/bert-base-german-cased",
          "dbmdz/bert-base-german-uncased",
          "dbmdz/bert-base-german-europeana-cased",
          "dbmdz/bert-base-german-europeana-uncased",
          "dbmdz/distilbert-base-german-europeana-cased",
          "deepset/gbert-base",
          "deepset/gbert-large",
          "deepset/gelectra-base",
          "deepset/gelectra-large",
          "german-nlp-group/electra-base-german-uncased",
          "bert-base-multilingual-cased",
          "distilbert-base-multilingual-cased",
]

text = train_df.text[0]
results = {}

for model_name in models:
    print(f"loading tokenizer for {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    # model = AutoModel.from_pretrained(model_name)

    tokens = "|".join(tokenizer.tokenize(text))
    results[model_name] = tokens

loading tokenizer for bert-base-german-cased
loading tokenizer for distilbert-base-german-cased
loading tokenizer for dbmdz/bert-base-german-cased
loading tokenizer for dbmdz/bert-base-german-uncased
loading tokenizer for dbmdz/bert-base-german-europeana-cased
loading tokenizer for dbmdz/bert-base-german-europeana-uncased
loading tokenizer for dbmdz/distilbert-base-german-europeana-cased
loading tokenizer for deepset/gbert-base
loading tokenizer for deepset/gbert-large
loading tokenizer for deepset/gelectra-base
loading tokenizer for deepset/gelectra-large
loading tokenizer for german-nlp-group/electra-base-german-uncased
loading tokenizer for bert-base-multilingual-cased
loading tokenizer for distilbert-base-multilingual-cased
CPU times: user 1.64 s, sys: 151 ms, total: 1.79 s
Wall time: 17.6 s


In [None]:
print(pd.Series(results))

bert-base-german-cased                          21|-|J|##ähr|##iger|fällt|wohl|bis|Saisonende|...
distilbert-base-german-cased                    21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
dbmdz/bert-base-german-cased                    21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
dbmdz/bert-base-german-uncased                  21|-|jahr|##iger|fall|##t|wohl|bis|saisonende|...
dbmdz/bert-base-german-europeana-cased          21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
dbmdz/bert-base-german-europeana-uncased        21|-|jahr|##iger|fallt|wohl|bis|saison|##ende|...
dbmdz/distilbert-base-german-europeana-cased    21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
deepset/gbert-base                              21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gbert-large                             21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gelectra-base                           21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gelectra-lar

In [None]:
tokenizer = AutoTokenizer.from_pretrained("german-nlp-group/electra-base-german-uncased")
print(tokenizer, tokenizer.do_lower_case)
print("|".join(tokenizer.tokenize(text))[:50])

PreTrainedTokenizerFast(name_or_path='german-nlp-group/electra-base-german-uncased', vocab_size=32767, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}) True
21|-|jähriger|fällt|wohl|bis|saisonende|aus|.|wien


In [None]:
tokenizer.do_lower_case

True

In [None]:
model = ClassificationModel(model_type, model_name, tokenizer_name="german-nlp-group/electra-base-german-uncased", args={"do_lower_case":tokenizer.do_lower_case}, use_cuda=False)
print(model.tokenizer, model.tokenizer.do_lower_case)
print("|".join(model.tokenizer.tokenize(text))[:50])

Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'pre_class

PreTrainedTokenizerFast(name_or_path='german-nlp-group/electra-base-german-uncased', vocab_size=32767, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}) True
21|-|jähriger|fällt|wohl|bis|saisonende|aus|.|wien


In [None]:
%%time
from simpletransformers.classification import ClassificationModel

# import warnings
# warnings.filterwarnings("ignore")

results2 = {}

for model_name in models:
    model_type = "electra" if "electra" in model_name else "distilbert" if "distilbert" in model_name else "bert"
    print(model_type, model_name)

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model_args = { "do_lower_case": tokenizer.do_lower_case }
    model = ClassificationModel(model_type, model_name, tokenizer_name=model_name, args=model_args, use_cuda=False)

    tokens = "|".join(model.tokenizer.tokenize(text))
    results2[model_name] = tokens

bert bert-base-german-cased


Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

distilbert distilbert-base-german-cased


Some weights of the model checkpoint at distilbert-base-german-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight

bert dbmdz/bert-base-german-cased


Some weights of the model checkpoint at dbmdz/bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-german-cased and are newly initi

bert dbmdz/bert-base-german-uncased


Some weights of the model checkpoint at dbmdz/bert-base-german-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model 

bert dbmdz/bert-base-german-europeana-cased


Some weights of the model checkpoint at dbmdz/bert-base-german-europeana-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassificat

bert dbmdz/bert-base-german-europeana-uncased


Some weights of the model checkpoint at dbmdz/bert-base-german-europeana-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassific

distilbert dbmdz/distilbert-base-german-europeana-cased


Some weights of the model checkpoint at dbmdz/distilbert-base-german-europeana-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at dbmdz/distilbert-base-german-europeana-cased and are newly initialized: ['pre_classifi

bert deepset/gbert-base


Some weights of the model checkpoint at deepset/gbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

bert deepset/gbert-large


Some weights of the model checkpoint at deepset/gbert-large were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initia

electra deepset/gelectra-base


Some weights of the model checkpoint at deepset/gelectra-base were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at deepset/gelectra-base and are newly initialized: ['pooler.dense.weight', 'pooler.dense.

electra deepset/gelectra-large


Some weights of the model checkpoint at deepset/gelectra-large were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at deepset/gelectra-large and are newly initialized: ['pooler.dense.weight', 'pooler.dens

electra german-nlp-group/electra-base-german-uncased


Some weights of the model checkpoint at german-nlp-group/electra-base-german-uncased were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at german-nlp-group/electra-base-german-uncased and are newly initi

bert bert-base-multilingual-cased


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

distilbert distilbert-base-multilingual-cased


Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'pre_class

CPU times: user 27.3 s, sys: 8.78 s, total: 36.1 s
Wall time: 3min 32s


In [None]:
print(pd.Series(results2))

bert-base-german-cased                          21|-|J|##ähr|##iger|fällt|wohl|bis|Saisonende|...
distilbert-base-german-cased                    21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
dbmdz/bert-base-german-cased                    21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
dbmdz/bert-base-german-uncased                  21|-|jahr|##iger|fall|##t|wohl|bis|saisonende|...
dbmdz/bert-base-german-europeana-cased          21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
dbmdz/bert-base-german-europeana-uncased        21|-|jahr|##iger|fallt|wohl|bis|saison|##ende|...
dbmdz/distilbert-base-german-europeana-cased    21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
deepset/gbert-base                              21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gbert-large                             21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gelectra-base                           21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gelectra-lar

In [None]:
results == results2

True