<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/TransformerTokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing Tokenization used by different Transformer Models

In [11]:
!pip install -q transformers > /dev/null

# check installed version
!pip freeze | grep transformers
# transformers==4.6.0

transformers==4.6.0


## Download Data

Using [10k German News Articles Dataset](notebooks/10kGNAD/README.md)

In [12]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data

2.7M May 15 12:18 test.csv
 24M May 15 12:18 train.csv


## Import Data

In [13]:
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

In [14]:
data_dir = Path("data/")

train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'

In [15]:
def load_file(filepath: Path) -> pd.DataFrame:
    f = pd.read_csv(filepath, sep=";", quotechar="'", names=['labels', 'text'])
    return f

In [16]:
train_df = load_file(data_dir / 'train.csv')
print(train_df.shape[0], 'articles')
display(train_df.head())

9245 articles


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


## Create Model

In [17]:
from transformers import AutoTokenizer, AutoModel

import warnings
warnings.simplefilter('ignore')

In [18]:
%%time
models = [
          "bert-base-german-cased",
          "distilbert-base-german-cased",
          "dbmdz/bert-base-german-cased",
          "dbmdz/bert-base-german-uncased",
          "dbmdz/bert-base-german-europeana-cased",
          "dbmdz/bert-base-german-europeana-uncased",
          "dbmdz/distilbert-base-german-europeana-cased",
          "deepset/gbert-base",
          "deepset/gbert-large",
          "deepset/gelectra-base",
          "deepset/gelectra-large",
          "german-nlp-group/electra-base-german-uncased",
          "bert-base-multilingual-cased",
          "distilbert-base-multilingual-cased",
]

text = train_df.text[0]
results = {}

for model_name in models:
    print(f"loading tokenizer for {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    # model = AutoModel.from_pretrained(model_name)

    tokens = "|".join(tokenizer.tokenize(text))
    results[model_name] = tokens

loading tokenizer for bert-base-german-cased
loading tokenizer for distilbert-base-german-cased
loading tokenizer for dbmdz/bert-base-german-cased
loading tokenizer for dbmdz/bert-base-german-uncased
loading tokenizer for dbmdz/bert-base-german-europeana-cased
loading tokenizer for dbmdz/bert-base-german-europeana-uncased
loading tokenizer for dbmdz/distilbert-base-german-europeana-cased
loading tokenizer for deepset/gbert-base
loading tokenizer for deepset/gbert-large
loading tokenizer for deepset/gelectra-base
loading tokenizer for deepset/gelectra-large
loading tokenizer for german-nlp-group/electra-base-german-uncased
loading tokenizer for bert-base-multilingual-cased
loading tokenizer for distilbert-base-multilingual-cased
CPU times: user 1.74 s, sys: 126 ms, total: 1.87 s
Wall time: 18.8 s


In [19]:
print(pd.Series(results))

bert-base-german-cased                          21|-|J|##ähr|##iger|fällt|wohl|bis|Saisonende|...
distilbert-base-german-cased                    21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
dbmdz/bert-base-german-cased                    21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
dbmdz/bert-base-german-uncased                  21|-|jahr|##iger|fall|##t|wohl|bis|saisonende|...
dbmdz/bert-base-german-europeana-cased          21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
dbmdz/bert-base-german-europeana-uncased        21|-|jahr|##iger|fallt|wohl|bis|saison|##ende|...
dbmdz/distilbert-base-german-europeana-cased    21|-|Jähr|##iger|fällt|wohl|bis|Saison|##ende|...
deepset/gbert-base                              21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gbert-large                             21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gelectra-base                           21|-|Jähr|##iger|fällt|wohl|bis|Saisonende|aus...
deepset/gelectra-lar