<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/TransformersEvaluation10kGNAD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation of Pre-trained Transformers Models for German Text Classification

**Goal:** Evaluate the performance of different pre-trained Transformers Models from the Hugging Face model hub, i.e. BERT, DistilBERT, Electra.

**Experiment Setup:**

This evaluation should rely on a simple common baseline for model training without going into details like hyperparameter optimization. Therefore, we use [SimpleTransformers](https://simpletransformers.ai/) with its default settings and the [Ten Thousand German News Articles Dataset](https://tblock.github.io/10kGNAD/).

## Prerequisites

### List of Models

In [87]:
model_names = [
          "bert-base-german-cased",
          "distilbert-base-german-cased",
          "dbmdz/bert-base-german-cased",
          "dbmdz/bert-base-german-uncased",
          "dbmdz/bert-base-german-europeana-cased",
          "dbmdz/bert-base-german-europeana-uncased",
          "dbmdz/distilbert-base-german-europeana-cased",
          "deepset/gbert-base",   # newest base model
          "deepset/gbert-large",
          "deepset/gelectra-base",
          "deepset/gelectra-large",   # current SOTA
          "german-nlp-group/electra-base-german-uncased",
          
          "bert-base-multilingual-cased",
          "distilbert-base-multilingual-cased",
]

### Check GPU

In [88]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Mon May 17 21:37:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    34W / 250W |   5283MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces


### Install Tranformers

In [89]:
# install transformers
!pip install -q --upgrade tqdm==4.47.0 >/dev/null
!pip install -q --upgrade transformers simpletransformers >/dev/null

# check installed version
!pip freeze | grep transformers
# simpletransformers==0.61.4
# transformers==4.6.0

simpletransformers==0.61.4
transformers==4.6.0


In [90]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import time
import os

from simpletransformers.classification import ClassificationModel
from transformers import AutoTokenizer

os.environ["WANDB_SILENT"] = "true"

### Connect Google Drive

In [91]:
from google.colab import drive

In [92]:
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Download Data

Get the 10k German News Articles Dataset

In [93]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data

2.7M May 17 11:51 test.csv
 24M May 17 11:51 train.csv


## Import Data

Load training and test dataset

In [94]:
data_dir = Path(os.getenv("DIR"))

train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'

def load_10kgnad(filepath: Path, columns=['labels', 'text']) -> pd.DataFrame:
    """Load 10kGNAD data file"""
    f = pd.read_csv(filepath, sep=";", quotechar="'", names=columns)
    return f

In [95]:
train_df = load_file(data_dir / 'train.csv')
print(train_df.shape[0], 'articles')
display(train_df.head())

9245 articles


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


In [96]:
test_df = load_file(data_dir / 'test.csv')
print(test_df.shape[0], 'articles')
display(test_df.head())

1028 articles


Unnamed: 0,labels,text
0,Wirtschaft,"Die Gewerkschaft GPA-djp lanciert den ""All-in-..."
1,Sport,Franzosen verteidigen 2:1-Führung – Kritische ...
2,Web,Neues Video von Designern macht im Netz die Ru...
3,Sport,23-jähriger Brasilianer muss vier Spiele pausi...
4,International,Aufständische verwendeten Chemikalie bei Gefec...


## Prepare for Model Training

Requirements of SimpleTransformers

* columns should be labeled `labels` and `text` (onready done during import)
* labels must be int values starting at `0`

In [97]:
# map label to integers
mapping_s = pd.Series(train_df.labels.value_counts().index)
mapping_s

0         Panorama
1              Web
2    International
3       Wirtschaft
4            Sport
5           Inland
6             Etat
7     Wissenschaft
8           Kultur
dtype: object

In [98]:
# replace labels with integers starting at 0
train_df.labels.replace(mapping_s.values, mapping_s.index, inplace=True)
test_df.labels.replace(mapping_s.values, mapping_s.index, inplace=True)
display(train_df.head())
display(test_df.head())

Unnamed: 0,labels,text
0,4,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,8,"Erfundene Bilder zu Filmen, die als verloren g..."
2,1,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,3,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,5,Estland sieht den künftigen österreichischen P...


Unnamed: 0,labels,text
0,3,"Die Gewerkschaft GPA-djp lanciert den ""All-in-..."
1,4,Franzosen verteidigen 2:1-Führung – Kritische ...
2,1,Neues Video von Designern macht im Netz die Ru...
3,4,23-jähriger Brasilianer muss vier Spiele pausi...
4,2,Aufständische verwendeten Chemikalie bei Gefec...


## Evaluation Setup

There are many different German (or multilingual) language models we want to evaluate

In [99]:
def get_model_type(model_name: str):
    if "electra" in model_name:
        return "electra"

    if "distilbert" in model_name:
        return "distilbert"

    if "bert" in model_name:
        return "bert"

    return None

model_df = pd.DataFrame([(get_model_type(m), m) for m in model_names], columns=["type", "name"])
model_df

Unnamed: 0,type,name
0,bert,bert-base-german-cased
1,distilbert,distilbert-base-german-cased
2,bert,dbmdz/bert-base-german-cased
3,bert,dbmdz/bert-base-german-uncased
4,bert,dbmdz/bert-base-german-europeana-cased
5,bert,dbmdz/bert-base-german-europeana-uncased
6,distilbert,dbmdz/distilbert-base-german-europeana-cased
7,bert,deepset/gbert-base
8,bert,deepset/gbert-large
9,electra,deepset/gelectra-base


In [106]:
mdl = model_df.iloc[10]
print(f"using model: '{mdl['name']}'")

using model: 'deepset/gelectra-large'


In [101]:
import wandb

# initialize weights & biases logging
project_name = "german_news_article_classification2"

In [102]:
# define hyperparameters
train_args ={"reprocess_input_data": True,
             "fp16": False,
             "num_train_epochs": 4,
             # "weight": train_weights_s.values,
             "evaluate_during_training": True,
             "overwrite_output_dir": True,
             "wandb_project": project_name}

from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='macro')

def precision_multiclass(labels, preds):
    return precision_score(labels, preds, average='macro')

def recall_multiclass(labels, preds):
    return recall_score(labels, preds, average='macro')

In [103]:
def init_classifier(model_type:str, model_name:str, num_labels, train_args) -> ClassificationModel:

    # need to create a tokenizer first and adjust train args with lower case setting
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    train_args = {**train_args, **{ "do_lower_case": tokenizer.do_lower_case }}

    # Create a ClassificationModel
    return ClassificationModel(model_type, model_name, tokenizer_name=model_name, num_labels=num_labels, args=train_args)

def train_model(model:ClassificationModel, train_df, eval_df, show_loss=False):
    return model.train_model(train_df, eval_df=eval_df, verbose=False, show_running_loss=show_loss, f1=f1_multiclass, acc=accuracy_score, precision=precision_multiclass, recall=recall_multiclass)

def eval_model(model:ClassificationModel, eval_df):
    return model.eval_model(test_df, wandb_log=False, f1=f1_multiclass, acc=accuracy_score, precision=precision_multiclass, recall=recall_multiclass)

def log_results(model_name, start, end, details):
    eval_df = pd.DataFrame(details)[-1:].reset_index(drop=True).round(4)
    result_df = pd.DataFrame({
        "start_time": [time.strftime('%Y-%m-%d %H:%M:%S %Z', time.gmtime(start))],
        "runtime": [int(end - start)],
        "model_name": [model_name],
        })
    output_df = pd.concat([result_df, eval_df], axis=1)

    eval_log = Path("/content/gdrive/My Drive/Colab Notebooks/nlp-classification/data") / "eval_log2.txt"

    if eval_log.exists():
        output_df.to_csv(eval_log, mode='a', header=False, index=False)
    else:
        output_df.to_csv(eval_log, index=False)

In [None]:
# run training multiple times
num_runs = 11

for i in range(num_runs):

    model = init_classifier(mdl["type"], mdl["name"], len(mapping_s), train_args)
    
    start = time.time()
    steps, details = train_model(model, train_df, test_df)
    end = time.time()
    
    wandb.finish()
    log_results(mdl["name"], start, end, details)