<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/21_10kGNAD_simpletransformers_default.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying German News Articles with SimpleTransformers

## Objectives

1. Train a text classifier with transfer learning based on a pretrained German transformer model.
2. Keep the implementation simple (just a few lines of code) by using the SimpleTransformers library. It also has sensible default model settings.


## Approach

Use following pretrained models on the 10k German News Articles dataset to classify 9 news topics.

* `distilbert-base-german-cased`
* `deepset/gbert-base`
* `deepset/gelectra-large`

## Learnings

...

## Prerequisites

In [1]:
model_type = "distilbert"
model_name = "distilbert-base-german-cased"

# model_type = "bert"
# model_name = "deepset/gbert-base"

# model_type = "electra"
# model_name = "deepset/gelectra-base"

project_name = "10kgnad_default__" + model_name.replace("/", "_")

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sun Nov 14 21:11:52 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# install transformers
!pip install -q -U tqdm transformers simpletransformers >/dev/null

# check installed version
!pip freeze | grep transformers
!pip freeze | grep torch
# simpletransformers==0.61.6 / 0.63.3
# transformers==4.6.1 / 4.12.3
# torch==1.8.1+cu101 / 1.10.0

simpletransformers==0.63.3
transformers==4.12.3
torch @ https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl
torchsummary==1.5.1
torchtext==0.11.0
torchvision @ https://download.pytorch.org/whl/cu111/torchvision-0.11.1%2Bcu111-cp37-cp37m-linux_x86_64.whl


In [5]:
import numpy as np
import pandas as pd
from pathlib import Path
import os

from simpletransformers.classification import ClassificationModel
from transformers import AutoTokenizer
from transformers import logging
import wandb

# hide progress bar when downloading tokenizers - a workaround!
logging.get_verbosity = lambda : logging.NOTSET

# disable transformer warnings like "Some weights of the model checkpoint"
logging.set_verbosity_error()

# disable logging of wandb
os.environ["WANDB_SILENT"] = "true"

## Download Data

Using the [10k German News Articles Dataset](https://tblock.github.io/10kGNAD/)

In [6]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data
2021-11-14 21:15:01 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/train.csv [24405789/24405789] -> "data/train.csv" [1]
2021-11-14 21:15:03 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/test.csv [2755020/2755020] -> "data/test.csv" [1]

2.7M Nov 14 21:15 test.csv
 24M Nov 14 21:15 train.csv


## Import Data

In [7]:
data_dir = Path("data/")

train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'

def read_csv_10kGNAD(filepath: Path, columns=["labels", "text"]) -> pd.DataFrame:
    """Load 10kGNAD csv file, handling its specific file format."""
    f = pd.read_csv(filepath, sep=";", quotechar="'", names=columns)
    return f

In [37]:
train_df = read_csv_10kGNAD(data_dir / 'train.csv')
print(train_df.shape[0], 'articles')
display(train_df.head())

9245 articles


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


In [38]:
test_df = read_csv_10kGNAD(data_dir / 'test.csv')
print(test_df.shape[0], 'articles')
display(test_df.head())

1028 articles


Unnamed: 0,labels,text
0,Wirtschaft,"Die Gewerkschaft GPA-djp lanciert den ""All-in-..."
1,Sport,Franzosen verteidigen 2:1-Führung – Kritische ...
2,Web,Neues Video von Designern macht im Netz die Ru...
3,Sport,23-jähriger Brasilianer muss vier Spiele pausi...
4,International,Aufständische verwendeten Chemikalie bei Gefec...


## Prepare Data for Model Training

There are a few requiremenf for feeding training data into SimpleTransformers:

* columns should be labeled `labels` and `text` (already done when reading the data)
* labels must be encoded as int values (starting at `0`!)

Additionally, we can handle imbalanced datasets by

* computing class weights for training

### Label Encoding

In [39]:
from sklearn.preprocessing import LabelEncoder

def encode_labels(train: pd.DataFrame, test: pd.DataFrame):
    le = LabelEncoder()

    train_labels = le.fit_transform(train.labels)
    test_labels = le.transform(test.labels)

    return train.assign(labels=train_labels), test.assign(labels=test_labels), le

# caution overwriting data
train_df, test_df, le = encode_labels(train_df, test_df)
display(train_df.head())

Unnamed: 0,labels,text
0,5,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,3,"Erfundene Bilder zu Filmen, die als verloren g..."
2,6,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,7,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,1,Estland sieht den künftigen österreichischen P...


### Computing Class Weights (not used yet)

In [11]:
from sklearn.utils.class_weight import compute_class_weight

def class_weights(labels: pd.Series) -> pd.DataFrame:
    """Compute class weights for imbalanced data."""
    uniq_labels = labels.unique()
    counts_s = labels.value_counts().reindex(uniq_labels)
    weights = compute_class_weight("balanced", uniq_labels, labels)
    return pd.DataFrame({"count": counts_s, "weight": weights}).sort_index()

weights_df = class_weights(train_df.labels)
display(weights_df)

Unnamed: 0,count,weight
0,601,1.709188
1,913,1.125106
2,1360,0.75531
3,485,2.117984
4,1510,0.68028
5,1081,0.950252
6,1509,0.68073
7,1270,0.808836
8,516,1.990741


## Model Setup

In [12]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='macro')

def precision_multiclass(labels, preds):
    return precision_score(labels, preds, average='macro')

def recall_multiclass(labels, preds):
    return recall_score(labels, preds, average='macro')

In [15]:
train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 200,    
    "evaluate_during_training_verbose": False,
    "evaluate_during_training_silent": True,
    "silent": True,
    "fp16": False,
    "wandb_project": project_name,
    }

In [69]:
def train():

    # need to create a tokenizer first and adjust train args with tokenizer's lower case setting
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    args = {**train_args, **{ "do_lower_case": tokenizer.do_lower_case }}

    # Create a ClassificationModel
    model = ClassificationModel(model_type,
                                model_name,
                                num_labels=train_df.labels.nunique(),
                                args=args)


    steps, details = model.train_model(train_df,
                                       eval_df=test_df,
                                       verbose=False,
                                       f1=f1_multiclass,
                                       acc=accuracy_score,
                                       precision=precision_multiclass,
                                       recall=recall_multiclass)
    
    result, _, _ = model.eval_model(test_df,
                                    f1=f1_multiclass,
                                    acc=accuracy_score,
                                    precision=precision_multiclass,
                                    recall=recall_multiclass,
                                    wandb_log=False)
    
    print(", ".join(f"{k}={v:.4}" for k,v in pd.Series(result).items()))

    wandb.join()

In [70]:
# run several experiments
for i in range(1,5):
    train()

mcc=0.8641, f1=0.8796, acc=0.8813, precision=0.8812, recall=0.8786, eval_loss=0.3743
mcc=0.8585, f1=0.8737, acc=0.8765, precision=0.8758, recall=0.8729, eval_loss=0.4117
mcc=0.8699, f1=0.8811, acc=0.8862, precision=0.8826, recall=0.882, eval_loss=0.3799
mcc=0.8685, f1=0.8821, acc=0.8852, precision=0.8839, recall=0.8808, eval_loss=0.379


---
## Evaluate Best Model

In [71]:
!ls -la outputs/

total 264112
drwxr-xr-x 9 root root      4096 Nov 14 21:22 .
drwxr-xr-x 1 root root      4096 Nov 14 21:19 ..
drwxr-xr-x 2 root root      4096 Nov 14 21:20 best_model
drwxr-xr-x 2 root root      4096 Nov 14 21:21 checkpoint-1000
drwxr-xr-x 2 root root      4096 Nov 14 21:22 checkpoint-1156-epoch-1
drwxr-xr-x 2 root root      4096 Nov 14 21:20 checkpoint-200
drwxr-xr-x 2 root root      4096 Nov 14 21:20 checkpoint-400
drwxr-xr-x 2 root root      4096 Nov 14 21:21 checkpoint-600
drwxr-xr-x 2 root root      4096 Nov 14 21:21 checkpoint-800
-rw-r--r-- 1 root root      1024 Nov 14 22:29 config.json
-rw-r--r-- 1 root root       162 Nov 14 22:29 eval_results.txt
-rw-r--r-- 1 root root      2685 Nov 14 22:29 model_args.json
-rw-r--r-- 1 root root 269663345 Nov 14 22:29 pytorch_model.bin
-rw-r--r-- 1 root root       112 Nov 14 22:29 special_tokens_map.json
-rw-r--r-- 1 root root       339 Nov 14 22:29 tokenizer_config.json
-rw-r--r-- 1 root root    479105 Nov 14 22:29 tokenizer.json
-rw-r--r-- 

In [72]:
# loading best model (as stored by SimpleTransformers)
# CAUTION: for some reason this seems to be the last model not the best model
model = ClassificationModel(model_type, "outputs/best_model")

In [73]:
result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score, precision=precision_multiclass, recall=recall_multiclass, wandb_log=False)
pd.Series(result)

mcc          0.868474
f1           0.882109
acc          0.885214
precision    0.883880
recall       0.880802
eval_loss    0.379012
dtype: float64

In [74]:
preds = pd.DataFrame(model_outputs, columns=le.classes_)
preds

Unnamed: 0,Etat,Inland,International,Kultur,Panorama,Sport,Web,Wirtschaft,Wissenschaft
0,-0.876530,2.946143,-0.962646,-2.103028,-0.840753,-2.275208,-0.945771,3.987202,-1.873073
1,-1.506068,-1.769839,-1.334047,-0.853971,-0.629229,6.310894,-1.118965,-0.756840,-1.061794
2,-0.916884,-1.182561,-1.041860,-1.375201,-1.185755,-2.032864,6.058130,-0.062409,-1.349943
3,-1.494297,-1.778045,-1.204552,-0.870875,-0.467632,6.212179,-1.074701,-0.802107,-1.107176
4,-1.329643,-1.451951,5.778667,-1.430867,0.092532,-2.052147,-0.307395,-0.449287,-1.569076
...,...,...,...,...,...,...,...,...,...
1023,-0.180114,-1.154644,-1.008521,-1.415806,-1.289979,-2.289228,5.715694,0.211121,-1.754068
1024,-0.538880,5.011572,-0.189598,-1.434116,-0.386076,-2.254290,-0.726566,0.364774,-1.741417
1025,-1.408488,-1.489617,-1.456370,-0.808149,-0.511240,6.276954,-1.294935,-0.762925,-1.022554
1026,-1.506702,-1.736403,-1.427306,-0.748358,-0.626613,6.271152,-1.051021,-0.722355,-1.005033


In [75]:
# preds.to_csv("data/predictions.csv", index=False)

In [76]:
pred_s = pd.DataFrame(model_outputs).idxmax(axis=1)

In [77]:
import sklearn.metrics as skm
skm.confusion_matrix(test_df.labels, pred_s)

array([[ 58,   1,   3,   1,   0,   0,   2,   2,   0],
       [  1,  85,   2,   2,   5,   0,   0,   5,   2],
       [  1,   1, 127,   0,  12,   2,   1,   6,   1],
       [  1,   1,   0,  46,   4,   0,   0,   0,   2],
       [  1,   7,   8,   2, 141,   0,   1,   5,   3],
       [  0,   0,   0,   0,   2, 118,   0,   0,   0],
       [  1,   2,   0,   1,   1,   0, 163,   0,   0],
       [  0,   8,   3,   0,   6,   0,   1, 122,   1],
       [  0,   1,   3,   0,   1,   1,   0,   1,  50]])

In [78]:
print(skm.classification_report(test_df.labels, pred_s, target_names=le.classes_))

               precision    recall  f1-score   support

         Etat       0.92      0.87      0.89        67
       Inland       0.80      0.83      0.82       102
International       0.87      0.84      0.86       151
       Kultur       0.88      0.85      0.87        54
     Panorama       0.82      0.84      0.83       168
        Sport       0.98      0.98      0.98       120
          Web       0.97      0.97      0.97       168
   Wirtschaft       0.87      0.87      0.87       141
 Wissenschaft       0.85      0.88      0.86        57

     accuracy                           0.89      1028
    macro avg       0.88      0.88      0.88      1028
 weighted avg       0.89      0.89      0.89      1028

