<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/TextClassifierDistilbertGerman.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying German News Articles with SimpleTransformers

## Objectives

1. Train a text classifier with transfer learning based on a pretrained German DistilBERT transformer model.
2. Keep the implementation simple (just a few lines of code) by using the SimpleTransformers library.

## Approach

This solution is heavily inspired by the article https://www.philschmid.de/bert-text-classification-in-a-different-language/ which uses a pretrained `distilbert-base-german-cased` language model to further fine-tune it for a downstream task to identify offensive language in German tweets (the [Germeval 2019](hhttps://projects.fzai.h-da.de/iggsa/projekt/) dataset).

Following, the same pretrained `distilbert-base-german-cased` model is used on the 10k German News Articles dataset to classify 9 news topics.

## Learnings

...

## Prerequisites

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sat Jun 12 23:23:18 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    38W / 300W |   3379MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# install transformers
!pip install -q -U tqdm==4.47.0 transformers simpletransformers >/dev/null

# check installed version
!pip freeze | grep transformers
!pip freeze | grep torch
# simpletransformers==0.61.6
# transformers==4.6.1
# torch==1.8.1+cu101

simpletransformers==0.61.6
transformers==4.6.1
torch==1.8.1+cu101
torchsummary==1.5.1
torchtext==0.9.1
torchvision==0.9.1+cu101


In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import os

from simpletransformers.classification import ClassificationModel
from transformers import AutoTokenizer

# hide progress bar when downloading tokenizers - a workaround!
from transformers import logging
logging.get_verbosity = lambda : logging.NOTSET

# suppress "Some parameters of your model ..." when loading a pretrained model
logging.set_verbosity_error()

os.environ["WANDB_SILENT"] = "true"

## Download Data

Using the [10k German News Articles Dataset](https://tblock.github.io/10kGNAD/)

In [None]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data

2.7M Jun 12 22:32 test.csv
 24M Jun 12 22:32 train.csv


## Import Data

In [None]:
data_dir = Path("data/")

train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'

def read_csv_10kGNAD(filepath: Path, columns=["labels", "text"]) -> pd.DataFrame:
    """Load 10kGNAD csv file, handling its specific file format."""
    f = pd.read_csv(filepath, sep=";", quotechar="'", names=columns)
    return f

In [None]:
train_df = read_csv_10kGNAD(data_dir / 'train.csv')
print(train_df.shape[0], 'articles')
display(train_df.head())

9245 articles


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


In [None]:
test_df = read_csv_10kGNAD(data_dir / 'test.csv')
print(test_df.shape[0], 'articles')
display(test_df.head())

1028 articles


Unnamed: 0,labels,text
0,Wirtschaft,"Die Gewerkschaft GPA-djp lanciert den ""All-in-..."
1,Sport,Franzosen verteidigen 2:1-Führung – Kritische ...
2,Web,Neues Video von Designern macht im Netz die Ru...
3,Sport,23-jähriger Brasilianer muss vier Spiele pausi...
4,International,Aufständische verwendeten Chemikalie bei Gefec...


## Prepare Data for Model Training

There are a few requiremenf for feeding training data into SimpleTransformers:

* columns should be labeled `labels` and `text` (already done when reading the data)
* labels must be encoded as int values (starting at `0`!)

Additionally, we can handle imbalanced datasets by

* computing class weights for training

### Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

def encode_labels(train: pd.DataFrame, test: pd.DataFrame):
    le = LabelEncoder()

    train_labels = le.fit_transform(train.labels)
    test_labels = le.transform(test.labels)

    return train.assign(labels=train_labels), test.assign(labels=test_labels)

train_df, test_df = encode_labels(train_df, test_df)
display(train_df.head())

Unnamed: 0,labels,text
0,5,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,3,"Erfundene Bilder zu Filmen, die als verloren g..."
2,6,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,7,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,1,Estland sieht den künftigen österreichischen P...


### Computing Class Weights (not used yet)

In [None]:
from sklearn.utils.class_weight import compute_class_weight

def class_weights(labels: pd.Series) -> pd.DataFrame:
    """Compute class weights for imbalanced data."""
    uniq_labels = labels.unique()
    counts_s = labels.value_counts().reindex(uniq_labels)
    weights = compute_class_weight("balanced", uniq_labels, labels)
    return pd.DataFrame({"count": counts_s, "weight": weights}).sort_index()

weights_df = class_weights(train_df.labels)
display(weights_df)

Unnamed: 0,count,weight
0,601,1.709188
1,913,1.125106
2,1360,0.75531
3,485,2.117984
4,1510,0.68028
5,1081,0.950252
6,1509,0.68073
7,1270,0.808836
8,516,1.990741


## Model Setup

In [None]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='macro')

def precision_multiclass(labels, preds):
    return precision_score(labels, preds, average='macro')

def recall_multiclass(labels, preds):
    return recall_score(labels, preds, average='macro')

In [None]:
import wandb

# initialize weights & biases logging
project_name = "10kGNAD_SimpleTransformers_base"

# define training parameters
train_args = {"reprocess_input_data": True,
              "fp16": False,
              "num_train_epochs": 1,
              # "weight": list(weights_df.weight),
              "evaluate_during_training": True,
              "evaluate_during_training_steps": 200,
              "overwrite_output_dir": True,
              "wandb_project": project_name}

model_type = "distilbert"
model_name = "distilbert-base-german-cased"

def train():

    # need to create a tokenizer first and adjust train args with lower case setting
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    args = {**train_args, **{ "do_lower_case": tokenizer.do_lower_case }}

    # Create a ClassificationModel
    model = ClassificationModel(model_type,
                                model_name,
                                num_labels=train_df.labels.nunique(),
                                args=args)


    steps, details = model.train_model(train_df,
                                       eval_df=test_df,
                                       verbose=False,
                                       f1=f1_multiclass,
                                       acc=accuracy_score,
                                       precision=precision_multiclass,
                                       recall=recall_multiclass)

    wandb.join()

In [None]:
while True:
    train()

---
## Evaluate Best Model

In [None]:
result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score, precision=precision_multiclass, recall=recall_multiclass, wandb_log=False)
pd.Series(result)

In [None]:
preds = pd.DataFrame(model_outputs, columns=mapping_s)
preds

In [None]:
# preds.to_csv("data/predictions.csv", index=False)

In [None]:
pred_s = pd.DataFrame(model_outputs).idxmax(axis=1)

In [None]:
mapping_s.values

In [None]:
import sklearn.metrics as skm
skm.confusion_matrix(test_df.labels, pred_s)

In [None]:
print(skm.classification_report(test_df.labels, pred_s, target_names=mapping_s.values))