<a href="https://colab.research.google.com/github/goerlitz/nlp-classification/blob/main/notebooks/10kGNAD/colab/22_10kGNAD_simpletransformers_hyperparam_distilbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Optimization for Classification of German News Articles

## Objective

Train a Classifier which achieves better performance than a Classifier with the default SimpleTransformers setup by doing hyperparameter optimization.

Use different pretrained models for optimization:
* `distilbert-base-german-cased`
* `deepset/gbert-base`

## Approach

Using SimpleTransformers which sweeps from Weights & Biases which evaluate different hyperparameter combinations.

See also:
* https://simpletransformers.ai/docs/tips-and-tricks/#hyperparameter-optimization
* https://towardsdatascience.com/hyperparameter-optimization-for-optimum-transformer-models-b95a32b70949


## Prerequisites

In [1]:
# model_type = "distilbert"
# model_name = "distilbert-base-german-cased"

model_type = "bert"
model_name = "deepset/gbert-base"

project_name = "10kgnad_sweep__" + model_name.replace("/", "_")

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Tue Jun 15 21:09:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Install Libraries

In [3]:
# install transformers
!pip install -q --upgrade tqdm==4.47.0 >/dev/null
!pip install -q --upgrade transformers simpletransformers >/dev/null

# check installed version
!pip freeze | grep transformers
# simpletransformers==0.61.6
# transformers==4.6.1

[31mERROR: google-colab 1.0.0 has requirement ipykernel~=4.10, but you'll have ipykernel 5.5.5 which is incompatible.[0m
simpletransformers==0.61.6
transformers==4.6.1


In [4]:
import pandas as pd
from pathlib import Path
import os

from simpletransformers.classification import ClassificationModel
from transformers import AutoTokenizer
from transformers import logging
import wandb

# hide progress bar when downloading tokenizers - a workaround!
logging.get_verbosity = lambda : logging.NOTSET

# disable transformer warnings like "Some weights of the model checkpoint"
logging.set_verbosity_error()

# disable logging of wandb
os.environ["WANDB_SILENT"] = "true"

### Download Data

Get the 10k German News Articles Dataset

In [5]:
%env DIR=data

!mkdir -p $DIR
!wget -nc https://github.com/tblock/10kGNAD/blob/master/train.csv?raw=true -nv -O $DIR/train.csv
!wget -nc https://github.com/tblock/10kGNAD/blob/master/test.csv?raw=true -nv -O $DIR/test.csv
!ls -lAh $DIR | cut -d " " -f 5-

env: DIR=data
2021-06-15 21:10:28 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/train.csv [24405789/24405789] -> "data/train.csv" [1]
2021-06-15 21:10:29 URL:https://raw.githubusercontent.com/tblock/10kGNAD/master/test.csv [2755020/2755020] -> "data/test.csv" [1]

2.7M Jun 15 21:10 test.csv
 24M Jun 15 21:10 train.csv


## Import Data

Load training and test dataset

In [6]:
data_dir = Path(os.getenv("DIR"))

train_file = data_dir / 'train.csv'
test_file = data_dir / 'test.csv'

def read_csv_10kGNAD(filepath: Path, columns=["labels", "text"]) -> pd.DataFrame:
    """Load 10kGNAD csv file, handling its specific file format."""
    f = pd.read_csv(filepath, sep=";", quotechar="'", names=columns)
    return f

In [7]:
train_df = read_csv_10kGNAD(data_dir / 'train.csv')
print(train_df.shape[0], 'articles')
display(train_df.head())

9245 articles


Unnamed: 0,labels,text
0,Sport,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,Kultur,"Erfundene Bilder zu Filmen, die als verloren g..."
2,Web,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,Wirtschaft,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,Inland,Estland sieht den künftigen österreichischen P...


In [8]:
test_df = read_csv_10kGNAD(data_dir / 'test.csv')
print(test_df.shape[0], 'articles')
display(test_df.head())

1028 articles


Unnamed: 0,labels,text
0,Wirtschaft,"Die Gewerkschaft GPA-djp lanciert den ""All-in-..."
1,Sport,Franzosen verteidigen 2:1-Führung – Kritische ...
2,Web,Neues Video von Designern macht im Netz die Ru...
3,Sport,23-jähriger Brasilianer muss vier Spiele pausi...
4,International,Aufständische verwendeten Chemikalie bei Gefec...


## Prepare for Model Training

Model Input Requirements:

* columns should be labeled `labels` and `text` (already done during import)
* labels must be int values starting at `0`

### Label Encoding

In [9]:
from sklearn.preprocessing import LabelEncoder

def encode_labels(train: pd.DataFrame, test: pd.DataFrame):
    le = LabelEncoder()

    train_labels = le.fit_transform(train.labels)
    test_labels = le.transform(test.labels)

    return train.assign(labels=train_labels), test.assign(labels=test_labels)

train_df, test_df = encode_labels(train_df, test_df)
display(train_df.head())

Unnamed: 0,labels,text
0,5,21-Jähriger fällt wohl bis Saisonende aus. Wie...
1,3,"Erfundene Bilder zu Filmen, die als verloren g..."
2,6,Der frischgekürte CEO Sundar Pichai setzt auf ...
3,7,"Putin: ""Einigung, dass wir Menge auf Niveau vo..."
4,1,Estland sieht den künftigen österreichischen P...


### Compute Class Weights

In [10]:
from sklearn.utils.class_weight import compute_class_weight

def class_weights(labels: pd.Series):
    uniq_labels = labels.unique()
    weights = compute_class_weight("balanced", uniq_labels, labels)
    return pd.Series(weights, index=uniq_labels).sort_index()

weights_s = class_weights(train_df.labels)
list(weights_s.values)

[1.7091883897208358,
 1.1251064865522697,
 0.7553104575163399,
 2.117983963344788,
 0.6802796173657101,
 0.9502518244423888,
 0.6807304322214859,
 0.8088363954505686,
 1.9907407407407407]

## Evaluation Setup

In [11]:
sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "f1", "goal": "maximize"},
    "parameters": {
        "num_train_epochs": {"values": [1, 2, 3, 4, 5]},
        "learning_rate": {"min": 1e-5, "max": 1e-4},
        "class_weights": {"values": [0, 1]},
        "train_batch_size": {"values": [8, 16, 24, 32]},
    },
}

sweep_id = wandb.sweep(sweep_config, project=project_name)

<IPython.core.display.Javascript object>

Create sweep with ID: jlrir8v2
Sweep URL: https://wandb.ai/goerlitz/10kgnad_sweep__deepset_gbert-base/sweeps/jlrir8v2


In [12]:
train_args ={"reprocess_input_data": True,
             "overwrite_output_dir": True,
             "evaluate_during_training": True,
             "fp16": False,
             "evaluate_during_training_verbose": False,
             "evaluate_during_training_silent": True,
             "wandb_project": project_name,
             "silent": True,
             }

In [13]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='macro')

def precision_multiclass(labels, preds):
    return precision_score(labels, preds, average='macro')

def recall_multiclass(labels, preds):
    return recall_score(labels, preds, average='macro')

In [14]:
def train():
    # Initialize a new wandb run
    wandb.init()

    print(wandb.config)

    # need to create a tokenizer first and adjust train args with lower case setting
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model_args = {**train_args, **{ "do_lower_case": tokenizer.do_lower_case }}

    # print(wandb.config["class_weights"])
    weight = None if wandb.config["class_weights"] == 0 else list(weights_s.values)

    # Create a ClassificationModel
    model = ClassificationModel(
        model_type,
        model_name,
        num_labels=train_df.labels.nunique(),
        weight=weight,
        args=model_args,
        sweep_config=wandb.config
    )

    # Train the model
    model.train_model(
        train_df,
        eval_df=test_df,
        verbose=False,
        show_running_loss=False,
        f1=f1_multiclass,
        acc=accuracy_score,
        precision=precision_multiclass,
        recall=recall_multiclass,
    )

    # Sync wandb
    wandb.join()

In [None]:
wandb.agent(sweep_id, train)

{'class_weights': 1, 'learning_rate': 4.8841260080517326e-05, 'num_train_epochs': 2, 'train_batch_size': 8}
