# Auto Trainer Tutorial

Here we want to show you a really easy example how to use Knodle out-of-the box. This tutorial consists of three main parts:
1. Download data from Knodle Server and load into memory.
2. Initialize Model (DistilBert) and prepare data.
3. Use AutoTrainer for a simple training.

All the steps are discussed in more detail below.

### Download data
 
We will use a preprocessed version of the IMDb dataset. 
- In https://github.com/knodle/knodle/examples, you can find a tutorial showing you how the data was preprossed and transformed into the Knodle format. Instead of using the download in the cells below, you can also use this tutorial to create the data yourself.
- The IMDb dataset holds movie reviews. The task is to classify whether a text is a positive or a negative movie review.

In [None]:
%load_ext autoreload
%autoreload 2

import os

imdb_data_dir = os.path.join(os.getcwd(), "datasets", "imdb")
processed_data_dir = os.path.join(imdb_data_dir, "processed")
os.makedirs(processed_data_dir, exist_ok=True)


In [None]:
from minio import Minio
from tqdm.auto import tqdm

client = Minio("knodle.dm.univie.ac.at", secure=False)
files = [
    "df_train.csv", "df_dev.csv", "df_test.csv",
    "train_rule_matches_z.lib", "dev_rule_matches_z.lib", "test_rule_matches_z.lib",
    "mapping_rules_labels_t.lib"
]

for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/imdb/processed", file),
        file_path=os.path.join(processed_data_dir, file),
    )

In [None]:
import joblib
import pandas as pd

df_train = pd.read_csv(os.path.join(processed_data_dir, "df_train.csv"))
df_dev = pd.read_csv(os.path.join(processed_data_dir, "df_dev.csv"))
df_test = pd.read_csv(os.path.join(processed_data_dir, "df_test.csv"))

mapping_rules_labels_t = joblib.load(os.path.join(processed_data_dir, "mapping_rules_labels_t.lib"))

train_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "train_rule_matches_z.lib"))
dev_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "dev_rule_matches_z.lib"))
test_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "test_rule_matches_z.lib"))

### Data description

We have three splits: train, develop and test split. For each split, there is 
- DataFrame, holding text. The training DataFrame only holds text, whereas the development and test set hold text and label.
- a Z matrix, relating instances, or rows in the DataFrame, to rules. 

Again, for more information we refer to the creation of the Dataset https://github.com/knodle/knodle/example.

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
print(f"Train Z dimension: {train_rule_matches_z.shape}")
print(f"Train avg. matches per sample: {train_rule_matches_z.sum() / train_rule_matches_z.shape[0]}")
print(f"Develop avg. matches per sample: {dev_rule_matches_z.sum() / dev_rule_matches_z.shape[0]}")
print(f"Test avg. matches per sample: {test_rule_matches_z.sum() / test_rule_matches_z.shape[0]}")

Here we can can already see the difficulty: On average, each instances has 34 matching rules. Thus the difficulty is to determine which rule or which combination of rules is correct.

### Preprocess data to DistilBert input

- Tokenize text. See https://huggingface.co/transformers/ on how to use Transformer-based models.
- Transform data to into the PyTorch tensor format. More specifically, the current Trainers accept TensorDataset, holding a list of tensors. In future, more specialized Datasets might be useful.

In [None]:
from typing import List

import numpy as np
import scipy.sparse as sp
import torch
from torch.utils.data import TensorDataset


def convert_text_to_transformer_input(tokenizer, texts: List[str]) -> TensorDataset:
    encoding = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    input_ids = encoding.get('input_ids')
    attention_mask = encoding.get('attention_mask')

    input_values_x = TensorDataset(input_ids, attention_mask)

    return input_values_x


def np_array_to_tensor_dataset(x: np.ndarray) -> TensorDataset:
    if isinstance(x, sp.csr_matrix):
        x = x.toarray()
    x = torch.from_numpy(x)
    x = TensorDataset(x)
    return x

In [None]:
from transformers import AutoTokenizer


model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

X_train = convert_text_to_transformer_input(tokenizer, df_train["sample"].tolist())
X_dev = convert_text_to_transformer_input(tokenizer, df_dev["sample"].tolist())
X_test = convert_text_to_transformer_input(tokenizer, df_test["sample"].tolist())

y_dev = np_array_to_tensor_dataset(df_dev['label'].values)
y_test = np_array_to_tensor_dataset(df_test['label'].values)

## Training and evaluation

In general, Knodle uses the "Trainer" data structure to handle training. This is a widely used format in Deep Learning frameworks, used by e.g. PyTorch Lightning (https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html) or Huggingface's Transformers library (https://huggingface.co/transformers/training.html#trainer). 
It takes data and configuration to define training. For each denoising method, a custom Trainer is built. Here, we use the "MajorityVoteTrainer". We do so by using the convenience wrapper "AutoTrainer". It allows access to different denoising methods by just using a keyword, e.g. "majority" in our case.

The following code shows how to train using the MajorityVoteTrainer. It is a simple baseline, using the following steps:

1. Restrict data to samples where at least one rule matches
2. Use majority vote to determine the instance labels. If there's no clear winner, randomly choose between labels.
3. Train DistilBert on these weakly formed labels. See https://huggingface.co/transformers/ on how to use Transformer-based models.

Afterwards, we show two alternative equivalent initializations of the Trainer. Then we show how easy it is to use a different method.

In [None]:
from transformers import AutoModelForSequenceClassification, AdamW

from knodle.trainer import AutoTrainer, AutoConfig


model = AutoModelForSequenceClassification.from_pretrained(model_name)

trainer_type = "majority"
custom_model_config = AutoConfig.create_config(
    name=trainer_type,
    optimizer=AdamW,
    lr=1e-4,
    batch_size=16,
    epochs=2,
    filter_non_labelled=True
)

print(custom_model_config.__dict__)

trainer = AutoTrainer(
    name="majority",
    model=model,
    mapping_rules_labels_t=mapping_rules_labels_t,
    model_input_x=X_train,
    rule_matches_z=train_rule_matches_z,
    dev_model_input_x=X_dev,
    dev_gold_labels_y=y_dev,
    trainer_config=custom_model_config,
)

trainer.train()

In [None]:
eval_dict, _ = trainer.test(X_test, y_test)
print(f"Accuracy: {eval_dict.get('accuracy')}")

### Alternative usages

The following two examples provide exactly the same functionality, just initialize differently.

Here we use the MajorityVoteTrainer explicitly. The code above just provides some convenience.

In [None]:
from transformers import AutoModelForSequenceClassification, AdamW

from knodle.trainer import MajorityVoteTrainer, MajorityConfig


model = AutoModelForSequenceClassification.from_pretrained(model_name)

custom_model_config = MajorityConfig(
    optimizer=AdamW,
    lr=1e-4,
    batch_size=16,
    epochs=2,
    filter_non_labelled=True
)

print(custom_model_config.__dict__)

trainer = MajorityVoteTrainer(
    model=model,
    mapping_rules_labels_t=mapping_rules_labels_t,
    model_input_x=X_train,
    rule_matches_z=train_rule_matches_z,
    dev_model_input_x=X_dev,
    dev_gold_labels_y=y_dev,
    trainer_config=custom_model_config,
)

trainer.train()

Here, we use a configuration dictionary. This eases the creation of benchmarks as you just have to loop over dictionary values.

In [None]:
from transformers import AutoModelForSequenceClassification, AdamW

from knodle.trainer import AutoTrainer, AutoConfig


model = AutoModelForSequenceClassification.from_pretrained(model_name)

config_args = {
    "name": "majority",
    "optimizer": AdamW,
    "lr": 1e-4,
    "batch_size": 16,
    "epochs": 2,
    "filter_non_labelled": True
}
custom_model_config = AutoConfig.create_config(**config_args)

print(custom_model_config.__dict__)

trainer = AutoTrainer(
    name=config_args["name"],
    model=model,
    mapping_rules_labels_t=mapping_rules_labels_t,
    model_input_x=X_train,
    rule_matches_z=train_rule_matches_z,
    dev_model_input_x=X_dev,
    dev_gold_labels_y=y_dev,
    trainer_config=custom_model_config,
)

### Use the k-NN - Trainer

- The following code snippet shows you how easy it is to use a different denoising method.
- The k-NN Trainer takes the k nearest neighbors and adds up the matching rules. Then again, it uses majority voting.

In [None]:
config_args["name"] = "knn"
config_args["k"] = 3

custom_model_config = AutoConfig.create_config(**config_args)

print(custom_model_config.__dict__)

trainer = AutoTrainer(
    name=config_args["name"],
    model=model,
    mapping_rules_labels_t=mapping_rules_labels_t,
    model_input_x=X_train,
    rule_matches_z=train_rule_matches_z,
    dev_model_input_x=X_dev,
    dev_gold_labels_y=y_dev,
    trainer_config=custom_model_config,
)

trainer.train()

## Further readings

We want to encourage you to head over to our repository
[knodle-experiments](https://github.com/knodle/knodle-experiments)
which adds a new layer of abstraction on top of Knodle, allowing you to easily create full benchmarking setups.