# Auto Trainer Tutorial

Here we want to show you an maximally easy example how to use Knodle out-of-the box. Therefor this tutorial contains
the following steps:
1. Download from Knodle Server and load into memory
2. Initialize model and data for a TF_IDF and Logistic Regression example
3. Use AutoTrainer to use 2 weak training strategies

All the steps are discussed in more detail below.

### Data 
 
We will use a preprocessed version of the IMDb dataset, see 
for further creation of the dataset. Instead of a download, you can also use this tutorial to create the data yourself.

In [1]:
%load_ext autoreload
%autoreload 2

import os

imdb_data_dir = os.path.join(os.getcwd(), "data", "imdb")
processed_data_dir = os.path.join(imdb_data_dir, "processed")
os.makedirs(processed_data_dir, exist_ok=True)


In [2]:
from minio import Minio
from tqdm.auto import tqdm

client = Minio("knodle.dm.univie.ac.at", secure=False)
files = [
    "df_train.csv", "df_dev.csv", "df_test.csv",
    "train_rule_matches_z.lib", "dev_rule_matches_z.lib", "test_rule_matches_z.lib",
    "mapping_rules_labels_t.lib"
]

for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/imdb/processed", file),
        file_path=os.path.join(processed_data_dir, file),
    )

  0%|          | 0/7 [00:00<?, ?it/s]

In [3]:
## TODO: Data Description

In [3]:
import joblib
import pandas as pd

df_train = pd.read_csv(os.path.join(processed_data_dir, "df_train.csv"))
df_dev = pd.read_csv(os.path.join(processed_data_dir, "df_dev.csv"))
df_test = pd.read_csv(os.path.join(processed_data_dir, "df_test.csv"))

mapping_rules_labels_t = joblib.load(os.path.join(processed_data_dir, "mapping_rules_labels_t.lib"))

train_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "train_rule_matches_z.lib"))
dev_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "dev_rule_matches_z.lib"))
test_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "test_rule_matches_z.lib"))

In [4]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,sample
0,1979,This film mildly entertaining neglects acknowl...
1,25454,"I originally saw premiere UK. I mesmerised it,..."
2,436,This excellent film characters adult swimming ...
3,3313,I really like Traci Lords. She greatest actres...
4,6626,"I picked DVD 1 discount, having idea it's (bu..."


In [5]:
print(f"Train Z dimension: {train_rule_matches_z.shape}")
print(f"Train avg. matches per sample: {train_rule_matches_z.sum() / train_rule_matches_z.shape[0]}")

Train Z dimension: (40000, 6786)
Train avg. matches per sample: 33.965325


### Preprocess data to TF_IDF values

In [6]:
from typing import Union

import numpy as np
import scipy.sparse as sp
import torch
from torch.utils.data import TensorDataset

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer


def remove_stop_words(text: str) -> str:
    text = ' '.join([word for word in text.split() if word not in (ENGLISH_STOP_WORDS)])
    return text


def np_to_tensor_dataset(x: np.ndarray):
    if isinstance(x, sp.csr_matrix):
        x = x.toarray()
    x = torch.from_numpy(x)
    x = TensorDataset(x)
    return x


def create_tfidf_dataset(
        text_data: [str], force_create_new: bool = False, max_features: int = None
):
    """Takes a list of strings, e.g. sentences, and transforms these in a simple TF-IDF representation"""
    text_data = [remove_stop_words(t) for t in text_data]
    vectorizer = TfidfVectorizer(min_df=2, max_features=max_features)
    transformed_data = vectorizer.fit_transform(text_data)
    dataset = np_to_tensor_dataset(transformed_data)
    return dataset

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):


In [7]:
X_train = create_tfidf_dataset(df_train["sample"].tolist(), max_features=5000)
X_dev = create_tfidf_dataset(df_dev["sample"].tolist(), max_features=5000)
X_test = create_tfidf_dataset(df_test["sample"].tolist(), max_features=5000)

y_dev = np_to_tensor_dataset(df_dev['label'].values)
y_test = np_to_tensor_dataset(df_test['label'].values)

## Training and evaluation

The following code shows the usage of the majority trainer. It is a simple baseline, using the following steps:
1. Restrict data to samples where at least one rule matches
2. Use majority vote for cases where multiple rules match. If there's no clear winner, randomly choose between labels.
3. Train on majority vote

In [21]:
from knodle.model.logistic_regression_model import LogisticRegressionModel
from knodle.trainer.auto_trainer import AutoTrainer


model = LogisticRegressionModel(X_train.tensors[0].shape[1], 2)

maj_trainer = AutoTrainer(
    name="majority",
    model=model,
    mapping_rules_labels_t=mapping_rules_labels_t,
    model_input_x=X_train,
    rule_matches_z=train_rule_matches_z,
    dev_model_input_x=X_dev,
    dev_gold_labels_y=y_dev
)

maj_trainer.train()

2021-04-01 11:49:31,925 knodle.trainer.trainer INFO     Training starts
2021-04-01 11:49:31,927 knodle.trainer.trainer INFO     Epoch: 0


  0%|          | 0/157 [00:00<?, ?it/s]

2021-04-01 11:49:32,075 knodle.trainer.trainer INFO     Train loss: 0.691, Train accuracy: 0.763
2021-04-01 11:49:32,205 knodle.trainer.trainer INFO     Train loss: 0.690, Train accuracy: 0.762
2021-04-01 11:49:32,333 knodle.trainer.trainer INFO     Train loss: 0.689, Train accuracy: 0.761
2021-04-01 11:49:32,461 knodle.trainer.trainer INFO     Train loss: 0.689, Train accuracy: 0.763
2021-04-01 11:49:32,588 knodle.trainer.trainer INFO     Train loss: 0.688, Train accuracy: 0.763
2021-04-01 11:49:32,712 knodle.trainer.trainer INFO     Train loss: 0.687, Train accuracy: 0.762
2021-04-01 11:49:32,836 knodle.trainer.trainer INFO     Train loss: 0.687, Train accuracy: 0.762
2021-04-01 11:49:32,956 knodle.trainer.trainer INFO     Train loss: 0.686, Train accuracy: 0.761
2021-04-01 11:49:33,082 knodle.trainer.trainer INFO     Train loss: 0.686, Train accuracy: 0.761
2021-04-01 11:49:33,184 knodle.trainer.trainer INFO     Epoch train loss: 0.6853720800132509
2021-04-01 11:49:33,185 knodle.tra

  0%|          | 0/20 [00:00<?, ?it/s]

2021-04-01 11:49:33,339 knodle.trainer.trainer INFO     Epoch development accuracy: 0.4962
2021-04-01 11:49:33,339 knodle.trainer.trainer INFO     Epoch: 1


  0%|          | 0/157 [00:00<?, ?it/s]

2021-04-01 11:49:33,480 knodle.trainer.trainer INFO     Train loss: 0.680, Train accuracy: 0.755
2021-04-01 11:49:33,598 knodle.trainer.trainer INFO     Train loss: 0.680, Train accuracy: 0.760
2021-04-01 11:49:33,718 knodle.trainer.trainer INFO     Train loss: 0.680, Train accuracy: 0.759
2021-04-01 11:49:33,835 knodle.trainer.trainer INFO     Train loss: 0.679, Train accuracy: 0.761
2021-04-01 11:49:33,950 knodle.trainer.trainer INFO     Train loss: 0.679, Train accuracy: 0.761
2021-04-01 11:49:34,068 knodle.trainer.trainer INFO     Train loss: 0.679, Train accuracy: 0.760
2021-04-01 11:49:34,183 knodle.trainer.trainer INFO     Train loss: 0.679, Train accuracy: 0.760
2021-04-01 11:49:34,303 knodle.trainer.trainer INFO     Train loss: 0.678, Train accuracy: 0.762
2021-04-01 11:49:34,420 knodle.trainer.trainer INFO     Train loss: 0.678, Train accuracy: 0.762
2021-04-01 11:49:34,512 knodle.trainer.trainer INFO     Epoch train loss: 0.6780885476974925
2021-04-01 11:49:34,513 knodle.tra

  0%|          | 0/20 [00:00<?, ?it/s]

2021-04-01 11:49:34,661 knodle.trainer.trainer INFO     Epoch development accuracy: 0.4962
2021-04-01 11:49:34,661 knodle.trainer.trainer INFO     Epoch: 2


  0%|          | 0/157 [00:00<?, ?it/s]

2021-04-01 11:49:34,797 knodle.trainer.trainer INFO     Train loss: 0.677, Train accuracy: 0.757
2021-04-01 11:49:34,916 knodle.trainer.trainer INFO     Train loss: 0.676, Train accuracy: 0.761
2021-04-01 11:49:35,035 knodle.trainer.trainer INFO     Train loss: 0.676, Train accuracy: 0.758
2021-04-01 11:49:35,149 knodle.trainer.trainer INFO     Train loss: 0.676, Train accuracy: 0.760
2021-04-01 11:49:35,266 knodle.trainer.trainer INFO     Train loss: 0.676, Train accuracy: 0.761
2021-04-01 11:49:35,379 knodle.trainer.trainer INFO     Train loss: 0.675, Train accuracy: 0.761
2021-04-01 11:49:35,498 knodle.trainer.trainer INFO     Train loss: 0.675, Train accuracy: 0.762
2021-04-01 11:49:35,614 knodle.trainer.trainer INFO     Train loss: 0.675, Train accuracy: 0.762
2021-04-01 11:49:35,732 knodle.trainer.trainer INFO     Train loss: 0.675, Train accuracy: 0.762
2021-04-01 11:49:35,826 knodle.trainer.trainer INFO     Epoch train loss: 0.6750082468530935
2021-04-01 11:49:35,827 knodle.tra

  0%|          | 0/20 [00:00<?, ?it/s]

2021-04-01 11:49:35,973 knodle.trainer.trainer INFO     Epoch development accuracy: 0.4962
2021-04-01 11:49:35,974 knodle.trainer.trainer INFO     Training done


In [22]:
eval_dict, _ = maj_trainer.test(X_test, y_test)
print(f"Accuracy: {eval_dict.get('accuracy')}")

  0%|          | 0/20 [00:00<?, ?it/s]

Accuracy: 0.5024


## Further readings

1. The next tutorial would be %TODO
2. Finally, we want to encourage you to head over to our repository 
[knodle-experiments](https://github.com/knodle/knodle-experiments)
which adds a new layer of abstraction on top of Knodle, allowing you to easily create full benchmarking setups.