# Auto Trainer Tutorial

Here we want to show you an maximally easy example how to use Knodle out-of-the box. Therefor this tutorial contains
the following steps:
1. Download from Knodle Server and load into memory.
2. Initialize model and data for a TF_IDF and Logistic Regression example.
3. Use AutoTrainer to use 2 weak training strategies.

All the steps are discussed in more detail below.

### Download data
 
We will use a preprocessed version of the IMDb dataset. In https://github.com/knodle/knodle/examples, you can find a tutorial showing you how the data was preprossed and transformed into the Knodle format. Instead of using the download in the cells below, you can also use this tutorial to create the data yourself.

In [1]:
%load_ext autoreload
%autoreload 2

import os

imdb_data_dir = os.path.join(os.getcwd(), "data", "imdb")
processed_data_dir = os.path.join(imdb_data_dir, "processed")
os.makedirs(processed_data_dir, exist_ok=True)


In [12]:
from minio import Minio
from tqdm.auto import tqdm

client = Minio("knodle.dm.univie.ac.at", secure=False)
files = [
    "df_train.csv", "df_dev.csv", "df_test.csv",
    "train_rule_matches_z.lib", "dev_rule_matches_z.lib", "test_rule_matches_z.lib",
    "mapping_rules_labels_t.lib"
]

for file in tqdm(files):
    client.fget_object(
        bucket_name="knodle",
        object_name=os.path.join("datasets/imdb/processed", file),
        file_path=os.path.join(processed_data_dir, file),
    )

  0%|          | 0/7 [00:00<?, ?it/s]

2021-04-20 10:16:14,876 urllib3.connectionpool DEBUG    Starting new HTTP connection (1): knodle.dm.univie.ac.at:80
2021-04-20 10:16:14,898 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "HEAD /knodle/datasets/imdb/processed/df_train.csv HTTP/1.1" 200 0
2021-04-20 10:16:14,905 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "GET /knodle/datasets/imdb/processed/df_train.csv HTTP/1.1" 200 35161423
2021-04-20 10:16:17,877 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "HEAD /knodle/datasets/imdb/processed/df_dev.csv HTTP/1.1" 200 0
2021-04-20 10:16:18,002 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "GET /knodle/datasets/imdb/processed/df_dev.csv HTTP/1.1" 200 4394019
2021-04-20 10:16:18,344 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "HEAD /knodle/datasets/imdb/processed/df_test.csv HTTP/1.1" 200 0
2021-04-20 10:16:18,350 urllib3.connectionpool DEBUG    http://knodle.dm.univie.ac.at:80 "GET /kno

### Data description

We have three splits: train, develop and test split. For each split, there is DataFrame, holding text, and a Z matrix, relating instances, or rows in the DataFrame, to rules. Again, for more information we refer to the creation of the Dataset https://github.com/knodle/knodle/example.

In [13]:
import joblib
import pandas as pd

df_train = pd.read_csv(os.path.join(processed_data_dir, "df_train.csv"))
df_dev = pd.read_csv(os.path.join(processed_data_dir, "df_dev.csv"))
df_test = pd.read_csv(os.path.join(processed_data_dir, "df_test.csv"))

mapping_rules_labels_t = joblib.load(os.path.join(processed_data_dir, "mapping_rules_labels_t.lib"))

train_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "train_rule_matches_z.lib"))
dev_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "dev_rule_matches_z.lib"))
test_rule_matches_z = joblib.load(os.path.join(processed_data_dir, "test_rule_matches_z.lib"))

In [14]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,sample
0,1979,This film mildly entertaining neglects acknowl...
1,25454,"I originally saw premiere UK. I mesmerised it,..."
2,436,This excellent film characters adult swimming ...
3,3313,I really like Traci Lords. She greatest actres...
4,6626,"I picked DVD 1 discount, having idea it's (bu..."


In [6]:
print(f"Train Z dimension: {train_rule_matches_z.shape}")
print(f"Train avg. matches per sample: {train_rule_matches_z.sum() / train_rule_matches_z.shape[0]}")

Train Z dimension: (40000, 6786)
Train avg. matches per sample: 33.965325


### Preprocess data to TF_IDF values


Within this simple tutorial we want to show how to use AutoTrainers. This requires as input
1. A model, in our case logistic regression
2. Data, in the X, Z, T format. See the README at https://github.com/knodle/ to understand our data format

In our case X takes the form of a vector per sample. We decide to use TF-IDF values. For more information on TF-IDF, see here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. On a high level, TF-IDF is supposed to reflect the importance of a word, given a corpus.

Lastly, we make use of simple routines from Scikit-Learn and transform the vector into a TensorDataset.

In [7]:
from typing import Union

import numpy as np
import scipy.sparse as sp
import torch
from torch.utils.data import TensorDataset

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer


def remove_stop_words(text: str) -> str:
    text = ' '.join([word for word in text.split() if word not in (ENGLISH_STOP_WORDS)])
    return text


def np_array_to_tensor_dataset(x: np.ndarray):
    if isinstance(x, sp.csr_matrix):
        x = x.toarray()
    x = torch.from_numpy(x)
    x = TensorDataset(x)
    return x


def create_tfidf_dataset(
        text_data: [str], force_create_new: bool = False, max_features: int = None
):
    """Takes a list of strings, e.g. sentences, and transforms these in a simple TF-IDF representation"""
    text_data = [remove_stop_words(t) for t in text_data]
    vectorizer = TfidfVectorizer(min_df=2, max_features=max_features)
    transformed_data = vectorizer.fit_transform(text_data)
    dataset = np_array_to_tensor_dataset(transformed_data)
    return dataset

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):


In [9]:
X_train = create_tfidf_dataset(df_train["sample"].tolist(), max_features=5000)
X_dev = create_tfidf_dataset(df_dev["sample"].tolist(), max_features=5000)
X_test = create_tfidf_dataset(df_test["sample"].tolist(), max_features=5000)

y_dev = np_array_to_tensor_dataset(df_dev['label'].values)
y_test = np_array_to_tensor_dataset(df_test['label'].values)

## Training and evaluation

The following code shows the usage of the majority trainer. It is a simple baseline, using the following steps:
1. Restrict data to samples where at least one rule matches
2. Use majority vote for cases where multiple rules match. If there's no clear winner, randomly choose between labels.
3. Train on majority vote

In [10]:
from knodle.model.logistic_regression_model import LogisticRegressionModel
from knodle.trainer.auto_trainer import AutoTrainer


model = LogisticRegressionModel(X_train.tensors[0].shape[1], 2)

maj_trainer = AutoTrainer(
    name="majority",
    model=model,
    mapping_rules_labels_t=mapping_rules_labels_t,
    model_input_x=X_train,
    rule_matches_z=train_rule_matches_z,
    dev_model_input_x=X_dev,
    dev_gold_labels_y=y_dev
)

maj_trainer.train()

2021-04-20 10:08:29,123 root         INFO     Initalized logger
2021-04-20 10:08:29,576 matplotlib   DEBUG    (private) matplotlib data path: /Users/andst/.cache/virtual-envs/knodle/lib/python3.7/site-packages/matplotlib/mpl-data
2021-04-20 10:08:29,577 matplotlib   DEBUG    matplotlib data path: /Users/andst/.cache/virtual-envs/knodle/lib/python3.7/site-packages/matplotlib/mpl-data
2021-04-20 10:08:29,581 matplotlib   DEBUG    CONFIGDIR=/Users/andst/.matplotlib
2021-04-20 10:08:29,584 matplotlib   DEBUG    matplotlib version 3.3.4
2021-04-20 10:08:29,585 matplotlib   DEBUG    interactive is False
2021-04-20 10:08:29,585 matplotlib   DEBUG    platform is darwin


2021-04-20 10:08:29,644 matplotlib   DEBUG    CACHEDIR=/Users/andst/.matplotlib
2021-04-20 10:08:29,648 matplotlib.font_manager DEBUG    Using fontManager instance from /Users/andst/.matplotlib/fontlist-v330.json
2021-04-20 10:08:29,823 matplotlib.pyplot DEBUG    Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2021-04-20 10:08:29,825 matplotlib.pyplot DEBUG    Loaded backend module://ipykernel.pylab.backend_inline version unknown.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Depr

  0%|          | 0/1250 [00:00<?, ?it/s]

2021-04-20 10:08:30,938 knodle.trainer.trainer INFO     Train loss: 0.689, Train accuracy: 0.742
2021-04-20 10:08:31,178 knodle.trainer.trainer INFO     Train loss: 0.686, Train accuracy: 0.751
2021-04-20 10:08:31,411 knodle.trainer.trainer INFO     Train loss: 0.684, Train accuracy: 0.755
2021-04-20 10:08:31,630 knodle.trainer.trainer INFO     Train loss: 0.682, Train accuracy: 0.757
2021-04-20 10:08:31,852 knodle.trainer.trainer INFO     Train loss: 0.681, Train accuracy: 0.758
2021-04-20 10:08:32,073 knodle.trainer.trainer INFO     Train loss: 0.681, Train accuracy: 0.758
2021-04-20 10:08:32,289 knodle.trainer.trainer INFO     Train loss: 0.680, Train accuracy: 0.759
2021-04-20 10:08:32,514 knodle.trainer.trainer INFO     Train loss: 0.679, Train accuracy: 0.759
2021-04-20 10:08:32,731 knodle.trainer.trainer INFO     Train loss: 0.678, Train accuracy: 0.758
2021-04-20 10:08:32,954 knodle.trainer.trainer INFO     Train loss: 0.678, Train accuracy: 0.759
2021-04-20 10:08:32,956 knodle

  0%|          | 0/157 [00:00<?, ?it/s]

  'precision', 'predicted', average, warn_for)
2021-04-20 10:08:33,161 knodle.trainer.trainer INFO     Epoch development accuracy: 0.4962
2021-04-20 10:08:33,162 knodle.trainer.trainer INFO     Epoch: 1


  0%|          | 0/1250 [00:00<?, ?it/s]

2021-04-20 10:08:33,405 knodle.trainer.trainer INFO     Train loss: 0.672, Train accuracy: 0.756
2021-04-20 10:08:33,624 knodle.trainer.trainer INFO     Train loss: 0.671, Train accuracy: 0.760
2021-04-20 10:08:33,834 knodle.trainer.trainer INFO     Train loss: 0.671, Train accuracy: 0.760
2021-04-20 10:08:34,042 knodle.trainer.trainer INFO     Train loss: 0.671, Train accuracy: 0.761
2021-04-20 10:08:34,252 knodle.trainer.trainer INFO     Train loss: 0.671, Train accuracy: 0.762
2021-04-20 10:08:34,457 knodle.trainer.trainer INFO     Train loss: 0.670, Train accuracy: 0.761
2021-04-20 10:08:34,668 knodle.trainer.trainer INFO     Train loss: 0.670, Train accuracy: 0.762
2021-04-20 10:08:34,879 knodle.trainer.trainer INFO     Train loss: 0.670, Train accuracy: 0.763
2021-04-20 10:08:35,108 knodle.trainer.trainer INFO     Train loss: 0.670, Train accuracy: 0.764
2021-04-20 10:08:35,319 knodle.trainer.trainer INFO     Train loss: 0.669, Train accuracy: 0.766
2021-04-20 10:08:35,321 knodle

  0%|          | 0/157 [00:00<?, ?it/s]

2021-04-20 10:08:35,509 knodle.trainer.trainer INFO     Epoch development accuracy: 0.496
2021-04-20 10:08:35,510 knodle.trainer.trainer INFO     Epoch: 2


  0%|          | 0/1250 [00:00<?, ?it/s]

2021-04-20 10:08:35,746 knodle.trainer.trainer INFO     Train loss: 0.667, Train accuracy: 0.770
2021-04-20 10:08:35,955 knodle.trainer.trainer INFO     Train loss: 0.667, Train accuracy: 0.774
2021-04-20 10:08:36,162 knodle.trainer.trainer INFO     Train loss: 0.666, Train accuracy: 0.776
2021-04-20 10:08:36,376 knodle.trainer.trainer INFO     Train loss: 0.666, Train accuracy: 0.777
2021-04-20 10:08:36,586 knodle.trainer.trainer INFO     Train loss: 0.665, Train accuracy: 0.781
2021-04-20 10:08:36,796 knodle.trainer.trainer INFO     Train loss: 0.665, Train accuracy: 0.782
2021-04-20 10:08:37,009 knodle.trainer.trainer INFO     Train loss: 0.665, Train accuracy: 0.784
2021-04-20 10:08:37,224 knodle.trainer.trainer INFO     Train loss: 0.665, Train accuracy: 0.787
2021-04-20 10:08:37,438 knodle.trainer.trainer INFO     Train loss: 0.665, Train accuracy: 0.788
2021-04-20 10:08:37,648 knodle.trainer.trainer INFO     Train loss: 0.664, Train accuracy: 0.790
2021-04-20 10:08:37,650 knodle

  0%|          | 0/157 [00:00<?, ?it/s]

2021-04-20 10:08:37,835 knodle.trainer.trainer INFO     Epoch development accuracy: 0.4954
2021-04-20 10:08:37,836 knodle.trainer.trainer INFO     Training done


In [11]:
eval_dict, _ = maj_trainer.test(X_test, y_test)
print(f"Accuracy: {eval_dict.get('accuracy')}")

  0%|          | 0/157 [00:00<?, ?it/s]

Accuracy: 0.5026


## Further readings

1. Go to the examples/ folder to see further tutorials.
2. In order, the next tutorial would be Crossweigh/
3. Finally, we want to encourage you to head over to our repository 
[knodle-experiments](https://github.com/knodle/knodle-experiments)
which adds a new layer of abstraction on top of Knodle, allowing you to easily create full benchmarking setups.