In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os
import joblib
import pandas as pd
import numpy as np

sys.path.append('../../')


# Majority vote + DistillBERT tutorial

With this tutorial we want to show you how to use a model from huggingface's transformers library within our framework. In order to so, we use the data from our ImdB data creation tutorial.
We make use of 
- Majority voting
- DistillBERT classification




## Read data

We begin by loading the data.
- Make sure you ran the IMDb movie data tutorial before you start. Alternatively, you can download the data with the following command.
- Afterwards we perform a train test split. Observe that we only use a fraction of the data for demonstrational purposes. If you want, you can increase the number of samples.

In [2]:
from tutorials.baseline.baseline_training_example import read_evaluation_data


imdb_dataset, rule_matches_z, mapping_rules_labels_t = read_evaluation_data()

review_series = imdb_dataset.reviews_preprocessed
label_ids = imdb_dataset.label_id

2021-01-17 00:17:38,695 root         INFO     Initalized logger


In [3]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(
    imdb_dataset[0:100], test_size=0.2, random_state=0
)

## Preprocess data

We now preprocess the data to establish a format our Trainer is able to work with. The steps are
- Load the DistillBert tokenizer and tokenize each movie review
- Establish the matrices X, Z and T needed for training


In [4]:
import torch
from torch.utils.data import TensorDataset
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

train_encoding = tokenizer(train_data.reviews_preprocessed.tolist(), return_tensors='pt', padding=True, truncation=True)
train_input_ids = train_encoding['input_ids']
train_attention_mask = train_encoding['attention_mask']

train_x = TensorDataset(train_input_ids, train_attention_mask)
train_y = TensorDataset(torch.from_numpy(train_data.label_id.values))
train_rule_matches_z = rule_matches_z[train_data.index]

  return torch._C._cuda_getDeviceCount() > 0
2021-01-17 00:17:41,260 urllib3.connectionpool DEBUG    Starting new HTTPS connection (1): huggingface.co:443
2021-01-17 00:17:41,604 urllib3.connectionpool DEBUG    https://huggingface.co:443 "HEAD /bert-base-uncased/resolve/main/vocab.txt HTTP/1.1" 200 0


In [5]:
test_encoding = tokenizer(test_data.reviews_preprocessed.tolist(), return_tensors='pt', padding=True, truncation=True)
test_input_ids = test_encoding['input_ids']
test_attention_mask = test_encoding['attention_mask']

test_x = TensorDataset(test_input_ids, test_attention_mask)
test_y = TensorDataset(torch.from_numpy(test_data.label_id.values))

## Load Model 

After data preparation is finished, we can start with the ML machinery. We need to specify our model, the training configuration and the trainer itself. To ease the start with knodle, we use the same structure as in the popular transformers library.
- Model: We use a distillbert model, as it's working rather well and it's a rather small transformer-based model
- Config: We try to stick close to huggingface's configuration https://huggingface.co/transformers/custom_datasets.html
- Trainer: A custom trainer, which can be found within this folder. It resembles the baseline trainer, just changes the Logistic regression model to a Transformer compatible trainer.

In [6]:
from transformers import DistilBertForSequenceClassification, AdamW
from knodle.trainer.config.trainer_config import TrainerConfig

from tutorials.baseline.bert.majority_bert_trainer import MajorityBertTrainer

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.train()

custom_model_config = TrainerConfig(
    model=model, optimizer_= AdamW(model.parameters(), lr=0.01), batch_size=4
)

trainer = MajorityBertTrainer(
    model,
    mapping_rules_labels_t=mapping_rules_labels_t,
    model_input_x=train_x,
    rule_matches_z=train_rule_matches_z,
    trainer_config=custom_model_config,
)


2021-01-17 00:17:42,286 urllib3.connectionpool DEBUG    Starting new HTTPS connection (1): huggingface.co:443
2021-01-17 00:17:42,607 urllib3.connectionpool DEBUG    https://huggingface.co:443 "HEAD /distilbert-base-uncased/resolve/main/config.json HTTP/1.1" 200 0
2021-01-17 00:17:42,640 urllib3.connectionpool DEBUG    Starting new HTTPS connection (1): huggingface.co:443
2021-01-17 00:17:42,979 urllib3.connectionpool DEBUG    https://huggingface.co:443 "HEAD /distilbert-base-uncased/resolve/main/pytorch_model.bin HTTP/1.1" 302 0
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializ

As we see, we have a standard DistillBERT with an additional classification layer with a binary output, 
defining our movie sentiment.

In [7]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

### Run training

In order to run the training procedure, we just need to call the train() method of the trainer.

In [8]:
trainer.train()

  rule_counts_probs = rule_counts / rule_counts.sum(axis=1).reshape(-1, 1)
2021-01-17 00:17:46,511 tutorials.baseline.bert.majority_bert_trainer INFO     Training starts


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

2021-01-17 00:17:46,550 tutorials.baseline.bert.majority_bert_trainer INFO     Epoch: 0
2021-01-17 00:20:55,848 tutorials.baseline.bert.majority_bert_trainer INFO     Epoch loss: 3.2868103981018066
2021-01-17 00:20:55,849 tutorials.baseline.bert.majority_bert_trainer INFO     Epoch Accuracy: 0.5625
2021-01-17 00:20:55,857 tutorials.baseline.bert.majority_bert_trainer INFO     Training done





## Run test set

Last but not least, we can run the test() method. In case you want to test the properties of multiple test 
datasets it's also possible to run this method multiple times.

In [9]:
trainer.test(test_features=test_x, test_labels=test_y)

  'precision', 'predicted', average, warn_for)
2021-01-17 00:21:11,885 knodle.trainer.ds_model_trainer.ds_model_trainer INFO     Accuracy is 0.4


{'0': {'precision': 0.4,
  'recall': 1.0,
  'f1-score': 0.5714285714285715,
  'support': 8},
 '1': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 12},
 'accuracy': 0.4,
 'macro avg': {'precision': 0.2,
  'recall': 0.5,
  'f1-score': 0.28571428571428575,
  'support': 20},
 'weighted avg': {'precision': 0.16,
  'recall': 0.4,
  'f1-score': 0.2285714285714286,
  'support': 20}}

In [None]:
trainer.test(test_features=train_x, test_labels=train_y)