## Opinion Role Labeling

### Using the NER approach
-------

**Requirements**
```
!pip install transformers datasets evaluate seqeval
```

alternatively use the following command (on *anaconda*):
```
!conda install -y -c conda-forge transformers datasets evaluate seqeval
```

**Load a dataset in the following format**
```
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , TARGET die neun vorwiegend englischsprachigen HOLDER schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , dem die neun vorwiegend englischsprachigen Provinzen schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die HOLDER ( Unión General del Trabajo ) dieses TARGET unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die UGT ( Unión General del Trabajo ) dieses Vorhaben unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Der HOLDER hat keine TARGET verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
[CLS] Der Trainer hat keine Regeln verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
```

In [27]:
# CONFIG VARIABLES

# DATASET_PATH="../../etl/data/raw/03_holder_target.txt"

TRAIN_DATASET_PATH="../../etl/data/processed/ORLConverter/01_train_orl.txt"
VAL_DATASET_PATH="../../etl/data/processed/ORLConverter/01_val_orl.txt"
TEST_DATASET_PATH="../../etl/data/processed/ORLConverter/01_test_orl.txt"

GENERATE_DATASET=True

BASE_MODEL_NAME="distilbert-base-german-cased"

TRAINED_MODEL_NAME="./data/trained_model_german_bert"

In [28]:
def read_sents(dataset_path):
    sents = []

    with open(dataset_path, "r", encoding="utf-8") as f:
        for c, l in enumerate(f.readlines()):
            if c < 100:
                pass
            sents.append(l.strip())

    # Der Minister findet die Debatte langweilig.
    real_sents = sents[1::2]

    # Der HOLDER findet die TARGET langweilig.
    masked_sents = sents[0::2]

    assert len(real_sents) == len(masked_sents), "Nr of masked and real sentences are not equal!"

    print(f"Total rows in the data: {len(real_sents) + len(masked_sents)}")
    
    return real_sents, masked_sents

train_sents_real, train_sents_masked = read_sents(TRAIN_DATASET_PATH)
val_sents_real, val_sents_masked = read_sents(VAL_DATASET_PATH)
test_sents_real, test_sents_masked = read_sents(TEST_DATASET_PATH)

Total rows in the data: 8382
Total rows in the data: 2406
Total rows in the data: 2390


In [29]:
NER_labels = ["O", "HOLDER", "TARGET", "PEXP"]

test_sent = ["Peter sowie Lucy mag Katzen sowie auch Hunde ."]
test_sent_masked = ["HOLDER sowie HOLDER mag TARGET sowie auch TARGET ."]

def align_sentences(real_sentences, masked_sentences, NER_labels):
    """Given two sentences that have their word-level tokens delimited by a white-space, 
    create NER-tags for each sentence token.
    """
    dataset = []
    
    max_multiple_holders = 0
    max_multiple_targets = 0
    max_multiple_pexps = 0

    for i, real_sentence, masked_sentence in zip(range(0,len(real_sentences)), real_sentences, masked_sentences):
        ner_tags = []
        real_sentence_split = real_sentence.split(" ")
        masked_sentence_split = masked_sentence.split(" ")
        
        # target / holder counting
        target_cnt = 0
        holder_cnt = 0
        pexp_cnt = 0

        assert len(real_sentence_split) == len(masked_sentence_split), "Misalignment of length of tokens in sentence." \
        + str(real_sentence_split) + str(masked_sentence_split) + str(i)

        for real_token, masked_token in zip(real_sentence_split, masked_sentence_split):
            if real_token == masked_token:
                ner_tags.append(0)
            elif masked_token == "HOLDER":
                holder_cnt += 1
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
            elif masked_token == "TARGET":
                target_cnt += 1
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
            elif masked_token == "PEXP":
                pexp_cnt += 1
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)

        dataset.append({
            "id": i,
            # remove the magic tokens from the sentences, 
            # since we will pass the sentences through the tokenizer again.
            "ner_tags": ner_tags[1:-1],
            "tokens": real_sentence_split[1:-1],
        })
    max_multiple_holders = max(max_multiple_holders, holder_cnt)
    max_multiple_targets = max(max_multiple_targets, target_cnt)
    max_multiple_pexps = max(max_multiple_pexps, pexp_cnt)

    return dataset, max_multiple_holders, max_multiple_targets, max_multiple_pexps

# align for training
train_aligned_sents, mmh, mmt, mmp = align_sentences(train_sents_real, train_sents_masked, NER_labels)
print("Max holders: ", mmh, "Max targets: ", mmt, "Max polar expressions: ", mmp)
val_aligned_sents, mmh, mmt, mmp = align_sentences(val_sents_real, val_sents_masked, NER_labels)
print("Max holders: ", mmh, "Max targets: ", mmt, "Max polar expressions: ", mmp)
test_aligned_sents, mmh, mmt, mmp = align_sentences(test_sents_real, test_sents_masked, NER_labels)
print("Max holders: ", mmh, "Max targets: ", mmt, "Max polar expressions: ", mmp)

# UNCOMMENT FOR TEST
# aligned_sents, mmh, mmt = align_sentences(test_sent, test_sent_masked, NER_labels)

print(train_aligned_sents[:10])

Max holders:  1 Max targets:  1 Max polar expressions:  1
Max holders:  1 Max targets:  1 Max polar expressions:  1
Max holders:  1 Max targets:  1 Max polar expressions:  1
[{'id': 0, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0], 'tokens': ['Die', 'Beförderung', 'Lichtensteins', 'ist', 'auch', 'insofern', 'leicht', 'nachvollziehbar', ',', 'als', 'der', 'ehemalige', 'Investmentbanker', 'das', 'Vertrauen', 'der', 'Vertreter', 'von', 'Chem', 'China', 'nicht', 'erst', 'als', 'Chef', 'von', 'Adama', 'erworben', 'hat', '.']}, {'id': 1, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['In', '«', 'Tiere', 'der', 'Grossstadt', '»', 'tragen', 'surreale', 'Ansichten', 'und', 'Bilderrätsel', 'die', 'Handlung', ',', 'was', 'Zeit', 'lässt', 'für', 'den', 'Blick', 'hinter', 'die', 'Fassaden', ':', 'Eine', 'wunderbar

In [30]:
# Make a huggingface dataset out of the record-based dataset
from datasets import Dataset, DatasetDict, load_from_disk

if GENERATE_DATASET:
    train_dataset = Dataset.from_list(train_aligned_sents)
    val_dataset = Dataset.from_list(val_aligned_sents)
    test_dataset = Dataset.from_list(test_aligned_sents)

    # randomly shuffle
    train_dataset = train_dataset.shuffle(seed=42)
    val_dataset = val_dataset.shuffle(seed=42)
    test_dataset = test_dataset.shuffle(seed=42)

    # dataset = dataset.train_test_split(test_size=0.2, shuffle=False)
    # train_test_dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

    # Split the 10% test + valid in half test, half valid
    # test_valid = train_test_dataset['test'].train_test_split(test_size=0.4, shuffle=False)

    # gather everyone if you want to have a single DatasetDict
    dataset = DatasetDict({
        'train': train_dataset,
        'valid': val_dataset,
        'test': test_dataset})

    dataset.save_to_disk("./data/split_dataset.hf")
else:
    dataset = load_from_disk("./data/split_dataset.hf")

print(dataset)

Flattening the indices:   0%|          | 0/5 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/2 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 4191
    })
    valid: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 1203
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 1195
    })
})


**Use DistilBERT tokenizer and embeddings**

In [31]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

In [32]:
# tokenize input
tokenized_input = tokenizer(dataset["train"][0]["tokens"], truncation=True, is_split_into_words=True)
print(tokenized_input)

# output subwords
print(tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"]))

{'input_ids': [102, 446, 746, 8565, 20614, 125, 4161, 2448, 1021, 386, 842, 1943, 566, 103], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'Auf', 'seiner', 'Flucht', 'verletzte', 'der', 'Deutsche', 'mehrere', 'Menschen', 'zum', 'Teil', 'schwer', '.', '[SEP]']


In [52]:
def tokenize_and_align_labels(examples):

    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        
        try:
            for word_idx in word_ids:  # Set the special tokens to -100.
                if word_idx is None:
                    label_ids.append(-100)
                elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                    label_ids.append(label[word_idx])
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx
            labels.append(label_ids)
        except Exception as e:
            print(f"Skipping example due to the following: { str(e) }")
            print(" ".join(examples[f"tokens"][i]))
            print(label)
            continue

    tokenized_inputs["labels"] = labels

    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

tokenized_dataset

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 4191
    })
    valid: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1203
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1195
    })
})

In [53]:
# check that the conversion worked.
print(tokenized_dataset["train"][0]["labels"])
print(tokenizer.convert_ids_to_tokens(tokenized_dataset["train"][0]["input_ids"]))

[-100, 0, 0, 0, 3, 0, 1, 0, 2, 0, 0, 0, 0, -100]
['[CLS]', 'Auf', 'seiner', 'Flucht', 'verletzte', 'der', 'Deutsche', 'mehrere', 'Menschen', 'zum', 'Teil', 'schwer', '.', '[SEP]']


**Define a DataCollator (for efficient padding of the tokens)**

In [35]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

**Download the model**

In [36]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# num labels is 4 bcz of the -100 label
model = AutoModelForTokenClassification.from_pretrained(BASE_MODEL_NAME, num_labels=4)

Some weights of the model checkpoint at distilbert-base-german-cased were not used when initializing DistilBertForTokenClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probab

***Train the model**

In [38]:
training_args = TrainingArguments(
    output_dir="./data/results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

trainer.save_model(TRAINED_MODEL_NAME)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: tokens, id, ner_tags. If tokens, id, ner_tags are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4191
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 786


Epoch,Training Loss,Validation Loss
1,No log,0.208197
2,0.166400,0.216775
3,0.166400,0.2308


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: tokens, id, ner_tags. If tokens, id, ner_tags are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1203
  Batch size = 16
Saving model checkpoint to ./data/results/checkpoint-500
Configuration saved in ./data/results/checkpoint-500/config.json
Model weights saved in ./data/results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./data/results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./data/results/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: tokens, id, ner_tags. If tokens, id, ner_tags are not expected by `DistilBertForTokenClassification.forward`,  you can safely

In [39]:
# In case no model was loaded up until now.
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_path = TRAINED_MODEL_NAME

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file ./data/trained_model_german_bert/config.json
Model config DistilBertConfig {
  "_name_or_path": "./data/trained_model_german_bert",
  "activation": "gelu",
  "architectures": [
    "DistilBertForTokenClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": true,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_ve

In [54]:
from transformers import pipeline

pipe = pipeline(task="token-classification",
                # model=trainer.model, -- in case freshly trained
                model=model,
                tokenizer=tokenizer)

pipe("Er sagt, dass der Präsident dem Volk etwas vorgemacht hat.")

[{'entity': 'LABEL_0',
  'score': 0.99969065,
  'index': 1,
  'word': 'Er',
  'start': 0,
  'end': 2},
 {'entity': 'LABEL_0',
  'score': 0.99962413,
  'index': 2,
  'word': 'sagt',
  'start': 3,
  'end': 7},
 {'entity': 'LABEL_0',
  'score': 0.99936646,
  'index': 3,
  'word': ',',
  'start': 7,
  'end': 8},
 {'entity': 'LABEL_0',
  'score': 0.9995933,
  'index': 4,
  'word': 'dass',
  'start': 9,
  'end': 13},
 {'entity': 'LABEL_0',
  'score': 0.99962044,
  'index': 5,
  'word': 'der',
  'start': 14,
  'end': 17},
 {'entity': 'LABEL_1',
  'score': 0.99183625,
  'index': 6,
  'word': 'Präsident',
  'start': 18,
  'end': 27},
 {'entity': 'LABEL_0',
  'score': 0.9996166,
  'index': 7,
  'word': 'dem',
  'start': 28,
  'end': 31},
 {'entity': 'LABEL_2',
  'score': 0.9307748,
  'index': 8,
  'word': 'Volk',
  'start': 32,
  'end': 36},
 {'entity': 'LABEL_0',
  'score': 0.9917257,
  'index': 9,
  'word': 'etwas',
  'start': 37,
  'end': 42},
 {'entity': 'LABEL_3',
  'score': 0.9075023,
  'i

In [55]:
pipe("Peter hat etwas gegen Fritz!")

[{'entity': 'LABEL_1',
  'score': 0.94375,
  'index': 1,
  'word': 'Peter',
  'start': 0,
  'end': 5},
 {'entity': 'LABEL_0',
  'score': 0.967745,
  'index': 2,
  'word': 'hat',
  'start': 6,
  'end': 9},
 {'entity': 'LABEL_0',
  'score': 0.91697866,
  'index': 3,
  'word': 'etwas',
  'start': 10,
  'end': 15},
 {'entity': 'LABEL_0',
  'score': 0.85223156,
  'index': 4,
  'word': 'gegen',
  'start': 16,
  'end': 21},
 {'entity': 'LABEL_2',
  'score': 0.50546294,
  'index': 5,
  'word': 'Fritz',
  'start': 22,
  'end': 27},
 {'entity': 'LABEL_0',
  'score': 0.92054045,
  'index': 6,
  'word': '!',
  'start': 27,
  'end': 28}]

**Evaluation on the test / val set**

In [56]:
# Tokenized dataset, convert numerical labels to labels ready for evaluation.

def labelize(example):
    labels1 = []
    for label in example["labels"]:
        if label in [0, -100]:
            labels1.append("LABEL_0")
        if label in [1]:
            labels1.append("LABEL_1")
        if label in [2]:
            labels1.append("LABEL_2")
        if label in [3]:
            labels1.append("LABEL_3")
    example["labels"] = labels1
    return example

def subword_ids_to_strings(example):
    example["subword_tokens"] = tokenizer.convert_ids_to_tokens(example["input_ids"])
    return example

tokenized_dataset["test"] = tokenized_dataset["test"].map(labelize)
tokenized_dataset["test"] = tokenized_dataset["test"].map(subword_ids_to_strings)

  0%|          | 0/1195 [00:00<?, ?ex/s]

  0%|          | 0/1195 [00:00<?, ?ex/s]

In [60]:
from torch.utils.data import Subset

# verify
tokenizer.decode(tokenized_dataset["test"][0]["input_ids"])
len(tokenized_dataset["test"])

small_ds = tokenized_dataset["test"].select(range(8))

len(small_ds)

" ".join(small_ds[0]["tokens"])

'Aber es kam anders : Die beiden Frauen und die beiden Männer haben sich nicht nur freundschaftlich , sondern neugierig aufeinander eingelassen , ohne es an Niveau fehlen zu lassen .'

In [58]:
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
from tqdm import tqdm

y_total_pred = []
y_total_true = []

for e in tqdm(tokenized_dataset["test"]):
    pred = [x["entity"] for x in pipe(" ".join(e["tokens"]))]
    # print("Prediction length:", len(pred))
    # print("Subword length:", len(e["subword_tokens"][1:-1]))
    # print("Labels length:", len(e["labels"][1:-1]))
    # print("----")
    y_total_pred.append(pred)
    true = e["labels"]
    y_total_true.append(true[1:-1])

print(classification_report(y_total_true, y_total_pred))

100%|███████████████████████████████████████| 1195/1195 [00:38<00:00, 30.65it/s]


              precision    recall  f1-score   support

      ABEL_0       0.53      0.44      0.48      4250
      ABEL_1       0.74      0.75      0.75      1195
      ABEL_2       0.61      0.35      0.45      1195
      ABEL_3       0.83      0.56      0.67      1195

   micro avg       0.62      0.49      0.55      7835
   macro avg       0.68      0.53      0.59      7835
weighted avg       0.62      0.49      0.55      7835



In [59]:
pred

['LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_1',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_2',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_2',
 'LABEL_0',
 'LABEL_0',
 'LABEL_3',
 'LABEL_0']