## Opinion Role Labeling

### Using the NER approach
-------

**Requirements**
```
!pip install transformers datasets evaluate seqeval
```

alternatively use the following command (on *anaconda*):
```
!conda install -y -c conda-forge transformers datasets evaluate seqeval
```

**Load a dataset in the following format**
```
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , TARGET die neun vorwiegend englischsprachigen HOLDER schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , dem die neun vorwiegend englischsprachigen Provinzen schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die HOLDER ( Unión General del Trabajo ) dieses TARGET unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die UGT ( Unión General del Trabajo ) dieses Vorhaben unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Der HOLDER hat keine TARGET verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
[CLS] Der Trainer hat keine Regeln verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
```

In [10]:
# CONFIG VARIABLES

# DATASET_PATH="../../etl/data/raw/03_holder_target.txt"

TRAIN_DATASET_PATH="../../etl/data/processed/ORLConverter/01_train_orl.txt"
VAL_DATASET_PATH="../../etl/data/processed/ORLConverter/01_val_orl.txt"
TEST_DATASET_PATH="../../etl/data/processed/ORLConverter/01_test_orl.txt"

GENERATE_DATASET=True

BASE_MODEL_NAME="xlm-roberta-base"

TRAINED_MODEL_NAME="./data/trained_model_xlm_roberta_base"

USE_WANDB=True

In [11]:
%env WANDB_PROJECT=erre_er_component

env: WANDB_PROJECT=erre_er_component


In [12]:
def read_sents(dataset_path):
    sents = []

    with open(dataset_path, "r", encoding="utf-8") as f:
        for c, l in enumerate(f.readlines()):
            if c < 100:
                pass
            sents.append(l.strip())

    # Der Minister findet die Debatte langweilig.
    real_sents = sents[1::2]

    # Der HOLDER findet die TARGET langweilig.
    masked_sents = sents[0::2]

    assert len(real_sents) == len(masked_sents), "Nr of masked and real sentences are not equal!"

    print(f"Total rows in the data: {len(real_sents) + len(masked_sents)}")
    
    return real_sents, masked_sents

train_sents_real, train_sents_masked = read_sents(TRAIN_DATASET_PATH)
val_sents_real, val_sents_masked = read_sents(VAL_DATASET_PATH)
test_sents_real, test_sents_masked = read_sents(TEST_DATASET_PATH)

Total rows in the data: 7666
Total rows in the data: 2744
Total rows in the data: 2766


In [13]:
NER_labels = ["O", "HOLDER", "TARGET", "PEXP"]

test_sent = ["Peter sowie Lucy mag Katzen sowie auch Hunde ."]
test_sent_masked = ["HOLDER sowie HOLDER mag TARGET sowie auch TARGET ."]

def align_sentences(real_sentences, masked_sentences, NER_labels):
    """Given two sentences that have their word-level tokens delimited by a white-space, 
    create NER-tags for each sentence token.
    """
    dataset = []
    
    max_multiple_holders = 0
    max_multiple_targets = 0
    max_multiple_pexps = 0

    for i, real_sentence, masked_sentence in zip(range(0,len(real_sentences)), real_sentences, masked_sentences):
        ner_tags = []
        real_sentence_split = real_sentence.split(" ")
        masked_sentence_split = masked_sentence.split(" ")
        
        # target / holder counting
        target_cnt = 0
        holder_cnt = 0
        pexp_cnt = 0

        assert len(real_sentence_split) == len(masked_sentence_split), "Misalignment of length of tokens in sentence." \
        + str(real_sentence_split) + str(masked_sentence_split) + str(i)

        for real_token, masked_token in zip(real_sentence_split, masked_sentence_split):
            if real_token == masked_token:
                ner_tags.append(0)
            elif masked_token == "HOLDER":
                holder_cnt += 1
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
            elif masked_token == "TARGET":
                target_cnt += 1
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
            elif masked_token == "PEXP":
                pexp_cnt += 1
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)

        dataset.append({
            "id": i,
            # remove the magic tokens from the sentences, 
            # since we will pass the sentences through the tokenizer again.
            "ner_tags": ner_tags[1:-1],
            "tokens": real_sentence_split[1:-1],
        })
    max_multiple_holders = max(max_multiple_holders, holder_cnt)
    max_multiple_targets = max(max_multiple_targets, target_cnt)
    max_multiple_pexps = max(max_multiple_pexps, pexp_cnt)

    return dataset, max_multiple_holders, max_multiple_targets, max_multiple_pexps

# align for training
train_aligned_sents, mmh, mmt, mmp = align_sentences(train_sents_real, train_sents_masked, NER_labels)
print("Max holders: ", mmh, "Max targets: ", mmt, "Max polar expressions: ", mmp)
val_aligned_sents, mmh, mmt, mmp = align_sentences(val_sents_real, val_sents_masked, NER_labels)
print("Max holders: ", mmh, "Max targets: ", mmt, "Max polar expressions: ", mmp)
test_aligned_sents, mmh, mmt, mmp = align_sentences(test_sents_real, test_sents_masked, NER_labels)
print("Max holders: ", mmh, "Max targets: ", mmt, "Max polar expressions: ", mmp)

# UNCOMMENT FOR TEST
# aligned_sents, mmh, mmt = align_sentences(test_sent, test_sent_masked, NER_labels)

print(train_aligned_sents[:10])

Max holders:  1 Max targets:  1 Max polar expressions:  1
Max holders:  1 Max targets:  1 Max polar expressions:  1
Max holders:  1 Max targets:  1 Max polar expressions:  1
[{'id': 0, 'ner_tags': [0, 1, 3, 0, 0, 2, 0, 0], 'tokens': ['Die', 'Grasshoppers', 'nutzten', 'den', 'personellen', 'Vorteil', 'aus', '.']}, {'id': 1, 'ner_tags': [0, 0, 1, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['Als', 'sich', 'Prinzessin', 'Haya', 'für', 'die', 'jungen', 'Frauen', 'eingesetzt', 'habe', ',', 'sei', 'auch', 'sie', 'ins', 'Visier', 'des', 'Emirs', 'geraten', 'und', 'mit', 'dem', 'Tod', 'bedroht', 'worden', '.']}, {'id': 2, 'ner_tags': [1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0], 'tokens': ['Maurer', 'erreichte', 'in', 'dem', 'Departement', ',', 'das', 'er', 'bis', '2015', 'führte', ',', 'eine', 'gewisse', 'Beruhigung', 'der', 'Verhältnisse', '.']}, {'id': 3, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0, 2, 

In [14]:
# Make a huggingface dataset out of the record-based dataset
from datasets import Dataset, DatasetDict, load_from_disk

if GENERATE_DATASET:
    train_dataset = Dataset.from_list(train_aligned_sents)
    val_dataset = Dataset.from_list(val_aligned_sents)
    test_dataset = Dataset.from_list(test_aligned_sents)

    # randomly shuffle
    train_dataset = train_dataset.shuffle(seed=42)
    val_dataset = val_dataset.shuffle(seed=42)
    test_dataset = test_dataset.shuffle(seed=42)

    # dataset = dataset.train_test_split(test_size=0.2, shuffle=False)
    # train_test_dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

    # Split the 10% test + valid in half test, half valid
    # test_valid = train_test_dataset['test'].train_test_split(test_size=0.4, shuffle=False)

    # gather everyone if you want to have a single DatasetDict
    dataset = DatasetDict({
        'train': train_dataset,
        'valid': val_dataset,
        'test': test_dataset})

    dataset.save_to_disk("./data/split_dataset.hf")
else:
    dataset = load_from_disk("./data/split_dataset.hf")

print(dataset)

Flattening the indices:   0%|          | 0/4 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/2 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 3833
    })
    valid: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 1372
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 1383
    })
})


**Use DistilBERT tokenizer and embeddings**

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [16]:
# tokenize input
tokenized_input = tokenizer(dataset["train"][0]["tokens"], truncation=True, is_split_into_words=True)
print(tokenized_input)

# output subwords
print(tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"]))

{'input_ids': [0, 2991, 7911, 9, 81405, 42, 45331, 56, 10264, 18370, 444, 615, 18146, 505, 644, 68, 78481, 9026, 74831, 1716, 4383, 933, 6, 190848, 33, 165, 9318, 542, 37616, 2046, 16463, 126, 6, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['<s>', '▁Ein', '▁64', '-', 'jährige', 'r', '▁Amerikan', 'er', '▁hatte', '▁dort', '▁am', '▁1.', '▁Oktober', '▁2017', '▁auf', '▁die', '▁Besucher', '▁eines', '▁Country', 'kon', 'zer', 'ts', '▁', 'geschoss', 'en', '▁und', '▁58', '▁von', '▁ihnen', '▁get', 'öt', 'et', '▁', '.', '</s>']


In [17]:
def tokenize_and_align_labels(examples):

    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        
        try:
            for word_idx in word_ids:  # Set the special tokens to -100.
                if word_idx is None:
                    label_ids.append(-100)
                elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                    label_ids.append(label[word_idx])
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx
            labels.append(label_ids)
        except Exception as e:
            print(f"Skipping example due to the following: { str(e) }")
            print(" ".join(examples[f"tokens"][i]))
            print(label)
            continue

    tokenized_inputs["labels"] = labels

    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

tokenized_dataset

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 3833
    })
    valid: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1372
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1383
    })
})

In [18]:
# check that the conversion worked.
print(tokenized_dataset["train"][0]["labels"])
print(tokenizer.convert_ids_to_tokens(tokenized_dataset["train"][0]["input_ids"]))

[-100, 0, 0, -100, -100, -100, 1, -100, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, -100, -100, -100, 3, -100, -100, 0, 0, 0, 0, 0, -100, -100, 0, -100, -100]
['<s>', '▁Ein', '▁64', '-', 'jährige', 'r', '▁Amerikan', 'er', '▁hatte', '▁dort', '▁am', '▁1.', '▁Oktober', '▁2017', '▁auf', '▁die', '▁Besucher', '▁eines', '▁Country', 'kon', 'zer', 'ts', '▁', 'geschoss', 'en', '▁und', '▁58', '▁von', '▁ihnen', '▁get', 'öt', 'et', '▁', '.', '</s>']


**Define a DataCollator (for efficient padding of the tokens)**

In [19]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

**Download the model**

In [20]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# num labels is 4 bcz of the -100 label
model = AutoModelForTokenClassification.from_pretrained(BASE_MODEL_NAME, num_labels=4)

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-st

***Train the model**

In [23]:
training_args = TrainingArguments(
    output_dir="./data/results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to=("wandb" if  USE_WANDB else None),
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

trainer.save_model(TRAINED_MODEL_NAME)

PyTorch: setting up devices


RuntimeError: WandbCallback requires wandb to be installed. Run `pip install wandb`.

In [24]:
%pip install wandb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/bin/bash: line 1: /home/parallels/envs/nbdev/bin/pip: cannot execute: required file not found


In [None]:
# In case no model was loaded up until now.
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_path = TRAINED_MODEL_NAME

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

NameError: name 'TRAINED_MODEL_NAME' is not defined

In [None]:
from transformers import pipeline

pipe = pipeline(task="token-classification",
                # model=trainer.model, -- in case freshly trained
                model=model,
                tokenizer=tokenizer)

pipe("Er sagt, dass der Präsident dem Volk etwas vorgemacht hat.")

[{'entity': 'LABEL_0',
  'score': 0.99943393,
  'index': 1,
  'word': 'Er',
  'start': 0,
  'end': 2},
 {'entity': 'LABEL_0',
  'score': 0.99932384,
  'index': 2,
  'word': 'sagt',
  'start': 3,
  'end': 7},
 {'entity': 'LABEL_0',
  'score': 0.99940085,
  'index': 3,
  'word': ',',
  'start': 7,
  'end': 8},
 {'entity': 'LABEL_0',
  'score': 0.99937755,
  'index': 4,
  'word': 'dass',
  'start': 9,
  'end': 13},
 {'entity': 'LABEL_0',
  'score': 0.9993318,
  'index': 5,
  'word': 'der',
  'start': 14,
  'end': 17},
 {'entity': 'LABEL_1',
  'score': 0.98201424,
  'index': 6,
  'word': 'Präsident',
  'start': 18,
  'end': 27},
 {'entity': 'LABEL_0',
  'score': 0.9995322,
  'index': 7,
  'word': 'dem',
  'start': 28,
  'end': 31},
 {'entity': 'LABEL_2',
  'score': 0.8740551,
  'index': 8,
  'word': 'Volk',
  'start': 32,
  'end': 36},
 {'entity': 'LABEL_0',
  'score': 0.98325515,
  'index': 9,
  'word': 'etwas',
  'start': 37,
  'end': 42},
 {'entity': 'LABEL_3',
  'score': 0.9126522,
  '

In [None]:
pipe("Peter hat etwas gegen Fritz!")

[{'entity': 'LABEL_1',
  'score': 0.9424176,
  'index': 1,
  'word': 'Peter',
  'start': 0,
  'end': 5},
 {'entity': 'LABEL_0',
  'score': 0.9950648,
  'index': 2,
  'word': 'hat',
  'start': 6,
  'end': 9},
 {'entity': 'LABEL_0',
  'score': 0.9327351,
  'index': 3,
  'word': 'etwas',
  'start': 10,
  'end': 15},
 {'entity': 'LABEL_0',
  'score': 0.9905109,
  'index': 4,
  'word': 'gegen',
  'start': 16,
  'end': 21},
 {'entity': 'LABEL_2',
  'score': 0.8834155,
  'index': 5,
  'word': 'Fritz',
  'start': 22,
  'end': 27},
 {'entity': 'LABEL_0',
  'score': 0.9602219,
  'index': 6,
  'word': '!',
  'start': 27,
  'end': 28}]

**Evaluation on the test / val set**

In [None]:
# Tokenized dataset, convert numerical labels to labels ready for evaluation.

def labelize(example):
    labels1 = []
    for label in example["labels"]:
        if label in [0, -100]:
            labels1.append("LABEL_0")
        if label in [1]:
            labels1.append("LABEL_1")
        if label in [2]:
            labels1.append("LABEL_2")
        if label in [3]:
            labels1.append("LABEL_3")
    example["labels"] = labels1
    return example

def subword_ids_to_strings(example):
    example["subword_tokens"] = tokenizer.convert_ids_to_tokens(example["input_ids"])
    return example

tokenized_dataset["test"] = tokenized_dataset["test"].map(labelize)
tokenized_dataset["test"] = tokenized_dataset["test"].map(subword_ids_to_strings)

  0%|          | 0/933 [00:00<?, ?ex/s]

  0%|          | 0/933 [00:00<?, ?ex/s]

In [None]:
from torch.utils.data import Subset

# verify
tokenizer.decode(tokenized_dataset["test"][0]["input_ids"])
len(tokenized_dataset["test"])

small_ds = tokenized_dataset["test"].select(range(8))

len(small_ds)

" ".join(small_ds[0]["tokens"])

'Vor allem der berüchtigten « Hölle des Nordens » von Paris nach Roubaix ( 17. April ) gilt seine Liebe .'

In [None]:
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
from tqdm import tqdm

y_total_pred = []
y_total_true = []

for e in tqdm(tokenized_dataset["test"]):
    pred = [x["entity"] for x in pipe(" ".join(e["tokens"]))]
    # print("Prediction length:", len(pred))
    # print("Subword length:", len(e["subword_tokens"][1:-1]))
    # print("Labels length:", len(e["labels"][1:-1]))
    # print("----")
    y_total_pred.append(pred)
    true = e["labels"]
    y_total_true.append(true[1:-1])

print(classification_report(y_total_true, y_total_pred))

100%|███████████████████████████████████████████████████████████████| 933/933 [00:30<00:00, 30.88it/s]


              precision    recall  f1-score   support

      ABEL_0       0.66      0.69      0.68      3310
      ABEL_1       0.73      0.82      0.77       933
      ABEL_2       0.67      0.66      0.66       932
      ABEL_3       0.83      0.88      0.86       933

   micro avg       0.70      0.73      0.72      6108
   macro avg       0.72      0.76      0.74      6108
weighted avg       0.70      0.73      0.72      6108



In [None]:
pred

['LABEL_0',
 'LABEL_1',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_3',
 'LABEL_0',
 'LABEL_2',
 'LABEL_2',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0']