## Opinion Role Labeling

### Using the NER approach
-------

**Requirements**
```
pip install transformers datasets evaluate seqeval
```

**Load a dataset in the following format**
```
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , TARGET die neun vorwiegend englischsprachigen HOLDER schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , dem die neun vorwiegend englischsprachigen Provinzen schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die HOLDER ( Unión General del Trabajo ) dieses TARGET unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die UGT ( Unión General del Trabajo ) dieses Vorhaben unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Der HOLDER hat keine TARGET verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
[CLS] Der Trainer hat keine Regeln verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
```

In [213]:
DATASET_PATH="data/silver_standard/holder_target.txt"

sents = []

with open(DATASET_PATH, "r", encoding="utf-8") as f:
    for c, l in enumerate(f.readlines()):
        if c < 100:
            pass
        sents.append(l.strip())

# Der Minister findet die Debatte langweilig.
real_sents = sents[1::2]

# Der HOLDER findet die TARGET langweilig.
masked_sents = sents[0::2]

assert len(real_sents) == len(masked_sents), "Nr of masked and real sentences are not equal!"

print(f"Total rows in the data: {len(real_sents) + len(masked_sents)}")

Total rows in the data: 130844


In [214]:
NER_labels = ["O", "HOLDER", "TARGET"]

def align_sentences(real_sentences, masked_sentences, NER_labels):
    """Given two sentences that have their word-level tokens delimited by a white-space, 
    create NER-tags for each sentence token.
    """
    dataset = []
    for i, real_sentence, masked_sentence in zip(range(0,len(real_sentences)), real_sentences, masked_sentences):
        ner_tags = []
        real_sentence_split = real_sentence.split(" ")
        masked_sentence_split = masked_sentence.split(" ")
        assert len(real_sentence_split) == len(masked_sentence_split), "Misalignment of length of tokens in sentence."
        for real_token, masked_token in zip(real_sentence_split, masked_sentence_split):
            if real_token == masked_token:
                ner_tags.append(0)
            elif masked_token == "HOLDER":
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
            elif masked_token == "TARGET":
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
        dataset.append({
            "id": i,
            # remove the magic tokens from the sentences, 
            # since we will pass the sentences through the tokenizer again.
            "ner_tags": ner_tags[1:-1],
            "tokens": real_sentence_split[1:-1],
        })
    return dataset
         
    
dataset = align_sentences(real_sents, masked_sents, NER_labels)

print(dataset[:10])

[{'id': 0, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['Das', 'immer', 'wieder', 'mit', 'dem', 'Separatismus', 'liebäugelnde', 'französischsprachige', 'Quebec', 'wollte', 'sich', 'mit', 'dem', 'Verfassungskompromiss', ',', 'dem', 'die', 'neun', 'vorwiegend', 'englischsprachigen', 'Provinzen', 'schliesslich', 'zugestimmt', 'hatten', ',', 'nicht', 'abfinden', '.']}, {'id': 1, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['Von', 'den', 'beiden', 'grossen', 'Gewerkschaften', 'hatte', 'die', 'UGT', '(', 'Unión', 'General', 'del', 'Trabajo', ')', 'dieses', 'Vorhaben', 'unterstützt', ',', 'während', 'die', 'Comisiones', 'Obreras', '(', 'CCOO', ')', 'es', 'im', 'Vorfeld', 'abgelehnt', 'hatten', ',', 'sich', 'jetzt', 'aber', 'mit', 'dem', 'Resultat', 'abfinden', '.']}, {'id': 2, 'ner_tags': [0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [215]:
# Make a huggingface dataset out of the record-based dataset
from datasets import Dataset

dataset = Dataset.from_list(dataset)
dataset = dataset.shuffle(seed=42)
dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 52337
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 13085
    })
})


**Use DistilBERT tokenizer and embeddings**

In [193]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [194]:
# tokenize input
tokenized_input = tokenizer(dataset["train"][0]["tokens"], truncation=True, is_split_into_words=True)
print(tokenized_input)

# output subwords
print(tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"]))

{'input_ids': [101, 9033, 2063, 4895, 7747, 8525, 5753, 2618, 16950, 16846, 2480, 27412, 1047, 29443, 11837, 24759, 20501, 2063, 6519, 2358, 18727, 13719, 17076, 4355, 5349, 2618, 6151, 2162, 6519, 3280, 21200, 14758, 5575, 4315, 8040, 21886, 6499, 12871, 8017, 19205, 2102, 1010, 3393, 7295, 2618, 7632, 15465, 6914, 27665, 2078, 1047, 5596, 4183, 6519, 2270, 10523, 6151, 13970, 7096, 3126, 22894, 7389, 2094, 4315, 9944, 5511, 11113, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'si', '##e', 'un', '##ters', '##tu', '##tz', '##te', 'zu', '##sat', '##z', '##liche', 'k', '##rip', '##pen', '##pl', '##atz', '##e', 'fur', 'st', '##adt', '##ische', 'ang', '##est', '##ell', '##te', 'und', 'war', 'fur', 'die', 'auf', '##stock', '##ung', 'der', 'sc', '##hul', '##so', '##zia', '##lar', '##bei', '##

In [216]:
def tokenize_and_align_labels(examples):

    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels

    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

tokenized_dataset

  0%|          | 0/53 [00:00<?, ?ba/s]

  0%|          | 0/14 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 52337
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 13085
    })
})

In [64]:
# check that the conversion worked.
print(tokenized_dataset["train"][0]["labels"])
print(tokenizer.convert_ids_to_tokens(tokenized_dataset["train"][0]["input_ids"]))

[-100, 1, 0, -100, -100, 0, -100, 0, -100, -100, -100, -100, 0, 0, -100, 2, -100, 0, 0, -100, -100, 0, 0, 0, -100, 0, 0, 0, 0, -100, -100, -100, -100, 0, -100, 0, 0, 0, 0, -100, -100, 0, -100, -100, 0, -100, 0, -100, -100, 0, -100, -100, -100, 0, -100]
['[CLS]', 'man', 'sol', '##lt', '##e', 'si', '##e', 'un', '##ters', '##tu', '##tze', '##n', ',', 'stat', '##t', 'si', '##e', 'zu', 'best', '##raf', '##en', '»', ',', 'sa', '##gt', 'martin', ',', 'dem', 'na', '##ch', '##ges', '##ag', '##t', 'wi', '##rd', ',', 'auf', 'die', 'hi', '##es', '##ige', 'reg', '##ier', '##ung', 'gross', '##en', 'ein', '##fl', '##uss', 'aus', '##zu', '##ub', '##en', '.', '[SEP]']


**Define a DataCollator (for efficient padding of the tokens)**

In [65]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

**Download the model**

In [56]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# num labels is 4 bcz of the -100 label
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=4)

Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 268M/268M [00:04<00:00, 61.9MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weigh

***Train the model**

In [66]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

trainer.save_model("./models/ORL")

***** Running training *****
  Num examples = 52337
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9816
The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: id, ner_tags, tokens. If id, ner_tags, tokens are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.0783,0.065323
2,0.0503,0.052069
3,0.0395,0.0506


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

TrainOutput(global_step=9816, training_loss=0.07171938918719178, metrics={'train_runtime': 21575.4297, 'train_samples_per_second': 7.277, 'train_steps_per_second': 0.455, 'total_flos': 4189931551985328.0, 'train_loss': 0.07171938918719178, 'epoch': 3.0})

In [28]:
# In case no model was loaded up until now.
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_path = "./trained_model"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

In [185]:
from transformers import pipeline

pipe = pipeline(task="token-classification",
                # model=trainer.model, -- in case freshly trained
                model=model,
                tokenizer=tokenizer)

pipe("Er sagt, dass der Präsident dem Volk etwas vorgemacht hat.")

[{'entity': 'LABEL_0',
  'score': 0.9949757,
  'index': 1,
  'word': 'er',
  'start': 0,
  'end': 2},
 {'entity': 'LABEL_0',
  'score': 0.99967515,
  'index': 2,
  'word': 'sa',
  'start': 3,
  'end': 5},
 {'entity': 'LABEL_0',
  'score': 0.9995388,
  'index': 3,
  'word': '##gt',
  'start': 5,
  'end': 7},
 {'entity': 'LABEL_0',
  'score': 0.9998976,
  'index': 4,
  'word': ',',
  'start': 7,
  'end': 8},
 {'entity': 'LABEL_0',
  'score': 0.9997031,
  'index': 5,
  'word': 'das',
  'start': 9,
  'end': 12},
 {'entity': 'LABEL_0',
  'score': 0.99986494,
  'index': 6,
  'word': '##s',
  'start': 12,
  'end': 13},
 {'entity': 'LABEL_0',
  'score': 0.9998369,
  'index': 7,
  'word': 'der',
  'start': 14,
  'end': 17},
 {'entity': 'LABEL_1',
  'score': 0.9823893,
  'index': 8,
  'word': 'pr',
  'start': 18,
  'end': 20},
 {'entity': 'LABEL_0',
  'score': 0.9886596,
  'index': 9,
  'word': '##asi',
  'start': 20,
  'end': 23},
 {'entity': 'LABEL_0',
  'score': 0.9971238,
  'index': 10,
  'w

In [75]:
pipe("Peter hat etwas gegen Fritz!")

[{'entity': 'LABEL_1',
  'score': 0.9838028,
  'index': 1,
  'word': 'peter',
  'start': 0,
  'end': 5},
 {'entity': 'LABEL_0',
  'score': 0.9998976,
  'index': 2,
  'word': 'hat',
  'start': 6,
  'end': 9},
 {'entity': 'LABEL_2',
  'score': 0.66772044,
  'index': 3,
  'word': 'et',
  'start': 10,
  'end': 12},
 {'entity': 'LABEL_0',
  'score': 0.9953389,
  'index': 4,
  'word': '##was',
  'start': 12,
  'end': 15},
 {'entity': 'LABEL_0',
  'score': 0.9999118,
  'index': 5,
  'word': 'ge',
  'start': 16,
  'end': 18},
 {'entity': 'LABEL_0',
  'score': 0.9997837,
  'index': 6,
  'word': '##gen',
  'start': 18,
  'end': 21},
 {'entity': 'LABEL_2',
  'score': 0.6101386,
  'index': 7,
  'word': 'fritz',
  'start': 22,
  'end': 27},
 {'entity': 'LABEL_0',
  'score': 0.9999211,
  'index': 8,
  'word': '!',
  'start': 27,
  'end': 28}]

**Evaluation on the test / val set**

In [217]:
# Tokenized dataset, convert numerical labels to labels ready for evaluation.

def labelize(example):
    labels1 = []
    for label in example["labels"]:
        if label in [0, -100]:
            labels1.append("LABEL_0")
        if label in [1]:
            labels1.append("LABEL_1")
        if label in [2]:
            labels1.append("LABEL_2")
    example["labels"] = labels1
    return example

def subword_ids_to_strings(example):
    example["subword_tokens"] = tokenizer.convert_ids_to_tokens(example["input_ids"])
    return example

tokenized_dataset["test"] = tokenized_dataset["test"].map(labelize)
tokenized_dataset["test"] = tokenized_dataset["test"].map(subword_ids_to_strings)

  0%|          | 0/13085 [00:00<?, ?ex/s]

  0%|          | 0/13085 [00:00<?, ?ex/s]

In [219]:
from torch.utils.data import Subset

# verify
tokenizer.decode(tokenized_dataset["test"][0]["input_ids"])
len(tokenized_dataset["test"])

small_ds = tokenized_dataset["test"].select(range(8))

len(small_ds)

" ".join(small_ds[0]["tokens"])

') , der 2001 schon im Kantonsrat sass und gegen Flughafen-Erweiterungen kämpfte , sagt : « Nach dem Grounding fühlten wir uns bestätigt , weil wir gegen die Mega-Hub-Phantasien gekämpft hatten .'

In [None]:
model()

In [204]:
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
from tqdm import tqdm

y_total_pred = []
y_total_true = []

for e in tqdm(tokenized_dataset["test"]):
    pred = [x["entity"] for x in pipe(" ".join(e["tokens"]))]
    # print("Prediction length:", len(pred))
    # print("Subword length:", len(e["subword_tokens"][1:-1]))
    # print("Labels length:", len(e["labels"][1:-1]))
    # print("----")
    y_total_pred.append(pred)
    true = e["labels"]
    y_total_true.append(true[1:-1])

print(classification_report(y_total_true, y_total_pred))

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13085/13085 [12:03<00:00, 18.08it/s]


              precision    recall  f1-score   support

      ABEL_0       0.89      0.90      0.90     36771
      ABEL_1       0.90      0.94      0.92     13056
      ABEL_2       0.90      0.91      0.91     12941

   micro avg       0.90      0.91      0.90     62768
   macro avg       0.90      0.92      0.91     62768
weighted avg       0.90      0.91      0.90     62768



In [227]:
pred

['LABEL_0',
 'LABEL_1',
 'LABEL_0',
 'LABEL_0',
 'LABEL_1',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_2',
 'LABEL_0']