## Opinion Role Labeling

### Using the NER approach
-------

**Requirements**
```
pip install transformers datasets evaluate seqeval
```

**Load a dataset in the following format**
```
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , TARGET die neun vorwiegend englischsprachigen HOLDER schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Das immer wieder mit dem Separatismus liebäugelnde französischsprachige Quebec wollte sich mit dem Verfassungskompromiss , dem die neun vorwiegend englischsprachigen Provinzen schliesslich zugestimmt hatten , nicht abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die HOLDER ( Unión General del Trabajo ) dieses TARGET unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Von den beiden grossen Gewerkschaften hatte die UGT ( Unión General del Trabajo ) dieses Vorhaben unterstützt , während die Comisiones Obreras ( CCOO ) es im Vorfeld abgelehnt hatten , sich jetzt aber mit dem Resultat abfinden . [SEP] 
[CLS] Der HOLDER hat keine TARGET verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
[CLS] Der Trainer hat keine Regeln verletzt . » Petkovic lehnt laut italienischen Medien auch ein Angebot ab , mit 30 Prozent der Summe abgefunden zu werden . [SEP] 
```

In [38]:
# CONFIG VARIABLES

DATASET_PATH="data/silver_standard/holder_target.txt"

GENERATE_DATASET=False

BASE_MODEL_NAME="distilbert-base-german-cased"

In [1]:
sents = []

with open(DATASET_PATH, "r", encoding="utf-8") as f:
    for c, l in enumerate(f.readlines()):
        if c < 100:
            pass
        sents.append(l.strip())

# Der Minister findet die Debatte langweilig.
real_sents = sents[1::2]

# Der HOLDER findet die TARGET langweilig.
masked_sents = sents[0::2]

assert len(real_sents) == len(masked_sents), "Nr of masked and real sentences are not equal!"

print(f"Total rows in the data: {len(real_sents) + len(masked_sents)}")

Total rows in the data: 130844


In [26]:
NER_labels = ["O", "HOLDER", "TARGET"]

def align_sentences(real_sentences, masked_sentences, NER_labels):
    """Given two sentences that have their word-level tokens delimited by a white-space, 
    create NER-tags for each sentence token.
    """
    dataset = []
    for i, real_sentence, masked_sentence in zip(range(0,len(real_sentences)), real_sentences, masked_sentences):
        ner_tags = []
        real_sentence_split = real_sentence.split(" ")
        masked_sentence_split = masked_sentence.split(" ")
        assert len(real_sentence_split) == len(masked_sentence_split), "Misalignment of length of tokens in sentence."
        for real_token, masked_token in zip(real_sentence_split, masked_sentence_split):
            if real_token == masked_token:
                ner_tags.append(0)
            elif masked_token == "HOLDER":
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
            elif masked_token == "TARGET":
                tag_index = NER_labels.index(masked_token)
                ner_tags.append(tag_index)
        dataset.append({
            "id": i,
            # remove the magic tokens from the sentences, 
            # since we will pass the sentences through the tokenizer again.
            "ner_tags": ner_tags[1:-1],
            "tokens": real_sentence_split[1:-1],
        })
    return dataset

aligned_sents = align_sentences(real_sents, masked_sents, NER_labels)

print(aligned_sents[:10])

[{'id': 0, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['Das', 'immer', 'wieder', 'mit', 'dem', 'Separatismus', 'liebäugelnde', 'französischsprachige', 'Quebec', 'wollte', 'sich', 'mit', 'dem', 'Verfassungskompromiss', ',', 'dem', 'die', 'neun', 'vorwiegend', 'englischsprachigen', 'Provinzen', 'schliesslich', 'zugestimmt', 'hatten', ',', 'nicht', 'abfinden', '.']}, {'id': 1, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['Von', 'den', 'beiden', 'grossen', 'Gewerkschaften', 'hatte', 'die', 'UGT', '(', 'Unión', 'General', 'del', 'Trabajo', ')', 'dieses', 'Vorhaben', 'unterstützt', ',', 'während', 'die', 'Comisiones', 'Obreras', '(', 'CCOO', ')', 'es', 'im', 'Vorfeld', 'abgelehnt', 'hatten', ',', 'sich', 'jetzt', 'aber', 'mit', 'dem', 'Resultat', 'abfinden', '.']}, {'id': 2, 'ner_tags': [0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [46]:
# Make a huggingface dataset out of the record-based dataset
from datasets import Dataset, DatasetDict, load_from_disk

if GENERATE_DATASET:
    dataset = Dataset.from_list(aligned_sents)

    # randomly shuffle
    dataset = dataset.shuffle(seed=42)

    # dataset = dataset.train_test_split(test_size=0.2, shuffle=False)
    train_test_dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

    # Split the 10% test + valid in half test, half valid
    test_valid = train_test_dataset['test'].train_test_split(test_size=0.4, shuffle=False)

    # gather everyone if you want to have a single DatasetDict
    dataset = DatasetDict({
        'train': train_test_dataset['train'],
        'test': test_valid['test'],
        'valid': test_valid['train']})

    dataset.save_to_disk("./data/split_dataset.hf")
else:
    dataset = load_from_disk("./data/split_dataset.hf")

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 52337
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 5234
    })
    valid: Dataset({
        features: ['id', 'ner_tags', 'tokens'],
        num_rows: 7851
    })
})


**Use DistilBERT tokenizer and embeddings**

In [39]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/464 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/240k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/479k [00:00<?, ?B/s]

In [41]:
# tokenize input
tokenized_input = tokenizer(dataset["train"][0]["tokens"], truncation=True, is_split_into_words=True)
print(tokenized_input)

# output subwords
print(tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"]))

{'input_ids': [102, 3098, 15685, 307, 153, 128, 18338, 12845, 17157, 4278, 30901, 222, 1132, 143, 136, 5069, 199, 2861, 12898, 19422, 125, 21221, 3524, 566, 103], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'Damit', 'greift', 'sie', 'in', 'die', 'Gesetzgebung', '##sko', '##mp', '##eten', '##z', 'des', 'Bundes', 'ein', 'und', 'verletzt', 'das', 'Gleich', '##behandlung', '##sprinzip', 'der', 'Bundesver', '##fassung', '.', '[SEP]']


In [42]:
def tokenize_and_align_labels(examples):

    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels

    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

tokenized_dataset

  0%|          | 0/53 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 52337
    })
    test: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 5234
    })
    valid: Dataset({
        features: ['id', 'ner_tags', 'tokens', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 7851
    })
})

In [43]:
# check that the conversion worked.
print(tokenized_dataset["train"][0]["labels"])
print(tokenizer.convert_ids_to_tokens(tokenized_dataset["train"][0]["input_ids"]))

[-100, 0, 0, 1, 0, 0, 0, -100, -100, -100, -100, 0, 0, 0, 0, 0, 0, 2, -100, -100, 0, 0, -100, 0, -100]
['[CLS]', 'Damit', 'greift', 'sie', 'in', 'die', 'Gesetzgebung', '##sko', '##mp', '##eten', '##z', 'des', 'Bundes', 'ein', 'und', 'verletzt', 'das', 'Gleich', '##behandlung', '##sprinzip', 'der', 'Bundesver', '##fassung', '.', '[SEP]']


**Define a DataCollator (for efficient padding of the tokens)**

In [44]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

**Download the model**

In [45]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# num labels is 4 bcz of the -100 label
model = AutoModelForTokenClassification.from_pretrained(BASE_MODEL_NAME, num_labels=4)

Downloading:   0%|          | 0.00/270M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-german-cased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probab

***Train the model**

In [47]:
training_args = TrainingArguments(
    output_dir="./data/results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

trainer.save_model("./data/trained_model")

The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: tokens, ner_tags, id. If tokens, ner_tags, id are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 52337
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9816
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [11]:
# In case no model was loaded up until now.
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_path = "./data/trained_model"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

In [12]:
from transformers import pipeline

pipe = pipeline(task="token-classification",
                # model=trainer.model, -- in case freshly trained
                model=model,
                tokenizer=tokenizer)

pipe("Er sagt, dass der Präsident dem Volk etwas vorgemacht hat.")

[{'entity': 'LABEL_0',
  'score': 0.9949757,
  'index': 1,
  'word': 'er',
  'start': 0,
  'end': 2},
 {'entity': 'LABEL_0',
  'score': 0.99967515,
  'index': 2,
  'word': 'sa',
  'start': 3,
  'end': 5},
 {'entity': 'LABEL_0',
  'score': 0.9995388,
  'index': 3,
  'word': '##gt',
  'start': 5,
  'end': 7},
 {'entity': 'LABEL_0',
  'score': 0.9998976,
  'index': 4,
  'word': ',',
  'start': 7,
  'end': 8},
 {'entity': 'LABEL_0',
  'score': 0.9997031,
  'index': 5,
  'word': 'das',
  'start': 9,
  'end': 12},
 {'entity': 'LABEL_0',
  'score': 0.99986494,
  'index': 6,
  'word': '##s',
  'start': 12,
  'end': 13},
 {'entity': 'LABEL_0',
  'score': 0.9998369,
  'index': 7,
  'word': 'der',
  'start': 14,
  'end': 17},
 {'entity': 'LABEL_1',
  'score': 0.9823893,
  'index': 8,
  'word': 'pr',
  'start': 18,
  'end': 20},
 {'entity': 'LABEL_0',
  'score': 0.9886596,
  'index': 9,
  'word': '##asi',
  'start': 20,
  'end': 23},
 {'entity': 'LABEL_0',
  'score': 0.9971238,
  'index': 10,
  'w

In [13]:
pipe("Peter hat etwas gegen Fritz!")

[{'entity': 'LABEL_1',
  'score': 0.9838028,
  'index': 1,
  'word': 'peter',
  'start': 0,
  'end': 5},
 {'entity': 'LABEL_0',
  'score': 0.9998976,
  'index': 2,
  'word': 'hat',
  'start': 6,
  'end': 9},
 {'entity': 'LABEL_2',
  'score': 0.66772044,
  'index': 3,
  'word': 'et',
  'start': 10,
  'end': 12},
 {'entity': 'LABEL_0',
  'score': 0.9953389,
  'index': 4,
  'word': '##was',
  'start': 12,
  'end': 15},
 {'entity': 'LABEL_0',
  'score': 0.9999118,
  'index': 5,
  'word': 'ge',
  'start': 16,
  'end': 18},
 {'entity': 'LABEL_0',
  'score': 0.9997837,
  'index': 6,
  'word': '##gen',
  'start': 18,
  'end': 21},
 {'entity': 'LABEL_2',
  'score': 0.6101386,
  'index': 7,
  'word': 'fritz',
  'start': 22,
  'end': 27},
 {'entity': 'LABEL_0',
  'score': 0.9999211,
  'index': 8,
  'word': '!',
  'start': 27,
  'end': 28}]

**Evaluation on the test / val set**

In [14]:
# Tokenized dataset, convert numerical labels to labels ready for evaluation.

def labelize(example):
    labels1 = []
    for label in example["labels"]:
        if label in [0, -100]:
            labels1.append("LABEL_0")
        if label in [1]:
            labels1.append("LABEL_1")
        if label in [2]:
            labels1.append("LABEL_2")
    example["labels"] = labels1
    return example

def subword_ids_to_strings(example):
    example["subword_tokens"] = tokenizer.convert_ids_to_tokens(example["input_ids"])
    return example

tokenized_dataset["test"] = tokenized_dataset["test"].map(labelize)
tokenized_dataset["test"] = tokenized_dataset["test"].map(subword_ids_to_strings)

  0%|          | 0/13085 [00:00<?, ?ex/s]

  0%|          | 0/13085 [00:00<?, ?ex/s]

In [15]:
from torch.utils.data import Subset

# verify
tokenizer.decode(tokenized_dataset["test"][0]["input_ids"])
len(tokenized_dataset["test"])

small_ds = tokenized_dataset["test"].select(range(8))

len(small_ds)

" ".join(small_ds[0]["tokens"])

') , der 2001 schon im Kantonsrat sass und gegen Flughafen-Erweiterungen kämpfte , sagt : « Nach dem Grounding fühlten wir uns bestätigt , weil wir gegen die Mega-Hub-Phantasien gekämpft hatten .'

In [None]:
model()

In [16]:
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score
from tqdm import tqdm

y_total_pred = []
y_total_true = []

for e in tqdm(tokenized_dataset["test"]):
    pred = [x["entity"] for x in pipe(" ".join(e["tokens"]))]
    # print("Prediction length:", len(pred))
    # print("Subword length:", len(e["subword_tokens"][1:-1]))
    # print("Labels length:", len(e["labels"][1:-1]))
    # print("----")
    y_total_pred.append(pred)
    true = e["labels"]
    y_total_true.append(true[1:-1])

print(classification_report(y_total_true, y_total_pred))

  1%|█▌                                                                                                                                                                    | 127/13085 [00:05<09:09, 23.58it/s]


KeyboardInterrupt: 

In [48]:
pred

['LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_1',
 'LABEL_2',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0',
 'LABEL_0']