# Relation extraction with Entity Markers

We will learn how to implement a relation extractor using [ROBERTA](https://arxiv.org/abs/1907.11692) archicture and Entity Marker context represenation proposed by [Soares et al. 2019](https://www.aclweb.org/anthology/P19-1279.pdf). We will use ACE-2005 dataset to train and evaluate our model. As you will notice you will need to understand a code with some level of complexity, in which we extend a transformer based encoder to use it in a relatin extraction task. 

You don't need to code anything in this notebook, but it is worth to spend some time understanding the modules that we use in this notebook (`ACERelationClassificationDataset` and `RobertaForTupleClassification`).

Goal of this lab are 1) to earn to modify the tokenizer to include special tokens (entity markers in this case), 2) to learn to extend neural archicteture on top of pretrained models provided in Huggingface, and 3) review advanced concepts of deep learning and relation extraction. 
    

In [None]:
# Mount Drive files
from google.colab import drive
drive.mount('/content/drive')

## Set up

We first need to install transformers modules from Huggingface. This is done directly using pip. After the instalation of the transformers modules, we have to set up the enviroment to use our custom implementation of the entity markers, which is build on top of the transformers. If you want to inspect the code, these are the most relevant parts you need to understand: 

- Module for manage datasets: __dataset.py__
  + We will make use of `ACERelationClassificationDataset` class, which read data from preprocesed JSON file.
- Implemetation of the model: __models.py__
   + We will use  `RobertaForTupleClassification` for relation extraction with entity markers.

The code for training and testing are implemented in the notebook.

In [None]:
!pip install transformers

In order to make all the imports correctly, we need to add the path that includes the modules for entity markers based relation extraction. Make sure you have all the modules in your Drive. Check your path to the Python modules included in `entity_markers` is correct. 

In [None]:
import sys
em_dir = 'drive/MyDrive/Colab Notebooks/nlp-app-II/labs/entity_marker'
sys.path.append(em_dir)

In [None]:
import torch
from torch.utils.data import DataLoader
import argparse
import json
import os

#from tqdm import tqdm
from tqdm.notebook import tqdm

from transformers import (
    AutoConfig,
    AutoTokenizer,
    set_seed,
    AdamW,
    get_linear_schedule_with_warmup
)

from models import RobertaForTupleClassification
from dataset import ACERelationClassificationDataset

In [None]:
def load_defaults(opt: argparse.Namespace) -> argparse.Namespace:
    # Model arguments
    opt.model_name_or_path = getattr(opt, 'model_name_or_path', 'xlm-roberta-base')
    opt.use_fast = getattr(opt, 'use_fast', False)
    opt.fp16 = getattr(opt, 'fp16', False) if hasattr(torch.cuda, 'amp') else False

    # Training arguments
    opt.seed = getattr(opt, 'seed', 0)
    opt.cache_dir = getattr(opt, 'cache_dir', None)
    opt.save_dir = getattr(opt, 'save_dir', 'output/')
    opt.batch_size = getattr(opt, 'batch_size', 16)
    opt.do_eval = getattr(opt, 'do_eval', False)
    opt.do_test = getattr(opt, 'do_test', False)
    opt.learning_rate = getattr(opt, 'learning_rate', 5e-5)
    opt.epochs = getattr(opt, 'epochs', 3)
    opt.gradient_accumulation_steps = getattr(opt, 'gradient_accumulation_steps', 1)
    opt.num_warmup_steps = getattr(opt, 'num_warmup_steps', 0)
    opt.max_grad_norm = getattr(opt, 'max_grad_norm', 5.0)

    # Data arguments
    opt.negative_class = getattr(opt, 'negative_class', False)
    opt.add_trigger_info = getattr(opt, 'add_trigger_info', False)
    opt.add_clean_context = getattr(opt, 'add_clean_context', False)

    return opt

We need to load a configuration file where we indicate the main stuff to train and evaluate the model (data paths, hyperparameters, etc). Open the config file from the notebook so you can edit directly.

In [None]:
config_file = em_dir + "/config_re.json"
with open(config_file) as f:
    opt = json.load(f)
    opt = argparse.Namespace(**opt)
opt = load_defaults(opt)

## Prepare logs, load tokenizers and the pretrained model
In this section will load the pretrained model and prepare the tokenizer to include the entity markers as special tokens. We want to have some special tokens in which the tokenizer does not divide them in subwords, and also we want those tokens to encode the contextual representation of the entity pairs that exibit a relation type (in a similar way as we do with [CLS] and [SEP])



In [None]:
# Log
train_log = vars(opt).copy()
train_log['epochs'] = []

# Set the seed
set_seed(opt.seed)

# Make the output dir if not exists
os.makedirs(opt.save_dir, exist_ok=True)

# Load the configuration
num_labels = len(ACERelationClassificationDataset.label2id)
label2id = ACERelationClassificationDataset.label2id
id2label = {int(value): key for key, value in label2id.items()}

config = AutoConfig.from_pretrained(
    opt.model_name_or_path,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
    cache_dir=opt.cache_dir
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    opt.model_name_or_path,
    cache_dir=opt.cache_dir,
    use_fast=opt.use_fast
)
# Include entity markers as tokens in the tokenizer
new_tokens = ['<e1s>', '<e1e>', '<e2s>', '<e2e>']
tokenizer.add_tokens(new_tokens)
tokenizer.save_pretrained(opt.save_dir)

# Update the config with new information
config.first_marker_token_id = tokenizer.convert_tokens_to_ids(['<e1s>', '<e1e>'])
config.second_marker_token_id = tokenizer.convert_tokens_to_ids( ['<e2s>', '<e2e>'])

# Load the model
model = RobertaForTupleClassification.from_pretrained(
    opt.model_name_or_path,
    config=config,
    cache_dir=opt.cache_dir
)

# Resize the vocab
config.vocab_size = len(tokenizer)
model.resize_token_embeddings(config.vocab_size)

model.save_pretrained(opt.save_dir)


### Exercise 1 
- Inspect the code and indicate the lines where we update the tokenizer with entity markers.

### Exercise 2

- Inspect the code for the class `RobertaForTupleClassificication` in `models.py`. Can you identify the how entity-markers approach is implemented? 

## Load ACE data for RE

In [None]:
# Load the data
train_dataset = ACERelationClassificationDataset(
    file_path=opt.train_data,
    tokenizer=tokenizer,
    from_preprocessed=True
)

train_dataloader = DataLoader(
    train_dataset, collate_fn=train_dataset.collate_fn,
    batch_size=opt.batch_size, shuffle=True
)

eval_dataset = ACERelationClassificationDataset(
    file_path=opt.eval_data,
    tokenizer=tokenizer,
    from_preprocessed=True
)

eval_dataloader = DataLoader(
    eval_dataset, collate_fn=eval_dataset.collate_fn,
    batch_size=opt.batch_size
)

### Exercise 3
Inspect the training data. Can you see the entity markers? 

In [None]:
train_dataset[0]

## Prepare the training

In [None]:
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": getattr(opt, 'weight_decay', 0.01)
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(
    optimizer_grouped_parameters,
    lr=opt.learning_rate
)

num_training_steps = len(train_dataset) * opt.epochs // (opt.batch_size * opt.gradient_accumulation_steps)
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=opt.num_warmup_steps,
    num_training_steps=num_training_steps
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

if opt.fp16:
    scaler = torch.cuda.amp.GradScaler()

# Training loop
print("***** Running training *****")
print(f"  FP16 enabled: {opt.fp16}")
print(f"  Num examples = {len(train_dataset)}")
print(f"  Num Epochs = {opt.epochs}")
print(f"  Train batch size = {opt.batch_size}")
print(f"  Gradient Accumulation steps = {opt.gradient_accumulation_steps}")
print(f"  Total optimization steps = {num_training_steps}")

## Train

In [None]:
    best_eval_score = 0.0
        
    for epoch in range(getattr(opt, 'epochs', 3)):
        # Prepare the model for epoch
        model.zero_grad()
        model.train()

        train_log['epochs'].append({})

        step, tp, p, r, total, total_loss = 0, 0, 0, 0, 0, 0
        progress = tqdm(train_dataloader, total=(len(train_dataset)//opt.batch_size),
                        desc=f"Epoch: {epoch} - Loss: {total_loss}"
        )
        for batch, _ in progress:
            # Batch to cuda
            batch = {key: value.to(device) for key, value in batch.items()}

            # Forward pass
            if opt.fp16:
                with torch.cuda.amp.autocast():
                    loss, output = model(**batch)
            else:
                loss, output = model(**batch)
            
            total_loss += loss.item()
            loss = loss / opt.gradient_accumulation_steps 

            # Backward pass
            if opt.fp16:
                scaler.scale(loss).backward()
            else:
                loss.backward()
            
            if (step + 1) % opt.gradient_accumulation_steps == 0:
                if opt.fp16:
                    scaler.unscale_(optimizer)
                    
                torch.nn.utils.clip_grad_norm_(
                    model.parameters(), opt.max_grad_norm
                )

                if opt.fp16:
                    scaler.step(optimizer)
                    scaler.update()
                else:
                    optimizer.step()

                lr_scheduler.step()
                model.zero_grad()

            # Evaluation 
            pred = output.detach().argmax(-1)
            true = batch['labels'].detach()
            p += (pred != 0).sum().item()
            r += (true != 0).sum().item()
            tp += ((pred == true).float() @ (true != 0).float()).item()

            precision = tp*100/p if p > 0.0 else 0.0
            recall = tp*100/r if r > 0.0 else 0.0
            f1_score = (2*precision*recall) / (precision+recall) if precision+recall else 0.0
            progress.set_description(f"Epoch: {epoch} - Loss: {total_loss/(step+1):.3f} - P/R/F: {precision:.2f}/{recall:.2f}/{f1_score:.2f}")
            
            step += 1
            
        print(f"Training evaluation results:")
        precision = tp*100/p if p > 0.0 else 0.0
        recall = tp*100/r if r > 0.0 else 0.0
        f1_score = (2*precision*recall) / (precision+recall) if precision+recall else 0.0
        print(f"Precision/Recall/F-Score: {precision:.2f}/{recall:.2f}/{f1_score:.2f}")
        train_log['epochs'][-1]["train"] = {
            'loss': total_loss/(step+1),
            'precision': precision,
            'recall': recall,
            'f1-score': f1_score
        }

        # Evaluation loop
        model.eval()
        if opt.do_eval:
            step, tp, p, r, total, total_loss = 0, 0, 0, 0, 0, 0
            
            progress = tqdm(eval_dataloader, total=(len(eval_dataset)//opt.batch_size),
                            desc=f"Epoch: {epoch} - Loss: {total_loss}"
            )
            for batch, _ in progress:
                batch = {key: value.to(device) for key, value in batch.items()}
                
                # Forward pass
                with torch.no_grad():
                    if opt.fp16:
                        with torch.cuda.amp.autocast():
                            loss, output = model(**batch)
                    else:
                        loss, output = model(**batch)
                
                total_loss += loss.item()
                # Evaluation 
                pred = output.detach().argmax(-1)
                true = batch['labels'].detach()
                p += (pred != 0).sum().item()
                r += (true != 0).sum().item()
                tp += ((pred == true).float() @ (true != 0).float()).item()

                precision = tp*100/p if p > 0.0 else 0.0
                recall = tp*100/r if r > 0.0 else 0.0
                f1_score = (2*precision*recall) / (precision+recall) if precision+recall else 0.0
                progress.set_description(f"Epoch: {epoch} - Loss: {total_loss/(step+1):.3f} - P/R/F: {precision:.2f}/{recall:.2f}/{f1_score:.2f}")
                
                step += 1

            print(f"Development evaluation results:")
            precision = tp*100/p if p > 0.0 else 0.0
            recall = tp*100/r if r > 0.0 else 0.0
            f1_score = (2*precision*recall) / (precision+recall) if precision+recall else 0.0
            print(f"Precision/Recall/F-Score: {precision:.2f}/{recall:.2f}/{f1_score:.2f}")
            train_log['epochs'][-1]["dev"] = {
                'loss': total_loss/(step+1),
                'precision': precision,
                'recall': recall,
                'f1-score': f1_score
            }
            eval_score = f1_score

            if eval_score > best_eval_score:
                print("Saving best model...")
                model.save_pretrained(opt.save_dir)
                best_eval_score = eval_score

        else:
            model.save_pretrained(opt.save_dir)

In [None]:
with open(os.path.join(opt.save_dir, 'train_log.json'), 'wt') as f:
    json.dump(train_log, f)
print("Train log saved in: {}".format(os.path.join(opt.save_dir, 'train_log.json')))

Following code obtains the predictions, print them and plots confusion matrix.


In [None]:
predictions = []
info = []
eval_dataloader = DataLoader(
    eval_dataset, collate_fn=eval_dataset.collate_fn,
    batch_size=1
)
with torch.no_grad():
    for batch, inst_info in eval_dataloader:
        batch = {key: value.to(device) for key, value in batch.items()}
        if opt.fp16:
            with torch.cuda.amp.autocast():
                loss, output = model(**batch)
        else:
            loss, output = model(**batch)

        output = output.argmax(dim=-1).detach().cpu().tolist()
        info.append(inst_info)
        predictions.extend(output)

In [None]:
def remove_markers(marked_tokens):
    return [token for token in marked_tokens if token not in ["<e1s>", "<e1e>", "<e2s>", "<e2e>"]]

for i, prediction in enumerate(predictions):
    predicted_label = id2label[prediction]
    gold_label = eval_dataset[i]['label']
    sentence =  " ".join(eval_dataset[i]['tokens'])
    print("{} - {}: {}".format(predicted_label, gold_label, sentence))

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

gold = [eval_dataset[i]['label'] for i in range(len(eval_dataset))]
pred = [id2label[prediction] for prediction in predictions]
cm = confusion_matrix(gold, pred, labels=list(label2id.keys()))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=list(label2id.keys()))

disp.plot(xticks_rotation='vertical')


### Exercise 4: 
- Where is located the confusion?