# Options (ALWAYS RUN)
## Transformer Model
*   **ENTITY_DELIM_TOKENS** - Entity delimiter tokens used in the dataset.

## Model Training
*   **TRAIN_DATA_PATH** - File paths of the data used for model training, if multiple paths are given then the data will be pooled.

*   **VAL_DATA_PATH** - File paths of the data used for model validation, if multiple paths are given then the data will be pooled and if no paths are given then no validation will take place *(useful for final model training)*.

*   **TRAINING_LOG_PATH** - File path to which training logs will be saved, if no path is given then logs won't be saved.

*   **BERT_MODEL** - Name of the pre-trained BERT model to load and use as part of the relation extraction model.

*   **ENTITY_TRANSFORMER_NUM_HEAD_VALS** - Hyperparameter values to iterate over for the number of heads used in the model's entity transformer. *(Note: for model testing and evaluation, the first value will be used)*

*   **ENTITY_TRANSFORMER_FF_DIM_VALS** - Hyperparameter values to iterate over for the feedforward dimension used in the model's entity transformer. *(Note: for model testing and evaluation, the first value will be used)*

*   **ENTITY_TRANSFORMER_NUM_LAYER_VALS** - Hyperparameter values to iterate over for the number of transformer layers used in the model's entity transformer. *(Note: for model testing and evaluation, the first value will be used)*

*   **NUM_EPOCHS** - Number of epochs to use during training.

*   **BATCH_SIZE** - Batch size to use during training.

*   **LEARNING_RATE** - ADAM learning rate to use during training.

*   **DROPOUT_RATE** - Dropout rate to use during training.

*   **USE_CLASS_WEIGHTING** - Whether to use a class weighting scheme during training.

*   **MODEL_SAVE_PATH** - File path to save the trained model to, this will be overwritten and hence will always contain the **LAST** model trained.

## Model Evaluation

*   **TEST_DATA_PATH** - File paths of the data used for model testing, if multiple paths are given then the data will be pooled.

*   **TESTING_LOG_PATH** - File path to which testing logs will be saved, if no path is given then logs won't be saved.

## Model Demo

*   **RELATION_TO_DESCRIPTION_PATH** - File path of the relation-ID-to-description mapping file.

*   **NUM_PRINTED_PREDICTIONS** - Number of predictions to print per input during the demo.


In [2]:
# === OPTIONS ==================================================================

# ====== TRANSFORMER MODEL ==================================================

ENTITY_DELIM_TOKENS = {
    "h": ("[E1S]", "[E1E]"),
    "t": ("[E2S]", "[E2E]")
}

# ====== MODEL TRAINING =====================================================

TRAIN_DATA_PATH = (
    "../dataset/train.json",
    "../dataset/val.json"
)

VAL_DATA_PATH = []

TRAINING_LOG_PATH = "logs/transformer_final_training.log"

BERT_MODEL = "bert-base-uncased"

ENTITY_TRANSFORMER_NUM_HEAD_VALS = (2,) # (2, 4, 8)

ENTITY_TRANSFORMER_FF_DIM_VALS = (256,) # (64, 128, 256)

ENTITY_TRANSFORMER_NUM_LAYER_VALS = (3,) # (1, 2, 3)

NUM_EPOCHS = 3

BATCH_SIZE = 16

LEARNING_RATE = 2e-5

DROPOUT_RATE = 0.1

USE_CLASS_WEIGHTING = True

MODEL_SAVE_PATH = "transformer.pt"

# ====== MODEL EVALUATION ===================================================

TEST_DATA_PATH = (
    "../dataset/test.json",
)

TESTING_LOG_PATH = "logs/transformer_testing.log"

# ====== MODEL DEMO =========================================================

RELATION_TO_DESCRIPTION_PATH = "../dataset/pid2name_filtered.json"

NUM_PRINTED_PREDICTIONS = 3

# ==============================================================================

# Transformer Model (ALWAYS RUN)

Our relation extraction transformer model is based off the architecture described in the paper: [Enriching Pre-trained Language Model with Entity Information for Relation Classification](https://dl.acm.org/doi/abs/10.1145/3357384.3358119).

In our model, however, we have made a novel adaption by using weighted entity token averages in place of plain averages. To calculate the weights used during entity token averaging, we have inserted a new transformer encoder plus feedforward softmax layer into the architecture which takes entity token embeddings and outputs entity token weights.

We have made this adaption following the assumption that each token within an entity holds varying levels of importance regarding relation extraction. E.g. in the entity phrase "King of England", the word "of" is likely to be of less importance than the word "King" or "England".

In [1]:
from itertools import chain
import torch
import torch.nn as nn
from torch.nn.utils.rnn import (
    pad_sequence, pack_padded_sequence, pad_packed_sequence
)
from torch.utils.data import Dataset
from transformers import BertModel, BertTokenizer

# Relation Extraction model
class REModel(nn.Module):

    # Create BERT tokenizer and add special entity delimiter tokens
    def create_tokenizer(bert_model, entity_delim_tokens):
        REModel.tokenizer = BertTokenizer.from_pretrained(bert_model)
        REModel.tokenizer.add_special_tokens(
            {
                'additional_special_tokens': list(
                    chain.from_iterable(entity_delim_tokens.values())
                )
            }
        )

    # Tokenization function
    def tokenize(inputs):
        return REModel.tokenizer(
            inputs,
            padding=True,
            return_tensors="pt",
            return_token_type_ids=False
        )

    def __init__(
            self,
            bert_model,
            entity_delim_tokens,
            entity_transformer_nhead,
            entity_transformer_dim_ff,
            entity_transformer_num_layers,
            num_outputs,
            dropout_rate
    ):
        super(REModel, self).__init__()

        # Get entity delimiter token IDs
        self.entity_delim_ids = {}
        for entity_id, delim_tokens in entity_delim_tokens.items():
            self.entity_delim_ids[
                entity_id
            ] = REModel.tokenizer.convert_tokens_to_ids(delim_tokens)

        # Create BERT model
        self.bert = BertModel.from_pretrained(bert_model)
        self.bert.resize_token_embeddings(len(REModel.tokenizer))
        bert_hidden_size = self.bert.config.hidden_size

        # Create feedforward layer for [CLS] token
        self.cls_feedforward = nn.Linear(
            bert_hidden_size,
            bert_hidden_size
        )

        # Create transformer encoder for calculating entity weights
        entity_transformer_layer = nn.TransformerEncoderLayer(
            d_model=bert_hidden_size,
            nhead=entity_transformer_nhead,
            dim_feedforward=entity_transformer_dim_ff,
            batch_first=True
        )
        self.entity_transformer = nn.TransformerEncoder(
            entity_transformer_layer,
            num_layers=entity_transformer_num_layers,
        )

        # Create feedforward layer for calculating entity weights
        self.entity_weight_feedforward = nn.Linear(bert_hidden_size, 1)

        # Create feedforward layer for weighted average entity embeddings
        self.entity_feedforward = nn.Linear(
            bert_hidden_size,
            bert_hidden_size
        )

        # Create feedforward layer for final relation embedding
        self.relation_feedforward = nn.Linear(
            bert_hidden_size * (1 + len(entity_delim_tokens)),
            num_outputs
        )

        # Create dropout layer
        self.dropout = nn.Dropout(p=dropout_rate)

    def forward(self, input_ids, attention_mask):

        # Pass inputs through BERT model
        bert_outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        ).last_hidden_state

        # Pass [CLS] token embedding through tanh + feedforward layer
        cls_embedding = self.cls_feedforward(
            torch.tanh(self.dropout(bert_outputs[:, 0, :]))
        )

        # Calculate weighted average entity embeddings and concatenate with
        # [CLS] token embedding to create final relation embedding
        #
        # TODO : Could probably fully vectorise this across entity types to
        # remove the for loop here. I think it'd be pretty complicated to do
        # this though and, since our usecase only requires 2 entity types, I
        # don't think the performance boost would be too big anyway.
        #
        relation_embedding = cls_embedding
        for start_id, end_id in self.entity_delim_ids.values():

            # Find entity delimiter positions
            start_poses = input_ids == start_id
            end_poses = input_ids == end_id

            # Get entity mask
            start_cumsum = torch.cumsum(start_poses, dim=1)
            end_cumsum = torch.cumsum(end_poses, dim=1)
            entity_mask = (
                (start_cumsum > end_cumsum) & (start_cumsum > 0) & ~start_poses
            )

            # Extract entity token embeddings
            entity_tokens = torch.split(
                bert_outputs[entity_mask],
                entity_mask.sum(dim=1).tolist()
            )

            # Normalise token embedding counts across inputs using padding
            padded_entity_tokens = pad_sequence(
                entity_tokens,
                batch_first=True,
                padding_value=0
            )

            # Create attention mask for padded entity token embeddings
            entity_token_attention = (
                (padded_entity_tokens == 0).all(dim=-1)
            ).float()

            # Pass entity token embeddings through entity transformer encoder
            entity_transformer_outputs = self.entity_transformer(
                padded_entity_tokens,
                src_key_padding_mask=entity_token_attention
            )

            # Pack all entity token embedding sequences together, ignorning
            # padding tokens
            packed_entity_transformer_outputs = pack_padded_sequence(
                entity_transformer_outputs,
                (entity_token_attention == 0).sum(dim=1).to("cpu"),
                batch_first=True,
                enforce_sorted=False
            )

            # Pass packed entity token embeddings through feedforward layer
            packed_entity_embedding_weights = self.entity_weight_feedforward(
                packed_entity_transformer_outputs.data
            )

            # Unpack entity embedding weights to restore per input data
            batch_size = entity_transformer_outputs.shape[0]
            entity_token_sequence_len = entity_transformer_outputs.shape[1]
            entity_embedding_weights = torch.full(
                (batch_size * entity_token_sequence_len,),
                float('-inf'),
                device=input_ids.device
            )
            entity_embedding_weights[
                (entity_token_attention.flatten() == 0).nonzero().squeeze()
            ] = packed_entity_embedding_weights.flatten()
            entity_embedding_weights = entity_embedding_weights.view(
                (batch_size, entity_token_sequence_len)
            )

            # Softmax entity embedding weights over each input
            entity_embedding_weights = nn.functional.softmax(
                entity_embedding_weights, dim=-1
            )

            # Calculate weighted average entity embedding
            weighted_entity_embedding = (
                padded_entity_tokens * entity_embedding_weights.unsqueeze(-1)
            ).sum(dim=1)

            # Pass weighted average entity embedding through tanh + feedforward
            # layer
            avg_entity_output = self.entity_feedforward(
                torch.tanh(self.dropout(weighted_entity_embedding))
            )

            # Concatenate average entity output with final relation embedding
            relation_embedding = torch.cat(
                [relation_embedding, avg_entity_output], dim=-1
            )

        # Pass final relation embedding through feedforward layer
        output = self.relation_feedforward(self.dropout(relation_embedding))

        return output

# Define PyTorch dataset class for dataloader
class REDataset(Dataset):
    def __init__(self, relation_id_map, data, tokenizer):

        # Get list of tokens and labels from data
        tokens = []
        self.labels = []
        for relation, samples in data.items():
            tokens += [sample["tokens"] for sample in samples]
            self.labels += [relation_id_map[relation]] * len(samples)

        # Tokenize sentences
        tokenized = tokenizer(tokens)
        self.input_ids = tokenized["input_ids"]
        self.attention_mask = tokenized["attention_mask"]

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return (
            self.input_ids[idx],
            self.attention_mask[idx],
            torch.tensor(self.labels[idx], dtype=torch.long)
        )


  from .autonotebook import tqdm as notebook_tqdm


# Model Training

In [None]:
import json
import time
from torch.utils.data import DataLoader

# Load training and validation data
train_data = {}
val_data = {}
for data_dict, data_paths in (
    (train_data, TRAIN_DATA_PATH),
    (val_data, VAL_DATA_PATH)
):
    for data_path in data_paths:
        with open(data_path) as f:
            new_data = json.load(f)
            for relation, samples in new_data.items():
                data_dict.setdefault(relation, []).extend(samples)

# Create relation to ID mapping
relation_id_map = {}
for relation in sorted(list(train_data.keys())):
    relation_id_map[relation] = len(relation_id_map)

# Create REModel tokenizer
REModel.create_tokenizer(BERT_MODEL, ENTITY_DELIM_TOKENS)

# Create training and validation datasets
train_dataset = REDataset(relation_id_map, train_data, REModel.tokenize)
if len(val_data) > 0:
    val_dataset = REDataset(relation_id_map, val_data, REModel.tokenize)
else:
    val_dataset = None
del train_data, val_data

# Create training and validation dataloaders
train_dataloader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, shuffle=True
)
if val_dataset:
    val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

# Get device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Calculate class weights if needed
class_weights = None
if USE_CLASS_WEIGHTING:
    class_counts = torch.bincount(torch.tensor(train_dataset.labels))
    class_weights = (
        len(train_dataset.labels) / (len(class_counts) * class_counts)
    ).float().to(device)

# Reset log file
if TRAINING_LOG_PATH:
    open(TRAINING_LOG_PATH, "w+").close()

# Loop over all required hyperparameter combinations
for num_head in ENTITY_TRANSFORMER_NUM_HEAD_VALS:
    for ff_dim in ENTITY_TRANSFORMER_FF_DIM_VALS:
        for num_layer in ENTITY_TRANSFORMER_NUM_LAYER_VALS:
            log_text = ""

            # Log model hyperparameters
            text = (
                f"Num. Heads: {num_head}, "
                f"F.F. Dim.: {ff_dim}, "
                f"Num. Layers: {num_layer}\n"
            )
            print(text)
            log_text += f"{text}\n"

            # Create model
            model = REModel(
                BERT_MODEL,
                ENTITY_DELIM_TOKENS,
                num_head,
                ff_dim,
                num_layer,
                len(relation_id_map),
                DROPOUT_RATE
            )
            model.to(device)

            # Create loss function and optimizer
            loss_func = nn.CrossEntropyLoss(weight=class_weights)
            optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

            # Train model
            model.train()
            prev_val_loss = float("inf")
            for epoch in range(NUM_EPOCHS):
                train_loss = 0
                curr_sample = 0
                start_time = time.time()

                # Log current epoch
                text = (f"\tEpoch: {epoch + 1}/{NUM_EPOCHS}")
                print(text)
                log_text += f"{text}"

                # Loop over all batches
                for input_ids, attention_mask, label in train_dataloader:
                    input_ids, attention_mask, label = (
                        input_ids.to(device),
                        attention_mask.to(device),
                        label.to(device)
                    )

                    # Get model output
                    output = model(input_ids, attention_mask)

                    # Calculate loss
                    loss = loss_func(output, label)
                    train_loss += loss.item()

                    # Optimize model
                    optimizer.zero_grad()
                    loss.backward()
                    optimizer.step()

                    # Log batch information
                    curr_sample += len(label)
                    time_per_sample = (time.time() - start_time) / curr_sample
                    text = (
                        f"\tSample: {curr_sample}/{len(train_dataset)}, "
                        f"Loss: {loss.item()}, "
                        f"Time Rem.: "
                        f"{round(time_per_sample * (len(train_dataset) - curr_sample))}s"
                    )
                    print(f"\r{text}", end=" "*10)
                    log_text += f"\n{text}"

                # Log epoch training information
                text = (
                    f"\n\tTraining Loss: {train_loss / len(train_dataloader)}" +
                    ("\n" if not val_dataset else "")
                )
                print(text)
                log_text += text

                # Calculate validation loss
                if val_dataset:
                    model.eval()
                    val_loss = 0
                    with torch.no_grad():
                        for input_ids, attention_mask, label in val_dataloader:
                            input_ids, attention_mask, label = (
                                input_ids.to(device),
                                attention_mask.to(device),
                                label.to(device)
                            )

                            # Get model output
                            output = model(input_ids, attention_mask)

                            # Calculate loss
                            loss = loss_func(output, label)
                            val_loss += loss.item()

                    # Log epoch validation information
                    text = (
                        f"\tValidation Loss: {val_loss / len(val_dataloader)}\n"
                    )
                    print(text)
                    log_text += f"\n{text}\n"

                    model.train()

            # Write log to file
            if TRAINING_LOG_PATH:
                with open(TRAINING_LOG_PATH, "a") as f:
                    f.write(log_text)

# Save model weights to file
if MODEL_SAVE_PATH:
    torch.save(model.state_dict(), MODEL_SAVE_PATH)


Num. Heads: 2, F.F. Dim.: 256, Num. Layers: 3

	Epoch: 1/3
	Sample: 54108/54108, Loss: 0.4776819944381714, Time Rem.: 0s          
	Training Loss: 1.0128691790009712

	Epoch: 2/3
	Sample: 54108/54108, Loss: 0.401192843914032, Time Rem.: 0s          
	Training Loss: 0.33581728160398944

	Epoch: 3/3
	Sample: 54108/54108, Loss: 0.33058929443359375, Time Rem.: 0s          
	Training Loss: 0.2132370227318321



# Model Evaluation

In [None]:
!pip install torchmetrics -q

import json
import time
from torchmetrics import Accuracy, F1Score, Precision, Recall
from torch.utils.data import DataLoader

# Load testing data
test_data = {}
for data_path in TEST_DATA_PATH:
    with open(data_path) as f:
        new_data = json.load(f)
        for relation, samples in new_data.items():
            test_data.setdefault(relation, []).extend(samples)

# Create relation to ID mapping
relation_id_map = {}
for relation in sorted(list(test_data.keys())):
    relation_id_map[relation] = len(relation_id_map)

# Create REModel tokenizer
REModel.create_tokenizer(BERT_MODEL, ENTITY_DELIM_TOKENS)

# Create test dataset
test_dataset = REDataset(relation_id_map, test_data, REModel.tokenize)
del test_data

# Create test dataloaders
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

# Get device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model for evaluation
model = REModel(
    BERT_MODEL,
    ENTITY_DELIM_TOKENS,
    ENTITY_TRANSFORMER_NUM_HEAD_VALS[0],
    ENTITY_TRANSFORMER_FF_DIM_VALS[0],
    ENTITY_TRANSFORMER_NUM_LAYER_VALS[0],
    len(relation_id_map),
    DROPOUT_RATE
)
model.load_state_dict(torch.load(
    MODEL_SAVE_PATH, weights_only=True
), strict=False)
model.to(device)
model.eval()

# Evaluate model using test data
predicted = []
actual = []
curr_sample = 0
start_time = time.time()
log_text = ""
for input_ids, attention_mask, label in test_dataloader:
    input_ids, attention_mask = (
        input_ids.to(device),
        attention_mask.to(device)
    )

    # Get model output
    output = model(input_ids, attention_mask)

    # Get predicted relation ID
    output = nn.functional.softmax(output, dim=-1)
    predicted.extend(torch.argmax(output, dim=-1).tolist())

    # Get actual relation ID
    actual.extend(label.tolist())

    # Log batch information
    curr_sample += len(label)
    time_per_sample = (time.time() - start_time) / curr_sample
    text = (
        f"Sample: {curr_sample}/{len(test_dataset)}, "
        f"Time Rem.: "
        f"{round(time_per_sample * (len(test_dataset) - curr_sample))}s"
    )
    print(f"\r{text}", end=" "*10)
    log_text += f"{text}\n"

# Calculate evaluation metrics
print("\n")
for metric_name, metric_func in (
    ("Accuracy", Accuracy(
            task="multiclass",
            num_classes=len(relation_id_map),
            average="macro"
        )
    ),
    ("F1 Score", F1Score(
            task="multiclass",
            num_classes=len(relation_id_map),
            average="macro"
        )
    ),
    ("Precision", Precision(
            task="multiclass",
            num_classes=len(relation_id_map),
            average="macro"
        )
    ),
    ("Recall", Recall(
            task="multiclass",
            num_classes=len(relation_id_map),
            average="macro"
        )
    )
):

    # Calculate metric
    result = metric_func(torch.tensor(predicted), torch.tensor(actual))

    # Log metric information
    text = f"{metric_name}: {result.item()}"
    print(f"{text}")
    log_text += f"\n{text}"

# Save log to file
with open(TESTING_LOG_PATH, "w+") as f:
    f.write(log_text)


Sample: 6012/6012, Time Rem.: 0s          

Accuracy: 0.8835148811340332
F1 Score: 0.8724994659423828
Precision: 0.8668336868286133
Recall: 0.8835148811340332


# Model Demo

**Inputs must include the entity delimiters "[E1S]" and "[E1E]" around occurences of the first entity, and "[E2S]" and "[E2E]" around occurences of the second entity.**

**E.g: "[E1S]Charles[E1E] is the [E2S]King of England[E2E]."**

In [13]:
import json

# Load relation to description mapping
with open(RELATION_TO_DESCRIPTION_PATH) as f:
    relation_to_desc = json.load(f)

# Create ID to relation mapping
relation_id_map = {}
for relation in sorted(list(relation_to_desc.keys())):
    relation_id_map[relation] = len(relation_id_map)
id_relation_map = {iid: rel for rel, iid in relation_id_map.items()}

# Get device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create REModel tokenizer
REModel.create_tokenizer(BERT_MODEL, ENTITY_DELIM_TOKENS)

# Load model for demo
print(f"Loading model from {MODEL_SAVE_PATH}")
model = REModel(
    BERT_MODEL,
    ENTITY_DELIM_TOKENS,
    ENTITY_TRANSFORMER_NUM_HEAD_VALS[0],
    ENTITY_TRANSFORMER_FF_DIM_VALS[0],
    ENTITY_TRANSFORMER_NUM_LAYER_VALS[0],
    len(relation_to_desc),
    DROPOUT_RATE
)
model.load_state_dict(torch.load(
    MODEL_SAVE_PATH, weights_only=True
), strict=False)
model.to(device)
model.eval()

# Demo loop
while True:

    # Get input
    inp = input("\nInput: ")
    if inp == "":
        break

    # Convert input to required model input format
    inp = REModel.tokenizer(inp)
    input_ids = torch.tensor(inp["input_ids"]).unsqueeze(0).to(device)
    attention_mask = torch.tensor(inp["attention_mask"]).unsqueeze(0).to(device)

    # Get model output
    output = model(input_ids, attention_mask)
    output = nn.functional.softmax(output, dim=-1)

    # Get the top 3 relation IDs and probabilities
    top_probs, top_ids = torch.topk(output, k=NUM_PRINTED_PREDICTIONS)
    top_probs *= 100

    # Print results
    for i in range(NUM_PRINTED_PREDICTIONS):
        desc = relation_to_desc[id_relation_map[top_ids[0][i].item()]]
        print(
            f"\n{top_probs[0][i].item():.2f}% - "
            f"{desc[0]} ({desc[1]})"
        )


Loading model from transformer.pt

Input: [E1S]Charles[E1E] is the [E2S]King of England[E2E].

93.73% - position held (subject currently or formerly holds the object position or public office)

5.96% - occupation (occupation of a person; see also "field of work" (Property:P101), "position held" (Property:P39))

0.04% - military rank (military rank achieved by a person (should usually have a "start time" qualifier), or military rank associated with a position)

Input: 
