# Dependency Parsing with the Catalan "An Cora" UD Treebank
For this challenge I created two models (using pure PyTorch rather than any NLP-specific library per the guidelines). One model is the "oracle" to predict "shift", "leftArc", or "rightArc" in the dependency parsing process. The other model determines the dependency relation column of each token. The two models are fairly similar in architecture, both using bi-directional multilayer LSTMs but with a few differences based on how the data is formatted for each scenario.

The features I used are specified in the `PARAMS` variable (this was originally meant to be customizable but is not currently working so). Both models take in the features as a representation of each token and are fed into the LSTM to determine outputs. For the oracle, the features are fed in as the pair of the top two tokens in the stack at time *t* in the labeling process.

At the end I use the models to determine LAS and UAS scores using the oracle's operators to determine the dependency heads and the dependency relation model's outputs directly.

## Imports
I started this challenge before Dr. Scannell provided his code for parsing the data, so I am using a library that does essentially the same thing. In addition I'm using vanilla PyTorch for the oracle and the dependency labeler. The datasets I create are subclasses of the PyTorch `Dataset` class.

In [1]:
from conllu import parse

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

device = torch.device("cuda")

## Customizable Parameters
Besides the features and files, there's a set of customizable parameters for each model. These are all standard parameters I have used in past challenges such as learning rate, dropout, embedding size, hidden size, etc. All parameters have been manually tuned to the optimal accuracy on their respective models.

In [90]:
PARAMS = {
    'features': ['form', 'lemma', 'upos'],
    'train': "ca_ancora-ud-train.conllu",
    'dev': "ca_ancora-ud-dev.conllu",
    'test': "ca_ancora-ud-test.conllu",
    'oracle': {
        'epochs': 20,
        'lr': 0.001,
        'dropout': 0.3,
        'embedding': 500,
        'hidden': 200,
        'n_layers': 4,
        'bidirectional': True
    },
    'deprel': {
        'epochs': 20,
        'lr': 0.001,
        'dropout': 0.3,
        'embedding': 500,
        'hidden': 200,
        'n_layers': 4,
        'bidirectional': True
    }
}

## Operator Dataset to Feed the Oracle
The `OperatorDataset` constructs the data the oracle will be trained on. The function of importance here is `get_samples`. It determines the sequence of "shift", "leftArc", and "rightArc" operators necessary to reconstruct the dependencies. Each of these labels is paired with the features of the top two tokens in the stack, which are the tokens used to make the decision. When taken as part of a full list of tokens in sequential order the oracle's LSTM sees the entire order of operators in which the dependencies are determined.

In [67]:
class OperatorDataset(Dataset):
    def __init__(self, filename, features=None):
        file = open(filename, "r")
        self.raw = file.read()
        self.operators = self.construct_id_dict(['shift', 'rightArc', 'leftArc'])
        self.parsed = self.clean_data(parse(self.raw))
        self.features = self.get_features() if features is None else features
        self.data = [self.get_samples(sentence, as_tensor=True) for sentence in self.parsed]

    def __len__(self):
        return len(self.parsed)

    def __getitem__(self, idx):
        tokens = self.data[idx]
        features = torch.stack([token[0] for token in tokens])
        labels = torch.stack([token[1] for token in tokens])
        return features, labels
    
    def get_features(self):
        assert hasattr(self, 'parsed') and self.parsed is not None
        features = set(['<unk>', '<ROOT>', '<pad>'])
        for sentence in self.parsed:
            for token in sentence:
                for feature in PARAMS['features']:
                    features.add(token[feature])
        return self.construct_id_dict(features)
    
    def label_operator(self, stack, remaining_tokens):
        if len(stack) >= 2 and stack[1]['head'] == stack[0]['id']:
            stack.pop(1)
            return stack, remaining_tokens, 'leftArc'
        elif len(stack) >= 2 and stack[1]['id'] == stack[0]['head'] and stack[0]['id'] not in [tok['head'] for tok in remaining_tokens]:
            stack.pop(0)
            return stack, remaining_tokens, 'rightArc'
        else:
            stack.insert(0, remaining_tokens.pop(0))
            return stack, remaining_tokens, 'shift'

    def get_samples(self, tokens: list, as_tensor=True):
        token_list = tokens.copy()
        samples = []
        # start with 2 roots so there are always 2 tokens in the stack, this keeps dimensions consistent
        root = {'form': '<ROOT>', 'lemma': '<ROOT>', 'upos': '<ROOT>', 'head': None, 'id': 0}
        stack = [root, root]
        while len(stack) > 2 or len(token_list) > 0:
            stack_top = stack[:2]
            stack, token_list, op = self.label_operator(stack, token_list)
            samples.append((stack_top, op))
        return samples if not as_tensor else self.samples_to_tensor(samples)
    
    def samples_to_tensor(self, samples: list):
        new_samples = []
        for sample in samples:
            features, operator = sample
            new_sample = torch.LongTensor([])
            for token in features:
                cleaned_features = [token[feature] if token[feature] in self.features['feature_to_id'] else '<unk>' for feature in PARAMS['features']]
                token_features = torch.LongTensor([self.features['feature_to_id'][f] for f in cleaned_features])
                new_sample = torch.cat((new_sample, token_features), 0)
            new_samples.append((new_sample, torch.LongTensor([self.operators['feature_to_id'][operator]])))
        return new_samples

    @staticmethod
    def construct_id_dict(set_of_features: set):
        features = {
            'feature_to_id': {feature: i for i, feature in enumerate(set_of_features)},
            'id_to_feature': {i: feature for i, feature in enumerate(set_of_features)}
        }
        return features
    
    def clean_data(self, data: list) -> list:
        clean_data = []
        for sentence in data:
            bad_sentence = False
            # catch nonprojective trees
            try:
                self.get_samples(sentence, as_tensor=False)
            except:
                bad_sentence = True
            if not bad_sentence:
                clean_data.append(sentence)
        return clean_data

Construct the operator datasets. The training set determines its own features based on the training data. The validation and test sets use the training set's features. If a feature not seen in the training set appears in the validation or test data, that feature is replaces with the '<unk>' feature.

In [91]:
oracle_train_set = OperatorDataset(PARAMS['train'])
oracle_val_set = OperatorDataset(PARAMS['dev'], oracle_train_set.features)
oracle_test_set = OperatorDataset(PARAMS['test'], oracle_train_set.features)

Dataloader creation. I left the batch size at the default of 1 for the sake of simplicity. Each sample drawn from the dataset is still a full sequential list comprising a single sentence in the data so the LSTM can operate properly.

In [92]:
oracle_train_loader = DataLoader(oracle_train_set)
oracle_val_loader = DataLoader(oracle_val_set)
oracle_test_loader = DataLoader(oracle_test_set)

## The Oracle Model
The oracle is a bi-directional multilayer LSTM. Before feeding to the LSTM, the model generates embeddings of the provided features then uses a 2D convolutional layer to reduce the `# features * embedding_size` to a single value. This way the LSTM is looking at a sequence of single values that each represent the top two tokens in the stack at any given time. From my understanding this reduction using Embedding + 2D convolution is fairly unconventional but it seems to be working!

In [70]:
class Oracle(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_size=500, hidden_size=200, lstm_layers=4, is_bidirectional=True):
        super(Oracle, self).__init__()
        self.embedding_size = embedding_size
        self.vocab_size = vocab_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.lstm_layers = lstm_layers
        self.is_bidirectional = is_bidirectional
        
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_size)
        self.conv = nn.Conv2d(1, 1, [len(PARAMS['features'])*2, self.embedding_size])
        self.lstm = nn.LSTM(1, hidden_size, num_layers=self.lstm_layers, bidirectional=self.is_bidirectional)
        self.linear = nn.Linear(hidden_size if not self.is_bidirectional else hidden_size*2, self.output_size)
        self.dropout = nn.Dropout(PARAMS['oracle']['dropout'])
        self.relu = F.relu
    
    def forward(self, inputs, hidden=None):
        emb = self.embedding(inputs).unsqueeze(1)
        emb = self.dropout(emb)
        emb = self.conv(self.relu(emb)).squeeze(-1)
        
        outputs, hidden = self.lstm(self.relu(emb), hidden)
        outputs = self.dropout(outputs)
        outputs = self.linear(self.relu(outputs).squeeze(1))
        
        return F.softmax(outputs, dim=-1), hidden

Create the model using specified parameters. The vocab size is the number of unique features and is used to generate the embeddings. The output size is the number of unique operators, which is 3.

In [71]:
VOCAB_SIZE = len(oracle_train_set.features['feature_to_id'])
ORACLE_OUTPUT_SIZE = len(oracle_train_set.operators['feature_to_id'])

oracle = Oracle(
    VOCAB_SIZE, 
    ORACLE_OUTPUT_SIZE, 
    PARAMS['oracle']['embedding'], 
    PARAMS['oracle']['hidden'],
    PARAMS['oracle']['n_layers'],
    PARAMS['oracle']['bidirectional']).to(device)

## Loss Function and Optimizer
I'm using the most standard loss function and optimizer I could to simplify things. Basic Cross Entropy Loss with an Adam optimizer. I also tried using a learning rate scheduler but wasn't able to achieve the same or better results with it.

In [72]:
oracle_criterion = nn.CrossEntropyLoss().to(device)
oracle_optimizer = torch.optim.Adam(oracle.parameters(), lr=PARAMS['oracle']['lr'])

## Oracle Training

In [73]:
def train_oracle(model, loader, criterion, optimizer):
    running_loss = 0
    running_acc = 0
    
    model.train()
    for sentence in loader:
        oracle_optimizer.zero_grad()
        inputs, labels = sentence
        outputs, hidden = model(inputs.squeeze(0).to(device))
        
        loss = criterion(outputs, labels.squeeze().to(device))

        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        running_acc += torch.sum(torch.argmax(outputs, dim=-1).cpu() == labels.squeeze()).item() / labels.squeeze().shape[0]
        torch.cuda.empty_cache()
        
    return running_loss / len(loader), running_acc / len(loader)

## Oracle Evaluation

In [74]:
def eval_oracle(model, loader, criterion):
    running_loss = 0
    running_acc = 0
    
    model.eval()
    for sentence in loader:
        inputs, labels = sentence
        outputs, hidden = model(inputs.squeeze(0).to(device))
        
        loss = criterion(outputs, labels.squeeze().to(device))
        
        running_loss += loss.item()
        running_acc += torch.sum(torch.argmax(outputs, dim=-1).cpu() == labels.squeeze()).item() / labels.squeeze().shape[0]
        torch.cuda.empty_cache()
        
    return running_loss / len(loader), running_acc / len(loader)

The highest accuracy I was able to achieve predicting operators with the oracle model is **98.4%** on the validation set.

In [77]:
for epoch in range(PARAMS['oracle']['epochs']):
    train_loss, train_acc = train_oracle(oracle, oracle_train_loader, oracle_criterion, oracle_optimizer)
    val_loss, val_acc = eval_oracle(oracle, oracle_val_loader, oracle_criterion)
    
    print("EPOCH {}".format(epoch+1))
    print("Train Loss:\t{:.3f}\t|  Train Accuracy:\t{:.2f}%".format(train_loss, train_acc*100))
    print("Val. Loss:\t{:.3f}\t|  Val. Accuracy:\t{:.2f}%".format(val_loss, val_acc*100))
    print("\n")

EPOCH 16
Train Loss:	0.570	|  Train Accuracy:	98.15%
Val. Loss:	0.571	|  Val. Accuracy:	98.01%


EPOCH 17
Train Loss:	0.569	|  Train Accuracy:	98.19%
Val. Loss:	0.570	|  Val. Accuracy:	98.16%


EPOCH 18
Train Loss:	0.568	|  Train Accuracy:	98.26%
Val. Loss:	0.569	|  Val. Accuracy:	98.24%


EPOCH 19
Train Loss:	0.568	|  Train Accuracy:	98.30%
Val. Loss:	0.567	|  Val. Accuracy:	98.40%


EPOCH 20
Train Loss:	0.567	|  Train Accuracy:	98.38%
Val. Loss:	0.568	|  Val. Accuracy:	98.33%




## Oracle Test Results (Not UAS/LAS)
This is the accuracy for the operators, not the dependency heads.

In [93]:
test_loss, test_acc = eval_oracle(oracle, oracle_test_loader, oracle_criterion)
print("Test Loss:\t{:.3f}\t|  Test Accuracy:\t{:.2f}%".format(test_loss, test_acc*100))

Test Loss:	0.567	|  Test Accuracy:	98.44%


## Dependency Relation Dataset
The dependency dataset is much simpler than the operator dataset in its behavior. It still does have the operator functions to clean the data of any non-projective trees, but its actual outputs are simply tensors of the feature IDs for each token's features, with the label of their dependency relation in sequential order.

In [78]:
class DepRelDataset(Dataset):
    def __init__(self, filename, features=None, dependency_labels=None):
        file = open(filename, "r")
        self.raw = file.read()
        self.operators = self.construct_id_dict(['shift', 'rightArc', 'leftArc'])
        self.parsed = self.clean_data(parse(self.raw))
        if features is None:
            self.features = self.get_features()
            self.dependency_labels = self.get_dependency_labels()
        else:
            self.features = features
            self.dependency_labels = dependency_labels

    def __len__(self):
        return len(self.parsed)

    def __getitem__(self, idx):
        token_list = self.parsed[idx]
        inputs_dict = {feature: [] for feature in PARAMS['features']}
        dependency_labels = []
        for token in token_list:
            for feature in PARAMS['features']:
                if token[feature] not in self.features['feature_to_id']:
                    feature_id = self.features['feature_to_id']['<unk>']
                else:
                    feature_id = self.features['feature_to_id'][token[feature]]
                inputs_dict[feature].append(feature_id)
            if token['deprel'] not in self.dependency_labels['feature_to_id']:
                dep_id = self.dependency_labels['feature_to_id']['<unk>']
            else:
                dep_id = self.dependency_labels['feature_to_id'][token['deprel']]
            dependency_labels.append(dep_id)
        
        inputs = [feature for feature in inputs_dict.values()]
        return torch.LongTensor(inputs), torch.LongTensor(dependency_labels)
    
    def get_features(self):
        assert hasattr(self, 'parsed') and self.parsed is not None
        features = set(['<unk>','<pad>'])
        for sentence in self.parsed:
            for token in sentence:
                for feature in PARAMS['features']:
                    features.add(token[feature])
        return self.construct_id_dict(features)
    
    def get_dependency_labels(self):
        assert hasattr(self, 'parsed') and self.parsed is not None
        dependency_labels = set(['<unk>','<pad>'])
        for sentence in self.parsed:
            for token in sentence:
                dependency_labels.add(token['deprel'])
        return self.construct_id_dict(dependency_labels)
    
    def label_operator(self, stack, remaining_tokens):
        if len(stack) >= 2 and stack[1]['head'] == stack[0]['id']:
            stack.pop(1)
            return stack, remaining_tokens, 'leftArc'
        elif len(stack) >= 2 and stack[1]['id'] == stack[0]['head'] and stack[0]['id'] not in [tok['head'] for tok in remaining_tokens]:
            stack.pop(0)
            return stack, remaining_tokens, 'rightArc'
        else:
            stack.insert(0, remaining_tokens.pop(0))
            return stack, remaining_tokens, 'shift'

    def get_operators(self, tokens: list):
        token_list = tokens.copy()
        operators = []
        stack = [{'head': None, 'id': 0}]
        while len(stack) > 1 or len(token_list) > 0:
            stack, token_list, op = self.label_operator(stack, token_list)
            operators.append(op)
        return [self.operators['feature_to_id'][op] for op in operators]

    @staticmethod
    def construct_id_dict(set_of_features: set):
        features = {
            'feature_to_id': {feature: i for i, feature in enumerate(set_of_features)},
            'id_to_feature': {i: feature for i, feature in enumerate(set_of_features)}
        }
        return features
    
    def clean_data(self, data: list) -> list:
        clean_data = []
        for sentence in data:
            bad_sentence = False
            # catch nonprojective trees
            try:
                self.get_operators(sentence)
            except:
                bad_sentence = True
            if not bad_sentence:
                clean_data.append(sentence)
        return clean_data

Like before, the validation and test sets use the training set's features and in this case dependency relation labels as well. The dependency relation labels are just in case there are any labels in the test data that haven't been seen during training.

In [94]:
deprel_train_set = DepRelDataset(PARAMS['train'])
deprel_val_set = DepRelDataset(PARAMS['dev'], deprel_train_set.features, deprel_train_set.dependency_labels)
deprel_test_set = DepRelDataset(PARAMS['test'], deprel_train_set.features, deprel_train_set.dependency_labels)

In [95]:
deprel_train_loader = DataLoader(deprel_train_set)
deprel_val_loader = DataLoader(deprel_val_set)
deprel_test_loader = DataLoader(deprel_test_set)

## Dependency Relation Model
The `DependencyModel` is almost identical to the oracle. (I think it actually is identical, but I kept them separate for the sake of customizing each to their specific task.)

In [81]:
class DependencyModel(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_size=500, hidden_size=200, lstm_layers=4, is_bidirectional=True):
        super(DependencyModel, self).__init__()
        self.embedding_size = embedding_size
        self.vocab_size = vocab_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.lstm_layers = lstm_layers
        self.is_bidirectional = is_bidirectional
        
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_size)
        self.conv = nn.Conv2d(1, 1, [len(PARAMS['features']), self.embedding_size])
        self.lstm = nn.LSTM(1, hidden_size, num_layers=self.lstm_layers, bidirectional=self.is_bidirectional)
        self.linear = nn.Linear(hidden_size if not self.is_bidirectional else hidden_size*2, self.output_size)
        self.dropout = nn.Dropout(PARAMS['deprel']['dropout'])
        self.relu = F.relu
    
    def forward(self, inputs):
        emb = self.embedding(inputs)
        emb = self.dropout(emb)
        emb = self.conv(self.relu(emb)).squeeze(-1)
        
        outputs, (hidden, cell) = self.lstm(self.relu(emb))
        outputs = self.dropout(outputs)
        outputs = self.linear(self.relu(outputs).squeeze(1))
        return F.softmax(outputs, dim=-1)

Declare the `DependencyModel` just like the oracle and using the same features. This time use the dependency relation labels as outputs.

In [84]:
VOCAB_SIZE = len(deprel_train_set.features['feature_to_id'])
DEP_OUTPUT_SIZE = len(deprel_train_set.dependency_labels['feature_to_id'])

dependency_model = DependencyModel(
    VOCAB_SIZE, 
    DEP_OUTPUT_SIZE, 
    PARAMS['deprel']['embedding'], 
    PARAMS['deprel']['hidden'],
    PARAMS['deprel']['n_layers'],
    PARAMS['deprel']['bidirectional']).to(device)

## Loss Function and Optimizer
Same as with the oracle.

In [85]:
dependency_criterion = nn.CrossEntropyLoss().to(device)
dependency_optimizer = torch.optim.Adam(dependency_model.parameters(), lr=PARAMS['deprel']['lr'])

## Train Dependency Relation Model
This function is slightly different than the oracle training based on how the dimensions worked out.

In [86]:
def train_deprel(model, loader, criterion, optimizer):
    running_acc = 0
    running_loss = 0

    model.train()
    for sentence in loader:
        optimizer.zero_grad()
        
        inputs, labels = sentence
        
        outputs = model(inputs.permute(2,0,1).to(device))
        loss = criterion(outputs.squeeze(), labels.squeeze().to(device))
        
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        running_acc += torch.sum(torch.argmax(outputs, dim=-1).cpu() == labels.squeeze()).item() / labels.squeeze().shape[0]

        torch.cuda.empty_cache()
    
    return running_loss / len(loader), running_acc / len(loader)

## Evaluate Dependency Relation Model

In [87]:
def eval_deprel(model, loader, criterion):
    running_acc = 0
    running_loss = 0
    
    model.eval()
    for sentence in loader:
        inputs, labels = sentence
        
        outputs = model(inputs.permute(2,0,1).to(device))
        loss = criterion(outputs.squeeze(), labels.squeeze().to(device))
        
        running_loss += loss.item()
        running_acc += torch.sum(torch.argmax(outputs, dim=-1).cpu() == labels.squeeze()).item() / labels.squeeze().shape[0]

        torch.cuda.empty_cache()
    
    return running_loss / len(loader), running_acc / len(loader)

In [88]:
for epoch in range(PARAMS['deprel']['epochs']):
    train_loss, train_acc = train_deprel(dependency_model, deprel_train_loader, dependency_criterion, dependency_optimizer)
    val_loss, val_acc = eval_deprel(dependency_model, deprel_val_loader, dependency_criterion)
    
    print("EPOCH {}".format(epoch+1))
    print("Train Loss:\t{:.3f}\t|  Train Accuracy:\t{:.2f}%".format(train_loss, train_acc*100))
    print("Val. Loss:\t{:.3f}\t|  Val. Accuracy:\t{:.2f}%".format(val_loss, val_acc*100))
    print("\n")

EPOCH 1
Train Loss:	3.029	|  Train Accuracy:	57.59%
Val. Loss:	2.928	|  Val. Accuracy:	67.77%


EPOCH 2
Train Loss:	2.906	|  Train Accuracy:	69.79%
Val. Loss:	2.888	|  Val. Accuracy:	71.60%


EPOCH 3
Train Loss:	2.884	|  Train Accuracy:	72.00%
Val. Loss:	2.876	|  Val. Accuracy:	72.74%


EPOCH 4
Train Loss:	2.868	|  Train Accuracy:	73.56%
Val. Loss:	2.856	|  Val. Accuracy:	74.78%


EPOCH 5
Train Loss:	2.863	|  Train Accuracy:	74.14%
Val. Loss:	2.825	|  Val. Accuracy:	77.87%


EPOCH 6
Train Loss:	2.828	|  Train Accuracy:	77.58%
Val. Loss:	2.814	|  Val. Accuracy:	78.94%


EPOCH 7
Train Loss:	2.795	|  Train Accuracy:	80.87%
Val. Loss:	2.839	|  Val. Accuracy:	76.47%


EPOCH 8
Train Loss:	2.786	|  Train Accuracy:	81.80%
Val. Loss:	2.793	|  Val. Accuracy:	81.08%


EPOCH 9
Train Loss:	2.780	|  Train Accuracy:	82.40%
Val. Loss:	2.769	|  Val. Accuracy:	83.40%


EPOCH 10
Train Loss:	2.776	|  Train Accuracy:	82.77%
Val. Loss:	2.760	|  Val. Accuracy:	84.40%


EPOCH 11
Train Loss:	2.769	|  Train Acc

## Dependency Relation Results

In [96]:
test_loss, test_acc = eval_deprel(dependency_model, deprel_test_loader, dependency_criterion)
print("Test Loss:\t{:.3f}\t|  Test Accuracy:\t{:.2f}%".format(test_loss, test_acc*100))

Test Loss:	2.747	|  Test Accuracy:	85.60%


## Evaluate Dependency Heads Using Operators
The `get_head_corrects` function performs the evaluation of the tokens using the operators and compares the predicted heads to the ground truth labels found in the tokens.

In [97]:
def get_head_corrects(tokens, operator_preds: torch.Tensor, operator_dict: dict):
    remaining_tokens = tokens.copy()
    stack = [{'form': '<ROOT>', 'lemma': '<ROOT>', 'upos': '<ROOT>', 'head': None, 'id': 0}]
    corrects = []
    idx = 0
    while (len(stack) > 1 or len(remaining_tokens) > 0) and idx < operator_preds.shape[0]:
        if len(remaining_tokens) == 0:
            operator = torch.argmax(operator_preds[idx][1:]).item()+1
        elif len(stack) < 2:
            operator = 0
        else:
            operator = torch.argmax(operator_preds[idx]).item()
        
        if operator_dict[operator] == "shift":
            stack.insert(0, remaining_tokens.pop(0))
        elif operator_dict[operator] == "leftArc":
            corrects.append(stack[0]['id'] == stack[1]['head'])
            stack.pop(1)
        elif operator_dict[operator] == "rightArc":
            corrects.append(stack[1]['id'] == stack[0]['head'])
            stack.pop(0)
        idx += 1
    return corrects

## Score Function for LAS and UAS
The `score` function uses the predicted heads and dependency relations to determine the "Labeled Attachment Score" and the "Unlabeled Attachment Score".

In [98]:
def score(oracle_set, deprel_set):
    running_las = 0
    running_uas = 0
    
    assert len(oracle_set) == len(deprel_set)

    for idx, sentence in enumerate(oracle_set.parsed):
        deprel_inputs, deprel_labels = deprel_set[idx]
        deprel_preds = dependency_model(deprel_inputs.unsqueeze(0).permute(2,0,1).to(device))
        deprel_corrects = (torch.argmax(deprel_preds, dim=-1).cpu() == deprel_labels.squeeze()).tolist()

        oracle_inputs, _ = oracle_set[idx]
        oracle_preds = oracle(oracle_inputs.to(device))[0]
        oracle_corrects = get_head_corrects(sentence, oracle_preds.cpu(), oracle_set.operators['id_to_feature'])

        running_uas += sum(oracle_corrects) / max(1, len(oracle_corrects))
        assert len(oracle_corrects) == len(deprel_corrects)
        running_las += sum([oracle_correct and deprel_corrects[i] for i, oracle_correct in enumerate(oracle_corrects)]) / max(1, len(oracle_corrects))
    
    return running_uas / len(oracle_set.parsed), running_las / len(oracle_set.parsed)

In [99]:
UAS_train, LAS_train = score(oracle_train_set, deprel_train_set)
UAS_val, LAS_val = score(oracle_val_set, deprel_val_set)
UAS_test, LAS_test = score(oracle_test_set, deprel_test_set)
print("Train Unlabeled Attachment Score:\t{:.2f}\t|  Train Labeled Attachment Score:\t{:.2f}".format(UAS_train*100, LAS_train*100))
print("Val. Unlabeled Attachment Score:\t{:.2f}\t|  Val. Labeled Attachment Score:\t{:.2f}".format(UAS_val*100, LAS_val*100))
print("Test Unlabeled Attachment Score:\t{:.2f}\t|  Test Labeled Attachment Score:\t{:.2f}".format(UAS_test*100, LAS_test*100))

Train Unlabeled Attachment Score:	86.82	|  Train Labeled Attachment Score:	75.42
Val. Unlabeled Attachment Score:	83.06	|  Val. Labeled Attachment Score:	71.96
Test Unlabeled Attachment Score:	85.60	|  Test Labeled Attachment Score:	73.93


My final test UAS was **85.60** compared to the 87.3 reported [here](https://core.ac.uk/download/pdf/78635826.pdf). The LAS was **73.83** compared to their reported 84.9