## 11411/611 – NLP (Spring 23)
## HW7 – Dependency Parsing

Deadline: April 20, 2023 at 11:59pm EST


Dependency Trees are representations used for the syntactic analysis of sentences. To build a dependency tree, for each word we decide which other word 
it is a dependent of and what is the relationship they share.

This is done by using the arc-standard system where each parse state is a configuration $\mathcal{C} = \{\sigma, \beta, \alpha\}$, in which $\sigma$ is the stack of processed words, $\beta$ is the buffer containing the remaining unprocessed words, and $\alpha$ is a set of already performed actions. Such a parser has been described in lecture which included an example of a sentence being parsed using this system.

Borrowing from a paper by Danqi Chen and Christopher D. Manning from Stanford, 
titled ["A Fast and Accurate Dependency Parser using Neural Networks"](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiswpujhYL-AhVVVTUKHa3KB2QQFnoECBsQAQ&url=https%3A%2F%2Faclanthology.org%2FD14-1082.pdf&usg=AOvVaw2ORPPa4A6vPNJQIXMYmhyJ), we will be using features that describe the configuration at a given step to train a Neural Network to predict the next action.

It may seem like there is a lot of information before your tasks actually begin, however, this is to clearly explain what the dependencies are doing to the data and to understand which aspect of dependency parsing are we really trying to implement. 

In [None]:
#@title Installing dependencies { display-mode: "form" }
! git clone https://github.com/sparekh9/nlp_hw_dep
%cd nlp_hw_dep/

! pip install 2to3
%mkdir data outputs
! 2to3 --write --nobackups -x import src/*.py

Cloning into 'nlp_hw_dep'...
remote: Enumerating objects: 188, done.[K
remote: Counting objects: 100% (144/144), done.[K
remote: Compressing objects: 100% (97/97), done.[K
remote: Total 188 (delta 55), reused 107 (delta 45), pack-reused 44[K
Receiving objects: 100% (188/188), 6.17 MiB | 4.13 MiB/s, done.
Resolving deltas: 100% (70/70), done.
/content/nlp_hw_dep
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting 2to3
  Downloading 2to3-1.0-py3-none-any.whl (1.7 kB)
Installing collected packages: 2to3
Successfully installed 2to3-1.0
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
RefactoringTool: Skipping optional fixer: ws_comma
RefactoringTool: No changes to src/configuration.py
RefactoringTool: No changes to src/decoder.py
RefactoringTool: No changes to src/depModel.py
RefactoringTool: Refactored src/eval.py
--- src/eval

In [None]:
import os
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim

#Preprocessing

UD Treebank is a collection of sentences which have been parsed by humans. We will be using the [GUM treebank](https://gucorpling.org/gum/annotations.html) to extract training data. Specifically, GUM incorporated parsing using the package Stanza which was then corrected by human collaborators.

The treebank is formatted as `.conll` files which are the standard for parsing. Feel free to look through any of the files available [here](https://github.com/amir-zeldes/gum/tree/master/dep) in the GitHub repository for the GUM treebank. 
These files specify information about the sentence, like the words, stems of the words, POS-tags, the pinnacle of the word (what the words is a dependent of i.e. where the arc begins from) and the nature of the dependence.

There are variety of topics which the sentences have been categorized into. We have taken an 80-20 split over all of these topics to create the `train.conll` and `dev.conll` files.



We see that this data needs to be transformed into configurations i.e. usable training data. The code provided parses the sentence using the given information in the `.conll` files
and captures the configuration of the parser at each step producing the necessary data needed to train our model, including the gold action that was performed - SHIFT, or one of LEFT-ARC, or RIGHT-ARC with the corresponding dependency. We will be using 52 features - 20 types of word features, 20 types of POS features, and 12 types of dependency features bring our total to 52 features. These would our input vectors i.e. our `X` and the gold action would be the output i.e. the `y` we are trying to predict.




In [None]:
# Trees to Instances (Data Generation)
! python src/gen.py trees/train.conll data/train.data
! python src/gen.py trees/dev.conll data/dev.data

  label_feats.append(self.arcs[f][1]) if f and self.arcs[f][
100...200...300...400...500...600...700...800...900...1000...1100...1200...1300...1400...1500...1600...1700...1800...1900...2000...2100...2200...2300...2400...2500...2600...2700...2800...2900...3000...3100...3200...3300...3400...3500...3600...3700...3800...3900...4000...4100...4200...4300...4400...4500...4600...4700...4800...4900...5000...5100...5200...5300...5400...5500...5600...5700...5800...5900...6000...6100...6200...6300...6400...6500...6600...6700...6800...6900...7000...7100...7200...7300...7400...7500...7600...7700...7800...7900...8000...8100...8200...8300...done!
100...200...300...400...500...600...700...800...900...1000...1100...1200...1300...1400...1500...1600...1700...1800...1900...2000...2100...2200...2300...2400...2500...2600...2700...2800...2900...3000...3100...3200...done!


The following line of code generates integer mappings of words, POS-tags, dependency labels, and actions of a given configuration of the parser which will be used to train the model. This is necessary since the data is currently in the form of strings which cannot directly be fed into a model. These will stored in the `data` folder as `vocabs.-` files.

In [None]:
# Vocab Creation
! python src/gen_vocab.py trees/train.conll data/vocabs

We use these mappings to convert the string for a configuration into vectors of integers by defining a class DataSamples inherited from the PyTorch class 
`DataSet`. You may look through the code by double-clicking the following cell.

In [None]:
#@title Converting configuration strings into integer vectors
# Data Loader
def remap_vocab(vocab, null_tok="<null>"):
    # we want to map null/unk to 0
    id2vocab = [null_tok] + [v for v in vocab if v != null_tok]
    vocab = {v: i for i, v in enumerate(id2vocab)}
    return vocab

def print_dict(d):
  for i, e in enumerate(d):
    if i == 0:
      print("{  '" + e + "': " + str(d[e]) + ",")
    else:
      print("   '" + e + "': " + str(d[e]) + ",")
    if i == 5: break
  print("   ...")
  print("}")

class DataSamples(torch.utils.data.Dataset):
    def __init__(self, data_path, partition="train"):
        self.data_path = data_path
        self.partition = partition
        self.create_dictionaries()
        with open(self.data_path + partition + ".data", "r") as f:
            self.raw_data = f.readlines()
        self.data = [a.split(" ") for a in self.raw_data]
        self.create_vectors()

    def __len__(self):
        return len(self.data)

    def __getitem__(self, ind):
        return torch.from_numpy(self.data[ind])

    # generating 53-length vector for a given configuration
    def create_vectors(self):
        abset_pos_count = 0
        abset_poss = set()
        abset_label_count = 0
        abset_labels = set()
        abset_action_count = 0
        abset_actions = set()

        for ind, inst in enumerate(self.data):
            for i in range(len(inst)):
                if i < 20:
                    if inst[i] in self.word_dict:
                        inst[i] = self.word_dict[inst[i]]
                    else:
                        inst[i] = self.word_dict["<unk>"]
                elif 20 <= i < 40:
                    if inst[i] in self.pos_dict:
                        inst[i] = self.pos_dict[inst[i]]
                    else:
                        abset_pos_count += 1
                        abset_poss.add(inst[i])
                        inst[i] = self.pos_dict["<null>"]
                elif 40 <= i < 52:
                    if inst[i] in self.label_dict:
                        inst[i] = self.label_dict[inst[i]]
                    else:
                        abset_label_count += 1
                        abset_labels.add(inst[i])
                        inst[i] = self.label_dict["<null>"]
                else:
                    if inst[i] in self.action_dict:
                        inst[i] = self.action_dict[inst[i]]
                    elif inst[i][:-1] in self.action_dict:
                        inst[i] = self.action_dict[inst[i][:-1]]
                    else:
                        abset_action_count += 1
                        abset_actions.add(inst[i][:-1])
                        inst[i] = self.action_dict[
                            inst[i][:-1].split(":")[0] + ":<null>"
                        ]
        self.data = np.array(self.data)

    def create_dictionaries(self):
        with open(self.data_path + "vocabs.word", "r") as f:
            words = f.readlines()
        words = [a.split(" ") for a in words]

        with open(self.data_path + "vocabs.pos", "r") as f:
            poss = f.readlines()
        poss = [a.split(" ") for a in poss]

        with open(self.data_path + "vocabs.labels", "r") as f:
            labels = f.readlines()
        labels = [a.split(" ") for a in labels]

        with open(self.data_path + "vocabs.actions", "r") as f:
            actions = f.readlines()
        actions = [a.split(" ") for a in actions]

        self.word_dict = remap_vocab({a[0]: int(a[1]) for a in words}, "<unk>")
        self.pos_dict = remap_vocab({a[0]: int(a[1]) for a in poss}, "<null>")
        self.label_dict = remap_vocab({a[0]: int(a[1]) for a in labels}, "<null>")
        self.action_dict = {a[0]: int(a[1]) for a in actions}

In [None]:
train = DataSamples("data/", partition="train")
dev = DataSamples("data/", partition="dev")

After running this, we have 4 dictionaries that contain the integer mappings. These can be accessed through `train.word_dict`, `train.pos_dict`, `train.label_dict`, and `train.action_dict`. Have a look at what some of the elements in these look like.



This dictionary contains maps all the words present in all the sentences in the original dataset.

In [None]:
print_dict(train.word_dict)

{  '<unk>': 0,
   '<null>': 1,
   '<root>': 2,
   '1': 3,
   'Introduction': 4,
   'and': 5,
   ...
}


This dictionary contains maps all the POS-tags present in all the sentences in the original dataset.

In [None]:
print_dict(train.pos_dict)

{  '<null>': 0,
   'ADJ': 1,
   'SYM': 2,
   'VERB': 3,
   'PART': 4,
   'AUX': 5,
   ...
}


This dictionary contains maps all the dependency labels present in the original dataset.

In [None]:
print_dict(train.label_dict)

{  '<null>': 0,
   'obl:agent': 1,
   'det': 2,
   'advmod': 3,
   'reparandum': 4,
   'punct': 5,
   ...
}


This dictionary contains maps all the GOLD-STANDARD actions present in the original dataset.

In [None]:
print_dict(train.action_dict)

{  'SHIFT': 0,
   'LEFT-ARC:obl:agent': 1,
   'LEFT-ARC:det': 2,
   'LEFT-ARC:advmod': 3,
   'LEFT-ARC:reparandum': 4,
   'LEFT-ARC:punct': 5,
   ...
}


A key feature of the `Dataset` class is that we specify a function `__getitem__(self, ind)` which gets the item at that given index. These can be accessed by simply treating the object as an Iterable i.e. by indexing into it. Here's an example of what an element in train looks like. The length should match be equal to our expected length for `X` plus our `y`.


In [None]:
train[0]

tensor([ 2,  1,  1,  1,  3,  4,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1, 18,  0,  0,  0, 14, 10,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

In [None]:
assert(len(train[0]) == 53)

As the final step, we convert the train and dev `DataSample` objects to a PyTorch `DataLoader` in order to split the data into batches to be used while training the model.

In [None]:
train_loader = DataLoader(train, batch_size=1024, shuffle=True, num_workers=0)
dev_loader = DataLoader(dev, batch_size=1024, shuffle=False, num_workers=0)

Ensure that you are using the GPU and `device=cuda` when you run this cell.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"PyTorch device: {device}")

PyTorch device: cuda


#Task 1 - Defining the Classifier

You will be defining the classifier using the `nn.Module` class from PyTorch.

As explained in the handout, we will be using an Embedding layer which will be separated into 3 parts. The output of these will be passed into the hidden layers you define using the `layer_sizes` variable. 

Note that traditionally for classification problems we would apply `Softmax` or `Sigmoid` to the output layer. However, we won't be working with the probabilities and the raw, unormalized values are enough to simply obtain the best action which is what we will be using to parse the sentence. 

All the necessary layers can be obtained using the `nn` module which has already been imported. For further information, you may look through the [documentation](https://pytorch.org/docs/stable/nn.html).


In [None]:
class Classifier(nn.Module):
    def __init__(
        self,
        word_vocab_size,
        word_emb_size,
        pos_vocab_size,
        pos_emb_size,
        depl_vocab_size,
        depl_emb_size,
        out_size,
        layer_sizes,
        dropout,
    ):
        super(Classifier, self).__init__()

        # TODO: Define the embedding layers according to the relevant vocab_size and embedding size
        self.word_emb = 
        self.pos_emb = 
        self.depl_emb = 

        # TODO: Determine the input size to the hidden layers using the formula provided in the handout
        in_size = 

        input_sizes = [in_size] + layer_sizes[:-1]
        output_sizes = layer_sizes
        layers = []

        # Adding the hidden layers
        for s1, s2 in zip(input_sizes, output_sizes):

            # TODO: Append a linear layer to layers with the input and output size
            

            # TODO: Append a LeakyReLu layer, feel free to experiment here
            

            # TODO: Append a dropout layer with our input dropout value
            

        # TODO: Append the output layer using the layer size of the last year 
        # as the input and out_size as the output
        

        # TODO: Initialize the layers of the networks by passing the 
        # unpacked list through a PyTorch Sequential container
        self.layers = 

    def forward(self, inputs):
        # TODO: Define the inputs to the each embedding layer
        # Hint: Each element in input has length 52, to be separated into three parts
        word_emb_input = 
        pos_emb_input = 
        depl_emb_input = 

        # Here we pass the inputs through the embedding layers and concatenate their outputs.
        embs = [
            *self.word_emb(word_emb_input).split(1, dim=1),
            *self.pos_emb(pos_emb_input).split(1, dim=1),
            *self.depl_emb(depl_emb_input).split(1, dim=1),
        ]
        embs = [torch.squeeze(tens) for tens in embs]
        embs = torch.cat(embs, dim=-1)

        # TODO: Pass the output of the embeddings layer through the rest of the layers.
        output = 
        return output

Testing your definition of the `Classifier` class

In [None]:
# We define a dummy classifier, and perform a singular forward pass to confirm whether the layers 
# have been added correctly
dummy_in = torch.randint(1,10,(1024,52))
dummy_classifier = Classifier(7421,64,19,32,54,32,109, [2048, 512, 256], 0.3)
dummy_out = dummy_classifier.forward(dummy_in)

assert(dummy_out.shape == (1024,109))

Once, you pass this test case, copy the code into the `classifier.py` file in the handout and make a preliminary submission to Gradescope.

#Task 2 - Initializing the model

Having defined the Classifier class, we will use this to create a model of our own. 

While initializing the model, the following parameters should be initialized according to the lengths of the dictionaries that contain the vocabularies i.e. number of unique words, POS-tags, dependency labels, and finally actions:

*   `word_vocab_size`
*   `pos_vocab_size`
*   `depl_vocab_size`
*   `out_size`

Think about what the `out_size` corresponds to and define it accordingly.

The following remaining variables may be considered hyperparameters to be tuned by you, though we have some suggestions:

* `word_emb_size` : < 100
* `pos_emb_size` : < 100
* `depl_emb_size` : < 100
* `layer_sizes` - 3 layers are sufficient to achieve the benchmark accuracy, too many may result in overfitting, populate the list with the number of neurons, in order.
* `dropout` : < 0.5

The same can be said for the learning rate and the number of `epochs`. We'd suggest keeping the `epochs` < 50, again to prevent overfitting. For the learning rate, too high a value and the model won't be able to reach optimal performance, too low a value and the model won't train quick enough. We'd suggest a value between 0.0005 and 0.05.

In [None]:
# TODO: Initialize the model, read the above instructions before beginning
model = Classifier(
    word_vocab_size = ,
    word_emb_size = ,
    pos_vocab_size = ,
    pos_emb_size = ,
    depl_vocab_size = ,
    depl_emb_size = ,
    out_size = ,
    layer_sizes = [],
    dropout =
)

# TODO: Initialize an AdamW optimizer with the model parameters and an appropriate learning rate
optimizer = 

# TODO: Initialize the loss function to cross-entropy loss
criterion =

# We define a scheduler to decay the learning rate as we move through the epochs
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.3)

# TODO: Initialize the number of epochs
epochs = 

# Setting the model to run and train on the GPU
model.to(device)
model.train()

#Task 3 - Training the Model

Having initialized the model with its parameters, we will train it on the training data using the `train_loader (DataLoader)` object. This is a typical NN training loop which you have likely seen before.

Your model should reach 0.9 accuracy fairly quickly and this will likely be necessary in order to obtain the desired performance while parsing.

You may refer to the Train the Network section of [this](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) tutorial.

In [None]:
for epoch in range(epochs):
    total_loss = 0
    total_acc = 0
    for idx, data_instance in enumerate(train_loader):

        # TODO: Extract the inputs and labels from the data_instance
        # and set them to the device 
        # Hint: Data Instance as the shape (1024, 53)
        inputs = 
        labels = 

        # TODO: Pass the inputs to the model to get the output
        output = 

        # TODO: Calculate the loss by using the loss function 
        # defined earlier, passing in outputs and labels.
        # Remember: To convert labels to an int64 tensor
        loss = 

        # TODO: Reset the optimizer for each batch
        

        # TODO: Backward propagate through the loss function
        

        # TODO: Step through the optimizer
        

        # Calculating the training accuracy
        total_loss += loss.item()
        correct = (labels == torch.argmax(output, dim=1)).float().sum()
        total_acc += correct
    accuracy = total_acc / len(train)
    print(
        "Epoch Number: {}, Loss: {}, Accuracy: {}".format(epoch, total_loss, accuracy)
    )

#Evaluation on the Dev Set

The naive process to evaulate the model would be to determine the accuracy of the model on predicting the actions for the dev set. While this is a good starting metric, for parsing we prefer to use the Labelled Attachment Score which determines whether a sentences was parsed correctly instead of whether an action was predicted correctly.

In [None]:
#@title Testing the accuracy of predicting gold actions on the dev set

correct = 0
for data_instance in dev_loader:
    x = data_instance[:, :-1].to(device)
    y = data_instance[:, -1].to(device)
    y_pred = torch.argmax(model(x), dim=1)
    correct += (y == y_pred).cpu().numpy().sum()
dev_accuracy = (correct / len(dev))

In [None]:
assert(dev_accuracy >= 0.80)

Specifying the paths of the input dev.conll files and your output file.

In [None]:
input_p = os.path.abspath("trees/dev.conll")
output_p = os.path.abspath("outputs/dev.out")

In [None]:
#@title Calculating the Labelled Attachment Score
def get_score_func(model, train_data, device):
    def _score_fn(inputs):
        nonlocal model, train_data, device
        words = [train_data.word_dict[w] if w in train_data.word_dict else 0 for w in inputs[:20]]
        pos = [train_data.pos_dict[w] if w in train_data.pos_dict else 0 for w in inputs[20:40]]
        labels = [train_data.label_dict[w] if w in train_data.label_dict else 0 for w in inputs[40:]]
        ipt = torch.tensor([[*words, *pos, *labels]]).to(device)
        with torch.no_grad():
            preds = model(ipt).cpu().numpy().squeeze()
        return preds
    return _score_fn

score_fn = get_score_func(model, train, device)
actions_map = {i:act for act,i in train.action_dict.items()}

%cd src
from decoder import Decoder
Decoder(score_fn, actions_map).parse(input_p, output_p)

from utils import evaluate
_ , las = evaluate(input_p, output_p)
%cd ..
print("Labeled attachment score", round(las, 2))

In [None]:
assert(las >= 0.75)

Download the `dev.out` file from the outputs folder and submit it along with the `classifier.py` file from Task 1.

We would like to acknowledge our use of the Dependency Parsing with
Feed-Forward Neural Network HW from Columbia University by Prof. Michael Collins in the design of this homework.