# Assignment 2
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.out** to moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **run your notebook and keep all running logs** so that we can check.

## 1 Data

We will conduct experiments on Conll2003, which contains 14041 sentences, and each sentence is annotated with the corresponding named entity tags. You can download the dataset from the following
link: https://data.deepai.org/conll2003.zip. We only focus on the token and NER tags, which are
the first and last columns in the dataset. The dataset is in the IOB format, which is a common format
for named entity recognition. The IOB format 1 is a simple way to represent the named entity tags. For
example, the sentence “I went to New York City last week” is annotated as follows:

    I O
    went O
    to O
    New B-LOC
    York I-LOC
    City I-LOC
    last O
    week O

In [None]:
def load_data():
    # load data
    import os

    
        data_dir = os.path.join(os.getcwd(), "data", "prep")
        train_path = os.path.join(data_dir, 'dev.in')
        dev_path = os.path.join(data_dir, 'dev.in')
        test_path  = os.path.join(data_dir, 'test.in')

    with open(train_path, 'r', encoding="utf-8") as f:
        train = [l.strip() for l in f.readlines()]
    with open(dev_path, 'r', encoding="utf-8") as f:
        dev = [l.strip() for l in f.readlines()]
    with open(test_path, 'r', encoding="utf-8") as f:
        test = [l.strip() for l in f.readlines()]
    
    return train, dev, test

## 2 Tagger
    You will train your tagger on the train set and evaluate it on the dev set. And then, you may tune the
    hyperparameters of your tagger to get the best performance on the dev set. Finally, you will evaluate
    your tagger on the test set to get the final performance.

    https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)

    There are some key points you should pay attention to:
    • You will batch the sentences in the dataset to accelerate the training process. To batch the sentences,
    you may need to pad the sentences to the same length.
    • You are free to design the model architecture with (Bi)LSTM or Transformer unit for each part, but
    please do not use any pretrained weights in your basic taggers.
    • You will adjust the hyperparameters of your tagger to get the best performance on the dev set. The
    hyperparameters include the learning rate, batch size, the number of hidden units, the number of
    layers, the dropout rate, etc.
    • You will use seqeval to evaluate your tagger on the dev set and the test set.


### 2.1 LSTM Tagger
    We will first use an LSTM tagger to solve the NER problem. There is a very simple implementation of the
    LSTM tagger on PyTorch website https://pytorch.org/tutorials/beginner/nlp/sequence_models_
    tutorial.html. You can refer to this implementation to implement your LSTM tagger.


In [None]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

In [None]:
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

In [None]:
# prepare the data
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}  # Assign each tag with a unique index

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

In [None]:
# create the model
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [None]:
# train the model
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores)