## CS310 Natural Language Processing
## Lab 5 (part 2): Data preparation for Named Entity Recognition (NER)

The dataset is CoNLL2003 English named entity recognition (NER). The dataset is a collection of news articles from Reuters. 

The dataset is annotated with four types of named entities: persons, locations, organizations, and miscellaneous entities that do not belong to the previous three types. The dataset is divided into three parts: **training**, **development**, and **testing**. 


In [7]:
from pprint import pprint
import torch.nn as nn

In [1]:
TRAIN_PATH = 'data/train.txt'
DEV_PATH = 'data/dev.txt'
TEST_PATH = 'data/test.txt'
EMBEDDINGS_PATH = 'data/glove.6B.100d.txt' 
# Download from https://nlp.stanford.edu/data/glove.6B.zip
# It includes dimension 50, 100, 200, and 300.

The dataset is in the IOB format. 
The IOB format is a simple text chunking format that divides the text into chunks and assigns a label to each chunk. The label is a combination of two parts: the type of the named entity and the position of the word in the named entity. The type of the named entity is one of the four types mentioned above. The position of the word in the named entity is one of three positions: B (beginning), I (inside), and O (outside). For example, the word "New" in the named entity "New York" is labeled as "B-LOC" and the word "York" is labeled as "I-LOC". The word "I" in the sentence "I live in New York" is labeled as "O".

In [2]:
def read_ner_data(path_to_file):
    words = []
    tags = []
    with open(path_to_file, 'r', encoding='utf-8') as file:
        for line in file:
            splitted = line.split()
            if len(splitted) == 0:
                continue
            word = splitted[0]
            if word == '-DOCSTART-':
                continue
            entity = splitted[-1]
            words.append(word)
            tags.append(entity)
        return words, tags

In [3]:
train_words, train_tags = read_ner_data(TRAIN_PATH)
dev_words, dev_tags = read_ner_data(DEV_PATH)
test_words, test_tags = read_ner_data(TEST_PATH)

In [9]:
pprint(list(zip(train_words[:10], train_tags[:10])))

[('EU', 'B-ORG'),
 ('rejects', 'O'),
 ('German', 'B-MISC'),
 ('call', 'O'),
 ('to', 'O'),
 ('boycott', 'O'),
 ('British', 'B-MISC'),
 ('lamb', 'O'),
 ('.', 'O'),
 ('Peter', 'B-PER')]


**Note** that
- Each sentence ends with token '.' and tag 'O'. Between sentences there is a blank line.
- Same padding and packing pipeline as in the previous lab need be used for the NER data, too.

---

### T1. Build vocabularies for both words and labels (tags)

Use *ALL* the data from train, dev, and test sets to build the vocabularies, for word and label (tag), respectively.

In [17]:
### START YOUR CODE ###
def build_vocab(*data_lists):
    all_words = [] 
    all_tags = []  

    for data_list in data_lists:
        if isinstance(data_list, tuple):
            all_words.extend(data_list[0])
            all_tags.extend(data_list[1])
        else:
            all_words.extend(data_list)
    
    unique_words = set(all_words)
    unique_tags = set(all_tags)
    
    return unique_words, unique_tags

vocab_words, vocab_tags = build_vocab((train_words, train_tags), (dev_words, dev_tags), (test_words, test_tags))


### END YOUR CODE ###

print('Word vocabulary size:', len(vocab_words))
print('Tag vocabulary size:', len(vocab_tags))

Word vocabulary size: 30289
Tag vocabulary size: 9


### Model Architecture

In `__init__` method, initialize `word_embeddings` with a pretrained embedding weight matrix loaded from `glove.6B.100d.txt`.

For some variants of model, e.g., maximum entropy Markov model (MEMM), you also need to initialize `tag_embeddings` with a random weight matrix.

`forward` method takes the sequence of word indices (and sequece lengths) as input and returns the log probabilities of predicted labels (tags). 

In [8]:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size, pretrained_embeddings):
        super(LSTMTagger, self).__init__()
        self.word_embeddings = nn.Embedding.from_pretrained(pretrained_embeddings)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, tagset_size)
        

    def forward(self, seq, seq_lens):
        embeds = self.word_embeddings(seq)
        packed_embeds = nn.utils.rnn.pack_padded_sequence(embeds, seq_lens, batch_first=True, enforce_sorted=False)
        lstm_out, _ = self.lstm(packed_embeds)
        lstm_out, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
        tag_space = self.fc(lstm_out)
        tag_scores = F.log_softmax(tag_space, dim=2)
        
        return tag_scores