CoNLL-2003 dataset task demonstrates the labeling of tokens for named entity recognition (NER), part-of-speech (POS) tagging, and chunking. Each component of the JSON object corresponds to a different layer of annotation for the sentence:

1. **Tokens**: These are the individual words or punctuation marks from the text. In this case, the sentence "EU rejects German call to boycott British lamb." is split into tokens:
   - "EU"
   - "rejects"
   - "German"
   - "call"
   - "to"
   - "boycott"
   - "British"
   - "lamb"
   - "."

2. **POS Tags**: This array contains the POS tags corresponding to each token. The tags are encoded as numbers, each representing a specific part of speech (like noun, verb, adjective). These numbers usually correspond to a tagging scheme such as the Penn Treebank POS tags:
   - "EU" is tagged as 22, which represents a proper noun.
   - "rejects" is tagged as 42, indicating a verb in present tense.
   - And so forth.

3. **Chunk Tags**: This array indicates phrase chunk boundaries and types (like NP for noun phrase, VP for verb phrase). Each number again corresponds to a specific type of phrase or boundary in a predefined scheme:
   - "EU" is part of a noun phrase, hence 11.
   - "rejects" begins a verb phrase, indicated by 21.
   - The chunk tags help in parsing the sentence into linguistically meaningful phrases.

4. **NER Tags**: These tags are used for named entity recognition. They identify whether each token is part of a named entity (like a person, location, organization) and the type of entity:
   - "EU" is tagged as 3, denoting an organization.
   - "German" and "British" are tagged as 7, indicating nationality or ethnicity.
   - Other tokens are tagged as 0, meaning they are not recognized as part of any named entity.

 Homework: 
Load a NER dataset (e.g. CoNLL-2003) using the script provided below.
   - Create a custom nn.Module class that takes Glove word embeddings as input, passes them through a linear layer, and outputs NER tags
   - Train the model using cross-entropy loss and evaluate its performance using entity-level F1 score
   - Analyze the model's predictions and visualize the confusion matrix to identify common errors
2. Build a multi-layer perceptron (MLP) for NER using Glove embeddings
   - Extend the previous exercise by creating an nn.Module class that defines an MLP architecture on top of Glove embeddings
   - Experiment with different hidden layer sizes and number of layers
   - Evaluate the trained model using entity-level precision, recall, and F1 scores
   - Compare the performance of the MLP model with the simple linear model from exercise 
   - 1
3. Explore the effects of different activation functions and regularization techniques for NER
   - Modify the MLP model from exercise 2 to allow configurable activation functions (e.g. ReLU, tanh, sigmoid)
   - Train models with different activation functions.)
   - Visualize the learned entity embeddings using dimensionality reduction techniques like PCA or t-SNE (edited) 
   - 

In [1]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as pl

# Load and preprocess CoNLL-2003 dataset
conll2003 = load_dataset("conll2003")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

from torch.nn.utils.rnn import pad_sequence

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(torch.tensor(label_ids))
    tokenized_inputs["input_ids"] = [torch.tensor(x) for x in tokenized_inputs["input_ids"]]
    tokenized_inputs["labels"] = labels
    return tokenized_inputs


def collate_fn(batch):
    input_ids = [torch.tensor(item["input_ids"]) for item in batch]
    labels = [torch.tensor(item["labels"]) for item in batch]
    input_ids = pad_sequence(input_ids, batch_first=True)
    labels = pad_sequence(labels, batch_first=True, padding_value=-100)
    return {"input_ids": input_ids, "labels": labels}



tokenized_conll2003 = conll2003.map(tokenize_and_align_labels, batched=True, remove_columns=conll2003["train"].column_names)
train_dataloader = DataLoader(tokenized_conll2003["train"], batch_size=32, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(tokenized_conll2003["validation"], batch_size=32, collate_fn=collate_fn)
test_dataloader = DataLoader(tokenized_conll2003["test"], batch_size=32, collate_fn=collate_fn)


  from .autonotebook import tqdm as notebook_tqdm
Map: 100%|██████████| 14041/14041 [00:00<00:00, 18368.40 examples/s]
Map: 100%|██████████| 3250/3250 [00:00<00:00, 21613.68 examples/s]
Map: 100%|██████████| 3453/3453 [00:00<00:00, 23084.67 examples/s]


In [2]:
# print all labels
print(conll2003["train"].features["ner_tags"].feature.names)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


In [None]:
from torchtext.vocab import GloVe
class ModelNER(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_tags):
        super(ModelNER, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(GloVe(name='6B', dim=embed_dim))
        # add here

    def forward(self, x):
        # add here
        return x

# Initialize model, loss function, and optimizer
vocab_size = tokenizer.vocab_size
num_tags = len(conll2003["train"].features["ner_tags"].feature.names)

model = ModelNER(vocab_size, embed_dim, num_tags)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

# Train the model
num_epochs = 10
for epoch in range(num_epochs):
    train_loss = 0
    for batch in train_dataloader:
        input_ids = batch["input_ids"]
        labels = batch["labels"]
        optimizer.zero_grad()
        outputs = model(input_ids)
        loss = criterion(outputs.view(-1, num_tags), labels.view(-1))
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    print(f"Epoch {epoch+1}, Train Loss: {train_loss / len(train_dataloader)}")
