# bertchunker: default program

In [2]:
from default import *
import os, sys

## Run the default solution on dev

In [3]:
chunker = FinetuneTagger(os.path.join('..', 'data', 'chunker'), modelsuffix='.pt')
decoder_output = chunker.decode(os.path.join('..', 'data', 'input', 'dev.txt'))

100%|██████████| 1027/1027 [02:32<00:00,  6.72it/s]


Ignore the warnings from the transformers library. They are expected to occur.

## Evaluate the default output

In [4]:
flat_output = [ output for sent in decoder_output for output in sent ]
sys.path.append('..')
import conlleval
true_seqs = []
with open(os.path.join('..', 'data', 'reference', 'dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 12296 phrases; correct: 10648.
accuracy:  92.51%; (non-O)
accuracy:  92.87%; precision:  86.60%; recall:  89.51%; FB1:  88.03
             ADJP: precision:  64.37%; recall:  74.34%; FB1:  68.99  261
             ADVP: precision:  63.88%; recall:  76.88%; FB1:  69.78  479
            CONJP: precision:  41.67%; recall:  71.43%; FB1:  52.63  12
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  18
               NP: precision:  86.08%; recall:  89.82%; FB1:  87.91  6508
               PP: precision:  97.01%; recall:  93.12%; FB1:  95.03  2343
              PRT: precision:  64.41%; recall:  84.44%; FB1:  73.08  59
             SBAR: precision:  75.71%; recall:  78.90%; FB1:  77.27  247
              UCP: precision:   0.00%; recall:   0.00%; FB1:   0.00  10
               VP: precision:  87.71%; recall:  89.80%; FB1:  88.74  2359


(86.59726740403383, 89.50907868190988, 88.02910052910053)

## Chanege 1
we can adopt a more sophisticated approach than simply taking the tag of the first subword for the entire word. A better strategy could involve voting or averaging predictions across subwords associated with a single word. Since the voting mechanism is simpler and more straightforward for classification tasks like tagging, we'll implement that approach.

    Subword Processing: This method processes each subword associated with a word, collects the predicted tags for all subwords of a word, and then decides on the final tag for the word based on a voting mechanism among its subwords.

    Voting Mechanism: For each word, the predicted tags from its subwords are aggregated, and the most frequently predicted tag is chosen as the final tag for the word. This approach considers the contribution of each subword to the overall prediction, potentially leading to more accurate and consistent tagging.
    
    Error Handling: An assertion is added to ensure the length of the input sequence matches the length of the output predictions, enhancing the method's robustness.

This improved method leverages the information from all subwords associated with a word, potentially increasing the accuracy of the tagging system by considering more context than simply the first subword's prediction.

In [5]:
def argmax(self, model, seq):
    output = [[] for _ in seq]
    with torch.no_grad():
        inputs = self.prepare_sequence(seq).to(device)
        tag_scores = model(inputs.input_ids).squeeze(0)
        word_ids = inputs.encodings[0].word_ids  # Get word IDs for subwords
        predictions = [[] for _ in range(len(seq))]  # Prepare a list to hold predictions for each word

        # Iterate through each subword prediction
        for i, word_id in enumerate(word_ids):
            if word_id is not None:  # Ignore special tokens
                # Append the predicted tag (as index) for each subword to the corresponding word's predictions
                predictions[word_id].append(int(tag_scores[i].argmax(dim=0)))

        # For each word, determine the most common predicted tag among its subwords
        for i, word_predictions in enumerate(predictions):
            if word_predictions:  # Ensure there are predictions to process
                # Use a voting mechanism to decide on the tag for each word
                most_common_tag_idx = max(set(word_predictions), key=word_predictions.count)
                output[i] = self.ix_to_tag[most_common_tag_idx]

    assert len(seq) == len(output), "The length of the sequence and output do not match."
    return output

he dev score is 90.3237

To specifically improve the subword-to-word resolution strategy in the argmax method as per the comments, let's explore a nuanced approach that aims to better capture the essence of the entire word from its constituent subwords. This involves a combination of strategies that could potentially address the drop in development scores:

    Enhanced Majority Voting: Instead of a simple majority vote, consider the frequency and confidence (softmax probabilities) of predicted tags for a more informed decision.

    Confidence-weighted Voting: Use the softmax probabilities to weight the votes for each tag, giving more influence to predictions made with higher confidence.

    Hybrid Approach: Combine the first subword's tag with a voting mechanism for the rest of the subwords. This approach acknowledges the importance of the first subword in many BERT-like models while still considering the context provided by subsequent subwords.

Implementing a confidence-weighted voting mechanism requires modifying the model's forward method to return softmax probabilities in addition to the argmax indices. However, to keep the modifications feasible within the original code structure and without altering the model's forward method, we'll focus on a hybrid approach that combines the insights from both the first subword and a simple majority vote for all subwords associated with a word.


In [6]:
def argmax(self, model, seq):
    output = [''] * len(seq)  # Initialize output with empty strings for each word in the input sequence
    with torch.no_grad():
        inputs = self.prepare_sequence(seq).to(device)
        tag_scores = model(inputs.input_ids).squeeze(0)
        word_ids = inputs.encodings[0].word_ids  # Get word IDs for subwords
        
        # Initialize a list to hold all tag predictions for each word
        word_tag_predictions = [[] for _ in range(len(seq))]
        
        for i, word_id in enumerate(word_ids):
            if word_id is not None:  # Ignore special tokens
                predicted_tag_idx = int(tag_scores[i].argmax(dim=0))
                word_tag_predictions[word_id].append(predicted_tag_idx)
        
        # Decide the tag for each word
        for word_id, tag_idxs in enumerate(word_tag_predictions):
            if tag_idxs:
                # If the first subword's tag differs from the majority, consider both
                first_tag = tag_idxs[0]
                majority_tag = max(set(tag_idxs), key=tag_idxs.count)
                
                # Prioritize the first subword's tag if there's a tie or the list is dominated by a single tag
                if first_tag == majority_tag or len(set(tag_idxs)) == 1:
                    output[word_id] = self.ix_to_tag[first_tag]
                else:
                    # Hybrid approach: Consider the first subword's tag but note the presence of other tags
                    output[word_id] = self.ix_to_tag[majority_tag]  # Fallback to majority if there's a clear preference
                
    assert len(seq) == len(output), "The length of the sequence and output do not match."
    return output

The dev score 90.3237


Label Weights Calculation: Before training, we calculate the weights for each label based on their frequency in the training data, which are then used to create a weighted NLLLoss. This helps the model pay more attention to infrequent labels.

Freezing Encoder Layers: In the initial epochs, the encoder's parameters are set to not require gradients, effectively freezing those layers. This allows the model's classification head to adapt to the task without disruptive updates from the pre-trained layers. After a specified number of epochs, the encoder is unfrozen to fine-tune the entire model.

In [7]:
def train(self):
        self.load_training_data(self.trainfile)
        self.model = TransformerModel(self.basemodel, len(self.tag_to_ix), lr=self.lr).to(device)
        # TODO You may want to set the weights in the following line to increase the effect of
        #   gradients for infrequent labels and reduce the dominance of the frequent labels
        # Calculate label weights for weighted loss
        label_counts = np.zeros(len(self.tag_to_ix))
        for _, tags in self.training_data:
            for tag in tags:
                label_counts[self.tag_to_ix[tag]] += 1
        label_weights = 1.0 / (label_counts + 1e-6)  # Prevent division by zero
        label_weights = label_weights / label_weights.sum() * len(self.tag_to_ix)  # Normalize
        label_weights = torch.tensor(label_weights, dtype=torch.float).to(device)
        
        loss_function = nn.NLLLoss(weight=label_weights)
        self.model.train()
        #loss = float("inf")
        total_loss = 0
        loss_count = 0
        for epoch in range(self.epochs):
            
            if epoch < 2:  
                for param in self.model.encoder.parameters():
                    param.requires_grad = False
            else:  
                for param in self.model.encoder.parameters():
                    param.requires_grad = True

            train_iterator = tqdm.tqdm(self.training_data)
            batch = []
            for tokenized_sentence, tags in train_iterator:
                # Step 1. Get our inputs ready for the network, that is, turn them into
                # Tensors of subword indices. Pre-trained transformer based models come with their fixed
                # input tokenizer which in our case will receive the words in a sentence and will convert the words list
                # into a list of subwords (e.g. you can look at https://aclanthology.org/P16-1162.pdf to get a better
                # understanding about BPE subword vocabulary creation technique).
                # The expected labels will be copied as many times as the size of the subwords list for each word and
                # returned in targets label.
                batch.append(self.prepare_sequence(tokenized_sentence, tags))
                if len(batch) < self.batchsize:
                    continue
                pad_id = self.tokenizer.pad_token_id
                o_id = self.tag_to_ix['O']
                max_len = max([x[1].size(0) for x in batch])
                # in the next two lines we pad the batch items so that each sequence comes to the same size before
                #  feeding the input batch to the model and calculating the loss over the target values.
                input_batch = [x[0].input_ids[0].tolist() + [pad_id] * (max_len - x[0].input_ids[0].size(0)) for x in batch]
                target_batch = [x[1].tolist() + [o_id] * (max_len - x[0].input_ids[0].size(0)) for x in batch]
                sentence_in = torch.LongTensor(input_batch).to(device)
                targets = torch.LongTensor(target_batch).to(device)
                # Step 2. Remember that Pytorch accumulates gradients.
                # We need to clear them out before each instance
                self.model.zero_grad()
                # Step 3. Run our forward pass.
                tag_scores = self.model(sentence_in)
                # Step 4. Compute the loss, gradients, and update the parameters by
                #  calling optimizer.step()
                loss = loss_function(tag_scores.view(-1, len(self.tag_to_ix)), targets.view(-1))
                total_loss += loss.item()
                loss_count += 1
                loss.backward()
                # TODO you may want to freeze the BERT encoder for a couple of epochs
                #   and then start performing full fine-tuning.
                for optimizer in self.model.optimizers:
                    optimizer.step()
                # HINT: getting the value of loss below 2.0 might mean your model is moving in the right direction!
                train_iterator.set_description(f"loss: {total_loss/loss_count:.3f}")
                del batch[:]

            if epoch == self.epochs - 1:
                epoch_str = '' 
            else:
                epoch_str = str(epoch)
            savefile = self.modelfile + epoch_str + self.modelsuffix
            print(f"Saving model file: {savefile}", file=sys.stderr)
            torch.save({
                        'epoch': epoch,
                        'model_state_dict': self.model.state_dict(),
                        'optimizer_state_dict': self.model.optimizers[0].state_dict(),
                        'loss': loss,
                        'tag_to_ix': self.tag_to_ix,
                        'ix_to_tag': self.ix_to_tag,
                    }, savefile)

The dev score is 90.5883

Improving the TransformerModel class involves several enhancements, particularly focusing on initializing a Conditional Random Field (CRF) layer for sequence labeling tasks and adjusting the optimization strategy to accommodate different learning rates for different parts of the model. 
Adding a CRF Layer

A CRF layer on top of the classification output can significantly improve the model's performance on sequence labeling tasks by considering the dependencies between labels in a sequence. The torchcrf library provides a PyTorch implementation of CRF which can be used here.

    Initialize the CRF Layer: After the classification head, initialize a CRF layer that will operate on the logits provided by the classification head.

Adjusting Optimizers

Different parts of the model might benefit from being optimized at different learning rates. For example, the pre-trained encoder might require a smaller learning rate to avoid catastrophic forgetting, while the newly added layers (like the classification head and CRF layer) might use a higher learning rate to speed up their convergence.

In [8]:
class TransformerModel(nn.Module):

    def __init__(
            self,
            basemodel,
            tagset_size,
            lr=5e-5
        ):
        torch.manual_seed(1)
        super(TransformerModel, self).__init__()
        self.basemodel = basemodel
        # the encoder will be a BERT-like model that receives an input text in subwords and maps each subword into
        # contextual representations
        self.encoder = None
        # the hidden dimension of the BERT-like model will be automatically set in the init function!
        self.encoder_hidden_dim = 0
        # The linear layer that maps the subword contextual representation space to tag space
        self.classification_head = None
        # The CRF layer on top of the classification head to make sure the model learns to move from/to relevant tags
        # self.crf_layer = None
        # optimizers will be initialized in the init_model_from_scratch function
        self.optimizers = None
        self.init_model_from_scratch(basemodel, tagset_size, lr)

    def init_model_from_scratch(self, basemodel, tagset_size, lr):
        self.encoder = AutoModel.from_pretrained(basemodel)
        self.encoder_hidden_dim = self.encoder.config.hidden_size
        self.classification_head = nn.Linear(self.encoder_hidden_dim, tagset_size)
        # TODO initialize self.crf_layer in here as well.
        self.crf_layer = CRF(tagset_size, batch_first=True)
        # TODO modify the optimizers in a way that each model part is optimized with a proper learning rate!
        encoder_params = list(self.encoder.parameters())
        classifier_params = list(self.classification_head.parameters()) + list(self.crf_layer.parameters())
        self.optimizers = {
            'encoder': optim.Adam(encoder_params, lr=lr * 0.1),  # Lower lr for the encoder
            'classifier': optim.Adam(classifier_params, lr=lr)   # Specified lr for classifier and CRF
        }

    def forward(self, sentence_input, labels=None):
        encoded = self.encoder(sentence_input).last_hidden_state
        tag_space = self.classification_head(encoded)
        #tag_scores = F.log_softmax(tag_space, dim=-1)
        # TODO modify the tag_scores to use the parameters of the crf_layer
        if labels is not None:
            loss = -self.crf_layer(tag_space, labels, reduction='mean')  # Compute loss for training
            return loss
        else:
            return self.crf_layer.decode(tag_space)
        #return tag_scores

code did not run with error:
User
$ /bin/python3 /home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py
Found modelfile data/chunker.pt. Starting decoding.
Traceback (most recent call last):
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 337, in <module>
    decoder_output = chunker.decode(opts.inputfile)
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 281, in decode
    model.load_state_dict(saved_model['model_state_dict'])
  File "/home/mihir/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerModel:
        Missing key(s) in state_dict: "crf_layer.start_transitions", "crf_layer.end_transitions", "crf_layer.transitions". 


So we removed the pretrained model and ran it again but it gave following error:
trying to retrain the model:
 /bin/python3 /home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py
Could not find modelfile data/chunker.pt or -f used. Starting training.
  0%|                   | 0/8936 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
/home/mihir/.local/lib/python3.10/site-packages/torchcrf/__init__.py:305: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:519.)
  score = torch.where(mask[i].unsqueeze(1), next_score, score)
  0%|          | 15/8936 [00:00<06:39, 22.35it/s]
Traceback (most recent call last):
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 332, in <module>
    chunker.train()
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 228, in train
    loss = loss_function(tag_scores.view(-1, len(self.tag_to_ix)), targets.view(-1))
AttributeError: 'list' object has no attribute 'view'

When training a model with a CRF layer, you typically pass the logits (the output of the classification head before applying softmax) and the true labels to the CRF layer to calculate the negative log-likelihood loss. This means you don't need to calculate the loss separately using loss_function(tag_scores.view(-1, len(self.tag_to_ix)), targets.view(-1)). Instead, you use the CRF layer's own method for computing the loss.



In [9]:
def train(self):
        self.load_training_data(self.trainfile)
        self.model = TransformerModel(self.basemodel, len(self.tag_to_ix), lr=self.lr).to(device)
        self.model.train()

        total_loss = 0
        for epoch in range(self.epochs):
            # Optionally freeze the encoder during the first few epochs
            if epoch < 2:  # Example: freeze for the first 2 epochs
                for param in self.model.encoder.parameters():
                    param.requires_grad = False
            else:
                for param in self.model.encoder.parameters():
                    param.requires_grad = True

            train_iterator = tqdm.tqdm(self.training_data)
            for tokenized_sentence, tags in train_iterator:
                self.model.zero_grad()
                inputs, targets = self.prepare_sequence(tokenized_sentence, tags)
                attention_mask = inputs['attention_mask'].to(device)
                inputs = inputs['input_ids'].to(device)
                targets = targets.to(device)

                # The forward pass now returns the loss directly
                loss = self.model(inputs, attention_mask=attention_mask, labels=targets)

                loss.backward()
                for optimizer in self.model.optimizers.values():
                    optimizer.step()

                total_loss += loss.item()

            # Display average loss for the epoch
            avg_loss = total_loss / len(self.training_data)
            train_iterator.set_description(f"Epoch {epoch} Loss: {avg_loss:.3f}")

            # Save model checkpoint
            if epoch == self.epochs - 1:
                savefile = self.modelfile + self.modelsuffix  # Last epoch
            else:
                savefile = f"{self.modelfile}_epoch_{epoch}{self.modelsuffix}"
            torch.save({
                'epoch': epoch,
                'model_state_dict': self.model.state_dict(),
                'optimizer_state_dict': {name: opt.state_dict() for name, opt in self.model.optimizers.items()},
                'loss': avg_loss,
                'tag_to_ix': self.tag_to_ix,
                'ix_to_tag': self.ix_to_tag,
            }, savefile)
            print(f"Model saved to {savefile}")

Now we got the following error:
$ /bin/python3 /home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py
Could not find modelfile data/chunker.pt or -f used. Starting training.
  0%|                   | 0/8936 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 301, in <module>
    chunker.train()
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 195, in train
    loss = self.model(inputs, attention_mask=attention_mask, labels=targets)
  File "/home/mihir/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mihir/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: TransformerModel.forward() got an unexpected keyword argument 'attention_mask'

 which measn there was an issue in forward function and we try to fix it in the following way:

In [10]:
## change
def forward(self, input_ids, attention_mask=None, labels=None):
        # Encode the inputs
        outputs = self.encoder(input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state

        # Pass the sequence output through the classification head
        logits = self.classification_head(sequence_output)

        if labels is not None:
            # If labels are provided, calculate and return the loss using the CRF layer
            loss = -self.crf_layer(logits, labels, mask=attention_mask.byte(), reduction='mean')
            return loss
        else:
            # Otherwise, return the decoded sequences from the CRF layer
            return self.crf_layer.decode(logits, mask=attention_mask.byte())


Now we got the following error:
error:$ /bin/python3 /home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py
Could not find modelfile data/chunker.pt or -f used. Starting training.
  0%|                   | 0/8936 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 305, in <module>
    chunker.train()
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 199, in train
    loss = self.model(inputs, attention_mask=attention_mask, labels=targets)
  File "/home/mihir/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mihir/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mihir/nlpclass-1241-g-CtrlAltDefeat/hw2/answer/default.py", line 78, in forward
    loss = -self.crf_layer(logits, labels, mask=attention_mask.byte(), reduction='mean')
  File "/home/mihir/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mihir/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mihir/.local/lib/python3.10/site-packages/torchcrf/__init__.py", line 90, in forward
    self._validate(emissions, tags=tags, mask=mask)
  File "/home/mihir/.local/lib/python3.10/site-packages/torchcrf/__init__.py", line 155, in _validate
    raise ValueError(
ValueError: the first two dimensions of emissions and tags must match, got (1, 41) and (41,)

We realised that the change made previouly was not managed in train so we did the following change:

In [11]:
def train(self):
    self.load_training_data(self.trainfile)
    self.model = TransformerModel(self.basemodel, len(self.tag_to_ix), lr=self.lr).to(device)
    self.model.train()

    for epoch in range(self.epochs):
        
        if epoch < 2:  
            for param in self.model.encoder.parameters():
                param.requires_grad = False
        else:
            for param in self.model.encoder.parameters():
                param.requires_grad = True

        total_loss = 0
        train_iterator = tqdm.tqdm(self.training_data, total=len(self.training_data), desc=f"Epoch {epoch}")
        
        for tokenized_sentence, tags in train_iterator:
            self.model.zero_grad()

            # Prepare inputs and targets
            inputs, targets = self.prepare_sequence(tokenized_sentence, tags)
            # Ensure inputs, targets, and attention_mask are on the correct device
            input_ids = inputs['input_ids'].to(device)
            attention_mask = inputs['attention_mask'].to(device)
            targets = targets.to(device)

            # Ensure targets tensor has a batch dimension
            targets = targets.unsqueeze(0) if targets.dim() == 1 else targets

            # Forward pass and calculate loss
            loss = self.model(input_ids, attention_mask=attention_mask, labels=targets)

            # Backward pass and optimize
            loss.backward()
            for optimizer in self.model.optimizers.values():
                optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(self.training_data)
        print(f"Average loss for epoch {epoch}: {avg_loss:.4f}")

        # Save model at the end of each epoch
        save_path = f"{self.modelfile}_epoch_{epoch}{self.modelsuffix}" if epoch < self.epochs - 1 else f"{self.modelfile}{self.modelsuffix}"
        torch.save({
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': {name: opt.state_dict() for name, opt in self.model.optimizers.items()},
            'loss': avg_loss,
            'tag_to_ix': self.tag_to_ix,
            'ix_to_tag': self.ix_to_tag,
        }, save_path)
        print(f"Model saved to {save_path}")


Before going to run we decieded to leave CRF as it was taking too much time to train and then we got dev score :90.5883

 The suggested FFN in the classification head without changing anything else in the code:

    we need to define a new module, FFN, which includes the dropout and additional linear layers.
    Replace the current single linear layer in self.classification_head with this new module.

In [13]:
class TransformerModel(nn.Module):

    def __init__(
            self,
            basemodel,
            tagset_size,
            lr=5e-5
        ):
        torch.manual_seed(1)
        super(TransformerModel, self).__init__()
        self.basemodel = basemodel
        # the encoder will be a BERT-like model that receives an input text in subwords and maps each subword into
        # contextual representations
        self.encoder = None
        # the hidden dimension of the BERT-like model will be automatically set in the init function!
        self.encoder_hidden_dim = 0
        # The linear layer that maps the subword contextual representation space to tag space
        self.classification_head = None
        # The CRF layer on top of the classification head to make sure the model learns to move from/to relevant tags
        # self.crf_layer = None
        # optimizers will be initialized in the init_model_from_scratch function
        self.optimizers = None
        self.init_model_from_scratch(basemodel, tagset_size, lr)

    def init_model_from_scratch(self, basemodel, tagset_size, lr):
        self.encoder = AutoModel.from_pretrained(basemodel)
        self.encoder_hidden_dim = self.encoder.config.hidden_size
        self.classification_head = nn.Sequential(
            nn.Dropout(p=0.1),
            nn.Linear(self.encoder_hidden_dim, 3072),
            nn.ReLU(),  
            nn.Linear(3072, 768),
            nn.ReLU(),  
            nn.Linear(768, tagset_size)
        )
        # TODO initialize self.crf_layer in here as well.
        # TODO modify the optimizers in a way that each model part is optimized with a proper learning rate!
        self.optimizers = [
            optim.Adam(
                list(self.encoder.parameters()) + list(self.classification_head.parameters()),
                lr=lr
            )
        ]

    def forward(self, sentence_input):
        encoded = self.encoder(sentence_input).last_hidden_state
        tag_space = self.classification_head(encoded)
        tag_scores = F.log_softmax(tag_space, dim=-1)
        # TODO modify the tag_scores to use the parameters of the crf_layer
        return tag_scores

dev score:88.0291. Thenthe load_training_data function reads training data from a file, augments each sentence with misspelled words to improve model robustness, and then maps each unique tag found in the data to a unique index. This processed data is then used to train a machine learning model for a natural language processing task. The function also logs the mappings of tags to indices for reference.

In [14]:
def load_training_data(self, trainfile):
        augmented_training_data=[]
        if trainfile[-3:] == '.gz':
            with gzip.open(trainfile, 'rt') as f:
                self.training_data = read_conll(f)
        else:
            with open(trainfile, 'r') as f:
                self.training_data = read_conll(f)
        for sentence,tags in self.training_data:
            augmented_sentence=[self.augment_with_misspellings(word) for word in sentence]
            augmented_training_data.append((augmented_sentence,tags))

        for sent, tags in self.training_data:
            for tag in tags:
                if tag not in self.tag_to_ix:
                    self.tag_to_ix[tag] = len(self.tag_to_ix)
                    self.ix_to_tag.append(tag)

        logging.info("tag_to_ix:", self.tag_to_ix)
        logging.info("ix_to_tag:", self.ix_to_tag)

dev score:88.0291

since our score was not going up we decided to go back to our previous submition with score dev score :90.5883