# bertchunker: default program

In [4]:
from bertchunker import *
import os, sys

## Run the default solution on dev

In [5]:
chunker = FinetuneTagger(os.path.join('..', 'data', 'chunker'), modelsuffix='.pt')
decoder_output = chunker.decode(os.path.join('..', 'data', 'input', 'dev.txt'))

100%|██████████| 1027/1027 [00:27<00:00, 36.79it/s]


Ignore the warnings from the transformers library. They are expected to occur.

## Evaluate the default output

In [6]:
flat_output = [ output for sent in decoder_output for output in sent ]
sys.path.append('..')
import conlleval
true_seqs = []
with open(os.path.join('..', 'data', 'reference', 'dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11952 phrases; correct: 11294.
accuracy:  96.70%; (non-O)
accuracy:  96.67%; precision:  94.49%; recall:  94.94%; FB1:  94.72
             ADJP: precision:  70.97%; recall:  77.88%; FB1:  74.26  248
             ADVP: precision:  80.75%; recall:  81.16%; FB1:  80.95  400
            CONJP: precision:  50.00%; recall:  71.43%; FB1:  58.82  10
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  95.29%; recall:  95.45%; FB1:  95.37  6247
               PP: precision:  97.83%; recall:  97.91%; FB1:  97.87  2443
              PRT: precision:  78.72%; recall:  82.22%; FB1:  80.43  47
             SBAR: precision:  89.70%; recall:  88.19%; FB1:  88.94  233
               VP: precision:  94.71%; recall:  95.53%; FB1:  95.12  2324


(94.49464524765729, 94.9394754539341, 94.71653807447164)

## Documentation

Write some beautiful documentation of your program here.

Adding noise to the training data so that model can learn these typos
```
def introduce_spelling_errors(word, error_rate=0.4):
    """

    This function takes a word and introduces a spelling error based on the specified error_rate.
    Spelling errors include swapping adjacent letters, inserting a random letter, deleting a random letter,
    or replacing a letter with another random letter.

    Parameters:
    - word (str): The input word to which spelling errors will be introduced.
    - error_rate (float): The probability of introducing a spelling error, ranging from 0 to 1.

    Returns:
    - str: The modified word with or without spelling errors.

    Example:
    >>> introduce_spelling_errors("example")
    'examlpe'

    >>> introduce_spelling_errors("python", error_rate=0.2)
    'pythoon'
    """
```

We also modify the network structure of classification head
```
nn.Sequential(
                nn.Dropout(p=0.1),
                nn.Linear(in_features=768, out_features=100, bias=True),
                nn.ReLU(),
                nn.Linear(in_features=100, out_features=100, bias=True),
                nn.ReLU(),
                nn.Linear(in_features=100, out_features=22, bias=True)
                )
```

Adjusting the learning rate of classification head (5e-4)
```
encoder_optimizer = optim.Adam(self.encoder.parameters(), lr=lr)
classification_head_optimizer = optim.Adam(self.classification_head.parameters(), lr=5e-4)
self.optimizers = [encoder_optimizer, classification_head_optimizer]
```

## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

* FB1 score has been improved to 94.7165.
* There are significant improvements in recall and precision.
* Except for INTJ, other categories have significantly improved.


Idea (Worked)
* Adding noise to the training data so that model can learn these typos
* Adding more hidden layers for FFN of classification head

Idea (didn't work)
* We tried error_rate >= 0.5 to control the generated noise ratio and the FB1 score has been reduced.