# Bertchunker Program

In [3]:
from default import *
import os, sys

## Running Solution on Dev

In [4]:
chunker = FinetuneTagger(os.path.join('..', 'data', 'chunker'), modelsuffix='.pt')
decoder_output = chunker.decode(os.path.join('..', 'data', 'input', 'dev.txt'))

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
config.json: 100%|██████████| 483/483 [00:00<?, ?B/s] 
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.63MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 12.6MB/s]
model.safetensors: 100%|██████████| 268M/268M [00:03<00:00, 76.4MB/s] 
100%|██████████| 1027/1027 [00:34<00:00, 30.18it/s]


Ignore the warnings from the transformers library. They are expected to occur.

## Evaluate the Output

In [5]:
flat_output = [ output for sent in decoder_output for output in sent ]
sys.path.append('..')
import conlleval
true_seqs = []
with open(os.path.join('..', 'data', 'reference', 'dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 12023 phrases; correct: 11264.
accuracy:  96.34%; (non-O)
accuracy:  96.42%; precision:  93.69%; recall:  94.69%; FB1:  94.18
             ADJP: precision:  73.03%; recall:  77.88%; FB1:  75.37  241
             ADVP: precision:  79.70%; recall:  78.89%; FB1:  79.29  394
            CONJP: precision:  45.45%; recall:  71.43%; FB1:  55.56  11
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  94.34%; recall:  95.17%; FB1:  94.76  6292
               PP: precision:  97.66%; recall:  97.34%; FB1:  97.50  2433
              PRT: precision:  69.23%; recall:  80.00%; FB1:  74.23  52
             SBAR: precision:  90.00%; recall:  91.14%; FB1:  90.57  240
               VP: precision:  93.43%; recall:  95.70%; FB1:  94.55  2360


(93.68709972552608, 94.68728984532616, 94.18453948743677)

# Documentation

## Transformer-based Sequence Tagger

### Overview
This program implements a Transformer-based model for sequence tagging, focusing on tasks like Named Entity Recognition (NER). It utilizes the Hugging Face Transformers library for model architecture and pre-trained embeddings.

### Components

#### 1. create_mispelling(word)
Description: Generates misspelled versions of input words to augment training data.
Parameters:
- word: Input word for misspelling.

Returns:
- Misspelled version of the input word.

#### 2. read_conll(handle, input_idx=0, label_idx=2)
Description: Reads CoNLL-formatted data from a file, handling both input words and their corresponding labels.

Parameters:
- handle: File handle for reading data.
- input_idx: Index of the input word in each line (default: 0).
- label_idx: Index of the label in each line (default: 2).

Returns:
- List of tuples containing input words and their labels.

#### 3. TransformerModel
Description: Custom Transformer-based model for sequence tagging.

Architecture:
- Uses a pre-trained Transformer-based encoder (e.g., BERT) for contextual word representations.
- Adds a linear layer for classification into different entity types.

Methods:
- init_model_from_scratch: Initializes the model with specified parameters.
- forward: Defines the forward pass of the model.

#### 4. FinetuneTagger
Description: Class for training and decoding the sequence tagger model.

Attributes:
- tokenizer: Tokenizer for processing input sequences.
- trainfile: Path to the training data file.
- modelfile: File to save the trained model.
- modelsuffix: Suffix for the model file.
- basemodel: Pre-trained base model for the encoder.
- epochs: Number of training epochs.
- batchsize: Batch size for training.
- lr: Learning rate for fine-tuning.
- training_data: List of tuples containing training data.
- tag_to_ix: Dictionary mapping tags to indices.
- ix_to_tag: List mapping indices to tags.
- model: Instance of the Transformer model.

Methods:
- load_training_data: Loads and preprocesses training data.
- prepare_sequence: Prepares input sequences for training or inference.
- argmax: Performs decoding to obtain predicted labels.
- train: Trains the sequence tagger model.
- model_str: Returns a string representation of the trained model.
- decode: Decodes input sequences using the trained model.

#### Usage
1. Training:
    - Initialize FinetuneTagger with appropriate parameters.
    - Call the train method to train the model.
2. Decoding:
    - Initialize FinetuneTagger with the trained model file.
    - Call the decode method with input data to obtain predictions.

#### Dependencies
- Python 3.x
- PyTorch
- Hugging Face Transformers
- tqdm (for progress bars)

#### References
- Hugging Face Transformers Documentation: https://huggingface.co/transformers/
- PyTorch Documentation: https://pytorch.org/docs/stable/index.html

## Analysis

### Initial Approach:
- Misspelling Function: Implemented a function to generate misspelled words, primarily swapping two characters.
- Augmenting Training Set: Initially replaced original words in the training set with misspelled versions.

### Experimentation:
1. Increasing Variations:
    - Expanded misspelling variations (e.g., 1 character deletion, addition, and replacement).
    - Limited impact on model performance observed.

2. Augmenting Strategy:
    - Augmented the training set by adding misspelled versions alongside original words.
    - Preserved original data while enhancing diversity.

3. Optimizing Misspelling Rate:
    - Found that increasing misspelling rate to 40% yielded optimal results.
    - Balanced diversity and data integrity effectively.
4. Doubling Training Set:
    - Doubled training set by adding misspelled versions at an 80% rate.
    - Substantial performance improvement achieved (~94.5% accuracy).

### Observations:
1. Effective Strategies:
    - Augmenting training data with diverse misspellings enhanced model performance.
    - Optimizing misspelling rate and maintaining data integrity were key factors.
2. Trade-offs:
    - Doubling training set improved performance but increased training time significantly.
    - Balancing augmentation benefits with computational cost is critical.

### Conclusion:
- Effective Approaches:
    - Diverse misspelling augmentation and optimized misspelling rate improved model accuracy.
- Future Considerations:
    - Further experimentation with misspelling variations and rates could refine model performance.
    - Exploring methods to mitigate computational overhead while augmenting training data is essential.