# Summary

This notebook demonstrates some advanced techniques in text classification for BERT-like LLMs. The examples in this notebook are based on a project aimed at detecting misconceptions in student responses, which is a challenging and complex NLP task. We will use a private dataset, carefully developed for this research, in some of the examples. Since the dataset is not publicly available, we have included a section that presents actual samples and provides an overview of its structure. For more details on this research and the dataset, please refer to my [thesis](https://digitalcommons.unf.edu/etd/1234/).

# Dataset Overview

### Background

The data is collected from a quiz administered on the fifth day of an introductory circuit analysis course. In the quiz, students are asked to determine and explain the values of each element in an electrical circuit, given that the value of a specific element has changed. Student answers are provided in paragraph format and include domain-specific terminology, abbreviations, acronyms, nomenclature, and equations.

Seven distinct misconceptions have been identified by experts in the student answers. In most cases, misconceptions are identified from a single sentence. However, some misconceptions exhibit inter-sentence dependencies, making them undetectable when sentences are analyzed independently. These misconceptions can only be identified when a sentence is examined in the context of a preceding sentence. As a result, the student answers are annotated at the sentence level.

### Example of Student Answer

Vs is an ideal component, so changing R2 will not affect it. R1 is in series with iR23 so changing R2 will also not affect the power associated with it. R3 will slowly have a decrease in power as R2's resistance goes to 0, with R2 at zero coinciding with no power in R3. R2 will increase in power, because the current is going to increase with a decrease in resistance. So Vs is the same, R1 is the same, R2 is larger and R3 is smaller.

### Dataset Structure

The following rows correspond to the student responses above. The question and reference answer are truncated for brevity. The `hypothesis` column indicates the sentence under evaluation, while the `context` column contains the preceding sentences. In this student answer, the second sentence (`id = 2`) contains a misconception. The label `none` represents no misconception.

<table>
   <thead>
      <tr>
         <th style="width: 1%; text-align: center;">id</th>
         <th style="width: 15%; text-align: center;">question</th>
         <th style="width: 15%; text-align: center;">reference_answer</th>
         <th style="width: 35%; text-align: center;">context</th>
         <th style="width: 35%; text-align: center;">hypothesis</th>
         <th style="width: 1%; text-align: center;">label</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td>1</td>
         <td>Resistors R1 and R2 are...</td>
         <td>As the resistance of R2...</td>
         <td></td>
         <td>Vs is an ideal component, so changing R2 will not affect it.</td>
         <td>none</td>
      </tr>
      <tr>
         <td>2</td>
         <td>Resistors R1 and R2 are...</td>
         <td>As the resistance of R2...</td>
         <td>Vs is an ideal component, so changing R2 will not affect it.</td>
         <td>R1 is in series with iR23 so changing R2 will also not affect the power associated with it.</td>
         <td>SM</td>
      </tr>
      <tr>
         <td>3</td>
         <td>Resistors R1 and R2 are...</td>
         <td>As the resistance of R2...</td>
         <td>Vs is an ideal component, so changing R2 will not affect it. R1 is in series with iR23 so changing R2 will also not affect the power associated with it.</td>
         <td>R3 will slowly have a decrease in power as R2's resistance goes to 0, with R2 at zero coinciding with no power in R3.</td>
         <td>none</td>
      </tr>
      <tr>
         <td>4</td>
         <td>Resistors R1 and R2 are...</td>
         <td>As the resistance of R2...</td>
         <td>Vs is an ideal component, so changing R2 will not affect it. R1 is in series with iR23 so changing R2 will also not affect the power associated with it. R3 will slowly have a decrease in power as R2's resistance goes to 0, with R2 at zero coinciding with no power in R3.</td>
         <td>R2 will increase in power, because the current is going to increase with a decrease in resistance.</td>
         <td>none</td>
      </tr>
      <tr>
         <td>5</td>
         <td>Resistors R1 and R2 are...</td>
         <td>As the resistance of R2...</td>
         <td>Vs is an ideal component, so changing R2 will not affect it. R1 is in series with iR23 so changing R2 will also not affect the power associated with it. R3 will slowly have a decrease in power as R2's resistance goes to 0, with R2 at zero coinciding with no power in R3. R2 will increase in power, because the current is going to increase with a decrease in resistance.</td>
         <td>So Vs is the same, R1 is the same, R2 is larger and R3 is smaller.</td>
         <td>none</td>
      </tr>
   </tbody>
</table>

In [1]:
from datasets import DatasetDict

dataset = DatasetDict.load_from_disk('private_dataset')

In [2]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'reference_answer', 'context', 'hypothesis', 'label'],
        num_rows: 1275
    })
    eval: Dataset({
        features: ['id', 'question', 'reference_answer', 'context', 'hypothesis', 'label'],
        num_rows: 204
    })
    test: Dataset({
        features: ['id', 'question', 'reference_answer', 'context', 'hypothesis', 'label'],
        num_rows: 208
    })
})


# Input Engineering

We frame misconception detection as a Recognizing Textual Entailment (RTE) task, where the goal is to determine whether a hypothesis entails a premise. In our dataset, each example consists of a question, reference answer, context, and hypothesis. Here, the reference answer serves as the premise. In our previous research, we have showed that the question provides valuable information for the model, while the context is essential for detecting misconceptions that involve inter-sentence dependencies. However, BERT-like LLMs are designed to process single or paired inputs, necessitating a strategy to integrate all four components into an input pair. Simply combining the question, reference answer, and context into the premise is problematic, as incorrect information in the context could conflict with the reference answer and confuse the model. Therefore, we must combine the context with the hypothesis while ensuring a clear separation. We concatenate the context with the hypothesis using a single newline character, as neither the context nor the hypothesis contains it. We combine the question and the reference answer with a single space, instead of a newline character, to present them as a single continuous text to the model.

### Pre-defined Variables

In [1]:
model_name = 'FacebookAI/roberta-large'

### Tokenization

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)



In [3]:
def tokenization_function(examples):
    # Integrate components/columns into input pairs
    if examples.__class__.__name__ == 'LazyBatch': # batched examples
        premises = [f'{a} {b}' for a, b in zip(examples['question'], examples['reference_answer'])]
        hypotheses = [f'{a}\n{b}' for a, b in zip(examples['context'], examples['hypothesis'])]
    else: # single example
        premises = f'{examples["question"]} {examples["reference_answer"]}'
        hypotheses = f'{examples["context"]}\n{examples["hypothesis"]}'
    
    # Tokenize
    return tokenizer(text = premises, text_pair = hypotheses)

In [4]:
dataset = dataset.map(tokenization_function, batched = True)



# Adding Domain-specific Words to LLMs

Student answers in our dataset contain domain-specific terminology, abbreviations, acronyms, and nomenclature. Some of these words are unknown to models like RoBERTa. In such cases, the tokenizer either splits these words incorrectly or maps them to `[unk]` (i.e., unknown), causing the model to misinterpret or ignore them. Depending on their frequency, they can introduce noise into the model input, negatively impacting its performance. In this section, we demonstrate how to add domain-specific words to both the tokenizer and the model before fine-tuning to ensure these terms are tokenized and interpreted correctly. The process of filtering domain-specific words from text is not included in this example, as it is complex and, at present, no library provides out-of-the-box tools that would allow me to summarize the process in a few lines of code.

### Pre-defined Variables

In [1]:
from tokenizers import AddedToken

In [2]:
model_name = 'FacebookAI/roberta-large'

# List of domain-specific words tokenized
domain_specific_tokens = [
    'KVL',
    AddedToken("Kirchhoff", single_word=True)
]

### Extending Tokenizer

In [3]:
from transformers import AutoTokenizer

In [4]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)



In [5]:
# Tokenization before adding domain-specific tokens
print(tokenizer.tokenize("Kirchhoff's KVL states the sum of voltages in a loop is zero."))

['K', 'ir', 'ch', 'hoff', "'s", 'ĠK', 'VL', 'Ġstates', 'Ġthe', 'Ġsum', 'Ġof', 'Ġvolt', 'ages', 'Ġin', 'Ġa', 'Ġloop', 'Ġis', 'Ġzero', '.']


In [6]:
# Log the number of domain-specific tokens in the list
print(f'Number of domain-specific tokens: {len(domain_specific_tokens):,}')

# Add domain-specific tokens to the tokenizer's vocabulary.
# Tokens that do not already exist in the current vocabulary will be added.
# New tokens will be appended to the end of the vocabulary but they will be
# kept isolated from the original vocabulary.
num_added_tokens = tokenizer.add_tokens(domain_specific_tokens)
print(f'Number of added tokens: {num_added_tokens:,}')

# Log original vocabulary size
print(f'Original vocabulary size: {tokenizer.vocab_size:,}')

# Log vocabulary size after adding domain-specific words
print(f'Extended vocabulary size: {len(tokenizer):,}')

Number of domain-specific tokens: 2
Number of added tokens: 2
Original vocabulary size: 50,265
Extended vocabulary size: 50,267

In [7]:
# Tokenization after adding domain-specific tokens
print(tokenizer.tokenize("Kirchhoff's KVL states the sum of voltages in a loop is zero."))

['Kirchhoff', "'s", 'Ġ', 'KVL', 'Ġstates', 'Ġthe', 'Ġsum', 'Ġof', 'Ġvolt', 'ages', 'Ġin', 'Ġa', 'Ġloop', 'Ġis', 'Ġzero', '.']


### Extending Model

In [8]:
from transformers import AutoModelForSequenceClassification

# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name)



In [9]:
# Resize the token embedding matrix to accommodate new tokens.
# The model will initialize the new embeddings based on existing embeddings.
# The model will learn about the new tokens during fine-tuning.
model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50267, 1024, padding_idx=1)

### Save Extended Tokenizer and Model

In [10]:
# Directory to save the tokenizer and model
output_dir = f'{model_name.split("/")[-1]}-extended-vocab'

In [11]:
# Save Tokenizer
# New tokens will be saved in "added_tokens.json"
tokenizer.save_pretrained(output_dir)

('roberta-large-extended-vocab/tokenizer_config.json',
 'roberta-large-extended-vocab/special_tokens_map.json',
 'roberta-large-extended-vocab/vocab.json',
 'roberta-large-extended-vocab/merges.txt',
 'roberta-large-extended-vocab/added_tokens.json',
 'roberta-large-extended-vocab/tokenizer.json')

In [12]:
# Save Model
model.save_pretrained(output_dir)