# Lab 8: Finetuning 

### COSC 410B: Spring 2025, Colgate University

In this lab you will finetune a pretrained language model, DistilBERT, on Natural Language Inference. 

#### Goals for this lab:

* Work with huggingface datasets and format it in appropriate ways given a task
* Work pre-trained language models (load, finetune, evaluate)

Note, you will only train the model on a small set of examples because the goal is to help you figure out how to set up a finetuning pipeline for your final projects, not to train a good model. 

## Part 1: Understand the task

You will be working with the Natural Language Inference task using the [MNLI dataset](https://huggingface.co/datasets/SetFit/mnli)

Briefly describe: 
* What is the input to this task?
* What is the output?
* Do you need to preprocess the input in some way before you can pass it into the model?
* What data and task was [DistilBERT](https://arxiv.org/pdf/1910.01108) pre-trained on? 
* What does it mean to finetune this model on NLI? 

**WRITE YOUR ANSWERS HERE**

## Part 2: Load and preprocess the data

In order to train the model, you need to get the data in the following format. 
```
DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 100
    })
    test: Dataset({
        features: ['id', 'text', 'label'],
        num_rows: 100
    })
})
```

In this part load the [MNLI dataset](https://huggingface.co/datasets/SetFit/mnli). Take only the first 100 examples from train and test for the purposes of this lab (because it will take extremely long to train the model otherwise). The preprocess it such that it can be passed into the model! A helper function is provided, and the model name is pre-specified. 

In [2]:
def tokenize_data(dataset, modelname):
    """Tokenize text data"""
    tokenizer = AutoTokenizer.from_pretrained(modelname)
    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)
    return dataset.map(tokenize_function, batched=True)

In [3]:
model_name = "distilbert-base-cased"

## Part 3: Evaluate model before finetuning

In this part you will load the model and evaluate it before finetuning. You will use the [AutoModelForSequenceClassification](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). A helper function is provided. 

In [4]:
def get_predictions(model, data):
    """ Get predictions on data for a classification model m returning predictions and true labels"""
    model.eval()
    model.to('cpu')
    predictions = model(torch.tensor(data['input_ids']), torch.tensor(data['attention_mask']))
    predictions = torch.argmax(predictions.logits, dim=-1)
    return predictions, torch.tensor(data['label'])

## Part 4: Finetune the model 

Finetune the model and then evaluate the model after finetuning. A helper function is given. 

In [5]:
def train_model(model, train_data, num_epochs):
    """ Train text model for classification """ 
    training_args = TrainingArguments(output_dir="test_trainer", 
                                               num_train_epochs=num_epochs)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data['train'],
        eval_dataset=train_data['test']
    )
    trainer.train()

## Part 5: Discussion

Answer the following questions

* What is being initialized when you load the model?
* What weights are being learned during finetuning?
* Why might pre-training help a model on this task? 