# Template for HW4
Use this file as a template. In your submission, save any computation results and outputs and submit the file, so we can see it when we grade.

You'll be using the tutorial at https://medium.com/data-science-in-your-pocket/modernbert-for-text-classification-04d7fba42dae as a starting point.
Then you'll modify it so that it uses the dataset "sentence-transformers/all-nli" instead. You'll need to really understand how the code works to do this: look up the documentation for the individual functions. Read about the dataset and find out its structure. 

In [17]:
#load modernbert (base) and your dataset here. Follow the tutorial (with any necessary modifications) at: https://medium.com/data-science-in-your-pocket/modernbert-for-text-classification-04d7fba42dae
from datasets import load_dataset
from datasets.arrow_dataset import Dataset
from datasets.dataset_dict import DatasetDict, IterableDatasetDict
from datasets.iterable_dataset import IterableDataset
import numpy as np

# Dataset id from huggingface.co/dataset
# dataset_id = "argilla/synthetic-domain-text-classification"
dataset_id = "sentence-transformers/all-nli"

# Load raw dataset
# train_dataset = load_dataset(dataset_id, split='train')
train_dataset = load_dataset(dataset_id, 'pair-class', split='train[:3%]')
#divide train set into fraction of the size, so more computationally feasible


split_dataset = train_dataset.train_test_split(test_size=0.1)
# print(split_dataset['train'][5:7])
print(split_dataset)

# create a new column in split_dataset which concatenates the "premise" and "hypothesis" columns
split_dataset = split_dataset.map(lambda x: {"text": x["premise"] + " <s> " + x["hypothesis"]}, remove_columns=["premise", "hypothesis"])
# print(split_dataset['train'][5:7])
# print(split_dataset)

from transformers import AutoTokenizer

# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Tokenize helper function
def tokenize(batch):
    return tokenizer(batch['text'], truncation=True, padding="max_length", return_tensors="pt", max_length=500)#tokenizer.model_max_length)


# Tokenize dataset
if "label" in split_dataset["train"].features.keys():
    split_dataset =  split_dataset.rename_column("label", "labels") # to match Trainer
tokenized_dataset = split_dataset.map(tokenize, batched=True, remove_columns=["text"])

from transformers import AutoModelForSequenceClassification

# Model id to load the tokenizer
model_id = "answerdotai/ModernBERT-base"

# Prepare model labels - useful for inference
labels = tokenized_dataset["train"].features["labels"].names
num_labels = len(labels)
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

# Download the model from huggingface.co/models
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=num_labels, label2id=label2id, id2label=id2label,
)

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 25435
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 2827
    })
})


Map:   0%|          | 0/25435 [00:00<?, ? examples/s]

Map:   0%|          | 0/2827 [00:00<?, ? examples/s]

Map:   0%|          | 0/25435 [00:00<?, ? examples/s]

Map:   0%|          | 0/2827 [00:00<?, ? examples/s]

Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
# train your model here, using fine-tuning. Print out training data and save it as part of this notebook file. Again follow the tutorial (with any necessary modifications) at: https://medium.com/data-science-in-your-pocket/modernbert-for-text-classification-04d7fba42dae

from transformers import Trainer, TrainingArguments

# Define training args
training_args = TrainingArguments(
    output_dir= "ModernBERT-domain-classifier",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    learning_rate=5e-5,
    num_train_epochs=2,
    bf16=True, # bfloat16 training
    optim="adamw_torch_fused", # improved optimizer
    # logging & evaluation strategies
    logging_strategy="steps",
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_total_limit=2,
    load_best_model_at_end=True,
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"] #note that we're using the test set as the validation set here---normally a no-no!
)
trainer.train()

Step,Training Loss,Validation Loss
500,1.1109,0.819322
1000,0.7777,0.822206
1500,0.6973,0.709097
2000,0.6663,0.576347
2500,0.6534,0.558026
3000,0.6442,0.616792
3500,0.5848,0.674804
4000,0.6046,0.557949
4500,0.6149,0.53745
5000,0.6157,0.585584


TrainOutput(global_step=12718, training_loss=0.5062742627703707, metrics={'train_runtime': 9977.2292, 'train_samples_per_second': 5.099, 'train_steps_per_second': 1.275, 'total_flos': 1.692819511767e+16, 'train_loss': 0.5062742627703707, 'epoch': 2.0})

In [None]:
import numpy as np

# test your trained model on the test set here. Report statistics and save the test data as part of this notebook file. 
predictions = trainer.predict(tokenized_dataset["test"])

# Process the prediction results (predictions, label_ids, metrics)
predicted_labels = np.argmax(predictions.predictions, axis=1)
predicted_label = id2label[str(predicted_labels[0])]


example_data = [split_dataset['train'][0]]
print("Example Input:", example_data[0]['text'])
print(f"Predicted Label: {predicted_label}")
print(f"Actual Label: {id2label[str(example_data[0]['labels'])]}")



Input: A woman and a girl are playing in a field of leaves <s> A mother and daughter are playing in a field.
Predicted Label: neutral
Actual Label: neutral


### Question 1
Analyze the true positive / false positive / true negative / false negative rates overall, treating every answer as either wrong or right (e.g., if the correct answer was 0, then only 0 is correct and 1 or 2 are both wrong). What results do you get? What does this tell you?

-------

**answer this question here. Add additional code blocks to show your work and/or display figures/charts.**

In [None]:
# any code used to answer Question 1 goes here

### Question 2
Now analyze the TP/FP/TN/FN rates *for each category* (contradiction, entailment, neutral). What do you learn? What does this tell you that you couldn't tell from the answer to Question 1?

-------

**answer this question here. Add additional code blocks to show your work and/or display figures/charts.**

In [None]:
# any code used to answer Question 2 goes here

### Question 3
Analyze specific errors. For example, look at questions that had (x) as a correct answer but the model guessed (y) instead. Look at specific cases. What do you notice? Come up with at least 4 hypotheses about what kinds of problems it's getting right and wrong.

-------

**answer this question here. Add additional code blocks to show your work and/or display figures/charts.**

In [None]:
# any code used to answer Question 3 goes here

### Question 4

Given your hypotheses from question 3, create a few test cases manually and see if they confirm or refute your hypotheses.

-------

**answer this question here. Add additional code blocks to show your work and/or display figures/charts.**

In [None]:
# any code used to answer Question 4 goes here