## Installation

In [1]:
# Install required packages
!pip install transformers datasets peft wandb evaluate

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.13.0->peft)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x

## Imports

In [15]:
import os
import numpy as np
import pandas as pd
import torch
import random
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from peft import (
    get_peft_model,
    LoraConfig,
    TaskType,
    PeftModel,
    PeftConfig
)
import wandb
import evaluate
from sklearn.metrics import mean_squared_error, mean_absolute_error

## Dataset

In [3]:
# Load the dataset
dataset = load_dataset("Short-Answer-Feedback/saf_communication_networks_english")

# Let's look at an example from the training set
print(dataset['train'][0])

# Let's get some basic statistics
print(f"Train set size: {len(dataset['train'])}")
print(f"Validation set size: {len(dataset['validation'])}")
print(f"Test set size: {len(dataset['test_unseen_answers'])}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.68k [00:00<?, ?B/s]

(…)-00000-of-00001-33368fd062630adb.parquet:   0%|          | 0.00/532k [00:00<?, ?B/s]

(…)-00000-of-00001-ac83a9f5b20af433.parquet:   0%|          | 0.00/150k [00:00<?, ?B/s]

(…)-00000-of-00001-934b6dd7b400658f.parquet:   0%|          | 0.00/125k [00:00<?, ?B/s]

(…)-00000-of-00001-c4d530c0df70ed3d.parquet:   0%|          | 0.00/134k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1700 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/427 [00:00<?, ? examples/s]

Generating test_unseen_answers split:   0%|          | 0/375 [00:00<?, ? examples/s]

Generating test_unseen_questions split:   0%|          | 0/479 [00:00<?, ? examples/s]

{'id': '6a31b925382d4e31a417cc78399dbff2', 'question': 'What is "frame bursting"? Also, give 1 advantage and disadvantage compared to the carrier extension.', 'reference_answer': 'Frame bursting reduces the overhead for transmitting small frames by concatenating a sequence of multiple frames in one single transmission, without ever releasing control of the channel.\nAdvantage :it is more efficient than carrier extension as single frames not filled up with garbage.\nDisadvantage :need frames waiting for transmission or buffering and delay of frames', 'provided_answer': 'Frame bursting is a feature for the IEEE 802.3z standard.\nAdvantage: better efficiency\nDisadvantage: station has to wait for enough data to send so frames need to wait (n-to-n delay)', 'answer_feedback': 'The response correctly answers the advantage and disadvantage part of the question. However, the definition is missing in the answer. The correct definition is that frame bursting is used to concatenate a sequence of 

### Pre-processing

In [4]:
# Load the tokenizer
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [5]:
# Define the preprocessing function
def preprocess_function(examples):
    # Combine question and answer into a single text
    inputs = [
        f"Question: {q}\nStudent Answer: {a}"
        for q, a in zip(examples["question"], examples["provided_answer"])
    ]

    # Tokenize the inputs
    model_inputs = tokenizer(
        inputs,
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )

    # Convert scores to tensors
    model_inputs["labels"] = torch.tensor(examples["score"], dtype=torch.float)

    return model_inputs

In [6]:
# Apply preprocessing to the datasets
tokenized_train = dataset["train"].map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
tokenized_val = dataset["validation"].map(preprocess_function, batched=True, remove_columns=dataset["validation"].column_names)
tokenized_test = dataset["test_unseen_answers"].map(preprocess_function, batched=True, remove_columns=dataset["test_unseen_answers"].column_names)

# Set the format for PyTorch
tokenized_train.set_format("torch")
tokenized_val.set_format("torch")
tokenized_test.set_format("torch")

Map:   0%|          | 0/1700 [00:00<?, ? examples/s]

Map:   0%|          | 0/427 [00:00<?, ? examples/s]

Map:   0%|          | 0/375 [00:00<?, ? examples/s]

## Evaluation metric

In [7]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.flatten()

    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(labels, predictions))
    mae = mean_absolute_error(labels, predictions)

    # Calculate Pearson correlation
    pearson_corr = np.corrcoef(predictions, labels)[0, 1]

    return {
        "rmse": rmse,
        "mae": mae,
        "pearson_correlation": pearson_corr
    }

## Model

In [8]:
# Initialize W&B
wandb.login()
wandb.init()

# Initialize the model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

# Define LoRA configuration
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["query","key","value"]  # Target attention layers in GPT-2
)

# Get the PEFT model
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmani696701[0m ([33mmani696701-northeastern-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,033,729 || all params: 125,680,130 || trainable%: 0.8225


## PEFT Fine-tuning using Lora

In [9]:
# Define training arguments
# Update your TrainingArguments to ensure proper logging
training_args = TrainingArguments(
    output_dir="./results/gpt2-student-answer-scoring",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,

    # Add these logging parameters
    logging_dir="./logs",
    logging_strategy="steps",
    logging_steps=50,  # Log every 50 steps
    logging_first_step=True,

    report_to="wandb",
    metric_for_best_model="rmse",
    greater_is_better=False,
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

# Train the model
trainer.train()

# Stop W&B
wandb.finish()

  trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Rmse,Mae,Pearson Correlation
1,0.5491,0.506507,0.711693,0.452876,0.503121
2,0.2814,0.219704,0.468726,0.347058,0.803265
3,0.2257,0.207204,0.455197,0.28671,0.820684
4,0.1762,0.210482,0.458784,0.278734,0.814384
5,0.1588,0.223268,0.472513,0.277825,0.820678
6,0.127,0.264609,0.514401,0.300041,0.82203


0,1
eval/loss,█▁▁▁▁▂
eval/mae,█▄▁▁▁▂
eval/pearson_correlation,▁█████
eval/rmse,█▁▁▁▁▃
eval/runtime,▁▆████
eval/samples_per_second,█▃▁▁▁▁
eval/steps_per_second,█▃▁▁▁▁
train/epoch,▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇████
train/global_step,▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇████
train/grad_norm,█▂▂▅▃▄▅▃▂█▄▁▃▁▃▅▃▄▂▂▃▂▁▂▂▃

0,1
eval/loss,0.26461
eval/mae,0.30004
eval/pearson_correlation,0.82203
eval/rmse,0.5144
eval/runtime,6.3517
eval/samples_per_second,67.226
eval/steps_per_second,8.502
total_flos,2716099946496000.0
train/epoch,6.0
train/global_step,1278.0


## Saving model adapters

In [10]:
# Save the model and adapters
model_save_path = "./results/roberta-student-answer-scoring/final"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

('./results/gpt2-student-answer-scoring/final/tokenizer_config.json',
 './results/gpt2-student-answer-scoring/final/special_tokens_map.json',
 './results/gpt2-student-answer-scoring/final/vocab.json',
 './results/gpt2-student-answer-scoring/final/merges.txt',
 './results/gpt2-student-answer-scoring/final/added_tokens.json',
 './results/gpt2-student-answer-scoring/final/tokenizer.json')

## Evaluation

In [16]:
# Evaluate on the test set
wandb.init()
test_results = trainer.evaluate(tokenized_test)
print(f"Test results: {test_results}")

# Load the model and configuration
peft_model_id = model_save_path
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    config.base_model_name_or_path,
    num_labels=1
)
model = PeftModel.from_pretrained(model, peft_model_id)

# Prediction function
def predict_score(question, answer):
    input_text = f"Question: {question}\nStudent Answer: {answer}"
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    )

    with torch.no_grad():
        outputs = model(**inputs)

    predicted_score = outputs.logits.item()
    return predicted_score

# Pick 5 random examples each run
num_examples = 5
indices = list(range(len(dataset["test_unseen_answers"])))
random.shuffle(indices)

for i in range(num_examples):
    idx = indices[i]
    example = dataset["test_unseen_answers"][idx]
    question = example["question"]
    answer = example["provided_answer"]
    true_score = example["score"]

    predicted_score = predict_score(question, answer)

    print(f"Example {i+1}:")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f"True Score: {true_score}")
    print(f"Predicted Score: {predicted_score}")
    print("-" * 50)

0,1
eval/loss,▁
eval/mae,▁
eval/pearson_correlation,▁
eval/rmse,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
train/epoch,▁
train/global_step,▁

0,1
eval/loss,0.14074
eval/mae,0.2473
eval/pearson_correlation,0.8656
eval/rmse,0.37516
eval/runtime,5.0983
eval/samples_per_second,73.554
eval/steps_per_second,9.219
train/epoch,6.0
train/global_step,1278.0


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Test results: {'eval_loss': 0.140742689371109, 'eval_rmse': 0.37515688634371225, 'eval_mae': 0.24730032682418823, 'eval_pearson_correlation': 0.8655981542766457, 'eval_runtime': 5.1141, 'eval_samples_per_second': 73.326, 'eval_steps_per_second': 9.19, 'epoch': 6.0}
Example 1:
Question: In the lecture you have learned about congestion control with TCP. Name the 2 phases of congestion control and explain how the Congestion Window (cwnd) and the Slow Start Threshold (ss_thresh) change in each phase (after initialization, where cwnd = 1 and ss_thresh = advertised window size) in 1-4 sentences .
Answer: The two phases of congestion control are a slow start and congestion avoidance. Lets assume the slow start threshold is X. After initialization, the "slow start" begins with checking if one segment arrives at the sender (receiving ACK) and increases the number of segments until the procedure throws an error. The new ss_thresh is half of the last segments that arrived successfully (cwnd/2). T