In this notebook, I fine tune the first of my three models. This model is fine-tuned only on the persuasive essay sets (1 and 2).

I begin by establishing a connection with the Hugging Face Hub and installing necessary libraries. I then import different classes and functions for machine learning libraries that I'll be using. Note, I import AutoModelForSequenceClassification since the task at hand fundamentally involves predicting a label (i.e. score) for sequences of text.

In [None]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!pip install transformers
!pip install accelerate
!pip install peft
!pip install datasets
!pip install bitsandbytes



In [None]:
import torch
from transformers import (AutoTokenizer,
                          AutoModelForSequenceClassification,
                          BitsAndBytesConfig,
                          Trainer,
                          TrainingArguments)
from datasets import load_dataset
from peft import (LoraConfig,
                  PeftConfig,
                  PeftModel,
                  get_peft_model,
                  prepare_model_for_kbit_training)

I now specify the base model (LLama-3 8B-Instruct), and create a configuration for BitsAndBytes. This latter part is crucial since it was impossible to train the model otherwise with my setup due to the computational burden. I first specify loading the model parameters in 4-bit precision, then specify the type of 4-bit quantization--it will leverage a data type called 4-bit NormalFloat. I then set the computation data type to bfloat16 and then enable using two rounds of quantization to try to keep the computations mor accurate.

This configuration was taken from https://medium.com/@vmn11/llm-fine-tuning-for-text-classification-using-qlora-13a7d3a256f6 as was the peft configuration below. This is a very useful article about QLoRA.

In [None]:
base_model = "meta-llama/Meta-Llama-3-8B-Instruct"

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= 'nf4',
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= True,
)

I'll now load the pre-trained model for sequence classification tasks. They key points are that the largest range in this dataset is 1-6 so there are 6 labels. The quantization configuration is simply bnb_config which was defined in the previous step.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
        base_model,
        num_labels=6,
        quantization_config=bnb_config,
        device_map='auto', # automatically distribute layers across available GPUs
        trust_remote_code=True # trust remote code in case there are custom layers
)



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B-Instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The following code is also taken largely from https://medium.com/@vmn11/llm-fine-tuning-for-text-classification-using-qlora-13a7d3a256f6. I set up the LoraConfig with a number of paramters:

lora_alpha controls the scaling factor for LoRA's low-rank matrices, i.e. the extent to which the matrices modify the existing model. 16 is pretty aggressive, but larger models do a better job handling these major updates without losing much performance on their general capabilities.


lora_dropout controls the dropout rate for the LoRA layers (prevents overfitting)

r is the rank; 2 is quite low but I have to prioritize efficiency.

Next, I update only the biases in LoRA layers to try to stabilize the learning procs.

The task type is sequence classification, and we specifically target projection layers since those layers determine the calculation of attention across the input sequence. Thus, we can hopefully influence how the model integrates focus and information without really modifying the architecture too much.

In [None]:
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=2,
    bias='lora_only',
    task_type='SEQ_CLS',
    target_modules=['q_proj', 'v_proj', 'k_proj']
)

model = get_peft_model(model, peft_config)


We load an appropriate tokenizer using AutoTokenizer and set its padding token to be the same as the end-of-sequence token

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Next, we need the data (now on S3).

In [1]:
# enable AWS functionalities
!pip install boto3
!pip install s3fs


Collecting botocore<1.35.0,>=1.34.101 (from boto3)
  Using cached botocore-1.34.101-py3-none-any.whl (12.2 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.69
    Uninstalling botocore-1.34.69:
      Successfully uninstalled botocore-1.34.69
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.12.3 requires botocore<1.34.70,>=1.34.41, but you have botocore 1.34.101 which is incompatible.[0m[31m
[0mSuccessfully installed botocore-1.34.101
Collecting botocore<1.34.70,>=1.34.41 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Using cached botocore-1.34.69-py3-none-any.whl (12.0 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.101
    Uninstalling botocore-1.34.101:
      Successfully uninstalled botocore-1.3

In [3]:
import pandas as pd
train_persuasive_model_input = pd.read_csv('s3://698modeldata/train_persuasive_model_input.csv')
eval_persuasive_model_input = pd.read_csv('s3://698modeldata/eval_persuasive_model_input.csv')
test_persuasive_model_input = pd.read_csv('s3://698modeldata/test_persuasive_model_input.csv')

I build a class called CustomDataset to handle the datasets. It specifies what to tokenize and encode, adding features (really the score_type), extracts target scores and converts them into a one-hot encoded format. Note: this class requires only minor changes when the models for the other datasets are being built. One thing that's noteworthy is that the max_length is 2000 which is more than enough for most observations. However, in case it isn't, then the truncation will happen on the front end--this means the prompt is more likely to be omitted than the essay itself.

In [None]:
import torch
from torch.utils.data import Dataset
import torch.nn.functional as F

class CustomDataset(Dataset):
    def __init__(self, df, tokenizer, feature_columns, max_length=2000, max_score=6):  # max_length 2000
        self.tokenizer = tokenizer
        self.data = df
        self.feature_columns = feature_columns
        self.max_length = max_length

        # Truncate from the beginning if necessary (less critical)
        self.encodings = tokenizer(list(df['final_input']), truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt", truncation_strategy='only_first')

        self.features = torch.tensor(df[self.feature_columns].values, dtype=torch.float32)

        self.labels = torch.tensor(df['target_score'].tolist(), dtype=torch.long) - 1
        self.labels = F.one_hot(self.labels, num_classes=max_score).float()

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['features'] = self.features[idx]
        item['labels'] = self.labels[idx]
        return item

# List of encoded
feature_columns = ['score_type_domain1_score', 'score_type_domain2_score',
       'score_type_rater1_domain1', 'score_type_rater2_domain1']


Next, I create instances of the CustomDataset class just defined for the training, evaluation, and testing sets. Then I create DataLoader instances for each set, with smaller batches for training. It isn't necessary to shuffle the data for evaluation and testing.

In [None]:
from torch.utils.data import DataLoader

# instances of the dataset for training, evaluation, and testing
train_dataset = CustomDataset(train_persuasive_model_input, tokenizer, feature_columns)
eval_dataset = CustomDataset(eval_persuasive_model_input, tokenizer, feature_columns)
test_dataset = CustomDataset(test_persuasive_model_input, tokenizer, feature_columns)

# DataLoaders for each dataset
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True, pin_memory=True)
eval_loader = DataLoader(eval_dataset, batch_size=8, shuffle=False, pin_memory=True)

test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)



In [None]:
model_path = '/content/drive/MyDrive/DATA698/models/best_persuasive_model'


Next, I set up the training process and run the process. There are quite a few training argumnents, and I'll comment on them directly. It's worth noting, I had to take a number of steps to reduce memory usage--these will likely impact model performance to some degree.

In [None]:
from transformers import Trainer, TrainingArguments

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model.config.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)


training_args = TrainingArguments(
    output_dir='./results',         # where outputs will be saved
    num_train_epochs=3,             # total number of training epochs
    per_device_train_batch_size=4,  # Made batch size smaller
    gradient_accumulation_steps=4,  # number of steps over which gradients are accumulated before a backprop pass is made
    per_device_eval_batch_size=8,   # evaluation batch size
    warmup_steps=500,               # number of warmup steps - stabilize early training
    weight_decay=0.01,              # to prevent overfitting
    logging_dir='./logs',
    logging_steps=10,               # log every 10 steps
    evaluation_strategy="epoch",    # evaluation strategy
    save_strategy="epoch",          # save strategy after each epoch
    load_best_model_at_end=True,    # load best model at the end of training
    metric_for_best_model='loss',   # metric to use for best model
    report_to="none",
    fp16=True,                      # mixed precision - reducing memory usage
)

# Initialize  Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

# Train
trainer.train()

# Save model + tokenizer
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)




Epoch,Training Loss,Validation Loss
0,0.303,0.308251
2,0.2508,0.254656




Epoch,Training Loss,Validation Loss
0,0.303,0.308251
2,0.2145,0.228527




('/content/drive/MyDrive/DATA698/models/best_persuasive_model/tokenizer_config.json',
 '/content/drive/MyDrive/DATA698/models/best_persuasive_model/special_tokens_map.json',
 '/content/drive/MyDrive/DATA698/models/best_persuasive_model/tokenizer.json')

Now I create an InferenceDataset class to manage the way that the test data is handled. Data from the final_input column is tokenized as it was tokenized earlier. Critically, the essay id and the feature (i.e. the score type) will be included. I then create an instance with the test data and verify its keys.

In [None]:
class InferenceDataset(Dataset):
    def __init__(self, df, tokenizer, feature_columns, max_length=2000):
        self.tokenizer = tokenizer
        self.data = df
        self.feature_columns = feature_columns
        self.max_length = max_length
        self.essay_ids = df['essay_id'].tolist()

        # Tokenization
        self.encodings = tokenizer(list(df['final_input']), truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt", truncation_strategy='only_first')

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['essay_id'] = self.essay_ids[idx]
        # Determine  active feature
        active_feature = next((f for f in self.feature_columns if self.data.iloc[idx][f] == 1), None)
        item['feature'] = active_feature
        return item

# Creating an instance with the test data
test_dataset = InferenceDataset(df=test_persuasive_model_input, tokenizer=tokenizer, feature_columns=feature_columns)
test_item = test_dataset[0]
print(test_item.keys())  # Check keys again


dict_keys(['input_ids', 'attention_mask', 'essay_id', 'feature'])


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


And indeed, essay_id and feature are there. Now I create a custom collate function to be able to iterate over items in a batch. This is useful because we can just aggregate each item's non-tensor data (i.e. essay_id and feature) into a list which mantains their order. I then use this function within DataLoader for the test data.

In [None]:
def custom_collate(batch):
    collated_batch = {key: torch.stack([item[key] for item in batch]) for key in batch[0] if key in ['input_ids', 'attention_mask', 'labels']}
    # Non-tensor data
    for key in ['essay_id', 'feature']:
        collated_batch[key] = [item[key] for item in batch]
    return collated_batch

# Update DataLoader
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False, collate_fn=custom_collate)


The evaluation process can now begin. First, we try to use a GPU and move the model to device.

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): LlamaForSequenceClassification(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=2, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=2, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
           

Now we switch the model to eval mode and loop over test_loader, moving through batches of the dataset. The input ids are moved to the same device as the model (a necessary step). The model processes the inputs and produces outputs (in the form of logits). Logits are then processed to extract the actual predictions. Each time a prediction is made, the essay_id, feature, and prediction go into a dictionary. predictions_info is a list of those dictionaries. The list of dictionaries is then converted into a dataframe which I'll save.

In [None]:
# go to eval mode
model.eval()
predictions_info = []

with torch.no_grad():
    for batch in test_loader:
        inputs = batch['input_ids'].to(device)
        outputs = model(inputs)
        predictions = outputs.logits.argmax(dim=-1)

        for i in range(len(predictions)):
            pred_info = {
                'essay_id': batch['essay_id'][i],
                'feature': batch['feature'][i],
                'prediction': predictions[i].item()
            }
            predictions_info.append(pred_info)

#convert to df
import pandas as pd
persuasive_predictions_df = pd.DataFrame(predictions_info)
print(persuasive_predictions_df)



  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


      essay_id                    feature  prediction
0         3587   score_type_domain1_score           2
1         3587   score_type_domain2_score           2
2          926  score_type_rater1_domain1           2
3          926  score_type_rater2_domain1           2
4          648  score_type_rater1_domain1           3
...        ...                        ...         ...
1071       185  score_type_rater2_domain1           3
1072      1541  score_type_rater1_domain1           3
1073      1541  score_type_rater2_domain1           3
1074      3956   score_type_domain1_score           3
1075      3956   score_type_domain2_score           3

[1076 rows x 3 columns]


In [None]:
print(persuasive_predictions_df.head())


   essay_id                    feature  prediction
0      3587   score_type_domain1_score           2
1      3587   score_type_domain2_score           2
2       926  score_type_rater1_domain1           2
3       926  score_type_rater2_domain1           2
4       648  score_type_rater1_domain1           3


In [None]:
save_path = '/content/drive/MyDrive/DATA698/models/predictions/persuasive_predictions_df.csv'


In [None]:
persuasive_predictions_df.to_csv(save_path, index=False)


Returning to the actual test_data, I'll extract the feature (again, really the score_type for each row) and then merge that dataframe with the predictions dataframe such that it's easier to compare the target_scores with the predictions. Finally, I clean it up a little bit, only keeping key columns, and save the df. We will then be ready to see how well the fine-tuned model did (I'll dedicate another notebook to all the scoring).

In [None]:
test_data = pd.read_csv('s3://698modeldata/test_persuasive_model_input.csv')test_data = pd.read_csv(test_data_path)

In [None]:
def get_active_feature(row, feature_columns):
    for feature in feature_columns:
        if row[feature] == 1:
            return feature
    return None

test_data['feature'] = test_data.apply(lambda row: get_active_feature(row, feature_columns), axis=1)


In [None]:
merged_test = pd.merge(persuasive_predictions_df, test_data, on=['essay_id', 'feature'], how='left')


In [None]:
final_df = merged_test.loc[:, ['essay_id', 'active_feature', 'prediction', 'target_score']]


In [None]:
save_path = '/content/drive/MyDrive/DATA698/models/predictions/final_persuasive_predictions_df.csv'
final_df.to_csv(save_path, index=False)

