<a href="https://colab.research.google.com/github/hsakkout/genai-peft-finetune/blob/development/LightweightFineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight LLAMA 3 Fine-Tuning Project


### Overview

* PEFT technique:  [**LoRA**](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)
* Model: [**meta-llama/Llama-3.1-8B**](https://huggingface.co/meta-llama/Llama-3.1-8B)
* Evaluation approach: **Pronoun disambiguation enhancement (an LLM dynamic text classification problem). [Rouge Scoring](https://huggingface.co/spaces/evaluate-metric/rouge)**
* Fine-tuning dataset:  [**Lots-of-LoRAs/task249_enhanced_wsc_pronoun_disambiguation**](https://huggingface.co/datasets/Lots-of-LoRAs/task249_enhanced_wsc_pronoun_disambiguation)

### Setup Google Colab environment

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
from google.colab import userdata

# First set the token in Colab's secrets manager
# Then access it securely:
token = userdata.get('hf_personal_default')

# Login using the stored token
!huggingface-cli login --token {token}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
# Google Colab instance installs
# Reminder: Restart runtime after installation

!pip install peft
!pip install bitsandbytes
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install rouge_score
!pip install bert_score
!pip install wandb



Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.13.2-py3-none-any.whl (320 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.13.2
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1
Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_

## Load and Evaluate a Foundation Model

- Choose and load a pre-trained Hugging Face model
- Load a data set for evaluation
- Evaluate its performance prior to Parameter-*Efficient* Fine-Tuning


### Choose and Setup Model

In [25]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seed
import torch
import random
import numpy as np
import pandas as pd

# Random seeds for reproducibility
SEED = 42

set_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

# CUDA seeds
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Ensure operations deterministic on GPU
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


In [26]:

# Due to the size of the base LLM, applying quantization
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True
        )

def setup_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    tokenizer.padding_side = "left"

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        pad_token_id=tokenizer.pad_token_id
    )

    return model, tokenizer



In [27]:
# Llama 3.1 is a powerful foundational model, while the 8bn parameter variant is still of manageable size

model_name = "meta-llama/Llama-3.1-8B-Instruct"  #  Note: use of this model is subject to Meta's license agreement and access approval

model, tokenizer = setup_model(model_name)

model

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128009)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): Llama

### Evaluate its performance


#### Configure the generation to make the answers less verbose

In [28]:
from transformers import StoppingCriteria, StoppingCriteriaList


# I will evaluate Llama 3.1 on a dynamic classification problem, where the classes need to be stated by the LLM succinctly.
# The following aims to reduce the verbosity of the response from Llama 3.1.  Without configuration, it tends to produce long, chatty responses.

class StopOnPeriod(StoppingCriteria):
    def __init__(self, tokenizer, stop_id):
        self.tokenizer = tokenizer
        self.stop_id = stop_id

    def __call__(self, input_ids, scores, **kwargs):
        return self.stop_id in input_ids[0][-1:]

class StopOnNewline(StoppingCriteria):
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.newline_ids = tokenizer.encode('\n', add_special_tokens=False)

    def __call__(self, input_ids, scores, **kwargs):
        n = len(self.newline_ids)
        if n > len(input_ids[0]):
            return False
        if input_ids[0][-n:].tolist() == self.newline_ids:
            return True
        return False


stop_token_id = tokenizer.encode(".", add_special_tokens=False)[-1]
stopping_criteria = StoppingCriteriaList([StopOnPeriod(tokenizer, stop_token_id),StopOnNewline(tokenizer)])


# Define bad words to avoid, again to reduce chatter
bad_phrases = ['The final', 'The answer', 'The correct', 'Answer:',
    'A', 'B', 'A)', 'B)', 'Option A', 'Option B',
    'Choice A', 'Choice B', '(A)', '(B)']
bad_words_ids = tokenizer(bad_phrases, add_special_tokens=False).input_ids

#### Define a method for interactive experimentation with the configured foundational model

In [29]:

# This method is for ad hoc text generation, to experiment interactively with the configured model's outputs.


def generate_text(model_in, tokenizer_in, prompt, max_new_tokens=20):
    # Prepare inputs
    inputs = tokenizer_in(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True
    )

    # Move inputs to GPU
    inputs = {k: v.to(model_in.device) for k, v in inputs.items()}

    # Generate text
    outputs = model_in.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer_in.eos_token_id,
        pad_token_id=tokenizer_in.pad_token_id,
        stopping_criteria=stopping_criteria,
        bad_words_ids=bad_words_ids,
        no_repeat_ngram_size=3,
        temperature=0.1
    )

    # Calculate the length of the input prompt
    input_length = inputs['input_ids'].shape[1]

    # Slice only the generated tokens (excluding the prompt) - this Llama 3.1 model returns the combined prompt and answer.
    generated_tokens = outputs[:, input_length:]

    # Decode only the generated tokens
    generated_text = tokenizer_in.batch_decode(generated_tokens, skip_special_tokens=True)

    return generated_text

# Adjusted prompt
prompt = (
    "Definition: In each example you will read a short sentence (or two). "
    "Then, a pronoun in the text is mentioned. Your task is to choose a referent which the mentioned pronoun refers to. "
    "You will be given two options in each case and one choice should seem much more likely to you than the other.\n\n"
    "Positive Example 1 - Input: sentence: I couldn't put the saucepan on the rack because it was too tall. "
    "pronoun: it. A) the saucepan B) the rack Output: the rack\n\n"
    "Positive Example 2 - Input: sentence: Arnold greatly influenced Jackson, though he lived two centuries earlier. "
    "pronoun: he. A) jackson B) arnold Output: arnold\n\n"
    "Negative Example 1 - Input: sentence: Joe and David's uncles can still beat them at tennis, even though they are 30 years younger. "
    "pronoun: they. A) joe and david B) joe and david's uncles Output: joe and david's uncles\n\n"
    "Negative Example 2 - Input: sentence: Gaston passed the half-empty plate to Hershel because he was full. "
    "pronoun: he. A) Gaston B) Hershel Output: Hershel\n\n"
    "Now complete the following example - Input: sentence: Joe and Steve paid the detectives after they delivered the final report on the case. "
    "pronoun: they. A) joe and steve B) the detectives Output:"
)

result = generate_text(model, tokenizer, prompt, max_new_tokens=10)

print(f'RESULT: \n{result}\n\n')
print(f'STRIPPED RESULT: \n{result[0].strip()}')

RESULT: 
[' \nAnswer:\nthe detectives\n']


STRIPPED RESULT: 
Answer:
the detectives


#### Load the two datasets (train and test)




In [30]:
# Load the datasets for more rigorous evaluation

from datasets import load_dataset

# Load dataset
train_dataset = load_dataset(
    'Lots-of-LoRAs/task249_enhanced_wsc_pronoun_disambiguation',
    split='train'
)

# Load dataset
test_dataset = load_dataset(
    'Lots-of-LoRAs/task249_enhanced_wsc_pronoun_disambiguation',
    split='test'
)

# Print the column names
print(f'Train Dataset Columns: {train_dataset.column_names}')
print(f'Train Dataset Rows : {len(train_dataset)}')
print(f'Test Dataset Columns: {test_dataset.column_names}')
print(f'Test Dataset Rows : {len(test_dataset)}')


Train Dataset Columns: ['input', 'output', 'id']
Train Dataset Rows : 547
Test Dataset Columns: ['input', 'output', 'id']
Test Dataset Rows : 69


#### Normalize the data

In [31]:
# Preprocess raw data set inputs into local 'prompts' and 'references' columns

def preprocess_function(examples):
    inputs = examples['input']
    raw_references = examples['output']

    prompts = []
    references = []

    for input_text, ref in zip(inputs, raw_references):
        # Build the prompt for each example
        prompt = (
            input_text.strip()
        )
        prompts.append(prompt)

        # Process the reference
        if isinstance(ref, list):
            ref = ref[0] if ref else ''
        else:
            ref = str(ref)

        references.append(ref.strip())

    return {
        'prompts': prompts,
        'references': references,
    }


#### Method for Systematic Evaluation

In [32]:
# Systematic evaluation method

from tqdm import tqdm
from rouge_score import rouge_scorer

def evaluate_quality(rawdataset, model_in, tokenizer_in):
    dest_device = model_in.device

    data = rawdataset.map(preprocess_function, batched=True)
    # Generate model outputs in batches
    batch_size = 5  # Adjust based on GPU memory
    generated_answers = []
    references = []
    questions_list = []

    for i in tqdm(range(0, len(data), batch_size)):
        batch = data[i:i+batch_size]
        prompts = batch['prompts']
        refs = batch['references']

        # Clear CUDA cache
        torch.cuda.empty_cache()

        # Tokenize inputs
        inputs = tokenizer_in(
            prompts,
            return_tensors='pt',
            truncation=True,
            padding=True,
            max_length=400
        )

        # Move inputs to the same device as the model
        inputs = {k: v.to(dest_device) for k, v in inputs.items()}

        # Generate outputs
        outputs = model_in.generate(
            **inputs,
            max_new_tokens=15,  # Adjust as needed
            eos_token_id=tokenizer_in.eos_token_id,
            pad_token_id=tokenizer_in.pad_token_id,
            stopping_criteria=stopping_criteria,
            bad_words_ids=bad_words_ids,
            no_repeat_ngram_size=3,
            temperature=0.1,
            num_beams=5,
            early_stopping=True
        )

        # **Calculate the length of the input prompt**
        input_length = inputs['input_ids'].shape[1]

        # **Extract only the generated tokens (excluding the prompt)**
        generated_tokens = outputs[:, input_length:]

        # Decode outputs
        decoded_outputs = tokenizer_in.batch_decode(generated_tokens, skip_special_tokens=True)

        # Post-process outputs if necessary
        processed_outputs = [output.strip() for output in decoded_outputs]

        generated_answers.extend(processed_outputs)
        references.extend(refs)
        questions_list.extend(prompts)


    # Minimal post-processing to clean the outputs (optional)
    cleaned_outputs = []
    for output in generated_answers:
        # Split at the first newline or period
        output = output.split('\n')[0].split('.')[0].strip()
        cleaned_outputs.append(output)

    # Step 5: Compute evaluation metrics and store detailed results
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    detailed_results = []

    for idx, (question, ref, pred) in enumerate(zip(
        questions_list, references, cleaned_outputs)):
        # Compute ROUGE scores for our model's prediction
        scores = scorer.score(ref, pred)

        # Store the detailed results
        detailed_results.append({
            'No.': idx + 1,
            'Question': question,
            'Abbrv Question': question[-300:],
            'Reference': ref,
            'Our Prediction': pred,
            'Our ROUGE-1 F1': scores['rouge1'].fmeasure,
            'Our ROUGE-2 F1': scores['rouge2'].fmeasure,
            'Our ROUGE-L F1': scores['rougeL'].fmeasure,
        })

    # Step 6: Create a pandas DataFrame
    df = pd.DataFrame(detailed_results)

    # Set pandas options for better display
    pd.set_option('display.max_colwidth', None)  # Don't truncate text in cells

    # Display the DataFrame in the notebook
    display(df[['No.', 'Abbrv Question', 'Reference', 'Our Prediction',
                'Our ROUGE-1 F1', 'Our ROUGE-2 F1', 'Our ROUGE-L F1']])

    # Also print the average scores
    avg_rouge1 = df['Our ROUGE-1 F1'].mean()
    avg_rouge2 = df['Our ROUGE-2 F1'].mean()
    avg_rougeL = df['Our ROUGE-L F1'].mean()

    print(f"\nAverage ROUGE Scores for Our Model:")
    print(f"ROUGE-1 F1 Score: {avg_rouge1:.4f}")
    print(f"ROUGE-2 F1 Score: {avg_rouge2:.4f}")
    print(f"ROUGE-L F1 Score: {avg_rougeL:.4f}")


In [33]:

evaluate_quality(test_dataset, model, tokenizer)

100%|██████████| 14/14 [00:35<00:00,  2.51s/it]


Unnamed: 0,No.,Abbrv Question,Reference,Our Prediction,Our ROUGE-1 F1,Our ROUGE-2 F1,Our ROUGE-L F1
0,1,e: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: freddy was upset with travis because the toaster he had bought from him didn't work .pronoun: he. A) freddy B) travis\nOutput:,freddy,travis,0.0,0.0,0.0
1,2,"te to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Fred covered his eye with his hand , because the wind was blowing sand around . He lowered it when the wind stopped .pronoun: it. A) his eye B) his hand\nOutput:",his hand,his hand,1.0,1.0,1.0
2,3,ut: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: luke couldn't see the stage with adam in front of him because he is so short .pronoun: he. A) luke B) adam\nOutput:,luke,adam,0.0,0.0,0.0
3,4,f-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: the board of aldermen refused the protesters a permit because they feared violence .pronoun: they. A) the board of aldermen B) the protesters\nOutput:,the board of aldermen,A,0.0,0.0,0.0
4,5,uncles\n\nNegative Example 2 -\nInput: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: The man held the boy against his will .pronoun: his. A) the man B) the boy\nOutput:,the boy,A) The man,0.4,0.0,0.4
...,...,...,...,...,...,...,...
64,65,e Example 2 -\nInput: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Bill did not pass the ball to Steven although he was open .pronoun: he. A) bill B) steven\nOutput:,steven,steven,1.0,0.0,1.0
65,66,"to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Susan and Hannah knew that Ann and Polly's sons had been in car accidents , so they told them about it .pronoun: they. A) susan and hannah B) ann and polly\nOutput:",susan and hannah,susan,0.5,0.0,0.5
66,67,aston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Peter did a lot worse than his good friend Jeff on the test because he had studied so hard .pronoun: he. A) peter B) jeff\nOutput:,jeff,jeff,1.0,0.0,1.0
67,68,2 -\nInput: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: I put the heavy books on the tables and they broke .pronoun: they. A) the heavy books B) the tables\nOutput:,the tables,B,0.0,0.0,0.0



Average ROUGE Scores for Our Model:
ROUGE-1 F1 Score: 0.3619
ROUGE-2 F1 Score: 0.0848
ROUGE-L F1 Score: 0.3619


## Perform Parameter-Efficient Fine-Tuning

- Create a PEFT model from the loaded model
- Run a training loop
- Save the PEFT model weights

### Setup up Trainable LoRA model

In [34]:

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)

# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

# Apply LoRA configuration
config = LoraConfig(
    task_type="CAUSAL_LM",
    inference_mode=False,
    r=16,
    lora_alpha=50,
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "gate_proj", "up_proj"],
    # Note: Applying LoRA to specific layers may require custom code
)

lora_model = get_peft_model(model, config)

# Optionally, print the trainable parameters to verify
lora_model.print_trainable_parameters()



trainable params: 28,311,552 || all params: 8,058,572,800 || trainable%: 0.3513


### Train the LoRA Model

#### Prepare and handle the data

In [35]:

train_data = train_dataset.map(preprocess_function, batched=True)

# Filter out any examples with empty prompts or references
train_data = train_data.filter(lambda example: example['prompts'] and example['references'])


def tokenize_function(examples):
    # Tokenize prompts and references separately
    tokenized_prompts = tokenizer(
        examples['prompts'],
        max_length=512,
        truncation=True,
        padding=False,
        add_special_tokens=False,
    )

    tokenized_references = tokenizer(
        examples['references'],
        max_length=512,
        truncation=True,
        padding=False,
        add_special_tokens=False,
    )

    input_ids = []
    labels = []
    attention_masks = []

    for prompt_ids, reference_ids in zip(tokenized_prompts['input_ids'], tokenized_references['input_ids']):
        # Concatenate prompt and reference IDs
        input_id = prompt_ids + reference_ids + [tokenizer.eos_token_id]
        input_ids.append(input_id)

        # Create labels by masking the prompt tokens
        label = [-100] * len(prompt_ids) + reference_ids + [tokenizer.eos_token_id]
        labels.append(label)

        # Create attention mask
        attention_mask = [1] * len(input_id)
        attention_masks.append(attention_mask)

    return {
        'input_ids': input_ids,
        'attention_mask': attention_masks,
        'labels': labels,
    }

# Apply the tokenization function
tokenized_datasets = train_data.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names,
)


# Define a custom data collator to handle dynamic padding
def data_collator(features):
    batch = {}

    # Get max sequence length in the batch
    max_length = max(len(f['input_ids']) for f in features)

    # Pad input_ids and labels
    input_ids = []
    attention_mask = []
    labels = []
    for f in features:
        pad_length = max_length - len(f['input_ids'])
        input_ids.append(f['input_ids'] + [tokenizer.pad_token_id] * pad_length)
        attention_mask.append([1] * len(f['input_ids']) + [0] * pad_length)
        labels.append(f['labels'] + [-100] * pad_length)

    batch['input_ids'] = torch.tensor(input_ids, dtype=torch.long)
    batch['attention_mask'] = torch.tensor(attention_mask, dtype=torch.long)
    batch['labels'] = torch.tensor(labels, dtype=torch.long)

    return batch


#### Train the Model

In [36]:

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,  # Result of experimentation
    gradient_accumulation_steps=4,  # Result of experimentation
    evaluation_strategy='no',
    save_strategy='epoch',
    logging_steps=10,
    save_total_limit=2,
    fp16=True,
    learning_rate=2e-4,
    max_grad_norm=0.3,
    warmup_steps=100,
    weight_decay=0.01,
    report_to='none',
    seed=SEED,
)

# Initialize the HF Trainer
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_datasets,
    data_collator=data_collator,
)


# Clear CUDA cache
torch.cuda.empty_cache()

# Start training
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  return fn(*args, **kwargs)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


Step,Training Loss
10,5.2248
20,4.7949
30,2.4105
40,0.3104
50,0.2621
60,0.2091
70,0.1372
80,0.131
90,0.1004
100,0.1234


  return fn(*args, **kwargs)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  return fn(*args, **kwargs)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass


TrainOutput(global_step=102, training_loss=1.3457130424532235, metrics={'train_runtime': 276.7183, 'train_samples_per_second': 5.93, 'train_steps_per_second': 0.369, 'total_flos': 2.308054007370547e+16, 'train_loss': 1.3457130424532235, 'epoch': 2.978102189781022})

#### Save the Result

In [37]:
# lora_model.save_pretrained("llama3_1_8b-lora-pretraining")
lora_model.save_pretrained("llama3_1_8b-lora-posttraining-4")

## Performing Inference with a PEFT Model

- Load the saved PEFT model weights
- Evaluate the performance of the trained PEFT model
- Compare the results to the results from prior to fine-tuning

### Load the Saved Model

In [38]:
from peft import PeftModel, PeftConfig


# Create a new, fresh copy of the Base foundational model

print(f'Recreating a new copy of Model {model_name}')
model_copy, tokenizer_copy = setup_model(model_name)



# Load the LoRA model

lora_weights_path = "./llama3_1_8b-lora-posttraining-4"
config = PeftConfig.from_pretrained(lora_weights_path)

lora_model_copy = PeftModel.from_pretrained(
    model_copy,
    lora_weights_path,
    device_map="auto"
)

Recreating a new copy of Model meta-llama/Llama-3.1-8B-Instruct


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Evaluate its quality using our standardized evaluator

In [39]:
evaluate_quality(test_dataset, lora_model_copy, tokenizer_copy)

100%|██████████| 14/14 [00:44<00:00,  3.21s/it]


Unnamed: 0,No.,Abbrv Question,Reference,Our Prediction,Our ROUGE-1 F1,Our ROUGE-2 F1,Our ROUGE-L F1
0,1,e: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: freddy was upset with travis because the toaster he had bought from him didn't work .pronoun: he. A) freddy B) travis\nOutput:,freddy,freddy,1.000000,0.0,1.000000
1,2,"te to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Fred covered his eye with his hand , because the wind was blowing sand around . He lowered it when the wind stopped .pronoun: it. A) his eye B) his hand\nOutput:",his hand,his hand,1.000000,1.0,1.000000
2,3,ut: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: luke couldn't see the stage with adam in front of him because he is so short .pronoun: he. A) luke B) adam\nOutput:,luke,luke,1.000000,0.0,1.000000
3,4,f-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: the board of aldermen refused the protesters a permit because they feared violence .pronoun: they. A) the board of aldermen B) the protesters\nOutput:,the board of aldermen,the protesters,0.333333,0.0,0.333333
4,5,uncles\n\nNegative Example 2 -\nInput: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: The man held the boy against his will .pronoun: his. A) the man B) the boy\nOutput:,the boy,the boy,1.000000,1.0,1.000000
...,...,...,...,...,...,...,...
64,65,e Example 2 -\nInput: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Bill did not pass the ball to Steven although he was open .pronoun: he. A) bill B) steven\nOutput:,steven,steven,1.000000,0.0,1.000000
65,66,"to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Susan and Hannah knew that Ann and Polly's sons had been in car accidents , so they told them about it .pronoun: they. A) susan and hannah B) ann and polly\nOutput:",susan and hannah,susan and hanna,0.666667,0.5,0.666667
66,67,aston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: Peter did a lot worse than his good friend Jeff on the test because he had studied so hard .pronoun: he. A) peter B) jeff\nOutput:,jeff,jeff,1.000000,0.0,1.000000
67,68,2 -\nInput: sentence: gaston passed the half-empty plate to hershel because he was full .pronoun: he. A) Gaston B) hershel\nOutput: hershel\n\nNow complete the following example -\nInput: sentence: I put the heavy books on the tables and they broke .pronoun: they. A) the heavy books B) the tables\nOutput:,the tables,the tables,1.000000,1.0,1.000000



Average ROUGE Scores for Our Model:
ROUGE-1 F1 Score: 0.7998
ROUGE-2 F1 Score: 0.4068
ROUGE-L F1 Score: 0.7961


## Comparison

With very little experimentation, or even selection of the layers for retraining, LoRA greatly improved the pronoun disambiguation problem classification quality, as tested on a testing data set, and measured by the the improved Rouge Scores:


| Model           | Rouge-1 F1 Score Mean | Rouge-L F1 Score Mean |
|-----------------|-----------------------|-----------------------|
| Original        |                0.3619 |                0.3619 |
| LoRA Fine-Tuned |                0.7998 |                0.7961 |

