# LLaMa Supervised Fine-Tuning
In the following notebook I'll focus on the fine-tuning of LLaMa2 7B model to the weather extraction task.

Since, it takes quite some time to fine-tune the model (5 epochs = ca. 12 minutes), I won't run may experiments and I'll only focus on 2 dimensions of the analysis:
1. Whether there is any difference in performance between 2 different prompts?
2. Does the performance improve with higher `lora_alpha`, a.k.a. greater influence of the adapter weights or not? For both prompts.

At the end I'll load the best model and evaluate (and quantify) the results, if they'll make any sense.

### Imports
Uncomment and run the following cell, if executing in Google Colab environment

In [None]:
# !pip install -r ../requirements.txt

In [1]:
import json
import gc
import re

import numpy as np
import torch
import transformers
from datasets import Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM
from transformers import TrainingArguments, BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, EarlyStoppingCallback
from trl import SFTTrainer


HF_MODEL_ID = "philschmid/Llama-2-7b-hf" # I'm using ungated 7b model
TRAIN_DATASET_PATH = "../data/hf_training"
VALIDATION_DATASET_PATH = "../data/hf_validation"

MAX_EPOCHS = 10
EARLY_STOPPING_PATIENCE = 2
LOGS_DIR = "outputs"
LORA_ALPHAS = [16, 64]

RANDOM_STATE = 42

2024-04-05 07:46:35.494393: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-05 07:46:35.494459: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-05 07:46:35.496233: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


#### Load the data

In [3]:
train_dataset = Dataset.load_from_disk(TRAIN_DATASET_PATH)
validation_dataset = Dataset.load_from_disk(VALIDATION_DATASET_PATH)

#### Setup prompt formatting functions

In [4]:
def instruction_prompt_formatting_func(example):
    prompts = []
    for i in range(len(example["weather_description"])):
        prompt = f"""{instruction_prompt_inputs(example["weather_description"][i])}
<json>
{json.dumps(example['weather_conditions'][i])}
</json>"""

        prompts.append(prompt)

    return prompts


def instruction_prompt_inputs(weather_description):
    return f"""\
Below is an instruction that describes an information extraction task.

### Instruction:
Read the following weather description and extract weather attributes, such as temperature, humidity, wind.
Write them down in a JSON file format.

### Input:
{weather_description}

### Response:"""


def custom_prompt_formatting_func(example):
    prompts = []
    for i in range(len(example["weather_description"])):
        prompt = f"""{custom_prompt_inputs(example["weather_description"][i])}
<json>
{json.dumps(example['weather_conditions'][i])}
</json>"""

        prompts.append(prompt)

    return prompts


def custom_prompt_inputs(weather_description):
    return f"""\
Given the following WEATHER DESCRIPTION extract WEATHER CONDITIONS in a JSON file format.

### WEATHER DESCRIPTION
{weather_description}

### WEATHER CONDITIONS"""

In [5]:
for prompt_example in instruction_prompt_formatting_func(validation_dataset.select(range(3)).to_pandas().to_dict("list")):
    print("="*20)
    print(prompt_example)

Below is an instruction that describes an information extraction task.

### Instruction:
Read the following weather description and extract weather attributes, such as temperature, humidity, wind.
Write them down in a JSON file format.

### Input:
Humidity levels are currently registered at 51% for the day. Winds will be a bit of a nuisance today, with speeds of 8 km/h and a East direction. It's a -41 degrees degree day, so rug up if you're heading out.

### Response:
<json>
{"humidity": "51%", "temperature": "-41 degrees", "weather": null, "wind_direction": "East", "wind_speed": "8 km/h"}
</json>
Below is an instruction that describes an information extraction task.

### Instruction:
Read the following weather description and extract weather attributes, such as temperature, humidity, wind.
Write them down in a JSON file format.

### Input:
The current temperature is a brisk -1 degrees degrees. It's going to be a windy one, with gusts reaching 12 km/h and coming from the NW. The weathe

#### Train model variants

In [6]:
def get_model_and_tokenizer(model_path):
   # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        # quantization for memory saving purposes
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=getattr(torch, "float16")
        ),
        device_map="auto"
    )
    model.config.use_cache = False
    model.config.pretraining_tp = 1 # faster computation of the linear layers

    # Load LLaMA tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"


    return model, tokenizer


def get_trainer(lora_alpha, formatting_func, model_path = HF_MODEL_ID):

    model, tokenizer = get_model_and_tokenizer(model_path)

    peft_config = LoraConfig(
        r=64,  # dimension of the updated matrices
        lora_alpha=lora_alpha,
        lora_dropout=0.1, # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        eval_dataset=validation_dataset,
        packing=False, # At first, I tried with packing, however I couldn't fit the data on T4 GPU
        formatting_func=formatting_func,
        args=TrainingArguments(
                per_device_train_batch_size=4, # maximum batch size before OOM error on the T4
                per_device_eval_batch_size=4,
                gradient_accumulation_steps=1,
                warmup_ratio=0.03,
                num_train_epochs=MAX_EPOCHS,
                evaluation_strategy="epoch",
                save_strategy="epoch",
                save_total_limit=1,
                max_steps=-1,
                learning_rate=2e-4, # This learning rate seems to stabilize the training according blogs/papers
                fp16=False,
                bf16=False,
                output_dir=LOGS_DIR,
                logging_steps=1,
                optim="paged_adamw_8bit",
                report_to="none",
                load_best_model_at_end=True
        ),
        peft_config=peft_config,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=EARLY_STOPPING_PATIENCE)]
    )

    return trainer

Run the train loop

In [7]:
transformers.set_seed(RANDOM_STATE)

results = []
for lora_alpha in LORA_ALPHAS:
    for prompt_format in [instruction_prompt_formatting_func, custom_prompt_formatting_func]:
        trainer = get_trainer(lora_alpha, prompt_format)
        trainer.train()
        evaluation_results = trainer.evaluate()

        results.append((lora_alpha, prompt_format, evaluation_results))

        model_path = f"alpha={str(lora_alpha)}_prompt={prompt_format.__name__}"
        trainer.model.save_pretrained(model_path)
        trainer.tokenizer.save_pretrained(model_path)

        # Clear memory for the next iteration
        del trainer
        gc.collect()

        torch.cuda.empty_cache()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Map:   0%|          | 0/25 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,0.8476,0.772568
2,0.4013,0.432811
3,0.2623,0.406261
4,0.211,0.423231
5,0.163,0.469207


Checkpoint destination directory outputs/checkpoint-57 already exists and is non-empty. Saving will proceed but saved results may be invalid.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Map:   0%|          | 0/25 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,0.9179,0.887952
2,0.406,0.455948
3,0.2741,0.440066
4,0.2352,0.466842
5,0.1831,0.506499


Checkpoint destination directory outputs/checkpoint-57 already exists and is non-empty. Saving will proceed but saved results may be invalid.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,0.4573,0.479415
2,0.2707,0.414562
3,0.2145,0.395296
4,0.1881,0.440828
5,0.1538,0.526534


Checkpoint destination directory outputs/checkpoint-57 already exists and is non-empty. Saving will proceed but saved results may be invalid.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,0.5084,0.502014
2,0.2894,0.445293
3,0.2252,0.443863
4,0.1982,0.482433
5,0.1674,0.553562


Checkpoint destination directory outputs/checkpoint-57 already exists and is non-empty. Saving will proceed but saved results may be invalid.


#### Display results

In [8]:
for lora_alpha, prompt_f, metrics in results:
    model_path = f"alpha={str(lora_alpha)}_prompt={prompt_f.__name__}"
    print(f"LoRA Alpha: {lora_alpha}, Prompt Function: {prompt_f.__name__}, Validation Loss: {metrics['eval_loss']:.3f}")
    print("Model Save Path: ", model_path, "\n")

LoRA Alpha: 16, Prompt Function: instruction_prompt_formatting_func, Validation Loss: 0.406
Model Save Path:  alpha=16_prompt=instruction_prompt_formatting_func 

LoRA Alpha: 16, Prompt Function: custom_prompt_formatting_func, Validation Loss: 0.440
Model Save Path:  alpha=16_prompt=custom_prompt_formatting_func 

LoRA Alpha: 64, Prompt Function: instruction_prompt_formatting_func, Validation Loss: 0.395
Model Save Path:  alpha=64_prompt=instruction_prompt_formatting_func 

LoRA Alpha: 64, Prompt Function: custom_prompt_formatting_func, Validation Loss: 0.444
Model Save Path:  alpha=64_prompt=custom_prompt_formatting_func 



#### Conclusions
* Losses between 2 prompts shouldn't be compared, because the data is slightly different, however it's interesting to see that `instruction prompt format` achieved lower evaluation loss values for both alphas - 16 and 64
* Eseentially every model variant overfits after epoch 3
* Higher Alpha achieved better evaluation loss value for instruct variant, but higher loss for custom variant

#### Prepare parsing and evaluation functions

In [12]:
CATEGORIES = ["humidity", "temperature", "weather", "wind_direction", "wind_speed"]
PARSE_PATTERN = re.compile(r'<json>(.*?)</json>', re.DOTALL)


def parse_output(generation, pattern):
    # Check if anything is generated
    responses = pattern.findall(generation)
    if not responses:
        return 1 # No response

    # Check if anything is JSON
    json_parsed = []
    for response in responses:
        try:
            json_parsed.append(json.loads(response.strip()))

        except:
            pass

    if not json_parsed:
        return 2 # No response in JSON format

    # Check if what's left has correct keys
    for parsed in json_parsed:
        if all(k in CATEGORIES for k in parsed.keys()):
            return parsed

    return 3 # JSON, but with hallucinated keys


def evaluate(y_true, y_pred):

    scores = {cat: [] for cat in CATEGORIES}
    for ex_true, ex_pred in zip(y_true, y_pred):
        for cat, val in ex_true.items():
            if isinstance(ex_pred, dict) and ex_pred.get(cat, "") == val:
                scores[cat].append(1)

            else: # Simplified evaluation (doesn't take into account parial matches or missing keys)
                scores[cat].append(0)

    return scores


def run_evaluation(model_path, model_prefix, evaluation_df, parser_pattern):

    model, tokenizer = get_model_and_tokenizer(model_path)
    pipe = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        device_map="auto",
        max_length=200 # as the counter meassure for overprediction
    )

    evaluation_df[f"{model_prefix}_raw_results"] = evaluation_df[f"{model_prefix}_prompt"].map(lambda x: pipe(x)[0]["generated_text"])
    evaluation_df[f"{model_prefix}_parsed_results"] = evaluation_df[f"{model_prefix}_raw_results"].map(lambda x: parse_output(x, parser_pattern))

    scores = evaluate(evaluation_df["weather_conditions"].tolist(), evaluation_df[f"{model_prefix}_parsed_results"].tolist())
    for cat, score in scores.items():
        print(f"Category: {cat}, Accuracy: {np.mean(score)}")


    del model
    del tokenizer
    del pipe
    gc.collect()

    return scores

#### Evaluate best models

In [13]:
BEST_INSTRUCT_PATH = "alpha=64_prompt=instruction_prompt_formatting_func"
BEST_CUSTOM_PATH = "alpha=16_prompt=custom_prompt_formatting_func"

In [14]:
val_df = validation_dataset.to_pandas()

val_df["custom_prompt"] = val_df["weather_description"].map(custom_prompt_inputs)
val_df["instruct_prompt"] = val_df["weather_description"].map(instruction_prompt_inputs)

Instruct Prompt

In [15]:
instruct_scores = run_evaluation(
    model_path=BEST_INSTRUCT_PATH,
    model_prefix="instruct",
    evaluation_df=val_df,
    parser_pattern=PARSE_PATTERN
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Category: humidity, Accuracy: 1.0
Category: temperature, Accuracy: 1.0
Category: weather, Accuracy: 1.0
Category: wind_direction, Accuracy: 1.0
Category: wind_speed, Accuracy: 1.0


In [16]:
# Examples of parsed results predictins
val_df["instruct_parsed_results"].head().tolist()

[{'humidity': '51%',
  'temperature': '-41 degrees',
  'weather': None,
  'wind_direction': 'East',
  'wind_speed': '8 km/h'},
 {'humidity': '48%',
  'temperature': '-1 degrees',
  'weather': 'cloudy',
  'wind_direction': 'NW',
  'wind_speed': '12 km/h'},
 {'humidity': '74%',
  'temperature': '24 degrees',
  'weather': 'sunny',
  'wind_direction': 'East',
  'wind_speed': '20 km/h'},
 {'humidity': '88%',
  'temperature': '-15 degrees',
  'weather': 'rainy',
  'wind_direction': 'West',
  'wind_speed': '0 km/h'},
 {'humidity': '20%',
  'temperature': '-23 degrees',
  'weather': 'rainy',
  'wind_direction': 'SE',
  'wind_speed': '5 km/h'}]

Custom Prompt

In [17]:
custom_scores = run_evaluation(
    model_path=BEST_CUSTOM_PATH,
    model_prefix="custom",
    evaluation_df=val_df,
    parser_pattern=PARSE_PATTERN
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Category: humidity, Accuracy: 1.0
Category: temperature, Accuracy: 1.0
Category: weather, Accuracy: 1.0
Category: wind_direction, Accuracy: 1.0
Category: wind_speed, Accuracy: 1.0


In [18]:
# Examples of parsed results predictins
val_df["custom_parsed_results"].head().tolist()

[{'humidity': '51%',
  'temperature': '-41 degrees',
  'weather': None,
  'wind_direction': 'East',
  'wind_speed': '8 km/h'},
 {'humidity': '48%',
  'temperature': '-1 degrees',
  'weather': 'cloudy',
  'wind_direction': 'NW',
  'wind_speed': '12 km/h'},
 {'humidity': '74%',
  'temperature': '24 degrees',
  'weather': 'sunny',
  'wind_direction': 'East',
  'wind_speed': '20 km/h'},
 {'humidity': '88%',
  'temperature': '-15 degrees',
  'weather': 'rainy',
  'wind_direction': 'West',
  'wind_speed': '0 km/h'},
 {'humidity': '20%',
  'temperature': '-23 degrees',
  'weather': 'rainy',
  'wind_direction': 'SE',
  'wind_speed': '5 km/h'}]