# Final Project Check-in 4
The goal of this check in is to fully implement your desired training approach for your task and evaluate pre and post training performance on your full evaluation benchmark datasets. For RAG tasks, this will involve implementing your full RAG pipeline and fully evaluating your pipeline on your RAG-specific benchmarks. This notebook will guide you through the necessary steps. 

## Step 1: Choose Your Training/RAG Approach (15 points)
In a markdown cell below, state which training/RAG method you plan to implement for your task. In one-two paragraphs, summarize why you chose that training/RAG approach, citing both things you have learned in class and your empirical results from implementing different training/RAG approaches for your task in the past two homeworks or your own experiments (in the case of RAG). Also describe any drawbacks you anticipate from the training/RAG approach you chose and why the advantages you anticipate outweigh those drawbacks. 

I plan to implement the PEFT method of LoRA for my training task. Specifically I plan to:

1. Load the base model: `Qwen/Qwen3-4B-Instruct-2507`
2. Load the dataset that I created previously (live forecast data/human forecast/instruction pairs)
3. Apply a Chat template
4. See the benchmark of the model using BERT/ROUGE-L/BLEU
5. Train the Model
6. Save and see new performance Metrics



In the past two homeworks we have implemented trainers that helped to fine-tune specific outcomes needed for base model. My dataset and task do not have a base dataset created in order to look at quantative metrics to see how the Model learns. The best metric that I could find was BERT, which calculates the simularity between two tokenized inputs. I can compare the inputs from the validation split and see if the BERT F1 score increases from pre-training to post training versions of the model. 

I have chosen to implement LoRA as I do not need to change the behavior of the model to the point where full fine tuning becomes necessary. So by implementing the LoRA PEFT method I can approach full fine tuning while hopefully maintaining some other functionality and decreasing the time to train. The biggest downside to this approach is the fact that it will take a long time to train, as I plan on letting it run for at least 5 full epochs on the full training dataset. This means I will need to have the code running while I am not actively monitoring it's progress. The advantage is obvious, I get a model that will be fine-tuned for my specific task that should perform better than the simple In Context Learnings from previous check-ins.

The performance metrics will be more difficult to implement so, I plan to use the `lm_eval` library to generate the results after tuning and then directly comparing teh generated text with the correct text to find the BERT/ROUGE-L/BLEU scores


## Step 2: Benchmark Your Model (20 points)
As you have done in your past two homeworks, use the `lm_eval` package or custom code to evaluate your model's pre-training/pre-RAG performance on your 3 benchmark tasks and testing split from your training/RAG data (use the same testing split that you used in the homeworks for weeks 9 and 10), this time on the full datasets without setting a limit (5 points). For RAG tasks, this will involve implementing the benchmark prompts without any retrieval of relevant documents (the model should perform very poorly, showing a need for RAG). Make sure to log samples as we have done in the past and print the results in a code cell below (5 points). For open-ended generation tasks, you can use a slurm script to benchmark your model since they may take a long time to run. If you go this route, make sure to save your results and model responses as separate json files and load and display them in a code cell below. 

With either approach, make sure to print 2 model responses (10 points)

## Load the Data
 * I saved the training dataset as a `.csv` previously and can access it here.

In [1]:
import os 
os.environ["HF_HOME"] = "/scratch/ezq9qu/models/cache"
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset, DatasetDict
from peft import get_peft_model, LoraConfig
from tqdm import tqdm
from lm_eval import evaluator
import evaluate
from sklearn.model_selection import train_test_split

In [2]:

all_data = pd.read_csv("training_data(1).csv")

training_data = all_data[["nws_forecast","human_forecast"]]
training_data.rename(columns={"nws_forecast":"prompt_text", "human_forecast":"Response"},inplace=True)


instruction_text = """
Output a human-readable surf-forecast similar to that of a veteran surf-obsever. The response should take into account the winds, sea-state, and wave period. The final output should be a few short sentences, with some surfing lingo and flair. The data is as follows:
"""

training_data["Instruct"] = "Q: " + instruction_text + training_data["prompt_text"]+" Let's think step by step\nA: "

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data.rename(columns={"nws_forecast":"prompt_text", "human_forecast":"Response"},inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data["Instruct"] = "Q: " + instruction_text + training_data["prompt_text"]+" Let's think step by step\nA: "


In [3]:
final_data = training_data

In [4]:
df_train, df_temp = train_test_split(final_data, train_size=0.8, random_state=126)
df_val, df_test = train_test_split(df_temp, train_size=0.5, random_state=126)

In [5]:
dataset = DatasetDict({
    "train":Dataset.from_pandas(df_train),
    "validation":Dataset.from_pandas(df_val),
    "test": Dataset.from_pandas(df_test)
})

There is now a 80/10/10 train/test/val split on the dataset

In [8]:
from huggingface_hub import login

# Option 1: Interactive login (recommended)
login() 

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

##

In [21]:
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it", padding_side = 'left')
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    torch_dtype="auto",
    device_map="auto"
)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

In [22]:
num_virtual_tokens = 10
num_epochs = 5
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_type_id

Now I we have to get the Example Prompt and the Response and compare them to generate the baseline metrrics, we cannot use a simple lm_eval test as they don't have a built in task for my specific use case.

In [23]:
train = dataset["train"].map(lambda samples: tokenizer(samples['Instruct']), batched = True)
val = dataset["validation"].map(lambda samples: tokenizer(samples['Instruct']), batched = True)

Map:   0%|          | 0/716 [00:00<?, ? examples/s]

Map:   0%|          | 0/89 [00:00<?, ? examples/s]

In [24]:
train[0]

{'prompt_text': 'Wind: SW winds 5 kt, Seas: 2 ft, Wave Detail: E 1 ft at 4 seconds and S 1 ft at 4 seconds.',
 'Response': 'Hey everyone, there’s no waves out back behind the shop right now. It looks like a lake out there. Buoys are reading 2.3 ft @ 3.8 seconds, with winds blowing 14 mph SSW. Clouds are still lingering over the beach, but besides that it’s a fairly nice day out. The waves that are breaking are […]',
 'Instruct': "Q: \nOutput a human-readable surf-forecast similar to that of a veteran surf-obsever. The response should take into account the winds, sea-state, and wave period. The final output should be a few short sentences, with some surfing lingo and flair. The data is as follows:\nWind: SW winds 5 kt, Seas: 2 ft, Wave Detail: E 1 ft at 4 seconds and S 1 ft at 4 seconds. Let's think step by step\nA: ",
 '__index_level_0__': 731,
 'input_ids': [2,
  235368,
  235292,
  235248,
  108,
  6140,
  476,
  3515,
  235290,
  78823,
  23238,
  235290,
  82773,
  3968,
  577,
  6

In [25]:
text_gen = pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    dtype = torch.bfloat16,
    device_map = "auto",
    do_sample = False
)

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [26]:
outputs = text_gen(
    val["Instruct"],
    batch_size = 16
)

In [27]:
predictions = []
for output in outputs:
    full_text = output[0]['generated_text']
    text = full_text.rsplit("A:",1)[-1]
    predictions.append(text.strip())

Now we can look at the predictions and the expected results prior to fine tuning, the difference is going to be rather large

In [33]:
predictions[0]

'1. Start with the wind: NE 5 kt is light and offshore, which is great — it helps keep the waves clean and prevents wind chop.  \n2. Look at the sea state: 2 ft is a modest swell, not massive, but manageable.  \n3. Analyze the wave detail: SE 2 ft at 8 seconds — that’s a long-period swell, which means it’ll ride smooth and carve well, perfect for experienced surfers. The E 1 ft at 5 seconds is a shorter, more choppy wave, likely from local wind or swell, not ideal for long rides.  \n4. Combine the elements: The long-period wave from the SE gives you a solid, predictable ride with good hold and shape. The offshore wind keeps the surface clean, and the wave period is ideal for carving.  \n5. Final output:  \n"Good clean conditions out there — offshore NE wind keeps the water smooth, and the SE swell at 8 seconds delivers long, powerful, and rideable waves. Perfect for carving and catching the cut. Watch out for the shorter, choppy E swell though — not ideal for long rides. Go out with a 

In [28]:
references = val["Response"]

In [35]:
references[0]

'Hey guys! There is a little longboard wave out back, but it is pretty calm. The ocean surface is clean, knee high, with barely any wind. The tide is going out, with low tide at 6:30pm. If you have time this evening, try to paddle out! Keep an eye on the cam and check back […]'

We can now calculate some metrics

In [29]:
rouge_metric = evaluate.load('rouge')
bert_metric = evaluate.load('bertscore')
bleu_metric = evaluate.load('bleu')

In [30]:
rouge_scores = rouge_metric.compute(predictions=predictions,references=references)
bert_scores = bert_metric.compute(predictions=predictions,references=references, lang= 'en')
bleu_score = bleu_metric.compute(predictions=predictions,references=references)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
def show_results(rouge_score, bert_score, bleu_metric):

    print(predictions[0])
    print("\n")
    print(references[0])


    print("\n--- ROUGE Scores ---")
    print(f"  ROUGE-1: {rouge_scores['rouge1']:.4f}")
    print(f"  ROUGE-2: {rouge_scores['rouge2']:.4f}")
    print(f"  **ROUGE-L: {rouge_scores['rougeL']:.4f}**")

    print("\n--- BLEU Score ---")
    print(f"  **BLEU: {bleu_score['bleu']:.4f}**")

    print("\n--- BERTScore ---")
    avg_f1 = np.mean(bert_scores['f1'])
    print(f"  **Average F1: {avg_f1:.4f}**")



In [32]:
show_results(rouge_score=rouge_scores,bert_score=bert_scores,bleu_metric=bleu_score)

**Step 1: Analyze the wind:**

* NE winds at 5 knots are a bit of a mixed bag. They'll be pushing the waves around, but not too much to make things gnarly.

**Step 2: Assess the sea state:**

* 2 feet of swell is a solid base for some fun. It's not huge, but it'll provide some decent size and shape.

**Step 3: Examine the wave detail:**

* The wave set-up is a bit of a puzzle. We've got a solid 2-foot set at 8 seconds, which is a good indicator of some fun, clean waves. But, there's also a smaller set at 5 seconds, which could mean some choppy sections.

**Step 4: Combine the information:**

* Overall, it looks like a decent day for a solid session. The 2-foot waves with a good period will be fun, but keep an eye out for the choppy sections. 

**Final Output:**

"Looks like a solid day for some beach breaks, with a mix of clean 2-footers and some choppy sections.  Keep an eye out for the 5


Hey guys! There is a little longboard wave out back, but it is pretty calm. The ocean surface i

This shows that the model is currently a repeater, it just repeats what it has been pre-trained on without actually using any actual surfer lingo. This is evidenced byt the low ROUGE and BLEU scores. However the High BERT score shows that the model does pretty well semantically, the initial picture is pretty good. The goal will be to make the ROUGE and BLEU scores go up (bring in some more surfer lingo), while maintaining the high bert score.

## Step 3: Train Your Model or Implement Your RAG Pipeline (25 points)
For finetuning tasks:
As you have done in the past two homeworks, prepare your data for training and train your model using the HuggingFace `trainer` (making sure to do a train/eval split of your training data (use the same train/eval split you used in your homeworks for weeks 9 and 10) and log metrics during training), loading and saving the best model at the end (10 points). Try at least three different hyperparameter combinations to find values that work well for your task (10 points). You can show results for each hyperparameter combination, or you can just show results for the best combination. In a markdown cell below your results in one paragraph, describe the values you tried, why you chose them, and which ones worked best (5 points).

For RAG tasks: 
Pull in your custom dataset and set up your RAG pipeline, including tokenizing your dataset, embedding it, storing the embeddings, and setting up the retrieval module based on a similarity metric (10 points). Try at least 3 different combinations of embedding and retrieval metrics to find what pipeline works best for your task, comparing performance for the different combinations based on performance on your manually-constructed test prompts for your task (10 points). In a markdown cell below your results in one paragraph, describe the combinations you tried, why you chose them, and which ones worked best (5 points). 

In [40]:
LORA_R = 64 
LORA_ALPHA = 64
LORA_DROPOUT = .05
lora_config = LoraConfig(
    r = LORA_R, #the lower dimension of the low-rank matrices
    lora_alpha = LORA_ALPHA, #scaling factor for the low-rank update
    lora_dropout = LORA_DROPOUT, #dropout factor to prevent overfitting
    bias = "none",
    task_type = "CAUSAL_LM", #set language modeling as task type
    target_modules = ["q_proj", "v_proj"], #add LoRA modules to every query and value matrix in the attention layers
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 23,592,960 || all params: 4,046,061,056 || trainable%: 0.5831


In [68]:
from transformers import TrainingArguments

def create_training_arguments(path, learning_rate = 0.000001, epochs = 5, eval_steps = 100):
    training_args = TrainingArguments(
        output_dir = path, #specify path for trained model weights
        auto_find_batch_size = True, #automatically find batch size
        learning_rate = learning_rate,
        num_train_epochs = epochs, 
        logging_steps = eval_steps, #this is how often we log training results
        eval_strategy = "steps", #evaluate every 150 steps
        eval_steps = eval_steps,
        save_steps = eval_steps,
        load_best_model_at_end = True,
    )
    return training_args

In [69]:
working_dir = "/scratch/ezq9qu"
output_directory = os.path.join(working_dir, "trained_lora_model_surfer_one")
if not os.path.exists(output_directory):
    os.mkdir(output_directory)

In [70]:
training_args = create_training_arguments(output_directory)

In [71]:
from transformers import Trainer, DataCollatorForLanguageModeling
def create_trainer(model, training_args, train_dataset, eval_dataset):
    trainer = Trainer(
        model = model,
        args = training_args,
        train_dataset = train_dataset,
        eval_dataset = eval_dataset,
        data_collator = DataCollatorForLanguageModeling(tokenizer,
                                                        mlm= False),
    )
    return trainer

In [72]:
trainer = create_trainer(model, training_args, train, val)



In [73]:
trainer.train()

Step,Training Loss,Validation Loss
100,0.2835,0.27654
200,0.2795,0.274848
300,0.2811,0.274187
400,0.2782,0.273797


TrainOutput(global_step=450, training_loss=0.28052342096964517, metrics={'train_runtime': 204.1422, 'train_samples_per_second': 17.537, 'train_steps_per_second': 2.204, 'total_flos': 1.0844837372276736e+16, 'train_loss': 0.28052342096964517, 'epoch': 5.0})

In [67]:
trainer.model.save_pretrained(f"{output_directory}/best_model")

This is how I implemented one batch of training. I was able to tune the hyperparameters using a python script and slurm. Google Gemini helped in the creation of the python file for sweeping through the hyperparameters.

The python file essentialy looped through a bunch of different hyperparameters and trained different models every time this could be done natively inside the `wandb` libray. It allowed me to save the best performing model to a file, it is saved in my `\scratch` folder. The best hyperparameters were:

* learning_rate: 0.00012929155174079129 
* lora_r: 64
* lora_alpha: 32
* lora_dropout: 0.1
* num_train_epochs: 5
* per_device_train_batch_size: 8

In [12]:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", padding_side = 'left')
model = AutoModelForCausalLM.from_pretrained('/scratch/ezq9qu/wandb-sweeps/stellar-sweep-4/best_model/', device_map="auto",dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Step 4: Assess Post-training Benchmark Performance (20 points)
Repeat Step 2 using your trained model or RAG pipeline. 

In [13]:
text_gen = pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    dtype = torch.bfloat16,
    device_map = "auto",
    do_sample = False
)

Device set to use cuda:0


In [14]:
outputs = text_gen(
    val["Instruct"],
    batch_size = 16
)

In [16]:
predictions = []
for output in outputs:
    full_text = output[0]['generated_text']
    text = full_text.rsplit("A:",1)[-1]
    predictions.append(text.strip())

In [20]:
references = val["Response"]

In [17]:
rouge_metric = evaluate.load('rouge')
bert_metric = evaluate.load('bertscore')
bleu_metric = evaluate.load('bleu')

In [23]:
rouge_scores = rouge_metric.compute(predictions=predictions,references=references)
bert_scores = bert_metric.compute(predictions=predictions,references=references, lang= 'en')
bleu_score = bleu_metric.compute(predictions=predictions,references=references)

In [24]:
show_results(rouge_score=rouge_scores,bert_score=bert_scores,bleu_metric=bleu_score)

2 ft SE waves at 8 seconds, with a bit of E 1 ft at 5 seconds. Light winds and seas, so it could be a bit of surf. 2 ft SE at 8 seconds, with a bit of E 1 ft at 5 seconds. Light winds and seas, so it could be a bit of surf. 2 ft SE at 8 seconds, with a bit of E 1 ft at 5 seconds. Light winds and seas, so it could be a bit


Hey guys! There is a little longboard wave out back, but it is pretty calm. The ocean surface is clean, knee high, with barely any wind. The tide is going out, with low tide at 6:30pm. If you have time this evening, try to paddle out! Keep an eye on the cam and check back […]

--- ROUGE Scores ---
  ROUGE-1: 0.1477
  ROUGE-2: 0.0162
  **ROUGE-L: 0.1069**

--- BLEU Score ---
  **BLEU: 0.0046**

--- BERTScore ---
  **Average F1: 0.8132**


## Step 5: Interpretation of Results (20 points)

For finetuning tasks:
In one-two paragraphs, summarize the results from training your model, noting how the outputs improved post training, how performance on the benchmarks and testing split changed post training, and whether and how the improvements and drawbacks from training you noticed empirically matched those you anticipated in Step 1 above. 

For RAG tasks:
In one-two paragraphs, summarize the results from implementing RAG with your model, noting how the outputs improved after implementing the RAG pipeline, how performance on the benchmarks and testing split changed after implementing RAG, and whether and how the improvements and drawbacks from RAG you noticed empirically matched those you anticipated in Step 1 above. 

It did not get the desired results from my implemented training. THe Rouge and BLEU scores did not increase and the BERT score went down slighty. This does not seem to get when I need, so in order to move forward get better results I have to add someway to incentivise the correct output formatting. This might be full fine tuning combined with some fine-tuning steps like in-context learning first, then followed by fine tuning. By implementing the training this way I hope to get the model to learn the correct output style first, then we can add more specifics with fine-tuning. I should be able to accomplish this with some few shot prompting.

## Additional Code 

Here is the code I wrote for the python training and slurm job

In [None]:

#!/bin/bash
#SBATCH -A ds5002 ##Define allocation
#SBATCH --partition=gpu ##Define GPU partition
#SBATCH --gres=gpu:2 ##Specify desired number of GPUs
#SBATCH --constraint=a6000 ##Optional: specify type of GPU
#SBATCH --ntasks=1 ##Specify number of tasks 

#SBATCH --cpus-per-task=2 ##Specify number of CPUs per task
#SBATCH --mem=10G ##Specify amount of CPU storage needed
#SBATCH -t 1-00:00:00 ##Specify time constraint in Days-Hours:Minutes:Seconds format
#SBATCH -J wandb-sweep ##Name the job
#SBATCH -o wandb-sweep-%A.out ##Provide a name for the .out file- this is where any output printed to console will be stored after the job finishes
#SBATCH -e wandb-sweep-%A.err ##Provide a name for the .err file- this is where any error messages and output will be printed during the job
#SBATCH --mail-user=ezq9qu@virginia.edu
#SBATCH --mail-type=ALL

module purge ##Purge any existing modules on the compute resources
module load miniforge ##Load miniforge for python
source activate /scratch/ezq9qu/llm_course ##Load your virtual environment
python sweep.py ##Run your .py file


SyntaxError: invalid syntax (927569835.py, line 1)

In [None]:

import os
import pandas as pd
import numpy as np
import torch
import wandb  
from datasets import load_dataset, Dataset, DatasetDict
from peft import get_peft_model, LoraConfig
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from sklearn.model_selection import train_test_split
import evaluate



os.environ["HF_HOME"] = "/scratch/ezq9qu/models/cache" 
os.environ["WANDB_PROJECT"] = "surf-forecast-lora-sweep"
WORKING_DIR = "/scratch/ezq9qu"


def load_surf_dataset():
    """
    Loads and preprocesses the surf forecast dataset.
    """
    all_data = pd.read_csv("training_data(1).csv")
    training_data = all_data[["nws_forecast", "human_forecast"]]

    training_data = training_data.rename(columns={"nws_forecast": "prompt_text", "human_forecast": "Response"})

    instruction_text = """
Output a human-readable surf-forecast similar to that of a veteran surf-obsever. The response should take into account the winds, sea-state, and wave period. The final output should be a few short sentences, with some surfing lingo and flair. The data is as follows:
"""
    # Create the "Instruct" column
    training_data = training_data.assign(
        Instruct="Q: " + instruction_text + training_data["prompt_text"] + " Let's think step by step\nA: "
    )
    
    final_data = training_data
    
    df_train, df_temp = train_test_split(final_data, train_size=0.8, random_state=126)
    df_val, df_test = train_test_split(df_temp, train_size=0.5, random_state=126)
    
    dataset = DatasetDict({
        "train": Dataset.from_pandas(df_train),
        "validation": Dataset.from_pandas(df_val),
        "test": Dataset.from_pandas(df_test)
    })
    return dataset

# --- Tokenization Function ---
def tokenize_dataset(dataset, tokenizer):
    """
    Applies tokenization to the dataset.
    """
    def tokenize_function(samples):
        # Tokenize the 'Instruct' field
        return tokenizer(samples['Instruct'], truncation=True, padding=False)

    train_dataset = dataset["train"].map(tokenize_function, batched=True)
    val_dataset = dataset["validation"].map(tokenize_function, batched=True)
    
    return train_dataset, val_dataset

def train():
    """
    This function is called by the wandb agent for each trial.
    """
    # 1. Initialize wandb run
    # This will fetch the hyperparams for this specific run
    run = wandb.init() 
    config = wandb.config

    # --- Load Tokenizer and Datasets ---
    model_name = "Qwen/Qwen3-4B-Instruct-2507"
    
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
    # Set pad token
    tokenizer.pad_token = tokenizer.eos_token
    
    dataset = load_surf_dataset()
    if dataset is None:
        return # Stop if data loading failed

    train_dataset, val_dataset = tokenize_dataset(dataset, tokenizer)

    # --- Load Base Model ---
    # Must be loaded fresh for each new run
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16, # More efficient than "auto"
        device_map="auto"
    )
    # Ensure model's pad_token_id is set
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_type_id

    # --- PEFT (LoRA) Configuration ---
    # Pull hyperparameters directly from wandb.config
    lora_config = LoraConfig(
        r=config.lora_r,
        lora_alpha=config.lora_alpha,
        lora_dropout=config.lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "v_proj"], # From your notebook
    )
    
    model = get_peft_model(model, lora_config)
    print(f"--- Run: {run.name} ---")
    model.print_trainable_parameters()

    # --- Training Arguments ---
    # Define a unique output dir for each run based on its wandb name
    output_dir = os.path.join(WORKING_DIR, "wandb-sweeps", run.name)

    training_args = TrainingArguments(
        output_dir=output_dir,
        
        # --- Hyperparameters from wandb config ---
        num_train_epochs=config.num_train_epochs,
        learning_rate=config.learning_rate,
        per_device_train_batch_size=config.per_device_train_batch_size,
        
        # --- Other Training Params ---
        per_device_eval_batch_size=config.per_device_train_batch_size * 2, # Eval can use larger batch
        gradient_accumulation_steps=2, # Accumulate gradients to simulate larger batch size
        optim="paged_adamw_8bit",    # Efficient optimizer for LoRA
        
        logging_steps=50,  # Log more frequently
        eval_strategy="steps",
        eval_steps=100,
        save_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss", # This is what the sweep will track
        greater_is_better=False,

        report_to="wandb",  
        
        bf16=torch.cuda.is_bf16_supported(), # Use bfloat16 if available
        fp16=False, # Mutually exclusive with bf16
    )

    # --- Trainer ---
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    # --- Train ---
    try:
        print(f"Starting training for run {run.name}...")
        trainer.train()
        
        # Save the best model from this run
        best_model_path = os.path.join(output_dir, "best_model")
        trainer.model.save_pretrained(best_model_path)
        print(f"Best model for run {run.name} saved to {best_model_path}")
        
    except Exception as e:
        print(f"Error during training for run {run.name}: {e}")
    
    finally:
        # 7. Finish the wandb run
        print(f"Finishing run {run.name}.")
        run.finish()

# --- Main execution to set up and run the sweep ---
if __name__ == "__main__":
    
    # 1. Define the Sweep Configuration
    # This dictionary tells wandb what hyperparameters to try
    sweep_config = {
        'method': 'bayes',  # Use Bayesian optimization
        'metric': {
            'name': 'eval/loss', # This is what Trainer logs
            'goal': 'minimize'     # We want to minimize the validation loss
        },
        'parameters': {
            'learning_rate': {
                'distribution': 'log_uniform_values',
                'min': 1e-5,  # Start from 0.00001
                'max': 5e-4   # Go up to 0.0005
            },
            'lora_r': {
                'values': [16, 32, 64] # Your notebook used 64
            },
            'lora_alpha': {
                'values': [32, 64, 128] # Often 1x or 2x of r
            },
            'lora_dropout': {
                'values': [0.05, 0.1]
            },
            'per_device_train_batch_size': {
                'values': [4, 8] # Adjust based on your GPU memory
            },
            'num_train_epochs': {
                'values': [3, 5] # Your notebook planned 5
            }
        }
    }

    # 2. Initialize the Sweep
    # This tells wandb to create the sweep project
    # You only need to run this line ONCE to create the sweep.
    try:
        sweep_id = wandb.sweep(sweep_config, project=os.environ["WANDB_PROJECT"])
        print(f"Sweep created successfully. Sweep ID: {sweep_id}")
        
        # 3. Start the agent
        # This will run 10 trials sequentially on this machine.
        # The agent will ask the wandb server for hyperparameters,
        # then call the `train` function with them.
        print("Starting wandb agent to run 10 trials...")
        wandb.agent(sweep_id, function=train, count=10)
        
        print("\n--- Sweep Finished ---")
        print(f"View all results at: https://wandb.ai/{wandb.api.default_entity}/{os.environ['WANDB_PROJECT']}/sweeps/{sweep_id}")

    except Exception as e:
        print(f"Error setting up or running sweep: {e}")
    except KeyboardInterrupt:
        print("\nSweep interrupted by user. Exiting.")