<a href="https://colab.research.google.com/github/leokeyan-lab/1.Python-Week-0-2-/blob/main/Fine_Tune_Model_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Qwen2 0.5B Instruct for Text-to-SQL Generation

## Introduction

This notebook demonstrates the fine-tuning of the Qwen2 0.5B Instruct model on a synthetic Text-to-SQL dataset. We'll explore different parameter-efficient fine-tuning methods (LoRA, QLoRA, and Prefix-Tuning) and compare them with full fine-tuning. The goal is to evaluate the model's ability to generate SQL queries from natural language instructions.

## Setup and Dependencies

First, let's install the necessary libraries. We'll use Unsloth for efficient fine-tuning of the model.

In [1]:
# Check if running in Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ

if IN_COLAB:
    # Install dependencies for Colab
    !pip install -q unsloth
    !pip install -q datasets evaluate rouge_score nltk bert_score py-rouge sacrebleu gleu textstat
    !pip install -q detoxify sentence_transformers
    !pip install -q accelerate peft transformers
    !pip install -q bitsandbytes
    #!pip uninstall -y bitsandbytes
    #!pip install bitsandbytes

In [2]:
import os
import numpy as np
import pandas as pd
import torch
import random
import nltk
import textstat
import re
import matplotlib.pyplot as plt
import seaborn as sns
import datasets # Import the entire datasets module
from datasets import load_dataset # Keep load_dataset for clarity
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, PrefixTuningConfig, get_peft_model, PeftModel
from unsloth import FastLanguageModel # This import is redundant if unsloth is already imported
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
from collections import Counter

# For evaluation metrics
import evaluate
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from bert_score import BERTScorer
from detoxify import Detoxify
from sentence_transformers import SentenceTransformer, util

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')

# Set random seed for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# Create output directory if it doesn't exist
os.makedirs('outputs', exist_ok=True)
os.makedirs('outputs/full_fine_tuning', exist_ok=True)
os.makedirs('outputs/lora_fine_tuning', exist_ok=True)
os.makedirs('outputs/qlora_fine_tuning', exist_ok=True)
os.makedirs('outputs/prefix_tuning', exist_ok=True)


Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel # This import is redundant if unsloth is already imported


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## Data Preparation

We'll load the Synthetic Text-to-SQL dataset and preprocess it for our fine-tuning task.

In [3]:
# Load the dataset
dataset = load_dataset("gretelai/synthetic_text_to_sql")
print(f"Dataset loaded with {len(dataset['train'])} examples")

# Display a few examples to understand the structure
print("\nSample examples:")
for i in range(5):
    print(f"\nExample {i+1}:")
    print(f"Question: {dataset['train'][i]['sql_prompt']}")
    print(f"SQL: {dataset['train'][i]['sql']}")

Dataset loaded with 100000 examples

Sample examples:

Example 1:
Question: What is the total volume of timber sold by each salesperson, sorted by salesperson?
SQL: SELECT salesperson_id, name, SUM(volume) as total_volume FROM timber_sales JOIN salesperson ON timber_sales.salesperson_id = salesperson.salesperson_id GROUP BY salesperson_id, name ORDER BY total_volume DESC;

Example 2:
Question: List all the unique equipment types and their corresponding total maintenance frequency from the equipment_maintenance table.
SQL: SELECT equipment_type, SUM(maintenance_frequency) AS total_maintenance_frequency FROM equipment_maintenance GROUP BY equipment_type;

Example 3:
Question: How many marine species are found in the Southern Ocean?
SQL: SELECT COUNT(*) FROM marine_species WHERE location = 'Southern Ocean';

Example 4:
Question: What is the total trade value and average price for each trader and stock in the trade_history table?
SQL: SELECT trader_id, stock, SUM(price * quantity) as total

In [4]:
# Preprocess the dataset
def preprocess_dataset(examples):
    # Format the input and output for instruction fine-tuning
    formatted_inputs = []
    formatted_outputs = []

    for question, query in zip(examples["sql_prompt"], examples["sql"]):
        # Clean and format the input
        input_text = f"### Instruction: Convert the following text to SQL query.\n\n### Input: {question}\n\n### Response:"

        # Clean and format the output
        output_text = query

        formatted_inputs.append(input_text)
        formatted_outputs.append(output_text)

    return {"input": formatted_inputs, "output": formatted_outputs}

# Apply preprocessing
processed_dataset = dataset.map(
    preprocess_dataset,
    batched=True,
    remove_columns=dataset["train"].column_names
)

# Convert the processed dataset to a pandas DataFrame for splitting
processed_df = processed_dataset["train"].to_pandas()

# Split the dataset into train, validation, and test sets
train_val_df, test_df = train_test_split(
    processed_df,
    test_size=0.1,
    random_state=seed
)

train_df, val_df = train_test_split(
    train_val_df,
    test_size=0.1,
    random_state=seed
)

# Convert the DataFrames back to Dataset objects if needed for later steps
train_dataset = datasets.Dataset.from_pandas(train_df)
val_dataset = datasets.Dataset.from_pandas(val_df)
test_dataset = datasets.Dataset.from_pandas(test_df)


print(f"Train set: {len(train_dataset)} examples")
print(f"Validation set: {len(val_dataset)} examples")
print(f"Test set: {len(test_dataset)} examples")

# Display a few processed examples
print("\nSample processed examples:")
for i in range(2):
    print(f"\nExample {i+1}:")
    print(f"Input: {train_dataset[i]['input']}")
    print(f"Output: {train_dataset[i]['output']}")

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5851 [00:00<?, ? examples/s]

Train set: 81000 examples
Validation set: 9000 examples
Test set: 10000 examples

Sample processed examples:

Example 1:
Input: ### Instruction: Convert the following text to SQL query.

### Input: What is the total climate finance for Caribbean countries with expenditure higher than 750,000 in 2021?

### Response:
Output: SELECT SUM(amount) FROM climate_finance_caribbean WHERE year = 2021 AND country IN ('Bahamas', 'Jamaica', 'Trinidad and Tobago');

Example 2:
Input: ### Instruction: Convert the following text to SQL query.

### Input: Identify the number of mental health parity violations by region?

### Response:
Output: SELECT region, SUM(violation_count) as total_violations FROM mental_health_parity_violations GROUP BY region;


## Model Fine-Tuning

We'll set up the Qwen2 0.5B Instruct model and fine-tune it using different methods.

### 1. Full Fine-Tuning

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig, DataCollatorForLanguageModeling
import torch
from unsloth import FastLanguageModel # Keep this import as other cells use FastLanguageModel
from datasets import Dataset # Import Dataset

# Load the model and tokenizer for full fine-tuning using FastLanguageModel
model_name = "Qwen/Qwen2-0.5B-Instruct"

# Determine dtype based on BF16 support
if torch.cuda.is_bf16_supported():
    dtype = torch.bfloat16
else:
    dtype = torch.float16 # Fallback to FP16 if BF16 is not supported

# Initialize the model and tokenizer using FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,
    dtype=dtype, # Use determined dtype
    load_in_4bit=False,  # Full fine-tuning - do not quantize
)

# Explicitly move model to CUDA device to ensure all parameters are on GPU
model.to('cuda')

# Convert the processed dataset to a Hugging Face Dataset object
# Assuming processed_df was created and split into train_df, val_df, and test_df
# Convert train_df to a Dataset object if it's not already
if not isinstance(train_dataset, Dataset):
    train_dataset = Dataset.from_pandas(train_df)
if not isinstance(val_dataset, Dataset):
    val_dataset = Dataset.from_pandas(val_df)


# Tokenize the dataset
def tokenize_function(examples):
    # Concatenate input and output for causal language modeling
    # The tokenizer will handle adding BOS/EOS tokens as needed
    text = [input_text + output_text + tokenizer.eos_token for input_text, output_text in zip(examples["input"], examples["output"])]
    tokenized_inputs = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=model.max_seq_length,
        return_attention_mask=True,
    )
    # For causal language modeling, labels are the same as input_ids
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()
    return tokenized_inputs

tokenized_train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["input", "output", "__index_level_0__"] # Remove original columns
)
tokenized_val_dataset = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["input", "output", "__index_level_0__"] # Remove original columns
)


# Set up training arguments
training_args = TrainingArguments(
    output_dir="outputs/full_fine_tuning",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=100,
    learning_rate=2e-5,
    fp16=torch.cuda.is_bf16_supported(), # Set fp16 based on bf16 support
    bf16=torch.cuda.is_bf16_supported(), # Set bf16 based on support
    logging_steps=5,
    optim="adamw_bnb_8bit", # Use adamw_bnb_8bit optimizer
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=seed,
    max_grad_norm=0, # Disable gradient clipping for debugging
)

# Initialize a data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset, # Add validation dataset
    tokenizer=tokenizer, # Keep tokenizer for logging and potential use in callbacks
    data_collator=data_collator,
)

# Train the model using the Trainer
trainer.train()

# Save the model
model.save_pretrained("outputs/full_fine_tuning/final_model")
tokenizer.save_pretrained("outputs/full_fine_tuning/final_model")

==((====))==  Unsloth 2025.8.8: Fast Qwen2 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/81000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

  trainer = Trainer(
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 81,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 357,898,112 of 494,032,768 (72.44% trained)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mleokeyan[0m ([33mleokeyan-thinkyai[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
5,5.4688
10,2.341
15,1.6323
20,1.4735
25,1.2775
30,1.3284
35,1.2159
40,1.2307
45,1.1276
50,1.1887


('outputs/full_fine_tuning/final_model/tokenizer_config.json',
 'outputs/full_fine_tuning/final_model/special_tokens_map.json',
 'outputs/full_fine_tuning/final_model/chat_template.jinja',
 'outputs/full_fine_tuning/final_model/vocab.json',
 'outputs/full_fine_tuning/final_model/merges.txt',
 'outputs/full_fine_tuning/final_model/added_tokens.json',
 'outputs/full_fine_tuning/final_model/tokenizer.json')

### 2. LoRA Fine-Tuning

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig
import torch
from unsloth import FastLanguageModel
from peft import LoraConfig, PrefixTuningConfig, get_peft_model, PeftModel # Keep these imports
from datasets import Dataset # Import Dataset

# Initialize the model for LoRA fine-tuning
model_name = "Qwen/Qwen2-0.5B-Instruct"

# Determine dtype based on BF16 support
if torch.cuda.is_bf16_supported():
    dtype = torch.bfloat16
else:
    dtype = torch.float16 # Fallback to FP16 if BF16 is not supported

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,
    dtype=dtype, # Use determined dtype
    load_in_4bit=True,  # Quantize the model
)

# Set up LoRA configuration parameters
lora_rank = 16
lora_alpha = 32
lora_dropout = 0.05

# Add LoRA adapters to the model
# Correctly pass model and integer rank 'r'
model = FastLanguageModel.get_peft_model(
    model, # Pass the model instance
    r = lora_rank, # Pass the integer rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = lora_alpha,
    lora_dropout = lora_dropout,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = seed,
)

# Manually format and tokenize the dataset
def format_and_tokenize_dataset(examples):
    formatted_texts = []
    for input_text, output_text in zip(examples["input"], examples["output"]):
        # Combine input and output into a single string with a clear separator
        # Using the same format as in preprocess_dataset
        full_text = f"{input_text} {output_text}{tokenizer.eos_token}"
        formatted_texts.append(full_text)

    # Tokenize the formatted texts
    # Add padding and truncation
    return tokenizer(formatted_texts, padding="max_length", truncation=True, max_length=2048)

# Apply the formatting and tokenization to the training dataset
tokenized_train_dataset = train_dataset.map(
    format_and_tokenize_dataset,
    batched=True,
    remove_columns=["input", "output", "__index_level_0__"], # Remove original columns and pandas index
)

# Apply the formatting and tokenization to the validation dataset (important for evaluation)
tokenized_val_dataset = val_dataset.map(
    format_and_tokenize_dataset,
    batched=True,
    remove_columns=["input", "output", "__index_level_0__"], # Remove original columns and pandas index
)


# Set the format to torch, which is expected by the Trainer
tokenized_train_dataset.set_format("torch")
tokenized_val_dataset.set_format("torch")


# Set up training arguments
training_args = TrainingArguments(
    output_dir="outputs/lora_fine_tuning",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    warmup_steps=5,
    max_steps=100,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(), # Set fp16 based on bf16 support
    bf16=torch.cuda.is_bf16_supported(), # Set bf16 based on support
    logging_steps=5,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=seed,
    max_grad_norm=0, # Disable gradient clipping for debugging
)

# Initialize a data collator
# DataCollatorForLanguageModeling is suitable for causal language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset, # Pass the manually tokenized dataset
    eval_dataset=tokenized_val_dataset, # Add validation dataset
    tokenizer=tokenizer, # Keep tokenizer for logging and potential use in callbacks
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the model
model.save_pretrained("outputs/lora_fine_tuning/final_model")
tokenizer.save_pretrained("outputs/lora_fine_tuning/final_model")

==((====))==  Unsloth 2025.8.8: Fast Qwen2 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/457M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.8.8 patched 24 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


Map:   0%|          | 0/81000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

  trainer = Trainer(
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 81,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 8,798,208 of 502,830,976 (1.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,2.0266
10,1.4024
15,1.1437
20,1.0919
25,1.0992
30,1.1364
35,1.0362
40,1.0467
45,1.0001
50,1.0695


('outputs/lora_fine_tuning/final_model/tokenizer_config.json',
 'outputs/lora_fine_tuning/final_model/special_tokens_map.json',
 'outputs/lora_fine_tuning/final_model/chat_template.jinja',
 'outputs/lora_fine_tuning/final_model/vocab.json',
 'outputs/lora_fine_tuning/final_model/merges.txt',
 'outputs/lora_fine_tuning/final_model/added_tokens.json',
 'outputs/lora_fine_tuning/final_model/tokenizer.json')

### 3. QLoRA Fine-Tuning

In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig
import torch
from unsloth import FastLanguageModel
from peft import LoraConfig, PrefixTuningConfig, get_peft_model, PeftModel # Keep these imports
from datasets import Dataset # Import Dataset

# Initialize the model for QLoRA fine-tuning
model_name = "Qwen/Qwen2-0.5B-Instruct"

# Determine dtype based on BF16 support
if torch.cuda.is_bf16_supported():
    dtype = torch.bfloat16
else:
    dtype = torch.float16 # Fallback to FP16 if BF16 is not supported

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,
    dtype=dtype, # Use determined dtype
    load_in_4bit=True,    # Quantize the model to 4-bit for QLoRA
)

# Set up LoRA configuration for QLoRA
# Note: FastLanguageModel.get_peft_model doesn't directly take a LoraConfig object
# We will extract the relevant parameters from the config
qlora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM", # task_type is not a direct argument for FastLanguageModel.get_peft_model
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Define target modules here
)

# Add LoRA adapters to the model using parameters from qlora_config
model = FastLanguageModel.get_peft_model(
    model, # Pass the model instance
    r = qlora_config.r, # Pass the integer rank from the config
    target_modules = qlora_config.target_modules, # Pass target modules from the config
    lora_alpha = qlora_config.lora_alpha, # Pass lora_alpha from the config
    lora_dropout = qlora_config.lora_dropout, # Pass lora_dropout from the config
    bias = qlora_config.bias, # Pass bias from the config
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = seed,
)

# Format the dataset for training using the manual tokenization function
# We will reuse the tokenized datasets created earlier for consistency with other methods
# Assuming tokenized_train_dataset and tokenized_val_dataset are defined
# If not, the user needs to run the data preprocessing cell first.

# Check if tokenized datasets exist, otherwise prompt the user to run the preprocessing cell
if 'tokenized_train_dataset' not in locals() or 'tokenized_val_dataset' not in locals():
    raise NameError("tokenized_train_dataset or tokenized_val_dataset is not defined. Please run the 'Data Preparation' cell first.")


# Set up training arguments using TrainingArguments class
training_args = TrainingArguments(
    output_dir="outputs/qlora_fine_tuning",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    warmup_steps=5,
    max_steps=100,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(), # Set fp16 based on bf16 support
    bf16=torch.cuda.is_bf16_supported(), # Set bf16 based on support
    logging_steps=5,
    optim="adamw_bnb_8bit", # Use adamw_bnb_8bit optimizer
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=seed,
    max_grad_norm=0, # Disable gradient clipping for debugging
)

# Initialize a data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset, # Pass the tokenized training dataset
    eval_dataset=tokenized_val_dataset, # Pass the tokenized validation dataset
    tokenizer=tokenizer, # Keep tokenizer for logging and potential use in callbacks
    data_collator=data_collator,
)

# Train the model using the Trainer
trainer.train()

# Save the model
model.save_pretrained("outputs/qlora_fine_tuning/final_model")
tokenizer.save_pretrained("outputs/qlora_fine_tuning/final_model")

==((====))==  Unsloth 2025.8.8: Fast Qwen2 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  trainer = Trainer(
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 81,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 4,399,104 of 498,431,872 (0.88% trained)


Step,Training Loss
5,2.1017
10,1.6648
15,1.2238
20,1.1318
25,1.1329
30,1.1504
35,1.0564
40,1.0662
45,1.0228
50,1.0894


('outputs/qlora_fine_tuning/final_model/tokenizer_config.json',
 'outputs/qlora_fine_tuning/final_model/special_tokens_map.json',
 'outputs/qlora_fine_tuning/final_model/chat_template.jinja',
 'outputs/qlora_fine_tuning/final_model/vocab.json',
 'outputs/qlora_fine_tuning/final_model/merges.txt',
 'outputs/qlora_fine_tuning/final_model/added_tokens.json',
 'outputs/qlora_fine_tuning/final_model/tokenizer.json')

### 4. Prefix-Tuning

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig, DataCollatorForLanguageModeling
import torch
# from unsloth import FastLanguageModel # Removed FastLanguageModel import to avoid patching interference
from peft import LoraConfig, PrefixTuningConfig, get_peft_model # Keep these imports
from datasets import Dataset # Import Dataset

# Initialize the model for Prefix-Tuning
model_name = "Qwen/Qwen2-0.5B-Instruct"

# Determine dtype based on BF16 support
if torch.cuda.is_bf16_supported():
    dtype = torch.bfloat16
else:
    dtype = torch.float16 # Fallback to FP16 if BF16 is not supported

# Set up BitsAndBytesConfig for 4-bit loading
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Specify the quantization type
    bnb_4bit_compute_dtype=dtype, # Use determined dtype for computation
    bnb_4bit_use_double_quant=True, # Enable double quantization
)

# Load the base model using AutoModelForCausalLM with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # Pass the bnb_config
    torch_dtype=dtype, # Use determined dtype
    device_map="auto", # Automatically map model to available devices
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set pad_token_id to eos_token_id if it's not set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


# Set up Prefix-Tuning configuration parameters
num_virtual_tokens = 20
prefix_projection = True


# Define the PrefixTuningConfig
peft_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=num_virtual_tokens,
    prefix_projection=prefix_projection,
)

# Add Prefix-Tuning adapters to the model using the standard get_peft_model function from peft
model = get_peft_model(model, peft_config)

# Print trainable parameters to verify Prefix-Tuning is applied
model.print_trainable_parameters()


# Manually format and tokenize the dataset
def format_and_tokenize_dataset(examples):
    formatted_texts = []
    for input_text, output_text in zip(examples["input"], examples["output"]):
        # Combine input and output into a single string with a clear separator
        # Using the same format as in preprocess_dataset
        full_text = f"{input_text} {output_text}{tokenizer.eos_token}"
        formatted_texts.append(full_text)

    # Tokenize the formatted texts
    # Add padding and truncation
    return tokenizer(formatted_texts, padding="max_length", truncation=True, max_length=1024) # Reduced max sequence length

# Apply the formatting and tokenization to the training dataset
# Assuming train_dataset and val_dataset are defined from the Data Preparation section
if 'train_dataset' not in locals() or 'val_dataset' not in locals():
     raise NameError("train_dataset or val_dataset is not defined. Please run the 'Data Preparation' cell first.")

tokenized_train_dataset = train_dataset.map(
    format_and_tokenize_dataset,
    batched=True,
    remove_columns=["input", "output", "__index_level_0__"], # Remove original columns and pandas index
)

tokenized_val_dataset = val_dataset.map(
    format_and_tokenize_dataset,
    batched=True,
    remove_columns=["input", "output", "__index_level_0__"], # Remove original columns and pandas index
)


# Set the format to torch, which is expected by the Trainer
tokenized_train_dataset.set_format("torch")
tokenized_val_dataset.set_format("torch")


# Set up training arguments using TrainingArguments class
training_args = TrainingArguments(
    output_dir="outputs/prefix_tuning",
    per_device_train_batch_size=2, # Reduced batch size
    gradient_accumulation_steps=4, # Increased accumulation steps to maintain effective batch size
    warmup_steps=5,
    max_steps=100,
    learning_rate=1e-3,
    fp16=not torch.cuda.is_bf16_supported(), # Set fp16 based on bf16 support
    bf16=torch.cuda.is_bf16_supported(), # Set bf16 based on support
    logging_steps=5,
    optim="adamw_8bit", # Use 8bit optimizer
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=seed,
    max_grad_norm=0, # Disable gradient clipping for debugging
    # Add evaluation parameters
    eval_strategy="steps", # Set evaluation strategy to "steps"
    eval_steps=10, # Evaluate every 10 steps
    save_steps=100, # Save checkpoint every 100 steps
    load_best_model_at_end=True, # Load the best model based on evaluation metric
    metric_for_best_model="eval_loss", # Metric to monitor for best model
    greater_is_better=False, # Lower eval_loss is better
)

# Initialize a data collator
# DataCollatorForLanguageModeling is suitable for causal language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset, # Pass the manually tokenized dataset
    eval_dataset=tokenized_val_dataset, # Add validation dataset
    tokenizer=tokenizer, # Keep tokenizer for logging and potential use in callbacks
    data_collator=data_collator,
)

# Train the model using the Trainer
trainer.train()

# Save the model (PEFT adapters only)
model.save_pretrained("outputs/prefix_tuning/final_model")
tokenizer.save_pretrained("outputs/prefix_tuning/final_model")

trainable params: 811,648 || all params: 494,844,416 || trainable%: 0.1640


Map:   0%|          | 0/81000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

  trainer = Trainer(
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 81,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 811,648 of 494,844,416 (0.16% trained)


AttributeError: 'Qwen2Attention' object has no attribute 'apply_qkv'

```

id: gNUS-SVgG8gM
cell_type: code

## Text Generation

Now, let's use the fine-tuned models to generate SQL queries for our test set.

In [13]:
# Function to generate text using the fine-tuned models
import torch # Import torch here
from peft import PeftModel # Import PeftModel as it's used in this cell
from unsloth import FastLanguageModel # Import FastLanguageModel as it's used in this cell
from tqdm.auto import tqdm # Import tqdm for progress bar
import pandas as pd # Import pandas as it's used to save results
from datasets import Dataset # Import Dataset for compatibility

# Check if test_dataset is defined. If not, prompt the user to run the data preparation cell.
if 'test_dataset' not in locals():
    raise NameError("test_dataset is not defined. Please run the 'Data Preparation' cell first.")


def generate_text(model, tokenizer, prompt, max_new_tokens=256):
    # Ensure the model is on the correct device (GPU)
    model.to('cuda')
    model.eval() # Set model to evaluation mode

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the response part (after the prompt)
    # Find the index of the response separator "### Response:"
    response_separator = "### Response:"
    separator_index = generated_text.find(response_separator)

    if separator_index != -1:
        # Extract text after the separator and strip leading/trailing whitespace
        response = generated_text[separator_index + len(response_separator):].strip()
    else:
        # If separator not found, return the whole generated text after the original prompt
        response = generated_text[len(prompt):].strip()


    return response

# Function to generate responses for the test set
def generate_responses(model, tokenizer, test_data):
    responses = []

    # Ensure model is in evaluation mode and on the correct device before generation loop
    model.to('cuda')
    model.eval()

    for example in tqdm(test_data):
        prompt = example["input"]
        # Pass the model instance directly to generate_text
        response = generate_text(model, tokenizer, prompt)
        responses.append(response)

    return responses

# Load the fine-tuned models and generate responses
model_paths = [
    "outputs/full_fine_tuning/final_model",
    "outputs/lora_fine_tuning/final_model",
    "outputs/qlora_fine_tuning/final_model",
    # "outputs/prefix_tuning/final_model" # Commented out as Prefix-Tuning was skipped
]

model_names = ["Full Fine-Tuning", "LoRA", "QLoRA"] # Updated model names list to exclude Prefix-Tuning
all_responses = {}

for model_path, model_name in zip(model_paths, model_names):
    print(f"\nGenerating responses for {model_name}...")

    # Determine dtype based on BF16 support
    if torch.cuda.is_bf16_supported():
        dtype = torch.bfloat16
    else:
        dtype = torch.float16 # Fallback to FP16 if BF16 is not supported

    # Load the model and tokenizer using FastLanguageModel for all methods
    if model_name == "Full Fine-Tuning":
         model, tokenizer = FastLanguageModel.from_pretrained(
            model_name = model_path, # Load from the saved path
            max_seq_length = 2048,
            dtype = dtype,
            load_in_4bit = False, # Full model not quantized
         )
    else:
        # For parameter-efficient methods, load the base model first then the adapters
        model, tokenizer = FastLanguageModel.from_pretrained(
            "Qwen/Qwen2-0.5B-Instruct", # Base model name
            max_seq_length = 2048,
            dtype = dtype,
            load_in_4bit = True, # Load in 4bit for PEFT models
        )
        # Load PEFT adapters
        model = PeftModel.from_pretrained(model, model_path)

    tokenizer.pad_token = tokenizer.eos_token

    # Ensure the model is on the GPU and in evaluation mode before generating
    model.to('cuda')
    model.eval()

    # Generate responses
    responses = generate_responses(model, tokenizer, test_dataset)
    all_responses[model_name] = responses

# Save the ground truth and generated responses
results_df = pd.DataFrame({
    "input": [example["input"] for example in test_dataset],
    "ground_truth": [example["output"] for example in test_dataset]
})

for model_name, responses in all_responses.items():
    results_df[model_name] = responses

results_df.to_csv("outputs/generation_results.csv", index=False)
print("\nGeneration results saved to outputs/generation_results.csv")


Generating responses for Full Fine-Tuning...
==((====))==  Unsloth 2025.8.8: Fast Qwen2 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  0%|          | 0/10000 [00:00<?, ?it/s]


Generating responses for LoRA...
==((====))==  Unsloth 2025.8.8: Fast Qwen2 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  0%|          | 0/10000 [00:00<?, ?it/s]


Generating responses for QLoRA...
==((====))==  Unsloth 2025.8.8: Fast Qwen2 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  0%|          | 0/10000 [00:00<?, ?it/s]


Generation results saved to outputs/generation_results.csv


## Evaluation

Now, let's evaluate the performance of the fine-tuned models using appropriate metrics. Since the task is Text-to-SQL generation, we'll use metrics that assess the similarity and correctness of the generated SQL queries compared to the ground truth. We will focus on metrics like:

- **Exact Match:** Percentage of generated queries that exactly match the ground truth.
- **BLEU:** A metric for evaluating the quality of text which has been machine-translated, but can be used here to compare generated SQL with ground truth.
- **ROUGE:** A set of metrics for evaluating automatic summarization and machine translation. We will use ROUGE-L for measuring the longest common subsequence.
- **BERTScore:** Measures the similarity between two sentences based on their BERT embeddings.

In [17]:
import pandas as pd
import evaluate
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from bert_score import BERTScorer
from datasets import Dataset # Import Dataset for compatibility
import textstat # Import textstat for Flesch Reading Ease
from detoxify import Detoxify # Import Detoxify for toxicity score
from sentence_transformers import SentenceTransformer, util # Import for CoSIM, Novelty, Diversity
import numpy as np
from collections import Counter # Import Counter for diversity and repetition


# Load the generation results
try:
    results_df = pd.read_csv("outputs/generation_results.csv")
except FileNotFoundError:
    print("Error: generation_results.csv not found. Please run the text generation cell first.")
    exit()

# Initialize evaluation metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")
bertscore_metric = BERTScorer(model_type='bert-base-uncased')
smoothie = SmoothingFunction().method4
detoxify_model = Detoxify('original')
sentence_transformer_model = SentenceTransformer('all-MiniLM-L6-v2')


# Store evaluation results
evaluation_results = {}

# Evaluate each model
model_names = ["Full Fine-Tuning", "LoRA", "QLoRA"] # Updated model names list to exclude Prefix-Tuning

for model_name in model_names:
    print(f"\nEvaluating {model_name}...")

    ground_truth_queries = results_df["ground_truth"].tolist()
    generated_queries = results_df[model_name].tolist()

    # Ensure all queries are strings before processing
    ground_truth_queries = [str(query) for query in ground_truth_queries]
    generated_queries = [str(query) for query in generated_queries]

    # Initialize a dictionary to store metrics for the current model
    model_metrics = {}

    # 1. Exact Match
    exact_match_count = sum([1 for gt, gen in zip(ground_truth_queries, generated_queries) if gt.strip() == gen.strip()])
    model_metrics["Exact Match"] = exact_match_count / len(ground_truth_queries) * 100

    # 2. BLEU (using NLTK's sentence_bleu for individual sentences)
    bleu_scores = []
    for gt, gen in zip(ground_truth_queries, generated_queries):
        reference = [gt.split()]
        candidate = gen.split()
        bleu_scores.append(sentence_bleu(reference, candidate, smoothing_function=smoothie))
    model_metrics["BLEU"] = sum(bleu_scores) / len(bleu_scores) * 100

    # 3. ROUGE (using the evaluate library)
    rouge_results = rouge_metric.compute(
        predictions=generated_queries,
        references=ground_truth_queries,
        use_stemmer=True,
    )
    model_metrics["ROUGE-1"] = rouge_results['rouge1'] * 100
    model_metrics["ROUGE-2"] = rouge_results['rouge2'] * 100
    model_metrics["ROUGE-L"] = rouge_results['rougeL'] * 100

    # 4. METEOR (using NLTK)
    # meteor_score expects tokenized sentences (list of words)
    meteor_scores = [meteor_score([gt.split()], gen.split()) for gt, gen in zip(ground_truth_queries, generated_queries)]
    model_metrics["METEOR"] = sum(meteor_scores) / len(meteor_scores) * 100

    # 5. GLEU (using NLTK - for sentence level)
    # NLTK's gleu_score requires tokenized sentences
    gleu_scores = []
    for gt, gen in zip(ground_truth_queries, generated_queries):
        reference = [gt.split()]
        candidate = gen.split()
        # Use min_len and max_len for GLEU
        gleu_scores.append(nltk.translate.gleu_score.sentence_gleu(reference, candidate))
    model_metrics["GLEU"] = sum(gleu_scores) / len(gleu_scores) * 100


    # 6. Repetition Rate (word level)
    total_words = 0
    repeated_words = 0
    for gen in generated_queries:
        words = gen.split()
        total_words += len(words)
        word_counts = Counter(words)
        for word, count in word_counts.items():
            if count > 1:
                repeated_words += (count - 1)
    model_metrics["Repetition Rate"] = (repeated_words / total_words) * 100 if total_words > 0 else 0


    # 7. Flesch Reading Ease (using textstat)
    # textstat expects a single string, so we'll average over all generated queries
    flesch_scores = [textstat.flesch_reading_ease(gen) for gen in generated_queries]
    model_metrics["Flesch Reading Ease"] = sum(flesch_scores) / len(flesch_scores) if len(flesch_scores) > 0 else 0

    # 8. CoSIM (Cosine Similarity between generated and ground truth embeddings)
    # This can be computationally expensive for a large dataset
    # We'll calculate embeddings in batches for efficiency
    cos_sim_scores = []
    batch_size = 64 # Adjust batch size based on available memory
    for i in tqdm(range(0, len(generated_queries), batch_size), desc="Calculating CoSIM"):
        gen_batch = generated_queries[i:i+batch_size]
        gt_batch = ground_truth_queries[i:i+batch_size]
        gen_embeddings = sentence_transformer_model.encode(gen_batch, convert_to_tensor=True)
        gt_embeddings = sentence_transformer_model.encode(gt_batch, convert_to_tensor=True)
        # Calculate cosine similarity for corresponding pairs
        cosine_scores = util.cos_sim(gen_embeddings, gt_embeddings)
        # Extract diagonal elements as they represent similarity between generated and ground truth pairs
        cos_sim_scores.extend(cosine_scores.diag().tolist())
    model_metrics["CoSIM"] = sum(cos_sim_scores) / len(cos_sim_scores) * 100 if len(cos_sim_scores) > 0 else 0


    # 9. BERTScore (using the bert_score library)
    P, R, F1 = bertscore_metric.score(generated_queries, ground_truth_queries, verbose=False) # Set verbose to False for cleaner output
    model_metrics["BERT Score"] = F1.mean().item() * 100


    # 10. Toxicity (using Detoxify)
    # Detoxify can also be computationally expensive, process in batches
    toxicity_scores = []
    for i in tqdm(range(0, len(generated_queries), batch_size), desc="Calculating Toxicity"):
        gen_batch = generated_queries[i:i+batch_size]
        results = detoxify_model.predict(gen_batch)
        # Use the 'toxicity' score
        toxicity_scores.extend(results['toxicity'])
    model_metrics["Toxicity"] = sum(toxicity_scores) / len(toxicity_scores) * 100 if len(toxicity_scores) > 0 else 0


    # 11. Novelty (percentage of unique generated queries)
    unique_generated_queries = set(generated_queries)
    model_metrics["Novelty"] = (len(unique_generated_queries) / len(generated_queries)) * 100 if len(generated_queries) > 0 else 0

    # 12. Diversity (average number of unique words per generated query)
    diversity_scores = []
    for gen in generated_queries:
        words = gen.split()
        if len(words) > 0:
            diversity_scores.append(len(set(words)) / len(words))
        else:
            diversity_scores.append(0)
    model_metrics["Diversity"] = sum(diversity_scores) / len(diversity_scores) * 100 if len(diversity_scores) > 0 else 0


    # Store the metrics for the current model
    evaluation_results[model_name] = model_metrics

# Display the evaluation results in a vertical table
print("\nEvaluation Results:")

# Create a list of metrics in the desired order
metric_order = [
    "Exact Match",
    "BLEU",
    "ROUGE-1",
    "ROUGE-2",
    "ROUGE-L",
    "METEOR",
    "GLEU",
    "Repetition Rate",
    "Flesch Reading Ease",
    "CoSIM",
    "BERT Score",
    "Toxicity",
    "Novelty",
    "Diversity",
]

# Create a dictionary for the vertical table format
vertical_results = {"Metric": metric_order}
for model_name in model_names:
    vertical_results[model_name] = [evaluation_results[model_name].get(metric, "N/A") for metric in metric_order]

evaluation_df_vertical = pd.DataFrame(vertical_results)

display(evaluation_df_vertical)

# Save the evaluation results
evaluation_df_vertical.to_csv("outputs/evaluation_results_vertical.csv", index=False)
print("\nEvaluation results saved to outputs/evaluation_results_vertical.csv")


Evaluating Full Fine-Tuning...


Calculating CoSIM:   0%|          | 0/157 [00:00<?, ?it/s]

Calculating Toxicity:   0%|          | 0/157 [00:00<?, ?it/s]


Evaluating LoRA...


Calculating CoSIM:   0%|          | 0/157 [00:00<?, ?it/s]

Calculating Toxicity:   0%|          | 0/157 [00:00<?, ?it/s]


Evaluating QLoRA...


Calculating CoSIM:   0%|          | 0/157 [00:00<?, ?it/s]

Calculating Toxicity:   0%|          | 0/157 [00:00<?, ?it/s]


Evaluation Results:


Unnamed: 0,Metric,Full Fine-Tuning,LoRA,QLoRA
0,Exact Match,2.15,2.57,2.54
1,BLEU,12.530289,15.212284,14.854203
2,ROUGE-1,56.562796,59.598573,59.209583
3,ROUGE-2,30.37678,34.283356,33.763359
4,ROUGE-L,52.273995,55.732931,55.33805
5,METEOR,39.38936,42.529829,41.88502
6,GLEU,18.846889,21.894156,21.520329
7,Repetition Rate,11.331919,7.083098,7.608599
8,Flesch Reading Ease,28.080069,27.277557,26.394054
9,CoSIM,81.171222,82.402178,82.248649



Evaluation results saved to outputs/evaluation_results_vertical.csv


## Conclusion

Based on the evaluation results, we can compare the performance of the different fine-tuning methods and draw conclusions about their effectiveness for the Text-to-SQL generation task with the Qwen2 0.5B Instruct model.

**Finish task:** Present the findings from the evaluation.

### Note on Prefix-Tuning

Due to compatibility issues encountered with Unsloth's patching and the standard PEFT implementation for Prefix-Tuning with the Qwen2 model in this environment, the Prefix-Tuning experiment has been skipped. The subsequent text generation and evaluation steps will only include the Full Fine-Tuning, LoRA, and QLoRA methods.