# Mexican forest cover Chatbot fine-tuning.

In this notebook, we will see how to fine-tune a custom chatbot (based on a Gemma3-1B model) using a hand-made training dataset.

This is a prompt-completion fine-tuning intended to generate SQL querys

Prerequisite: Create HuggingFace token with permission access to `google/gemma-3-1b`.

In [None]:
from datasets import Dataset, DatasetDict
import pandas as pd
import os
from huggingface_hub import login
import torch
from transformers import AutoTokenizer, BitsAndBytesConfig, GemmaTokenizer, AutoModelForCausalLM
from transformers.models.gemma3 import Gemma3ForCausalLM

Load the custom training dataset, available in this same github project.

In [5]:
excel_file_path = '/usr/workspace/media/training_prompts.xlsx'
df = pd.read_excel(excel_file_path)
hf_dataset = Dataset.from_pandas(df)
single_dataset_dict = DatasetDict({'train': hf_dataset})

In [6]:
single_dataset_dict['train'][0]

{'prompt': 'User request: ¿Cuál es el estado con mayor superficie cubierta por bosque?\n\nSQL:',
 'completion': 'SELECT\n  entidad_federativa,\n  area_cubierta_por_bosque\nFROM \n  superficie_bd.superficie_forestal\nORDER BY\n  area_cubierta_por_bosque\nDESC\nLIMIT 1;',
 'system_prompt': 'You are a SQL generator for ClickHouse database. Given a user request in natural language, you will respond with exactly one valid ClikHouse SQL query, nothing else. Use proper table and column names from the schema. Handle aggregations and filtering appropriately.',
 '__index_level_0__': 0}

Download Gemma-3-1B from HuggingFace and set up tokenizer.

In [None]:
my_token = "hf_abcdefghijklmnopqrstuvwxyz"

login(token=my_token)

model_id = 'google/gemma-3-1b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id, token=my_token)
model = Gemma3ForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", token=my_token, attn_implementation='eager')

# Set up the chat format, make sure pad_token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
#tokenizer.chat_template = "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"


Set up LoRA configurations, datasets and SFT (Supervised Fine-Tuning) training procedure.

In [8]:
os.environ["WANDB_DISABLED"] = "true"

from peft import LoraConfig, PeftModel

lora_config = LoraConfig(
    r=16,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

In [None]:
def tokenize_function(examples):
    # Process all examples in the batch
    prompts = examples["prompt"]
    completions = examples["completion"]
    texts = []
    for prompt, completion in zip(prompts, completions):
        # Next line is for training a role based chatbot, this is not the case 
        # text = tokenizer.apply_chat_template([{"role": "user", "content": prompt.strip()}, {"role": "assistant", "content": completion.strip()}], tokenize=False)
        # Because this is a completion chatbot, just concatenate prompt + completion:
        text = prompt.strip() + " " + completion.strip()
        texts.append(text)
    return tokenizer(texts, truncation=True, padding="max_length", max_length=512)

single_dataset_dict = single_dataset_dict.map(tokenize_function, batched = True)

Map:   0%|          | 0/84 [00:00<?, ? examples/s]

Start the fine-tuning with 150 training steps (which will take ~3 minutes on a RTX 4060 Laptop GPU with 8gb VRAM).

In [10]:
import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    train_dataset = single_dataset_dict['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=150,
        #num_train_epochs=1,
        learning_rate=2e-4,
        #fp16=True,
        bf16=True,
        # It makes training faster
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        report_to = "none",
    ),
    peft_config=lora_config,
)
trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Truncating train dataset:   0%|          | 0/84 [00:00<?, ? examples/s]

Step,Training Loss
1,13.3818
2,13.1386
3,11.3441
4,6.6089
5,5.7094
6,5.5676
7,4.3866
8,3.4427
9,2.2095
10,1.2888


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


TrainOutput(global_step=150, training_loss=0.58042059948047, metrics={'train_runtime': 140.8726, 'train_samples_per_second': 4.259, 'train_steps_per_second': 1.065, 'total_flos': 1310407969996800.0, 'train_loss': 0.58042059948047})

Now, let's save the trainer weights, and run an example inference step on the fine-tuned model to make sure it can perform question answering. Weights will be saved in a folder named "forest_chatbot".

In [12]:
trainer.save_model("forest_chatbot")

Next, we can merge the LoRA weights to the base model for on-device inference. Merged weights will be saved in a folder named "merged_model".

In [21]:
from peft import AutoPeftModelForCausalLM
import torch

# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained("forest_chatbot")
# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
# Resize vocab size to match with base model vocabulary table.
merged_model.resize_token_embeddings(262144)
merged_model.save_pretrained("merged_model", safe_serialization=True, max_shard_size="2GB")
#Save tokenizer with the fine-tuned model
tokenizer.save_pretrained("merged_model")

('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/chat_template.jinja',
 'merged_model/tokenizer.json')

### Test pipeline

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# load globally so it's not reloaded every call
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "merged_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)


model = AutoModelForCausalLM.from_pretrained(model_path).to(device)

def nl_to_sql(request: str) -> str:
    """
    Generate a SQL query from a natural language request using the fine-tuned model.
    """
    prompt = f"User request: {request}\nSQL:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=128,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    full_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    sql = full_output.replace(prompt, "").strip()
    return sql

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
sql = nl_to_sql("Dame la suma de la superficie forestal.")
print("Generated SQL:", sql)

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
W0821 00:57:54.726000 591 site-packages/torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode


Generated SQL: ```sql
SELECT
  SUM(superficie_forestal)
FROM
  superficie_bd.superficie_forestal;
```

Explanation:

The query selects the sum of the column 'superficie_forestal' from the table 'superficie_bd.superficie_forestal'.  The `SUM()` function calculates the total of the values in the column.
```

Final Answer: The final answer is: SELECT SUM(superficie_forestal) FROM superficie_bd.superficie_forestal;
```uru
