**Modified Unsloth's probided Notebook**

In [1]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True 


model, tokenizer = FastLanguageModel.from_pretrained(
	model_name = "unsloth/Qwen3-4B-unsloth-bnb-4bit",
	max_seq_length = max_seq_length,
	dtype = dtype,
	load_in_4bit = load_in_4bit,
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
INFO 06-13 19:57:01 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-13 19:57:02 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.6.1: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.663 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [2]:
model = FastLanguageModel.get_peft_model(
	model,
	r = 16, # Suggested 8, 16, 32, 64, 128
	target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
					  "gate_proj", "up_proj", "down_proj",],
	lora_alpha = 16,
	lora_dropout = 0,
	bias = "none",
	use_gradient_checkpointing = False,
	random_state = 3407,
	use_rslora = False,
	loftq_config = None,
)

Unsloth 2025.6.1 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


# Dataset

In [3]:
from datasets import Dataset
import json


SYSTEM_PROMPT = (
	"/no_think The user provides a table in pipe-separated format and a collection of available documents. "
	"Your task is to fill missing values or/and add additional attributes with their corresponding values based *only* on the available documents. "
	"A document may or may not provide information related to the table. If not, ignore it.\n "
	"The available attributes are: name, priceRange, eatType, familyFriendly, near, customer rating, food, area"
)

CHAT_TEMPLATE_TRAINING = """<|im_start|>system
{}
<|im_end|>
<|im_start|>user
Query: {}

Documents: {}
<|im_end|>
<|im_start|>assistant
{}"""


def create_unsloth_dataset(data_dict, tokenizer):    
	EOS_TOKEN = tokenizer.eos_token
	
	def formatting_prompts_func(examples):
		texts = []
		for i, key in enumerate(examples["key"]):
			item = data_dict[key]
			
			# Get retrieved documents
			retrieved_documents = item.get("ground_truth_retrieved", []) + [item.get("positive", "")]
			retrieved_documents = [doc for doc in retrieved_documents if doc.strip()]
			
			query_table = item["truncated_serialized_query_csv"]
			ground_truth = item["serialized_query_csv"]
			
			# Format documents as numbered list
			documents_text = "\n".join([f"{i+1}) {doc}" for i, doc in enumerate(retrieved_documents)])
			
			# Create the formatted text using chat template
			text = CHAT_TEMPLATE_TRAINING.format(
				SYSTEM_PROMPT,
				query_table,
				documents_text,
				ground_truth
			) + EOS_TOKEN
			
			texts.append(text)
		
		return {"text": texts}
	
	data_keys = sorted(data_dict.keys(), key=int)
	dataset_dict = {"key": data_keys}
	
	dataset = Dataset.from_dict(dataset_dict)
	dataset = dataset.map(formatting_prompts_func, batched=True)
	
	return dataset



DATASET_DICT_PATH = "../../data/dataset_dict.json"
with open(DATASET_DICT_PATH, "rb") as f:
	dataset_dict = json.load(f)

dataset = create_unsloth_dataset(dataset_dict, tokenizer)

Map:   0%|          | 0/33501 [00:00<?, ? examples/s]

In [4]:
dataset[0]

{'key': '0',
 'text': '<|im_start|>system\n/no_think The user provides a table in pipe-separated format and a collection of available documents. Your task is to fill missing values or/and add additional attributes with their corresponding values based *only* on the available documents. A document may or may not provide information related to the table. If not, ignore it.\n The available attributes are: name, priceRange, eatType, familyFriendly, near, customer rating, food, area\n<|im_end|>\n<|im_start|>user\nQuery: name|food|priceRange|customer rating|familyFriendly|near\nThe Sooty Stove|Fast food|cheap|average|no|\n\nDocuments: 1) The Sooty Stove is Fast food with coffee shop located on side CafÃ© Sicilia with cheap price no average\n2) The Sooty Stove is a coffee shop providing Indian food in the moderate price range. It is near CafÃ© Sicilia. Its customer rating is 1 out of 5.\n3) The Sooty Stove is a Japanese food coffee shop that is not kids friendly. it has a customer rating of 3

# Training

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
	model = model,
	tokenizer = tokenizer,
	train_dataset = dataset,
	dataset_text_field = "text",
	max_seq_length = max_seq_length,
	dataset_num_proc = 2,
	packing = True,
	args = TrainingArguments(
		per_device_train_batch_size = 128,
		gradient_accumulation_steps = 4,
		warmup_steps = 5,
		num_train_epochs = 1,
		# max_steps = 60,
		learning_rate = 2e-5,
		fp16 = not is_bfloat16_supported(),
		bf16 = is_bfloat16_supported(),
		logging_steps = 1,
		optim = "adamw_8bit",
		weight_decay = 0.01,
		lr_scheduler_type = "linear",
		seed = 3407,
		output_dir = "../../models/finetuned_qwen3-4b-4bit",
		report_to = "none", # Use this for WandB etc
	),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/33501 [00:00<?, ? examples/s]

In [6]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3090. Max memory = 23.663 GB.
4.207 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 33,501 | Num Epochs = 1 | Total steps = 65
O^O/ \_/ \    Batch size per device = 128 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (128 x 4 x 1) = 512
 "-____-"     Trainable parameters = 33,030,144/4,000,000,000 (0.83% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,3.2625
2,3.2582
3,3.2865
4,3.2663
5,3.2579
6,3.2505
7,3.2466
8,3.2373
9,3.2064
10,3.1526


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2273.2269 seconds used for training.
37.89 minutes used for training.
Peak reserved memory = 18.906 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 80.256 %.
Peak reserved memory for training % of max memory = 0.0 %.


In [None]:
CHAT_TEMPLATE_INFERENCE = """<|im_start|>system
{}
<|im_end|>
<|im_start|>user
Query: {}

Documents: {}
<|im_end|>
<|im_start|>assistant
"""

def format_inference_prompt(query_table, documents, tokenizer):


	documents_text = "\n".join([f"{i+1}. {doc}" for i, doc in enumerate(documents)])
	prompt = CHAT_TEMPLATE_INFERENCE.format(SYSTEM_PROMPT, query_table, documents_text)

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	return inputs

# Inference

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

query = "name|priceRange|familyFriendly|\nThe Hollow Bell CafÃ©|below 20|"
documents = [
	"If you are looking for an inexpensive, family friendly restaurant, The Hollow Bell CafÃ© is the place to go.",
	# "The The Hollow Bell CafÃ© is not children friendly cost more than Â£30.",
	"A family friendly restaurant, The Hollow Bell CafÃ©, is not expensive.",
	"The The Hollow Bell CafÃ© is an adult only cheat restaurant.",
	"If you are looking for an inexpensive, family friendly restaurant, The Hollow Bell CafÃ© is the place to go."
]

inputs = format_inference_prompt(query, documents, tokenizer)

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
tokenizer.batch_decode(outputs)

['<|im_start|>system\nYou are an expert in text to table. The user will provide a query table and a list of retrieved documents that may or may not be related to the query. The query may be missing values and/or attributes. Your task is to:\n- Decide which documents can be used based on the query.\n- Fill the missing values *always* based on the documents.\n- Add extra attributes and their corresponding values, if needed, *always* based on the documents.\n- The available attributes are: name, priceRange, eatType, familyFriendly, near, customer rating, food, area\n<|im_end|>\n<|im_start|>user\nQuery: name|priceRange|familyFriendly|\nThe Hollow Bell CafÃ©|below 20|\n\nDocuments: 1. If you are looking for an inexpensive, family friendly restaurant, The Hollow Bell CafÃ© is the place to go.\n2. A family friendly restaurant, The Hollow Bell CafÃ©, is not expensive.\n3. The The Hollow Bell CafÃ© is an adult only cheat restaurant.\n4. If you are looking for an inexpensive, family friendly res

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
finetuned_lora_model_dir = "../../models/finetuned_lora_model"
model.save_pretrained(finetuned_lora_model_dir) # Local saving
tokenizer.save_pretrained(finetuned_lora_model_dir)
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

raise Exception("Training finished")

('../../models/finetuned_lora_model/tokenizer_config.json',
 '../../models/finetuned_lora_model/special_tokens_map.json',
 '../../models/finetuned_lora_model/vocab.json',
 '../../models/finetuned_lora_model/merges.txt',
 '../../models/finetuned_lora_model/added_tokens.json',
 '../../models/finetuned_lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
from unsloth import FastLanguageModel
model_2, tokenizer_2 = FastLanguageModel.from_pretrained(
	model_name = finetuned_lora_model_dir, # YOUR MODEL YOU USED FOR TRAINING
	max_seq_length = max_seq_length,
	dtype = dtype,
	load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model_2) # Enable native 2x faster inference


==((====))==  Unsloth 2025.6.1: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


['<|im_start|>system\nYou are an expert in text to table. The user will provide a query table and a list of retrieved documents that may or may not be related to the query. The query may be missing values and/or attributes. Your task is to:\n- Decide which documents can be used based on the query.\n- Fill the missing values *always* based on the documents.\n- Add extra attributes and their corresponding values, if needed, *always* based on the documents.\n- The available attributes are: name, priceRange, eatType, familyFriendly, near, customer rating, food, area\n<|im_end|>\n<|im_start|>user\nQuery: name|priceRange|familyFriendly|\nThe Hollow Bell CafÃ©|below 20|\n\nDocuments: 1. If you are looking for an inexpensive, family friendly restaurant, The Hollow Bell CafÃ© is the place to go.\n2. A family friendly restaurant, The Hollow Bell CafÃ©, is not expensive.\n3. The The Hollow Bell CafÃ© is an adult only cheat restaurant.\n4. If you are looking for an inexpensive, family friendly res

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
query = "name|priceRange|familyFriendly|\nThe Hollow Bell CafÃ©|below 20|"
documents = [
	"If you are looking for an inexpensive, family friendly restaurant, The Hollow Bell CafÃ© is the place to go.",
	# "The The Hollow Bell CafÃ© is not children friendly cost more than Â£30.",
	"A family friendly restaurant, The Hollow Bell CafÃ©, is not expensive.",
	"The The Hollow Bell CafÃ© is an adult only cheat restaurant.",
	"If you are looking for an inexpensive, family friendly restaurant, The Hollow Bell CafÃ© is the place to go."
]

inputs = format_inference_prompt(query, documents, tokenizer_2).to('cuda')

outputs = model_2.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer_2.batch_decode(outputs)

['<|im_start|>system\nYou are an expert in text to table. The user will provide a query table and a list of retrieved documents that may or may not be related to the query. The query may be missing values and/or attributes. Your task is to:\n- Decide which documents can be used based on the query.\n- Fill the missing values *always* based on the documents.\n- Add extra attributes and their corresponding values, if needed, *always* based on the documents.\n- The available attributes are: name, priceRange, eatType, familyFriendly, near, customer rating, food, area\n<|im_end|>\n<|im_start|>user\nQuery: name|priceRange|familyFriendly|\nThe Hollow Bell CafÃ©|below 20|\n\nDocuments: 1. If you are looking for an inexpensive, family friendly restaurant, The Hollow Bell CafÃ© is the place to go.\n2. A family friendly restaurant, The Hollow Bell CafÃ©, is not expensive.\n3. The The Hollow Bell CafÃ© is an adult only cheat restaurant.\n4. If you are looking for an inexpensive, family friendly res

In [None]:
if False:
	# I highly do NOT suggest - use Unsloth if possible
	from peft import AutoPeftModelForCausalLM
	from transformers import AutoTokenizer
	model = AutoPeftModelForCausalLM.from_pretrained(
		"lora_model", # YOUR MODEL YOU USED FOR TRAINING
		load_in_4bit = load_in_4bit,
	)
	tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
finetuned_lora_model_dir_16_bit = "../../models/finetuned_lora_model_16_bit"

if True: model.save_pretrained_merged(finetuned_lora_model_dir_16_bit, tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

Found HuggingFace hub cache directory: /home/giorgos/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [01:10<00:00, 70.56s/it]


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
	model.push_to_hub_gguf(
		"hf/model", # Change hf to your username!
		tokenizer,
		quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
		token = "", # Get a token at https://huggingface.co/settings/tokens
	)

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with ðŸ¤— HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
	<a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
	<a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
	<a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>