## Load the Dataset

In [1]:
from datasets import load_dataset

ds = load_dataset("yahma/alpaca-cleaned",split="train").select(range(1000))
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['output', 'input', 'instruction'],
    num_rows: 1000
})

you can use huggingface datasets or any other dataset type, even your custom data

In [2]:
ds[0]

{'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.',
 'input': '',
 'instruction': 'Give three tips for staying healthy.'}

## Load the Model

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
messages = [
    {"role": "user", "content": "what is the capital city of Egypt?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ
 الجزائ



you can use instruction tuning model or base model

## install llama-factory

In [4]:
%%capture
!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
!pip install -e ".[torch,metrics]" --no-build-isolation

## Optional: Give access to wandb to monitor the traning process

In [8]:
from google.colab import userdata
import wandb

wandb.login(key=userdata.get('w&b_token'))

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmohamedelsayad866[0m ([33mmohamedelsayad866-met[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## Save the Data in json file

the data must be in the following Structure:

```json
{
"dataset_name": {
  "file_name": "data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "system": "system",
    "history": "history"
  }
 }
}

In [7]:
import json

data_list = ds.to_list()

json_filename = "/content/drive/MyDrive/dataset.json"

with open(json_filename, 'w') as f:
    json.dump(data_list, f, indent=4)

print(f"Dataset saved to {json_filename}")

Dataset saved to /content/drive/MyDrive/dataset.json


**now you have to open the following file and add your dataset in the following format:**

LLaMA-Factory/data/dataset_info.json

```json
{
  "dataset_name": {
  "file_name": "data.json",
  "ranking": true,
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "chosen": "chosen",
    "rejected": "rejected"
  }
 }
}

## Add the configration file to the following path:

/content/LLaMA-Factory/examples/train_lora/

Edit the following configration as you want

In [12]:
%%writefile /content/LLaMA-Factory/examples/train_lora/alpaca-cleaned.yaml

### model
model_name_or_path: Qwen/Qwen2.5-1.5B
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 64
lora_target: all

### dataset
dataset: alpaca-cleaned
template: qwen
cutoff_len: 1024
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: wandb  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

# eval
# eval_dataset: alpaca_en_demo
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500


Overwriting /content/LLaMA-Factory/examples/train_lora/alpaca-cleaned.yaml


## Finally use this command to train

replace the yaml file path with your own

In [13]:
!cd /content/LLaMA-Factory && llamafactory-cli train /content/LLaMA-Factory/examples/train_lora/alpaca-cleaned.yaml

2025-09-20 13:05:46.957871: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758373546.978953   12718 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758373546.986099   12718 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1758373547.001905   12718 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758373547.001933   12718 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1758373547.001937   12718 computation_placer.cc:177] computation placer alr

## Load the adapter from the path you selected in the yaml file

In [14]:
model.load_adapter("saves/qwen/lora/sft")

## Test the model after fine tuning

In [15]:
messages = [
    {"role": "user", "content": "what is the capital city of Egypt?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


The capital city of Egypt is Cairo. It is located in the Nile Delta region and is the largest city in the country, with a population of over 16 million people. Cairo is also the


**That's It!**

Eng. Mohamed Elsayad