# Llama_3_2_3B_law_finetuning_with_Unsloth

❤️ Created by [Ahmed Boulahia](https://www.linkedin.com/in/ahmed-boulahia/).

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth
hf_token = ''
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "silma-ai/SILMA-9B-Instruct-v1.0", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token, # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: If you want to finetune Gemma 2, install flash-attn to make it faster!
To install flash-attn, do the below:

pip install --no-deps --upgrade "flash-attn>=2.6.3"
==((====))==  Unsloth 2025.2.12: Fast Gemma2 patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/3.91G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/46.9k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

RuntimeError: Unsloth: The tokenizer `silma-ai/SILMA-9B-Instruct-v1.0`
does not have a {% if add_generation_prompt %} for generation purposes.
Please file a bug report immediately - thanks!

In [None]:
!pip install -U bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<3,>=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "hatim00101/SILMA-9B-Instruct-v1.0_poems_v1.2"
config = PeftConfig.from_pretrained(peft_model_id, token=hf_token) # Provide the Hugging Face token here
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto',attn_implementation='eager' )
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model, providing the token again
model = PeftModel.from_pretrained(model, peft_model_id, token=hf_token)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/3.91G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/2.69G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/46.9k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/71.6M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:

# Load the PEFT model, providing the token again
model = PeftModel.from_pretrained(model, peft_model_id, token=hf_token)

In [None]:
import os
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Disable TorchDynamo completely
os.environ["TORCH_DYNAMO_DISABLE"] = "1"

peft_model_id = "hatim00101/SILMA-9B-Instruct-v1.0_poems_v1.2"
config = PeftConfig.from_pretrained(peft_model_id, token=hf_token)

# Specify an offload directory
offload_dir = "offload_folder"  # Create this directory if it doesn't exist
os.makedirs(offload_dir, exist_ok=True)

# Load the base model without 8-bit quantization, and using the same device_map as during training
# and pass the offload_dir to the from_pretrained method
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    device_map='auto',
    trust_remote_code=True,
    offload_folder=offload_dir  # Pass the offload directory here
) # trust_remote_code=True for Silma models
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Peft model, and pass the offload_dir to the from_pretrained method
model = PeftModel.from_pretrained(
    model,
    peft_model_id,
    token=hf_token,
    offload_folder=offload_dir  # Pass the offload directory here
)

# Move the model to CUDA if available
model = model.to('cuda')

# Prepare the input batch correctly
test_prompt = """
### Input:
الموضوع: عَفـا جـانِبُ الْبَطْحاءِ مِنْ إِبْنِ هاشِمٍ
البحر: طويل
القافية: م

### Response:
"""
batch = tokenizer(test_prompt, return_tensors='pt').to('cuda')

# Generate output with correct settings
output_tokens = model.generate(
    **batch,
    max_new_tokens=200,
    temperature=1.0,  # Increase randomness
    top_p=0.7,  # Use nucleus sampling
    do_sample=True,  # Enable sampling for Top-P to work
    no_repeat_ngram_size=2,  # Prevent repeated phrases
    eos_token_id=tokenizer.eos_token_id  # Ensure stopping
)

# Decode and print the output
print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]



KeyError: 'base_model.model.model.model.layers.35.input_layernorm'

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.12 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
import re
from datasets import load_dataset
from transformers import AutoTokenizer

# Load dataset
dataset = load_dataset("faisalq/poemsDataset")

README.md:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

poemsDataset.csv:   0%|          | 0.00/823M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2090907 [00:00<?, ? examples/s]

In [None]:
import pandas as pd

# Convert to pandas DataFrame
df = pd.DataFrame(dataset['train'])



KeyError: 'البحر : طويل'

In [None]:
df[df['poem_title'] == 'لعمري لقد طالت بواسط ليلتي'].sample(5)


Unnamed: 0,poem_title,first_hemistich,second_hemistich,poet,meter,sub_meter,البحر,جزء البحر,era,rhyme,قافية,type_en,type_ar,link,gender
1839568,لعمري لقد طالت بواسط ليلتي,حفظن ذمار الود لما تذمرت,بنو أرضنا منحفظ كل ذمار,حمزة قفطان,taweel,complete,طويل,تام,Modern,raa,ر,,,https://poetry.dctabudhabi.ae/#/poems/102039-%...,m
1839610,لعمري لقد طالت بواسط ليلتي,لعمرو أبي الأيام رنقن مشربي,وأكثرن في طرق الحياة عثاري,حمزة قفطان,taweel,complete,طويل,تام,Modern,raa,ر,,,https://poetry.dctabudhabi.ae/#/poems/102039-%...,m
1839585,لعمري لقد طالت بواسط ليلتي,يقولون حفظ النوع في الناس واجب,على الناس من باد هناك وقاري,حمزة قفطان,taweel,complete,طويل,تام,Modern,raa,ر,,,https://poetry.dctabudhabi.ae/#/poems/102039-%...,m
1839613,لعمري لقد طالت بواسط ليلتي,وإن جل ما بي أن ينهنه بالكبا,شدا باسما في الطرس شدو هزار,حمزة قفطان,taweel,complete,طويل,تام,Modern,raa,ر,,,https://poetry.dctabudhabi.ae/#/poems/102039-%...,m
1839576,لعمري لقد طالت بواسط ليلتي,إذا لا رتضيت العيش غضا بقربها,وألبست ثوب العز غير معار,حمزة قفطان,taweel,complete,طويل,تام,Modern,raa,ر,,,https://poetry.dctabudhabi.ae/#/poems/102039-%...,m


In [None]:
from collections import defaultdict
EOS_TOKEN = tokenizer.eos_token  # Define EOS token

# Cleaning function
def clean_text(text):
    if text is None:
        return ""
    text = re.sub(r"[^\u0600-\u06FF\s]", "", text)  # Keep only Arabic characters
    text = re.sub(r"\s+", " ", text).strip()  # Normalize spaces
    return text

# Select relevant columns and clean text
dataset = dataset.map(lambda example: {
    'first_hemistich': clean_text(example['first_hemistich']),
    'second_hemistich': clean_text(example['second_hemistich']),
    'البحر': clean_text(example['البحر']),
    'sub_meter': clean_text(example['sub_meter']),
    'قافية': clean_text(example['قافية']),
})

# Filter dataset to keep only poems with a specific meter
selected_meter = "طويل"  # يمكنك تغييره إلى أي بحر آخر
dataset = dataset.filter(lambda example: example['البحر'] == selected_meter)

# Formatting prompt structure
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
Generate an Arabic poem based on the given theme, meter, and rhyme.

### Input:
الموضوع: {poem_title}
البحر: {البحر}
القافية: {قافية}

### Response:
{first_hemistich} {second_hemistich}""" + EOS_TOKEN

def format_poetry(examples):
    poem_dict = defaultdict(list)

    # Group all الأبيات by poem_title
    for poem_title, meter, rhyme, first_hemistich, second_hemistich in zip(
        examples["poem_title"], examples["البحر"], examples["قافية"],
        examples["first_hemistich"], examples["second_hemistich"]
    ):
        poem_dict[poem_title].append(f"{first_hemistich} {second_hemistich}")

    # Format the full poem in the required structure
    formatted_texts = []
    for poem_title, أبيات in poem_dict.items():
        meter = examples["البحر"][examples["poem_title"].index(poem_title)]
        rhyme = examples["قافية"][examples["poem_title"].index(poem_title)]

        text = f"### Input:\nالموضوع: {poem_title}  \nالبحر: {meter}  \nالقافية: {rhyme}  \n\n### Response:\n" + "\n".join(أبيات)
        formatted_texts.append(text)

    return {"text": formatted_texts}

# Apply formatting
dataset = dataset.map(format_poetry, batched=True, remove_columns=dataset["train"].column_names)

# Sample Output
print(dataset["train"][0]["text"])


Map:   0%|          | 0/2090907 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2090907 [00:00<?, ? examples/s]

Map:   0%|          | 0/442763 [00:00<?, ? examples/s]

### Input:
الموضوع: عَفـا جـانِبُ الْبَطْحاءِ مِنْ إِبْنِ هاشِمٍ  
البحر: طويل  
القافية: م  

### Response:
عفا جانب البطحاء من إبن هاشم وجاور لحدا خارجا في الغماغم
دعته المنايا دعوة فأجابها وما تركت في الناس مثل ابن هاشم
عشية راحوا يحملون سريره تعاوره أصحابه في التزاحم
فإن يك غالته المنايا وريبها فقد كان معطاء كثير التراحم


In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 38302
    })
})

In [None]:
dataset.shape

{'train': (38302, 1)}

In [None]:
print(dataset['train']['text'][22])

### Input:
الموضوع: ألا رُبَّ ذِي فَقْرٍ وَإِنْ كانَ مُثْرِياً  
البحر: طويل  
القافية: ر  

### Response:
ألا رب ذي فقر وإن كان مثريا يروح عليه شأوه وأباعره
وكم مخرب مجدا تولى بناءه سواه فأودى عزه ومفاخره
تحيف منه اللؤم أكناف مجده فقد خرب البيت الذي هو عامره
وزال عموداه ورثت حباله وأصلح أولاه وأفسد أخره


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

### Key Concepts:
1. **Dataset Size**: The total number of data samples (rows) in your dataset.
   - Let’s call this `N`, the number of samples in your dataset.

2. **Batch Size**: The number of data samples processed before the model updates its parameters.
   - Let’s call this `B`, the batch size.

3. **Epoch**: One epoch means that the model has seen **all** the training data once.
   - So, in one epoch, the model processes all `N` samples in batches of size `B`.

4. **Steps per Epoch**: The number of batches required to process the entire dataset once.
   - This is calculated as:
     \[
     \text{steps per epoch} = \frac{N}{B}
     \]
   - For example, if your dataset has 10,000 samples and your batch size is 100, then you’ll need 100 steps to complete one epoch.

5. **Max Steps**: The total number of steps (iterations) the training will run for.
   - If `max_steps` is specified, it limits the total number of steps the model will take, regardless of the number of epochs.



In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset['train'],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Tokenizing train dataset (num_proc=2):   0%|          | 0/38302 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/38302 [00:00<?, ? examples/s]

Given a dataset size of 810 samples, here's how you can calculate the best `max_steps`:

1. **Effective Batch Size**: The `per_device_train_batch_size` is `2` and `gradient_accumulation_steps` is `4`, so the effective batch size is:
   \[
   \text{effective batch size} = 2 \times 4 = 8
   \]

2. **Steps per Epoch**: To calculate how many steps it takes to go through the entire dataset (1 epoch):
   \[
   \text{steps per epoch} = \frac{1642}{8} \approx 205.25 \quad \text{(round down to 205 steps)}
   \]

3. **Best Max Steps**:
   - If you want to train for **1 epoch**, set `max_steps = 205`.
   - For **multiple epochs**, multiply by the number of desired epochs. For example, if you want to train for 3 epochs:
     \[
     \text{max\_steps} = 205 \times 3 = 615
     \]

### Updated Code:
If you want to train for 1 epoch, use:
```python
max_steps = 205
```

For 3 epochs:
```python
max_steps = 615
```

Adjust `max_steps` depending on how many epochs you want to train for!

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.635 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 38,302 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 80,740,352


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhatimalhomid[0m ([33mhatimalhomid-education-com[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
1,6.2137
2,6.7151
3,6.4346
4,6.4798
5,6.4601
6,6.65
7,6.2376
8,6.0847
9,6.235
10,5.7892


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

706.033 seconds used for training.
11.77 minutes used for training.
Peak reserved memory = 8.459 GB.
Peak reserved memory for training = 2.902 GB.
Peak reserved memory % of max memory = 57.357 %.
Peak reserved memory for training % of max memory = 19.677 %.


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("hatim00101/DeepseekB-LoRA", token = hf_token) # Online saving
tokenizer.push_to_hub("hatim00101/DeepseekB-LoRA", token = hf_token) # Online saving

README.md:   0%|          | 0.00/593 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/323M [00:00<?, ?B/s]

Saved model to https://huggingface.co/hatim00101/DeepseekB-LoRA


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

In [None]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question in Arabic:

### Input:
{}

### Response:
{}"""

inputs = tokenizer(
[
    alpaca_prompt.format(
        "ما هي أعراض مرض السكري؟", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question in Arabic:

### Input:
ما هي أعراض مرض السكري؟

### Response:
الجروح أو الأورام التي تظهر في الأطراف القريبة من الجلد، وعدم القدرة على التغلب على التهاب الأوعية الدموية، وعدم القدرة على التغلب على التهاب القلب، وعدم القدرة على التغلب على التهاب الجهاز الهضمي، وعدم القدرة على التغلب على التهاب الساقين، وعدم القدرة على التغلب على 

In [None]:
from transformers import AutoTokenizer, TextStreamer
FastLanguageModel.for_inference(model)  # ✅ Enable fast inference

# ✅ Define the test prompt (example)
test_prompt = """### Input:
الموضوع: عَفـا جـانِبُ الْبَطْحاءِ مِنْ إِبْنِ هاشِمٍ
البحر: طويل
القافية: م

### Response:
"""

'''
عفا جانب البطحاء من إبن هاشم وجاور لحدا خارجا في الغماغم
دعته المنايا دعوة فأجابها وما تركت في الناس مثل ابن هاشم
عشية راحوا يحملون سريره تعاوره أصحابه في التزاحم
فإن يك غالته المنايا وريبها فقد كان معطاء كثير التراحم
'''


# ✅ Tokenize input
inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")

# ✅ Generate text
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)



<｜begin▁of▁sentence｜>### Input:
الموضوع: عَفـا جـانِبُ الْبَطْحاءِ مِنْ إِبْنِ هاشِمٍ  
البحر: طويل  
القافية: م  

### Response:
عافا جانب الباطح من ابن هاشم وعافا جانب الباطح من ابن هاشم
وكم ذا جوته بخري جانب وكم ذا جوته بخري جانب
فهل من كأن عافا جانب الباطح من ابن هاشم يعافى بخري جانب
فهل من كأن عافا جانب الباطح من ابن هاشم يعافى بخري جانب
وكم ذا جوته


In [None]:
# Tokenize input for the model
inputs = tokenizer([test_input], return_tensors="pt").to("cuda")

# Stream and generate output
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "AhmedBou/Arabic-Medicine-Meta-Llama-3.2-3B-LoRA", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question in Arabic:

### Input:
{}

### Response:
{}"""

inputs = tokenizer(
[
    alpaca_prompt.format(
        "ما هي أعراض مرض السكري؟", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer the following question in Arabic:

### Input:
ما هي أعراض مرض السكري؟

### Response:
زيادة معدل إنتاج السكر في الدم.<|eot_id|>


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
model.save_pretrained_gguf("Arabic-Law-Meta-Llama-3.2-3B-GGUF", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("AhmedBou/Arabic-Law-Meta-Llama-3.2-3B-GGUF", tokenizer, quantization_method = "q4_k_m", token = hf_token)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.58 out of 12.67 RAM for saving.


100%|██████████| 28/28 [00:02<00:00, 13.57it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving Arabic-Law-Meta-Llama-3.2-3B-GGUF/pytorch_model-00001-of-00002.bin...
Unsloth: Saving Arabic-Law-Meta-Llama-3.2-3B-GGUF/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at Arabic-Law-Meta-Llama-3.2-3B-GGUF into f16 GGUF format.
The output location will be /content/Arabic-Law-Meta-Llama-3.2-3B-GGUF/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: Arabic-Law-Meta-Llama-3.2-3B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-

100%|██████████| 28/28 [00:00<00:00, 31.69it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving AhmedBou/Arabic-Law-Meta-Llama-3.2-3B-GGUF/pytorch_model-00001-of-00002.bin...
Unsloth: Saving AhmedBou/Arabic-Law-Meta-Llama-3.2-3B-GGUF/pytorch_model-00002-of-00002.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at AhmedBou/Arabic-Law-Meta-Llama-3.2-3B-GGUF into f16 GGUF format.
The output location will be /content/AhmedBou/Arabic-Law-Meta-Llama-3.2-3B-GGUF/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: Arabic-Law-Meta-Llama-3.2-3B-GGUF
INFO:ggu

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/AhmedBou/Arabic-Law-Meta-Llama-3.2-3B-GGUF
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/AhmedBou/Arabic-Law-Meta-Llama-3.2-3B-GGUF
