This is my adaptation on the notebook created by unsloth, found at:

https://github.com/unslothai/unsloth

https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.9: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.12.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
# alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

# ### Instruction:
# {}

# ### Input:
# {}

# ### Response:
# {}"""

# EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
# def formatting_prompts_func(examples):
#     questions = examples["instruction"]
#     inputs       = examples["input"]
#     outputs      = examples["output"]
#     texts = []
#     for instruction, input, output in zip(instructions, inputs, outputs):
#         # Must add EOS_TOKEN, otherwise your generation will go on forever!
#         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
#         texts.append(text)
#     return { "text" : texts, }
# pass

# from datasets import load_dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
# dataset = dataset.map(formatting_prompts_func, batched = True,)

alpaca_prompt = """Below is an question about a law in Jordan, paired with an input that provides further context. Write an answer that appropriately answers the question.

### Question:
{}

### Input:
{}

### Answer:
{}"""

## Format QAs According to Alpaca Instruction Format

In [None]:
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

def format_prompts_func(qa_pairs):
    questions = qa_pairs["question"]
    inputs       = qa_pairs["context"]
    outputs      = qa_pairs["answer"]
    texts = []

    for question, input, output in zip(questions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(question, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import Dataset

dataset = Dataset.from_json("/content/4_final_qa.json")
# Split into train and test sets (e.g., 80% train, 15% test)
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)

dataset = train_test_split["train"].map(format_prompts_func, batched = True,)

ModuleNotFoundError: No module named 'datasets'

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # Change the training mode from steps to epochs to iterate over all the dataset at least once
        num_train_epochs = 1, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/4695 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
6.004 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,695 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 587
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.0431
2,2.2607
3,2.4155
4,2.2634
5,2.0748
6,1.7306
7,1.8483
8,1.5998
9,2.0164
10,1.359


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'start_gpu_memory' is not defined

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
# # alpaca_prompt = Copied from above
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence.", # instruction
#         "1, 1, 2, 3, 5, 8", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025']

# Comment out the previous cell and use the cell below instead

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Replace these values with the new question and context
question = "ما هو شرط الانتفاع بشيء وفقاً للمادة 54، جاوب باللغة العربية؟"
context = """المادة 54
كل شئ يمكن حيازته مادياً أو معنوياً والانتفاع به انتفاعاً مشروعاً ولا يخرج عن التعامل بطبيعته أو بحكم القانون يصح أن يكون محلاً للحقوق المالية
الباب التمهيدي
الفصل الثالث- الأشياء والأموال
الأشياء التي تخرج عن التعامل بطبيعتها"""
expected_answer = "يجب أن يكون الانتفاع به انتفاعاً مشروعاً ولا يخرج عن التعامل بطبيعته أو بحكم القانون."

# Format the prompt using alpaca_prompt
formatted_prompt = alpaca_prompt.format(
    question,  # The question
    context,   # The input providing context
    ""         # Leave the answer blank for generation
)

# Tokenize the input
inputs = tokenizer(
    [formatted_prompt],
    return_tensors="pt"
).to("cuda")

# Generate the output
outputs = model.generate(
    **inputs,
    max_new_tokens=64,
    use_cache=True
)

# Decode and print the output
decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(decoded_output[0])


Below is an instruction that describes a task, paired with an input that provides further context. Write:

### Instruction:
ما هو شرط الانتفاع بشيء وفقاً للمادة 54، جاوب باللغة العربية؟

### Input:
المادة 54
كل شئ يمكن حيازته مادياً أو معنوياً والانتفاع به انتفاعاً مشروعاً ولا يخرج عن التعامل بطبيعته أو بحكم القانون يصح أن يكون محلاً للحقوق المالية
الباب التمهيدي
الفصل الثالث- الأشياء والأموال
الأشياء التي تخرج عن التعامل بطبيعتها

### Response:
يجب أن يكون الانتفاع مشروعاً ولا يخرج عن التعامل بطبيعته أو بحكم القانون.


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# # alpaca_prompt = Copied from above
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence.", # instruction
#         "1, 1, 2, 3, 5, 8", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer)
# _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Continue the fibonnaci sequence.

### Input:
1, 1, 2, 3, 5, 8

### Response:
13, 21, 34, 55, 89, 144<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# model.save_pretrained("lora_model") # Local saving
# tokenizer.save_pretrained("lora_model")

model.push_to_hub("msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps", token = "") # Online saving
tokenizer.push_to_hub("msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps", token = "") # Online saving

No files have been modified since last commit. Skipping to prevent empty commit.


Saved model to https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
One of the most famous and iconic tall towers in Paris is the Eiffel Tower. Standing at 324 meters (1,063 feet) tall, this wrought iron tower is a symbol of the city and a must-see attraction for tourists from all over the world.<|end_of_text|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>

---

################################################################################

---

## Evaluate Performance

### Import UNSLOTH

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

### Import Dataset

In [None]:
from datasets import Dataset

dataset = Dataset.from_json("/content/4_final_qa.json")
# Split into train and test sets (e.g., 80% train, 15% test)
train_test_split = dataset.train_test_split(test_size=0.2, seed=42)

Generating train split: 0 examples [00:00, ? examples/s]

#### Define alpaca prompt template


In [None]:
alpaca_prompt = """Below is an question about a law in Jordan, paired with an input that provides further context. Write an answer that appropriately answers the question.

### Question:
{}

### Input:
{}

### Answer:
{}"""

### Load Models



#### Load Llama 3.1 Fine Tuned Model

In [None]:
from unsloth import FastLanguageModel

# Your Hugging Face model path
fine_tuned_model_path = "msfasha/Meta-Llama-3.1-8B-finetuned-jordanian_laws_qa-1-ephoc-587-steps"

# Parameters
max_seq_length = 2048
dtype = None  # None for auto-detection, or specify Float16/Bfloat16
load_in_4bit = True  # If you used 4-bit quantization during training

# Load the model and tokenizer
model1, tokenizer1 = FastLanguageModel.from_pretrained(
    model_name=fine_tuned_model_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Enable faster inference
FastLanguageModel.for_inference(model1)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.11: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2024.12.11 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

### Evaluate Llama 3.1 Fine Tuned Model

In [None]:
import pandas as pd
from tqdm import tqdm


# Ensure tqdm progress bar is visible
tqdm.pandas()

def run_inference(row):
    question = row['question']
    context = row['context']
    expected_answer = row['answer']

    # Format the prompt
    formatted_prompt = alpaca_prompt.format(question, context, "")

    # Tokenize input
    inputs = tokenizer1([formatted_prompt], return_tensors="pt").to("cuda")

    # Generate response
    outputs = model1.generate(**inputs, max_new_tokens=64, use_cache=True)

    # Decode and print the response
    decoded_output = tokenizer1.batch_decode(outputs, skip_special_tokens=True)[0]

    # Extract just the response part from the decoded output
    answer_start = decoded_output.find("### Answer:") + len("### Answer:")
    answer = decoded_output[answer_start:].strip()

    return {
        "id" : row['id'],
        "question": question,
        "ground_truth": expected_answer,
        "fine_tuned_llama3.1": answer
    }

# Convert dataset to a pandas DataFrame
data = train_test_split["test"]
data_df = pd.DataFrame(data).iloc[:200,:]

# Run inference for each row
model1_results = data_df.progress_apply(run_inference, axis=1)

# Convert results to DataFrame
model1_results_df = pd.DataFrame(list(model1_results))


########################################################################
# Save csv
model1_results_df.to_csv("fine_tuned_llama3.1_results_200.csv", index=False)

from huggingface_hub import HfApi

# Initialize the Hugging Face API
api = HfApi()

# Your Hugging Face authentication token
token = "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"

# Repository details
repo_id = "msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps"
path_in_repo = "fine_tuned_llama3.1_results_200.csv"  # Name you want to give the file in the repository
local_file_path = "/content/fine_tuned_llama3.1_results_200.csv"  # Path to the local file you want to upload

# Upload file
api.upload_file(
    path_or_fileobj=local_file_path,
    path_in_repo=path_in_repo,
    repo_id=repo_id,
    repo_type="model",
    token=token
)

100%|██████████| 200/200 [11:30<00:00,  3.45s/it]


CommitInfo(commit_url='https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps/commit/7b9f4029c5dc5511fa341a8bd537946f8be1b9f9', commit_message='Upload fine_tuned_llama3.1_results_200.csv with huggingface_hub', commit_description='', oid='7b9f4029c5dc5511fa341a8bd537946f8be1b9f9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps', endpoint='https://huggingface.co', repo_type='model', repo_id='msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps'), pr_revision=None, pr_num=None)

#### Load Fine Tuned Llama 3.1 Instruct Model

In [None]:
from unsloth import FastLanguageModel

# Your Hugging Face model path
fine_tuned_model_path = "msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps"

# Parameters
max_seq_length = 2048
dtype = None  # None for auto-detection, or specify Float16/Bfloat16
load_in_4bit = True  # If you used 4-bit quantization during training

# Load the model and tokenizer
model2, tokenizer2 = FastLanguageModel.from_pretrained(
    model_name=fine_tuned_model_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Enable faster inference
FastLanguageModel.for_inference(model2)


==((====))==  Unsloth 2024.12.11: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:  21%|##1       | 1.22G/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

#### Evaluate Fine Tuned Llama 3.1 Instruct Model

In [None]:
import pandas as pd
from tqdm import tqdm


# Ensure tqdm progress bar is visible
tqdm.pandas()

def run_inference(row):
    question = row['question']
    context = row['context']
    expected_answer = row['answer']

    # Format the prompt
    formatted_prompt = alpaca_prompt.format(question, context, "")

    # Tokenize input
    inputs = tokenizer2([formatted_prompt], return_tensors="pt").to("cuda")

    # Generate response
    outputs = model2.generate(**inputs, max_new_tokens=64, use_cache=True)

    # Decode and print the response
    decoded_output = tokenizer2.batch_decode(outputs, skip_special_tokens=True)[0]

    # Extract just the response part from the decoded output
    answer_start = decoded_output.find("### Answer:") + len("### Answer:")
    answer = decoded_output[answer_start:].strip()

    return {
        "id" : row['id'],
        "question": question,
        "ground_truth": expected_answer,
        "fine_tuned_llama3.1_instruct": answer
    }

# Convert dataset to a pandas DataFrame
data = train_test_split["test"]
data_df = pd.DataFrame(data).iloc[:200,:]

# Run inference for each row
model1_results = data_df.progress_apply(run_inference, axis=1)

# Convert results to DataFrame
model1_results_df = pd.DataFrame(list(model1_results))


########################################################################
# Save csv
model1_results_df.to_csv("fine_tuned_llama3.1_instruct_results_200.csv", index=False)

from huggingface_hub import HfApi

# Initialize the Hugging Face API
api = HfApi()

# Your Hugging Face authentication token
token = "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"

# Repository details
repo_id = "msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps"
path_in_repo = "fine_tuned_llama3.1_instruct_results_200.csv"  # Name you want to give the file in the repository
local_file_path = "/content/fine_tuned_llama3.1_instruct_results_200.csv"  # Path to the local file you want to upload

# Upload file
api.upload_file(
    path_or_fileobj=local_file_path,
    path_in_repo=path_in_repo,
    repo_id=repo_id,
    repo_type="model",
    token=token
)

100%|██████████| 200/200 [09:45<00:00,  2.93s/it]


CommitInfo(commit_url='https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps/commit/ffb3ef780049bd013d26917a3f9b3aab0b8a3e15', commit_message='Upload fine_tuned_llama3.1_instruct_results_200.csv with huggingface_hub', commit_description='', oid='ffb3ef780049bd013d26917a3f9b3aab0b8a3e15', pr_url=None, repo_url=RepoUrl('https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps', endpoint='https://huggingface.co', repo_type='model', repo_id='msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps'), pr_revision=None, pr_num=None)

### Load Llama 3.1 Base Model

In [None]:
from unsloth import FastLanguageModel

# Your Hugging Face model path
fine_tuned_model_path = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"

# Parameters
max_seq_length = 2048
dtype = None  # None for auto-detection, or specify Float16/Bfloat16
load_in_4bit = True  # If you used 4-bit quantization during training

# Load the model and tokenizer
model1, tokenizer1 = FastLanguageModel.from_pretrained(
    model_name=fine_tuned_model_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Enable faster inference
FastLanguageModel.for_inference(model1)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.11: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): L

### Evaluate Llama 3.1 Base Model

In [None]:
import pandas as pd
from tqdm import tqdm


# Ensure tqdm progress bar is visible
tqdm.pandas()

def run_inference(row):
    question = row['question']
    context = row['context']
    expected_answer = row['answer']

    # Format the prompt
    formatted_prompt = alpaca_prompt.format(question, context, "")

    # Tokenize input
    inputs = tokenizer1([formatted_prompt], return_tensors="pt").to("cuda")

    # Generate response
    outputs = model1.generate(**inputs, max_new_tokens=64, use_cache=True)

    # Decode and print the response
    decoded_output = tokenizer1.batch_decode(outputs, skip_special_tokens=True)[0]

    # Extract just the response part from the decoded output
    answer_start = decoded_output.find("### Answer:") + len("### Answer:")
    answer = decoded_output[answer_start:].strip()

    return {
        "id" : row['id'],
        "question": question,
        "ground_truth": expected_answer,
        "llama3.1": answer
    }

# Convert dataset to a pandas DataFrame
data = train_test_split["test"]
data_df = pd.DataFrame(data).iloc[:200,:]

# Run inference for each row
model1_results = data_df.progress_apply(run_inference, axis=1)

# Convert results to DataFrame
model1_results_df = pd.DataFrame(list(model1_results))


########################################################################
# Save csv
model1_results_df.to_csv("llama3.1_results_200.csv", index=False)

from huggingface_hub import HfApi

# Initialize the Hugging Face API
api = HfApi()

# Your Hugging Face authentication token
token = "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"

# Repository details
repo_id = "msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps"
path_in_repo = "llama3.1_results_200.csv"  # Name you want to give the file in the repository
local_file_path = "/content/llama3.1_results_200.csv"  # Path to the local file you want to upload

# Upload file
api.upload_file(
    path_or_fileobj=local_file_path,
    path_in_repo=path_in_repo,
    repo_id=repo_id,
    repo_type="model",
    token=token
)

100%|██████████| 200/200 [14:40<00:00,  4.40s/it]


CommitInfo(commit_url='https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps/commit/05ff0dbe010e27f9962cd577bfb822f9e2640cb8', commit_message='Upload llama3.1_results_200.csv with huggingface_hub', commit_description='', oid='05ff0dbe010e27f9962cd577bfb822f9e2640cb8', pr_url=None, repo_url=RepoUrl('https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps', endpoint='https://huggingface.co', repo_type='model', repo_id='msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps'), pr_revision=None, pr_num=None)

### Load Llama 3.1 Instruct Model

In [None]:
from unsloth import FastLanguageModel

# Your Hugging Face model path
fine_tuned_model_path = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"

# Parameters
max_seq_length = 2048
dtype = None  # None for auto-detection, or specify Float16/Bfloat16
load_in_4bit = True  # If you used 4-bit quantization during training

# Load the model and tokenizer
model2, tokenizer2 = FastLanguageModel.from_pretrained(
    model_name=fine_tuned_model_path,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Enable faster inference
FastLanguageModel.for_inference(model2)

==((====))==  Unsloth 2024.12.11: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): L

### Evaluate Llama 3.1 Instruct Model

In [None]:
import pandas as pd
from tqdm import tqdm


# Ensure tqdm progress bar is visible
tqdm.pandas()

def run_inference(row):
    question = row['question']
    context = row['context']
    expected_answer = row['answer']

    # Format the prompt
    formatted_prompt = alpaca_prompt.format(question, context, "")

    # Tokenize input
    inputs = tokenizer2([formatted_prompt], return_tensors="pt").to("cuda")

    # Generate response
    outputs = model2.generate(**inputs, max_new_tokens=64, use_cache=True)

    # Decode and print the response
    decoded_output = tokenizer2.batch_decode(outputs, skip_special_tokens=True)[0]

    # Extract just the response part from the decoded output
    answer_start = decoded_output.find("### Answer:") + len("### Answer:")
    answer = decoded_output[answer_start:].strip()

    return {
        "id" : row['id'],
        "question": question,
        "ground_truth": expected_answer,
        "llama3.1_instruct": answer
    }

# Convert dataset to a pandas DataFrame
data = train_test_split["test"]
data_df = pd.DataFrame(data).iloc[:200,:]

# Run inference for each row
model1_results = data_df.progress_apply(run_inference, axis=1)

# Convert results to DataFrame
model1_results_df = pd.DataFrame(list(model1_results))


########################################################################
# Save csv
model1_results_df.to_csv("llama3.1_instruct_results_200.csv", index=False)

from huggingface_hub import HfApi

# Initialize the Hugging Face API
api = HfApi()

# Your Hugging Face authentication token
token = "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"

# Repository details
repo_id = "msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps"
path_in_repo = "llama3.1_instruct_results_200.csv"  # Name you want to give the file in the repository
local_file_path = "/content/llama3.1_instruct_results_200.csv"  # Path to the local file you want to upload

# Upload file
api.upload_file(
    path_or_fileobj=local_file_path,
    path_in_repo=path_in_repo,
    repo_id=repo_id,
    repo_type="model",
    token=token)

100%|██████████| 200/200 [16:03<00:00,  4.82s/it]


CommitInfo(commit_url='https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps/commit/82430d3adb09d8aa3fe9eb5acabc903200337db6', commit_message='Upload llama3.1_instruct_results_200.csv with huggingface_hub', commit_description='', oid='82430d3adb09d8aa3fe9eb5acabc903200337db6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps', endpoint='https://huggingface.co', repo_type='model', repo_id='msfasha/Meta-Llama-3.1-8B-Instruct-finetuned-jordanian_laws_qa-1-ephoc-587-steps'), pr_revision=None, pr_num=None)

Computing BLEU Scores

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Define a function to calculate BLEU scores
def calculate_bleu_scores(ground_truths, responses):
    smoothie = SmoothingFunction().method4
    scores = [
        sentence_bleu([ground_truth.split()], response.split(), smoothing_function=smoothie)
        for ground_truth, response in zip(ground_truths, responses)
    ]
    return scores

# Extract data
ground_truths = comparison_df["Ground Truth"].tolist()
fine_tuned_responses = comparison_df["Fine-Tuned Model"].tolist()
base_responses = comparison_df["Base Model"].tolist()

# Calculate BLEU scores for both models
fine_tuned_bleu_scores = calculate_bleu_scores(ground_truths, fine_tuned_responses)
base_bleu_scores = calculate_bleu_scores(ground_truths, base_responses)

# Add scores to the dataframe for comparison
comparison_df["Fine-Tuned BLEU"] = fine_tuned_bleu_scores
comparison_df["Base BLEU"] = base_bleu_scores

# Display the updated DataFrame to the user
tools.display_dataframe_to_user(name="BLEU Scores for Model Responses", dataframe=comparison_df)


ROUGE Scores

In [None]:
from rouge_score import rouge_scorer

# Define a function to calculate ROUGE scores
def calculate_rouge_scores(ground_truths, responses):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_results = []
    for ground_truth, response in zip(ground_truths, responses):
        scores = scorer.score(ground_truth, response)
        rouge_results.append({
            "ROUGE-1": scores["rouge1"].fmeasure,
            "ROUGE-2": scores["rouge2"].fmeasure,
            "ROUGE-L": scores["rougeL"].fmeasure
        })
    return rouge_results

# Calculate ROUGE scores for both models
fine_tuned_rouge_scores = calculate_rouge_scores(ground_truths, fine_tuned_responses)
base_rouge_scores = calculate_rouge_scores(ground_truths, base_responses)

# Extract ROUGE-1, ROUGE-2, and ROUGE-L scores for both models
comparison_df["Fine-Tuned ROUGE-1"] = [score["ROUGE-1"] for score in fine_tuned_rouge_scores]
comparison_df["Fine-Tuned ROUGE-2"] = [score["ROUGE-2"] for score in fine_tuned_rouge_scores]
comparison_df["Fine-Tuned ROUGE-L"] = [score["ROUGE-L"] for score in fine_tuned_rouge_scores]

comparison_df["Base ROUGE-1"] = [score["ROUGE-1"] for score in base_rouge_scores]
comparison_df["Base ROUGE-2"] = [score["ROUGE-2"] for score in base_rouge_scores]
comparison_df["Base ROUGE-L"] = [score["ROUGE-L"] for score in base_rouge_scores]

# Display the updated DataFrame with ROUGE scores to the user
tools.display_dataframe_to_user(name="ROUGE Scores for Model Responses", dataframe=comparison_df)


Charts for Results

In [None]:
import matplotlib.pyplot as plt

# Extract relevant columns for analysis
ids = comparison_df["ID"].astype(str)
fine_tuned_bleu = comparison_df["Fine-Tuned BLEU"]
base_bleu = comparison_df["Base BLEU"]

# Create BLEU score comparison chart
plt.figure(figsize=(10, 6))
plt.plot(ids, fine_tuned_bleu, marker="o", label="Fine-Tuned BLEU", linewidth=2)
plt.plot(ids, base_bleu, marker="s", label="Base BLEU", linewidth=2)
plt.title("BLEU Score Comparison: Fine-Tuned vs Base Models", fontsize=14)
plt.xlabel("Case ID", fontsize=12)
plt.ylabel("BLEU Score", fontsize=12)
plt.legend(fontsize=12)
plt.grid(alpha=0.4)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Extract ROUGE scores for fine-tuned and base models
fine_tuned_rouge1 = comparison_df["Fine-Tuned ROUGE-1"]
base_rouge1 = comparison_df["Base ROUGE-1"]

# Create ROUGE-1 score comparison chart
plt.figure(figsize=(10, 6))
plt.bar(ids, fine_tuned_rouge1, width=0.4, label="Fine-Tuned ROUGE-1", alpha=0.7)
plt.bar(ids, base_rouge1, width=0.4, label="Base ROUGE-1", alpha=0.7, bottom=fine_tuned_rouge1*0.0)
plt.title("ROUGE-1 Score Comparison: Fine-Tuned vs Base Models", fontsize=14)
plt.xlabel("Case ID", fontsize=12)
plt.ylabel("ROUGE-1 Score", fontsize=12)
plt.legend(fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


BLEU Scores

In [None]:
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

def calculate_bleu_scores(ground_truths, responses):
    smoothie = SmoothingFunction().method4
    scores = []
    for ground_truth, response in zip(ground_truths, responses):
        if isinstance(response, (int, float)):  # Check if response is a number
            scores.append(0.0)  # Assign 0.0 BLEU score for numerical responses
        else:
            scores.append(sentence_bleu([ground_truth.split()], response.split(), smoothing_function=smoothie))
    return scores

# Load the data
df = pd.read_csv("/content/all_results.csv")  # Replace with your file path

# Extract data
ground_truths = df["ground_truth"].tolist()

df

Unnamed: 0,id,question,ground_truth,llama3.1,llama3 1_instruct,fine_tuned_llama3 1,fine_tuned_llama3.1_instruct
0,2969,متى يُقبل طلب الرد؟,يُقبل طلب الرد إذا تم تقديمه قبل الدخول في الد...,The question is asking when the request for a ...,يُقبل طلب الرد بتقديم استدعاء إلى رئيس محكمة ا...,يُقبل طلب الرد قبل الدخول في الدعوى أو المحاكمة.,يُقبل طلب الرد في أول جلسة تلي الحادث إذا كان ...
1,2284,ما هو الإجراء الذي يجوز اتخاذه حتى مع وقف الدع...,يجوز اتخاذ الإجراءات والتدابير الضرورية أو الم...,ما هو الإجراء الذي يجوز اتخاذه حتى مع وقف الدع...,الإجابة الصحيحة هي: أن توقف الدعوى. \nلأن الم...,يجوز للمحكمة أن توقف الدعوى وتحدد للمشتكى عليه...,اتخاذ الإجراءات الضرورية أو المستعجلة كالتوقيف...
2,3177,ما هو دور المحكمة وفقًا للمادة 34؟,دور المحكمة وفقًا للمادة 34 هو ترجيح البينة بي...,المحكمة هي الهيئة التي تقوم بمهمة التحكيم ووضع...,ينبغي للمحكمة أن ترجح بينة على أخرى وفاقاً لما...,تستطيع المحكمة أن ترجح بينة على أخرى وفقًا للم...,المحكمة تُستخدم في حالات الرجحان بين بينة على ...
3,3536,ما هو شرط العفو من العقوبة المقررة للجرائم الت...,التبليغ عن الجريمة قبل اكتشافها ورد المال محل ...,يجب على المحكمة ان تقضي بالحد الأدنى للعقوبة، ...,يجب أن يكون التبليغ قبل اكتشاف الجريمة. إذا حص...,يجب أن يكون التبليغ قبل اكتشاف الجريمة.,يجب أن يكون التبليغ قبل اكتشاف الجريمة وينتج ع...
4,1398,ما هي الاستثناءات من المصروفات التي يمكن تغطيت...,المصروفات القضائية.,(1) أ- امتياز الملكية\n(2) ب- امتياز المشاركة\...,يستوفى الامتياز من ثمن الاموال المتعلقة بالامت...,لا يجوز تغطية المصروفات القضائية بمستحقات الام...,يمكن تغطية المصروفات القضائية.
...,...,...,...,...,...,...,...
195,2503,ما هو النتيجة القانونية عندما تقضي المحكمة بفس...,النتيجة القانونية هي عدم مسؤولية المحكوم عليه.,"In the case of Jordan, the result of the nulli...",عندما تقضي المحكمة بفسخ الحكم بسبب عدم وجود جر...,تقرر عدم مسؤولية المحكوم عليه في الحالة الأولى...,يقرر في الحالة الأولى والثانية عدم مسؤولية الم...
196,4276,ما هي العمليات المسموح بها للشركة القابضة وفقا...,المسموح للشركة القابضة هو إدارة الشركات التابع...,The following are the allowed operations of th...,تتضمن العمليات المسموح بها للشركة القابضة وفقا...,يمكن للشركة القابضة إدارة الشركات التابعة لها،...,الشركة القابضة يمكنها إدارة الشركات التابعة له...
197,4899,ما هي الجرائم التي تتعلق بالمسكوكات والتي تُعا...,تتعلق الجرائم بطلاء القطع المعدنية بالذهب أو ا...,أولاً: الجرائم التي تتعلق بالمسكوكات والتي تُع...,الجواب هو: \nتُعاقب بالأشغال مدة لا تنقص عن خم...,الجرائم المتصلة بالمسكوكات التي تتعلق بالمسكوك...,الجرائم المتصلة بالمسكوكات التي تُعاقب عليها ب...
198,3961,ما هي العلاقة بين ذمة المساهم في الشركة المساه...,ذمة المساهم مستقلة عن ذمة الشركة، حيث تكون الش...,The financial liability of the company's gener...,تعتبر الذمة المالية للشركة المساهمة العامة مست...,يُعتبر ذمة المساهم في الشركة المساهمة العامة م...,تعتبر الذمة المالية للشركة المساهمة العامة مست...


In [None]:
# Calculate BLEU scores for each model
models = ["llama3.1", "llama3 1_instruct", "fine_tuned_llama3 1", "fine_tuned_llama3.1_instruct"]
for model in models:
    model_responses = df[model].tolist()
    model_bleu_scores = calculate_bleu_scores(ground_truths, model_responses)
    df[f"{model}_BLEU"] = model_bleu_scores

# Display the updated DataFrame
df

Unnamed: 0,id,question,ground_truth,llama3.1,llama3 1_instruct,fine_tuned_llama3 1,fine_tuned_llama3.1_instruct,llama3.1_BLEU,llama3 1_instruct_BLEU,fine_tuned_llama3 1_BLEU,fine_tuned_llama3.1_instruct_BLEU
0,2969,متى يُقبل طلب الرد؟,يُقبل طلب الرد إذا تم تقديمه قبل الدخول في الد...,The question is asking when the request for a ...,يُقبل طلب الرد بتقديم استدعاء إلى رئيس محكمة ا...,يُقبل طلب الرد قبل الدخول في الدعوى أو المحاكمة.,يُقبل طلب الرد في أول جلسة تلي الحادث إذا كان ...,0.000000,0.046179,0.535732,0.222723
1,2284,ما هو الإجراء الذي يجوز اتخاذه حتى مع وقف الدع...,يجوز اتخاذ الإجراءات والتدابير الضرورية أو الم...,ما هو الإجراء الذي يجوز اتخاذه حتى مع وقف الدع...,الإجابة الصحيحة هي: أن توقف الدعوى. \nلأن الم...,يجوز للمحكمة أن توقف الدعوى وتحدد للمشتكى عليه...,اتخاذ الإجراءات الضرورية أو المستعجلة كالتوقيف...,0.049300,0.013151,0.014887,0.090567
2,3177,ما هو دور المحكمة وفقًا للمادة 34؟,دور المحكمة وفقًا للمادة 34 هو ترجيح البينة بي...,المحكمة هي الهيئة التي تقوم بمهمة التحكيم ووضع...,ينبغي للمحكمة أن ترجح بينة على أخرى وفاقاً لما...,تستطيع المحكمة أن ترجح بينة على أخرى وفقًا للم...,المحكمة تُستخدم في حالات الرجحان بين بينة على ...,0.016499,0.008377,0.046691,0.055239
3,3536,ما هو شرط العفو من العقوبة المقررة للجرائم الت...,التبليغ عن الجريمة قبل اكتشافها ورد المال محل ...,يجب على المحكمة ان تقضي بالحد الأدنى للعقوبة، ...,يجب أن يكون التبليغ قبل اكتشاف الجريمة. إذا حص...,يجب أن يكون التبليغ قبل اكتشاف الجريمة.,يجب أن يكون التبليغ قبل اكتشاف الجريمة وينتج ع...,0.013528,0.013528,0.032003,0.115727
4,1398,ما هي الاستثناءات من المصروفات التي يمكن تغطيت...,المصروفات القضائية.,(1) أ- امتياز الملكية\n(2) ب- امتياز المشاركة\...,يستوفى الامتياز من ثمن الاموال المتعلقة بالامت...,لا يجوز تغطية المصروفات القضائية بمستحقات الام...,يمكن تغطية المصروفات القضائية.,0.000000,0.021803,0.032359,0.168219
...,...,...,...,...,...,...,...,...,...,...,...
195,2503,ما هو النتيجة القانونية عندما تقضي المحكمة بفس...,النتيجة القانونية هي عدم مسؤولية المحكوم عليه.,"In the case of Jordan, the result of the nulli...",عندما تقضي المحكمة بفسخ الحكم بسبب عدم وجود جر...,تقرر عدم مسؤولية المحكوم عليه في الحالة الأولى...,يقرر في الحالة الأولى والثانية عدم مسؤولية الم...,0.000000,0.041013,0.097315,0.097315
196,4276,ما هي العمليات المسموح بها للشركة القابضة وفقا...,المسموح للشركة القابضة هو إدارة الشركات التابع...,The following are the allowed operations of th...,تتضمن العمليات المسموح بها للشركة القابضة وفقا...,يمكن للشركة القابضة إدارة الشركات التابعة لها،...,الشركة القابضة يمكنها إدارة الشركات التابعة له...,0.000000,0.032051,0.037513,0.040541
197,4899,ما هي الجرائم التي تتعلق بالمسكوكات والتي تُعا...,تتعلق الجرائم بطلاء القطع المعدنية بالذهب أو ا...,أولاً: الجرائم التي تتعلق بالمسكوكات والتي تُع...,الجواب هو: \nتُعاقب بالأشغال مدة لا تنقص عن خم...,الجرائم المتصلة بالمسكوكات التي تتعلق بالمسكوك...,الجرائم المتصلة بالمسكوكات التي تُعاقب عليها ب...,0.026162,0.022969,0.002655,0.008627
198,3961,ما هي العلاقة بين ذمة المساهم في الشركة المساه...,ذمة المساهم مستقلة عن ذمة الشركة، حيث تكون الش...,The financial liability of the company's gener...,تعتبر الذمة المالية للشركة المساهمة العامة مست...,يُعتبر ذمة المساهم في الشركة المساهمة العامة م...,تعتبر الذمة المالية للشركة المساهمة العامة مست...,0.000000,0.033432,0.123399,0.042421


In [None]:
# Calculate and print the mean for each of the remaining BLEU score columns
for model in models:
    model_bleu_col = f"{model}_BLEU"
    average_bleu = df[model_bleu_col].mean()
    print(f"Average BLEU Score for {model}: {average_bleu}")

Average BLEU Score for llama3.1: 0.05885008270815453
Average BLEU Score for llama3 1_instruct: 0.12854347737580796
Average BLEU Score for fine_tuned_llama3 1: 0.2911996850859672
Average BLEU Score for fine_tuned_llama3.1_instruct: 0.2739888359841367


### Compute Rogue

In [7]:
!pip install rouge-score ace_tools

Collecting ace_tools
  Downloading ace_tools-0.0-py3-none-any.whl.metadata (300 bytes)
Downloading ace_tools-0.0-py3-none-any.whl (1.1 kB)
Installing collected packages: ace_tools
Successfully installed ace_tools-0.0


In [10]:
import pandas as pd
from rouge_score import rouge_scorer

# Load the dataset
file_path = '/content/all_results.csv'
data = pd.read_csv(file_path)


# Ensure all text fields are strings
data = data.astype(str)

# Initialize Rouge Scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Define a function to compute ROUGE scores
def compute_rouge_scores(reference, candidate):
    scores = scorer.score(reference, candidate)
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

# Compute ROUGE scores for all models
models = [col for col in data.columns if col != 'ground_truth']
rouge_results = {model: {'rouge1': [], 'rouge2': [], 'rougeL': []} for model in models}

for model in models:
    for index, row in data.iterrows():
        scores = compute_rouge_scores(row['ground_truth'], row[model])
        rouge_results[model]['rouge1'].append(scores['rouge1'])
        rouge_results[model]['rouge2'].append(scores['rouge2'])
        rouge_results[model]['rougeL'].append(scores['rougeL'])

# Calculate average ROUGE scores for each model
average_rouge_scores = {
    model: {
        'average_rouge1': sum(rouge_results[model]['rouge1']) / len(rouge_results[model]['rouge1']),
        'average_rouge2': sum(rouge_results[model]['rouge2']) / len(rouge_results[model]['rouge2']),
        'average_rougeL': sum(rouge_results[model]['rougeL']) / len(rouge_results[model]['rougeL']),
    }
    for model in models
}

# Convert results to a DataFrame and display them
rouge_summary = pd.DataFrame.from_dict(average_rouge_scores, orient='index')

# Save the results to a CSV file
output_file = "rouge_score_comparison.csv"
rouge_summary.to_csv(output_file, index=True)

print(f"ROGUE score comparison results have been saved to {output_file}.")


ROGUE score comparison results have been saved to rouge_score_comparison.csv.
