In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
#!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git


# Unsloth achieves up to 2x faster fine-tuning speeds compared to traditional methods, with a significant reduction in memory usage (up to 70%).
# This makes it suitable for environments with constrained computational resources, like Google Colab or low-end GPUs​

# Unsloth leverages LoRA (Low-Rank Adaptation), which modifies only a small fraction (1-10%) of a model's parameters during training,
# instead of fine-tuning the entire model. This drastically reduces the computational and memory requirements while achieving comparable performance.
# It allows models to adapt to domain-specific tasks without retraining the entire network, enabling faster iterations and greater flexibility.

# By supporting 4-bit quantization, Unsloth minimizes memory usage during training and inference.
# Quantization reduces the precision of the weights and activations, which reduces memory demands and accelerates computation while preserving accuracy​.
#     - the weights of the models use only 4-bits representation.

In [2]:
model_path = "emeses/lab2_model"

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#model_path = "/content/drive/My Drive/Colab Notebooks/lora_model2"

!ls "/content/drive/My Drive/Colab Notebooks"

In [3]:
from unsloth import FastLanguageModel
import torch


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path, # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    #max_seq_length = 512,
    dtype = None,
    #dtype = torch.bfloat32,
    load_in_4bit = True,
    #load_in_4bit = False,
)
#).to("cpu")

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "I have beef, chicken, and lettuce, what can I make?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")
#).to("cpu")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/97.3M [00:00<?, ?B/s]

Unsloth 2024.12.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


1. Grilled Chicken and Steak Salad with Chicken or Beef, if desired
2. Grilled Chicken and Steak with Roasted Lettuce
3. Chicken or Steak Stir-Fry with Vegetables and Lettuce
4. Lettuce Wraps with Chicken, Beef, or Steak
5. Grilled Chicken and Steak with a Side Salad or Roasted Lettuce
6. Chicken or Steak Quesadilla or Wrap with Lettuce
7. Beef or Chicken and Vegetable Kabobs with Roasted Lettuce<|eot_id|>


We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
#print("Loaded model path:", model_path)
#print("Model configuration:", model.config)
#print("Tokenizer vocab size:", tokenizer.vocab_size)
#print("Tokenizer config:", tokenizer)

Model configuration: LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "pad_token_id": 128004,
  "pretraining_tp": 1,
  "quantization_config": {
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8b

In [4]:

    from unsloth import FastLanguageModel

    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

In 1858, the Virgin Mary appeared to Saint Bernadette Soubirous, a young peasant girl who would become known as "Our Lady of Lourdes" in France. Saint Bernadette described these apparitions to others, leading to the establishment of Lourdes as a famous Catholic pilgrimage site.<|eot_id|>


EVALUATE model performance with SQUAD dataset. 
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

In [6]:
from datasets import load_dataset
test_dataset = load_dataset("rajpurkar/squad", split ='validation')

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
import math
# Set the percentage of questions to process

shuffled_dataset = test_dataset.shuffle(seed=42)

percentage = 5  # Process 5% of the dataset
num_samples = math.ceil(len(test_dataset) * (percentage / 100))

# Slice the dataset to the desired size
subset_dataset = test_dataset.select(range(num_samples))

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Initialize tokenizer and streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# List to store results
predicted_answers = []

# Iterate through the subset dataset
for item in subset_dataset:
    question = item["question"]  # Get the question
    context = item["context"]    # Optionally include context in prompt

    # Prepare the input message
    messages = [
        {
            "role": "user",
            "content": question
        }
    ]

    # Tokenize and prepare inputs
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Must add for generation
        return_tensors="pt",
    ).to("cuda")

    # Generate response
    output = model.generate(
        input_ids=inputs,
        streamer=text_streamer,
        max_new_tokens=128,
        use_cache=True,
        temperature=1.5,
        min_p=0.1,
    )

    # Save result as a dictionary
    predicted_answers.append({
        "id": item["id"],
        "question": question,
        "generated_answer": tokenizer.decode(output[0], skip_special_tokens=True),
        "context": context,  # Save the context if you need to evaluate later
       "ground_truth": item["answers"]["text"],  # Ground truth answer
        })

# The `results` list now contains the output for the selected percentage of questions


The Carolina Panthers represented the AFC at Super Bowl 50. They lost to the Denver Broncos 24-10 in the game.<|eot_id|>
The New England Patriots represented the American Football Conference (AFC) at Super Bowl 50, while the Carolina Panthers represented the NFC.<|eot_id|>
Super Bowl 50, also known as the Super Bowl XLVII or the 50th edition of the National Football League championship game, took place at Levi's Stadium in Santa Clara, California, on February 7, 2016. The event was the culmination of the 2015 NFL season, in which the Denver Broncos defeated the Carolina Panthers 24-10.<|eot_id|>
Super Bowl 50 was played on February 7, 2016. Denver Broncos won the championship by defeating the Carolina Panthers 24-10.<|eot_id|>
The 50th anniversary of the Super Bowl was celebrated on February 6, 2012. The official color for the celebration was midnight blue.<|eot_id|>
The Super Bowl is a major event that brings people together from across the United States and around the world to celebr

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
from tabulate import tabulate
import re
from collections import Counter
import nltk
from nltk.corpus import wordnet

# Download WordNet for synonyms (only needed once)
nltk.download('wordnet')

# Function to clean the text (remove non-word characters and split into words)
def clean_text(text):
    text = text.lower()  # Case insensitive
    text = re.sub(r'\W+', ' ', text)  # Remove non-word characters
    return text.split()

# Function to compare generated answer with ground truth
def compare_answers(generated_answer, ground_truth):
    gen_words = set(clean_text(generated_answer))
    gt_words = set(clean_text(ground_truth))

    # Exact match of words
    common_words = gen_words.intersection(gt_words)
    match_percentage = len(common_words) / len(gt_words) * 100

    return match_percentage, common_words

# Prepare table data with required columns
table_data = [
    [predicted_answer["id"], predicted_answer["question"], predicted_answer["generated_answer"], ", ".join(predicted_answer["ground_truth"])]
    for predicted_answer in predicted_answers
]

# Process the table and compare answers
processed_table = []
match_ranges = {100: 0, 75: 0, 50: 0, 30: 0, 0: 0}  # Store counts of different match ranges
total_rows = 0

for row in table_data:
    raw_answer = row[2]  # The third column is the generated answer
    # Extract text starting from "assistant" onward
    match = re.search(r'assistant\s*(.*)', raw_answer, re.DOTALL)
    cleaned_answer = match.group(1).strip() if match else raw_answer.strip()

    ground_truth = row[3]

    # Compare the generated answer with the ground truth
    match_percentage, common_words = compare_answers(cleaned_answer, ground_truth)

    # Determine the match range (based on match percentage)
    if match_percentage == 100:
        match_ranges[100] += 1
    elif match_percentage >= 75:
        match_ranges[75] += 1
    elif match_percentage >= 50:
        match_ranges[50] += 1
    elif match_percentage >= 30:
        match_ranges[30] += 1
    else:
        match_ranges[0] += 1
     # Add the cleaned-up row with comparison results
    processed_table.append([row[1], cleaned_answer, ground_truth, match_percentage, ", ".join(common_words)])

    # Update total row count
    total_rows += 1

# Print the cleaned-up and formatted table with comparison results
headers = ["Question", "Generated Answer", "Ground Truth", "Match Percentage", "Common Words"]
print(tabulate(processed_table, headers=headers, tablefmt="grid"))


In [None]:
# Print statistics
print("\nStatistics:")
print(f"Total Rows Processed: {total_rows}")
for percentage, count in match_ranges.items():
    print(f"{percentage}% Match: {count} rows")

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)
10. [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
11. [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
12. [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>