To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**

Use our [Llama-3 8b Instruct](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) notebook for conversational style finetunes.

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install rouge
!pip install evaluate
!pip install rouge_score
#!pip install sari
!pip install prettytable
!pip install nltk
!pip install pandas
!pip install sacrebleu


* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Mistral patching release 2024.8
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

In [None]:

#SSirikonda
#Save the pretrained model to a variable
pretrained_model, pretrained_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)



==((====))==  Unsloth: Fast Mistral patching release 2024.8
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
#Training with subset of train data
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. A response that appropriately completes the request is needed.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
#UNCOMMENT BELOW TO TRAIN FULL DATA and comment out subset line
#dataset = load_dataset("ssirikon/AESLC_Unsloth_Train"), split = "train")
dataset = load_dataset("ssirikon/AESLC_Unsloth_Train_Subset", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Downloading data:   0%|          | 0.00/809k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/967 [00:00<?, ? examples/s]

Map:   0%|          | 0/967 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        #per_device_train_batch_size = 2,
        per_device_train_batch_size = 1,
        #gradient_accumulation_steps = 4,
        gradient_accumulation_steps = 2,
        warmup_steps = 5,
        max_steps = 20,
        #max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        #fp16 = False, # Explicitly set fp16 to False
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",


    ),
)

Map (num_proc=2):   0%|          | 0/967 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
8.842 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 967 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 2
\        /    Total batch size = 2 | Total steps = 20
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.2356
2,3.0029
3,2.5459
4,2.5297
5,2.9232
6,2.1615
7,1.6683
8,2.2312
9,2.1162
10,1.4129


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

54.4124 seconds used for training.
0.91 minutes used for training.
Peak reserved memory = 9.473 GB.
Peak reserved memory for training = 0.631 GB.
Peak reserved memory % of max memory = 64.232 %.
Peak reserved memory for training % of max memory = 4.279 %.


In [None]:
#Ssirikonda save finetuned model with different name
finetuned_model = model.save_pretrained("finetuned_model", local_files_only=True) # Local saving
finetuned_tokenizer = tokenizer.save_pretrained("finetuned_tokenizer", local_files_only=True)

In [None]:
#SSirikonda
#Save the pretrained model to a variable
finetuned_model, finetuned_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "finetuned_model",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# Enable gradient checkpointing
finetuned_model.gradient_checkpointing_enable()

# Use bfloat16 for less memory usage
finetuned_model = finetuned_model.to(dtype=torch.bfloat16)

==((====))==  Unsloth: Fast Mistral patching release 2024.8
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
from datasets import load_dataset

dataset_val = load_dataset("ssirikon/AESLC_Unsloth_Val", split = "validation")

dataset_test = load_dataset("ssirikon/AESLC_Unsloth_Test", split = "test")

dataset_val_subset = load_dataset("ssirikon/AESLC_Unsloth_Val_Subset", split = "validation")

dataset_test_subset = load_dataset("ssirikon/AESLC_Unsloth_Test_Subset", split = "test")


Downloading data:   0%|          | 0.00/1.95M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1962 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/1.68M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1906 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/7.84k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
import re
def regextract(text):
    """
    Extracts text between '### Response: ' and ' <|end_of_text|>' and then between '**Subject:' and '\n\n'.

    Args:
      text: A string.

    Returns:
      The extracted text or None if no match is found.
    """
    match = re.search(r'### Output:\n(.*?)\n{2}', text, re.DOTALL) # Escaped the '*' character using '\*' to match it literally.
    if match:
        print('match found')
        return match.group(1).strip()
    return 'None'  # Return None if no match is found in either step

In [None]:
from datasets import Dataset
def extract_and_format(input_texts):
    """
    Extracts the text from a Pandas Series or DataFrame and formats it into a list of lists.

    Args:
        input_texts: A Pandas Series or DataFrame containing text data.

    Returns:
        A list of lists, where each inner list contains a single extracted text string.
    """
    if isinstance(input_texts, pd.DataFrame): # Check if input_texts is a DataFrame
        input_texts = input_texts['text'] # Extract the 'text' column
    result = [[regextract(str(text))] for text in input_texts]
    return result

In [None]:
import pandas as pd # Import the Pandas library

def run_model(model, tokenizer, inputs):
    outputs_decoded = []
    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. A response that appropriately completes the request is expected.

    ### Instruction:
    {}

    ### Input:
    {}

    ### Output:
    {}"""
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

    outputs = []
    for item in inputs:  # Iterate over each item in the inputs Dataset
    # Check if 'item' is already a dictionary
        if isinstance(item, dict):
            instruction = item.get('instruction', '')  # Use .get() to handle missing keys
            email = item.get('Email', '')
        # If 'item' is a string, try to extract relevant parts (modify as needed)
        elif isinstance(item, str):
            # Example: assuming 'item' is a single string containing both instruction and email
            parts = item.split('###')  # Split based on a delimiter if present
            instruction = parts[0] if len(parts) > 0 else ''
            email = parts[1] if len(parts) > 1 else ''
        else:
            # Handle other data types or raise an error if unexpected
            instruction = ''
            email = ''
    # Assuming 'item' is a string, modify as needed to extract relevant parts
        input_text = alpaca_prompt.format(instruction, email, "")

        inputs_tokenized = tokenizer(
            [input_text],
            return_tensors="pt",
            padding=True,
            truncation=True
        ).to("cuda")
        outputs = model.generate(
            input_ids=inputs_tokenized['input_ids'],
            attention_mask=inputs_tokenized['attention_mask'],
            max_new_tokens=64,
            use_cache=True, #Enable caching
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )
        # Decode and join the output strings
        decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        print('decoded---',decoded_output)
        outputs_decoded.append(" ".join(decoded_output))  # Join the list of strings into a single string
    #print('output length-----',len(outputs_decoded))

    return outputs_decoded

In [None]:
'''
import pandas as pd # Import the Pandas library

def run_model(model, tokenizer, inputs):
    outputs_decoded = []
    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. A response that appropriately completes the request is expected.

    ### Instruction:
    {}

    ### Input:
    {}

    ### Output:
    {}"""

    outputs = []
    for item in inputs:
        # Extract instruction and input from the item
        instruction = item['instruction']  # Assuming 'instruction' is a column in your dataset
        input_value = item['Email']  # Assuming 'input' is a column in your dataset

        # Create input_text using alpaca_prompt
        input_text = alpaca_prompt.format(instruction, input_value, "")

        inputs_tokenized = tokenizer(
            [input_text], # Now input_text is the formatted prompt
            return_tensors="pt",
            padding=True,
            truncation=True
        ).to("cuda")

        # Generate outputs without using past key values (no caching)
        outputs = model.generate(
            input_ids=inputs_tokenized['input_ids'],
            attention_mask=inputs_tokenized['attention_mask'],
            max_new_tokens=64,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

        # Decode the generated output
        output_decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
        outputs_decoded.append(output_decoded)

    print('output length-----',len(outputs_decoded))

    return outputs_decoded
  '''

'\nimport pandas as pd # Import the Pandas library\n\ndef run_model(model, tokenizer, inputs):\n    outputs_decoded = []\n    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. A response that appropriately completes the request is expected.\n\n    ### Instruction:\n    {}\n\n    ### Input:\n    {}\n\n    ### Output:\n    {}"""\n\n    outputs = []\n    for item in inputs:\n        # Extract instruction and input from the item\n        instruction = item[\'instruction\']  # Assuming \'instruction\' is a column in your dataset\n        input_value = item[\'Email\']  # Assuming \'input\' is a column in your dataset\n\n        # Create input_text using alpaca_prompt\n        input_text = alpaca_prompt.format(instruction, input_value, "")\n\n        inputs_tokenized = tokenizer(\n            [input_text], # Now input_text is the formatted prompt\n            return_tensors="pt",\n            padding=True,\n            trunca

In [None]:
dataset_test_subset["Subject"]

['Fariba Karimi looking for another role Feb 1st  ',
 'Reutes Kobra Changes  ',
 'Draft ICAP WG AGENDA FOR OCt. 5  ',
 'Natural Gas Origination  ',
 'Tyson Update  ',
 'Lexis-Nexis Training: Houston & Worldwide / Dow Jones Training  ',
 'Final version  ',
 'Origination Opportunities in Global Markets  ',
 'Congratulations  ',
 'Meeting on Tuesday, November 30  ']

In [None]:
extract_and_format(run_model(pretrained_model,pretrained_tokenizer, dataset_test_subset)) #pretrained base model

[[''],
 ['### Answer:\n    Reuters Kobra Permissioning Database Changes'],
 ['10/5 ICAP Working Group Agenda'],
 ['### Correct:\n    Summarize the email and create a subject line.'],
 ['1.5-pager that describes the meeting optics/logistics/attendee attributes, etc.'],
 ['### Answer:\n    Lexis-Nexis Training'],
 ['1. Jeff, yet another email. This article just got re-written but obviously needs to be updated. What do you think about putting in your name on this? Mike'],
 ['None'],
 [''],
 ['2 pm (Sao Paulo) time']]

In [None]:
from os import rmdir
#recursively remove
!rm -rf huggingface_tokenizers_cache

In [None]:
extract_and_format(run_model(finetuned_model,finetuned_tokenizer, dataset_test_subset)) #pretrained base model

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 

In [None]:
extract_and_format(run_model(model,tokenizer, dataset_test_subset)) #lora finetuned

In [None]:

import pandas as pd
from IPython.display import display
from evaluate import load
import evaluate
import nltk
import sacrebleu
nltk.download('punkt') # Download the 'punkt' resource for tokenization
nltk.download('wordnet') # Download the 'wordnet' resource for METEOR
from nltk.translate.meteor_score import meteor_score
#from nltk.translate.cider import Cider
'''
def create_results_table(datasetType, model, input_texts, outputs_decoded):
  """
  Calculates various evaluation metrics and displays the results in a Pandas DataFrame.

  Args:
    model: The name of the model being evaluated.
    input_texts: A list of reference texts.
    outputs_decoded: A list of generated texts.
  """
  # Extract the strings from the single-element arrays and handle potential non-string elements
  candidates = [text[0] if isinstance(text[0], str) else str(text[0]) for text in extract_and_format(outputs_decoded)]
  references = [text[0] if isinstance(text[0], str) else str(text[0]) for text in extract_and_format(input_texts)]


  # Tokenize the reference texts for METEOR
  tokenized_candidates = [nltk.word_tokenize(can) for can in candidates]
  tokenized_references = [nltk.word_tokenize(ref) for ref in references]

  # Calculate ROUGE scores
  rouge = evaluate.load('rouge')
  rouge_results = rouge.compute(predictions=candidates, references=[[ref] for ref in references])

  # Calculate BLEU scores
  bleu = evaluate.load('bleu')
  # Handle potential zero-length predictions for BLEU
  try:
    bleu_results = bleu.compute(predictions=tokenized_candidates, references=[[ref] for ref in references])
  except ZeroDivisionError:
    print("Warning: ZeroDivisionError encountered during BLEU calculation. Setting BLEU score to 0.")
    bleu_results = {'bleu': 0}  # Set BLEU to 0 in case of error

  # Calculate SacreBLEU score
  sacrebleu = evaluate.load('sacrebleu')
  sacrebleu_results = sacrebleu.compute(predictions=candidates, references=[[ref] for ref in references])

  # Calculate METEOR scores
  meteor_scores = []
  meteor = evaluate.load('meteor')
  meteor_results = meteor.compute(predictions=tokenized_candidates, references=tokenized_references)

  avg_meteor_score = sum(meteor_results['score']) / len(meteor_results['score'])


  data = {'Dataset': [datasetType],
          'Model': [model],
          'ROUGE1': [rouge_results['rouge1']],
          'ROUGE2': [rouge_results['rouge2']],
          'ROUGEL': [rouge_results['rougeL']],
          'ROUGELsum': [rouge_results['rougeLsum']],
          'BLEU': [bleu_results['bleu']],
          'METEOR': [avg_meteor_score],
          'SacreBLEU': [sacrebleu_results['score']]
          }
  df = pd.DataFrame(data)
  display(df)
  '''

In [None]:
def create_results_table(datasetType, model, input_texts, outputs_decoded):
  """
  Calculates various evaluation metrics and displays the results in a Pandas DataFrame.

  Args:
    model: The name of the model being evaluated.
    input_texts: A list of reference texts.
    outputs_decoded: A list of generated texts.
  """
  # Extract the strings from the single-element arrays and handle potential non-string elements
  candidates = [text[0] if isinstance(text[0], str) else str(text[0]) for text in extract_and_format(outputs_decoded)]
  references = [text[0] if isinstance(text[0], str) else str(text[0]) for text in extract_and_format(input_texts)]

  # Check if candidates or references are empty and handle the case
  if not candidates or not references:
    print("Warning: Empty candidates or references. Skipping evaluation.")
    return

  # Tokenize the reference texts for METEOR
  tokenized_candidates = [nltk.word_tokenize(can) for can in candidates]
  tokenized_references = [nltk.word_tokenize(ref) for ref in references]

  # Calculate ROUGE scores
  rouge = evaluate.load('rouge')
  rouge_results = rouge.compute(predictions=candidates, references=[[ref] for ref in references])

  # Calculate BLEU scores
  bleu = evaluate.load('bleu')
  # Handle potential zero-length predictions for BLEU
  if len(tokenized_candidates[0])==0: # Check if the tokenized candidate list is empty
    print("Warning: Empty candidates encountered during BLEU calculation. Setting BLEU score to 0.")
    bleu_results = {'bleu': 0}  # Set BLEU to 0 in case of error
  else:
    bleu_results = bleu.compute(predictions=tokenized_candidates, references=[[ref] for ref in references])

  # Calculate SacreBLEU score
  sacrebleu = evaluate.load('sacrebleu')
  sacrebleu_results = sacrebleu.compute(predictions=candidates, references=[[ref] for ref in references])

  # Calculate METEOR scores
  meteor_scores = []
  meteor = evaluate.load('meteor')
  meteor_results = meteor.compute(predictions=tokenized_candidates, references=tokenized_references)

  avg_meteor_score = sum(meteor_results['score']) / len(meteor_results['score'])


  data = {'Dataset': [datasetType],
          'Model': [model],
          'ROUGE1': [rouge_results['rouge1']],
          'ROUGE2': [rouge_results['rouge2']],
          'ROUGEL': [rouge_results['rougeL']],
          'ROUGELsum': [rouge_results['rougeLsum']],
          'BLEU': [bleu_results['bleu']],
          'METEOR': [avg_meteor_score],
          'SacreBLEU': [sacrebleu_results['score']]
          }
  df = pd.DataFrame(data)
  display(df)

In [None]:
def pipelineStepsForPretrainedModel(dataset, datasetType):
  input_texts = dataset["Subject"]
  #run model
  outputs_decoded = run_model(pretrained_model,pretrained_tokenizer, dataset)
  #then extract response
  outputs_decoded  = extract_and_format(outputs_decoded)
  # Check if outputs_decoded is not empty before proceeding
  if outputs_decoded:
    print(outputs_decoded)
    create_results_table(datasetType, pretrained_model, input_texts, outputs_decoded)
    df = pd.DataFrame(columns=['Output', 'Input'])
    for i, j in zip( outputs_decoded, input_texts):
      #add i,j to dataframe df
      df2 = {'Output': i, 'Input': j}
      df = df._append(df2, ignore_index = True)
    print(df)
  else:
    print("Warning: Empty outputs_decoded. Skipping evaluation.")

In [None]:
def pipelineStepsForFineTunedModel(dataset, datasetType):
  input_texts = dataset["Subject"]
  #run model
  outputs_decoded = run_model(model,tokenizer, dataset)
  #then extract response
  outputs_decoded  = extract_and_format(outputs_decoded)
  # Check if outputs_decoded is not empty before proceeding
  if outputs_decoded:
    print(outputs_decoded)
    create_results_table(datasetType, model, input_texts, outputs_decoded)
    df = pd.DataFrame(columns=['Output', 'Input'])
    for i, j in zip( outputs_decoded, input_texts):
      #add i,j to dataframe df
      df2 = {'Output': i, 'Input': j}
      df = df._append(df2, ignore_index = True)
    print(df)
  else:
    print("Warning: Empty outputs_decoded. Skipping evaluation.")


In [None]:
#run on subsets
#$$$$$$$$$$$$$$$$$$$$$$ Dataset_test_subset $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$'
pipelineStepsForPretrainedModel(dataset_test_subset, 'Test')
pipelineStepsForFineTunedModel(dataset_test_subset, 'Test')
#'$$$$$$$$$$$$$$$$$$$$$$ Dataset_val_subset $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$'
#pipelineStepsForPretrainedModel(dataset_val_subset, 'Validation')
#pipelineStepsForFineTunedModel(dataset_val_subset, 'Validation')




In [None]:
#call pipeline on full datasets
#'$$$$$$$$$$$$$$$$$$$$$$ Dataset_val_subset $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$'
pipelineStepsForPretrainedModel(dataset_val, 'Validation')
pipelineStepsForFineTunedModel(dataset_val, 'Validation')
#$$$$$$$$$$$$$$$$$$$$$$ Dataset_test_subset $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$'
pipelineStepsForPretrainedModel(dataset_test, 'Test')
pipelineStepsForFineTunedModel(dataset_test, 'Test')


#https://klu.ai/glossary/rouge-score
#https://huggingface.co/spaces/evaluate-metric/rouge


What are some alternatives to ROUGE?
There are several alternative metrics for evaluating the quality of text summaries:

BLEU (Bilingual Evaluation Understudy) — A widely-used metric in machine translation, BLEU measures the similarity between a candidate summary and one or more reference summaries by counting the number of n-grams that appear in both. It is particularly useful for evaluating system-generated summaries since it doesn't require human judgments.

METEOR (Metric for Evaluation of Translation with Explicit ORdering) — A more recent metric, METEOR incorporates features such as synonyms and paraphrases to better capture the semantic similarity between candidate and reference summaries. It also takes into account sentence-level matching, making it a useful alternative to ROUGE for evaluating text summarization tasks.

CIDEr (Consensus-Based Image Description Evaluation) — Originally developed for image captioning tasks, CIDEr is an extension of ROUGE that uses term frequency-inverse document frequency (TF-IDF) weighting to better capture the importance of specific words or phrases in a summary. This can help reduce the impact of common words on the overall similarity score and provide a more nuanced evaluation of text summaries.

ROUGE-L — A variant of ROUGE that focuses on evaluating the longest common subsequence (LCS) between candidate and reference summaries, ROUGE-L can be useful for assessing how well a summary captures the main ideas or concepts from an original text.

SARI (Scribble-and-Revise) — A more recent metric that evaluates the quality of text edits, SARI measures the ability of a system to add, delete, and rephrase words or phrases in a summary to improve its coherence and readability. This can be particularly useful for evaluating summarization tasks where the goal is not only to condense information but also to make it more accessible and engaging for readers.

By considering these alternative metrics, you can gain a broader understanding of how well your text summaries perform and identify areas for improvement in your system or approach.

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize the email and create a subject line.", # instruction
        "Greg/Phillip,  Attached is the Grande Communications Service Agreement. The business points can be found in Exhibit C.  I Can get the Non-Disturbance agreement after it has been executed by you and Grande. I will fill in the Legal description of the property one I have received it. Please execute and send to:  Grande Communications, 401 Carlson Circle, San Marcos Texas, 78666 Attention Hunter Williams. <<Bishopscontract.doc>>  ", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize the email and create a subject line.", # instruction
        "All -    In preparation for another round of Trading Track interviews for the ENA group, please be aware of the following dates...   October 10 - October 16 :  Initial phone interviews by two traders for External candidates. October 24, 3:00 - 6:00pm :  Final interviews for internal and external candidates. Please send either Karen Buckley or me the names of internal individuals who you feel would be a great candidate for the ENA Trading Track. We look forward to your active participation. Kind regards,  ", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("your_name/lora_model", token = "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") # Online saving
tokenizer.push_to_hub("your_name/lora_model", token = "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") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize the email and create a subject line.", # instruction
        "John,  As discussed, the AIG exposure is $57MM, and it is distributed among the price, option, and exotic books. The attached spreadsheet details the dollar value and volume by month by book. Please call if you have questions. Tanya", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>