To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**

Use our [Llama-3 8b Instruct](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing) notebook for conversational style finetunes.

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install rouge
!pip install evaluate
!pip install rouge_score
#!pip install sari
!pip install prettytable
!pip install nltk
!pip install pandas
!pip install sacrebleu
!pip install bitsandbytes



* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-7b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Gemma patching. Transformers = 4.43.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.57G [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [30]:
def format_test(x):
  instruction = "Generate a subject line for the following email."

  if x['input']:
    formatted_text = f"""Below is an instruction that describes a task. \
    Write a response that appropriately completes the request.

    ### Instruction:
    {instruction}

    ### Input:
    {x['input']}

    ### Response:
    {x['output']}<eos>"""

  else:
    formatted_text = f"""Below is an instruction that describes a task. \
    Write a response that appropriately completes the request.

    ### Instruction:
    {instruction}

    ### Response:
    {x['output']}<eos>"""

  # Return a dictionary instead of a string
  return {"text": formatted_text}


In [29]:
from datasets import load_dataset
#UNCOMMENT BELOW TO TRAIN FULL DATA and comment out subset line
dataset = load_dataset("ssirikon/AESLC_Unsloth_Train", split = "train")
#dataset = load_dataset("ssirikon/AESLC_Unsloth_Train_Subset", split = "train")
dataset = dataset.map(format_test)


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [31]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        #per_device_train_batch_size = 2,
        per_device_train_batch_size = 1,
        #gradient_accumulation_steps = 4,
        gradient_accumulation_steps = 2,
        warmup_steps = 5,
        max_steps = 10,
        #max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        #fp16 = False, # Explicitly set fp16 to False
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "paged_adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",


    ),
)

Map (num_proc=2):   0%|          | 0/14455 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [7]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.83 GB of memory reserved.


In [36]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 14,455 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 2
\        /    Total batch size = 2 | Total steps = 10
 "-____-"     Number of trainable parameters = 50,003,968


Step,Training Loss
1,1.984
2,1.744
3,1.4506
4,1.69
5,2.0401
6,1.9402
7,1.6722
8,1.5022
9,1.4791
10,1.6082


In [37]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

20.1241 seconds used for training.
0.34 minutes used for training.
Peak reserved memory = 13.266 GB.
Peak reserved memory for training = 7.436 GB.
Peak reserved memory % of max memory = 89.951 %.
Peak reserved memory for training % of max memory = 50.42 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [38]:
from datasets import load_dataset

dataset_val = load_dataset("ssirikon/AESLC_Unsloth_Val", split = "validation")

dataset_test = load_dataset("ssirikon/AESLC_Unsloth_Test", split = "test")

dataset_val_subset = load_dataset("ssirikon/AESLC_Unsloth_Val_Subset", split = "validation")

dataset_test_subset = load_dataset("ssirikon/AESLC_Unsloth_Test_Subset", split = "test")


In [39]:
import re
def regextract(text):
    """
    Extracts text between '### Response: ' and ' <|end_of_text|>' and then between '**Subject:' and '\n\n'.

    Args:
      text: A string.

    Returns:
      The extracted text or None if no match is found.
    """
    match = re.search(r'### Output:\n(.*?)\n{2}', text, re.DOTALL) # Escaped the '*' character using '\*' to match it literally.
    if match:
        #print('match found')
        return match.group(1).strip()
    return 'None'  # Return None if no match is found in either step

In [40]:
from datasets import Dataset
def extract_and_format(input_texts):
    """
    Extracts the text from a Pandas Series or DataFrame and formats it into a list of lists.

    Args:
        input_texts: A Pandas Series or DataFrame containing text data.

    Returns:
        A list of lists, where each inner list contains a single extracted text string.
    """
    if isinstance(input_texts, pd.DataFrame): # Check if input_texts is a DataFrame
        input_texts = input_texts['text'] # Extract the 'text' column
    result = [regextract(str(text)) for text in input_texts]
    result = [str(text).replace('### Answer:',"").replace('\n',"").replace('### Correct:',"").replace("  ","").replace('Subject:',"") for text in result ]

    return result

In [41]:
def format_func(instruction, email):

  instruction = "Generate a subject line for the following email."

  if email:
    formatted_text = f"""Below is an instruction that describes a task. \
    Write a response that appropriately completes the request.

    ### Instruction:
    {instruction}

    ### Input:
    {email}

    ### Response:
    """

  else:
    formatted_text = f"""Below is an instruction that describes a task. \
    Write a response that appropriately completes the request.

    ### Instruction:
    {instruction}

    ### Response:
    """

    # Return a dictionary instead of a string
  return {"text": formatted_text}

In [42]:
def extract_subject(text):
    start_tag = "### Response:"

    # Find the start and end indices
    start_idx = text.find(start_tag)

    # Check if both tags are found
    if start_idx == -1:
        return None  # Tags not found

    # Extract content between the tags
    subject = text[start_idx + len(start_tag):].strip()

    return subject

In [51]:
import pandas as pd # Import the Pandas library
from transformers import TextStreamer

dataset = dataset_test_subset

original_subjects = []
emails=[]
result_subjects =[]
results = []

ann0  = []
ann1  = []
ann2  = []

#df = input_ds.to_pandas()
#print(df)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
for item in dataset:
    original_subject = item['Subject']
    email = item['Email']
    ann00= item['Ann0']
    ann11= item['Ann1']
    ann22= item['Ann2']
    email = item['Email']
    Prompt = ''

    Prompt = format_func("Generate a subject line for the following email.", email)['text']

    #print(Prompt)

    # Define the device (GPU if available, otherwise CPU)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move both models to the same device
    model.to(device)
    inputs = tokenizer(
    [
        Prompt
    ], return_tensors = "pt").to("cuda")

    text_streamer = TextStreamer(tokenizer)
    generated_ids = model.generate(**inputs, streamer=text_streamer, max_new_tokens=512, repetition_penalty=1.5 )
    #print(generated_ids[0])
    # Get the generated text by decoding the IDs
    mod_subjects = tokenizer.decode(generated_ids[0], skip_special_tokens=True) # Apply to the decoded text
    #results = [model_subjects]
    results = extract_subject(mod_subjects)
    result_subjects.append(results)
    #print(results)

    original_subjects.append(original_subject)
    ann0.append(ann00)
    ann1.append(ann11)
    ann2.append(ann22)
    emails.append(email)

zipped_subjects = list(zip(emails,original_subjects, result_subjects, ann0, ann1, ann2))

model_with_Unsloth_df = pd.DataFrame(zipped_subjects, columns = ['Emails','True_subjects', 'result_subjects', 'ann0', 'ann1', 'ann2'])
model_with_Unsloth_df.to_csv('Gemma_with_Unsloth.csv')
#print(result_subjects)


<bos>Below is an instruction that describes a task.     Write a response that appropriately completes the request.

    ### Instruction:
    Generate a subject line for the following email.

    ### Input:
    Sally and Kevin,  Fariba Karimi does the Accounting for Equity Trading. She has been in our group for about a year as a Senior Specialist. She will be released from her current role at the end of January. She is very interested in finding another spot at Enron. If you are aware of anything in either Operations or Accounting, please give me or her a call at 5-2510. Thanks,  

    ### Response:   
    Enquiry regarding Faria's release date<eos>
<bos>Below is an instruction that describes a task.     Write a response that appropriately completes the request.

    ### Instruction:
    Generate a subject line for the following email.

    ### Input:
    Market Data has recently made changes to the Reuters Kobra permissioning database. If you no longer have access to information that y

In [52]:
gemma_subjects=result_subjects

In [53]:
gemma_subjects

["Enquiry regarding Faria's release date",
 'KOBRA Access Changes',
 'Draft Agenda - Oct-04 Meeting',
 'Gas Business Changes',
 'Meeting of Governors & Business Leaders - Logistics Document',
 'Training',
 "Mike's Article",
 'Resume',
 'Super Saturdays',
 'Reschedule Meeting Time - Sao Paolo, Brazil meeting on May10th at noon Central Standard US/Canada or Noon Mexico City Localtime']

In [54]:
import evaluate
import rouge
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=gemma_subjects,
    references=original_subjects,
    use_aggregator=True,
    use_stemmer=True,
)


ann0_original_model_results = rouge.compute(
    predictions=gemma_subjects,
    references=ann0,
    use_aggregator=True,
    use_stemmer=True,
)

ann1_original_model_results = rouge.compute(
    predictions=gemma_subjects,
    references=ann1,
    use_aggregator=True,
    use_stemmer=True,
)


ann2_original_model_results = rouge.compute(
    predictions=gemma_subjects,
    references=ann2,
    use_aggregator=True,
    use_stemmer=True,
)



zipped_results = list(zip(original_model_results,
                          ann0_original_model_results,
                          ann1_original_model_results,
                          ann2_original_model_results))

print('Gemma_Subjects Vs. Original Subjects:')
print(original_model_results)


print('\nGemma_Subject Vs. ann0:')
print(ann0_original_model_results)

print('\nGemma_Subject Vs. ann1:')
print(ann1_original_model_results)


print('\nGemma_Subject Vs. ann1:')
print(ann2_original_model_results)

Gemma_Subjects Vs. Original Subjects:
{'rouge1': 0.18233333333333335, 'rouge2': 0.008695652173913044, 'rougeL': 0.18266666666666667, 'rougeLsum': 0.18333333333333332}

Gemma_Subject Vs. ann0:
{'rouge1': 0.3284768740031898, 'rouge2': 0.05, 'rougeL': 0.28984848484848486, 'rougeLsum': 0.29283492822966506}

Gemma_Subject Vs. ann1:
{'rouge1': 0.17879227053140098, 'rouge2': 0.02857142857142857, 'rougeL': 0.14322463768115942, 'rougeLsum': 0.1434057971014493}

Gemma_Subject Vs. ann1:
{'rouge1': 0.3017885562713149, 'rouge2': 0.06666666666666667, 'rougeL': 0.29813565744600223, 'rougeLsum': 0.29965390930908165}


In [47]:
dataset_test_subset["Subject"]

['Fariba Karimi looking for another role Feb 1st  ',
 'Reutes Kobra Changes  ',
 'Draft ICAP WG AGENDA FOR OCt. 5  ',
 'Natural Gas Origination  ',
 'Tyson Update  ',
 'Lexis-Nexis Training: Houston & Worldwide / Dow Jones Training  ',
 'Final version  ',
 'Origination Opportunities in Global Markets  ',
 'Congratulations  ',
 'Meeting on Tuesday, November 30  ']

#https://klu.ai/glossary/rouge-score
#https://huggingface.co/spaces/evaluate-metric/rouge


What are some alternatives to ROUGE?
There are several alternative metrics for evaluating the quality of text summaries:

BLEU (Bilingual Evaluation Understudy) — A widely-used metric in machine translation, BLEU measures the similarity between a candidate summary and one or more reference summaries by counting the number of n-grams that appear in both. It is particularly useful for evaluating system-generated summaries since it doesn't require human judgments.

METEOR (Metric for Evaluation of Translation with Explicit ORdering) — A more recent metric, METEOR incorporates features such as synonyms and paraphrases to better capture the semantic similarity between candidate and reference summaries. It also takes into account sentence-level matching, making it a useful alternative to ROUGE for evaluating text summarization tasks.

CIDEr (Consensus-Based Image Description Evaluation) — Originally developed for image captioning tasks, CIDEr is an extension of ROUGE that uses term frequency-inverse document frequency (TF-IDF) weighting to better capture the importance of specific words or phrases in a summary. This can help reduce the impact of common words on the overall similarity score and provide a more nuanced evaluation of text summaries.

ROUGE-L — A variant of ROUGE that focuses on evaluating the longest common subsequence (LCS) between candidate and reference summaries, ROUGE-L can be useful for assessing how well a summary captures the main ideas or concepts from an original text.

SARI (Scribble-and-Revise) — A more recent metric that evaluates the quality of text edits, SARI measures the ability of a system to add, delete, and rephrase words or phrases in a summary to improve its coherence and readability. This can be particularly useful for evaluating summarization tasks where the goal is not only to condense information but also to make it more accessible and engaging for readers.

By considering these alternative metrics, you can gain a broader understanding of how well your text summaries perform and identify areas for improvement in your system or approach.

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [48]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize the email and create a subject line.", # instruction
        "All -    In preparation for another round of Trading Track interviews for the ENA group, please be aware of the following dates...   October 10 - October 16 :  Initial phone interviews by two traders for External candidates. October 24, 3:00 - 6:00pm :  Final interviews for internal and external candidates. Please send either Karen Buckley or me the names of internal individuals who you feel would be a great candidate for the ENA Trading Track. We look forward to your active participation. Kind regards,  ", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

NameError: name 'alpaca_prompt' is not defined

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("your_name/lora_model", token = "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") # Online saving
tokenizer.push_to_hub("your_name/lora_model", token = "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") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize the email and create a subject line.", # instruction
        "John,  As discussed, the AIG exposure is $57MM, and it is distributed among the price, option, and exotic books. The attached spreadsheet details the dollar value and volume by month by book. Please call if you have questions. Tanya", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)
9. [**NEW**] We make Phi-3 Medium / Mini **2x faster**! See our [Phi-3 Medium notebook](https://colab.research.google.com/drive/1hhdhBa1j_hsymiW9m-WzxQtgqTH_NHqi?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>