To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + support us if you can!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

**[NEW] Llama-3 8b is trained on a crazy 15 trillion tokens! Llama-2 was 2 trillion.**

In [None]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes # Downgrade xformers to a compatible version
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)

    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

In [None]:
#!pip install xformers --upgrade # Upgrade xformers to the latest version


* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
# fourbit_models = [
#     "unsloth/mistral-7b-bnb-4bit",
#     "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
#     "unsloth/llama-2-7b-bnb-4bit",
#     "unsloth/gemma-7b-bnb-4bit",
#     "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
#     "unsloth/gemma-2b-bnb-4bit",
#     "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
#     "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
# ] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.43.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
 model = FastLanguageModel.get_peft_model(
     model,
     r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
     target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                       "gate_proj", "up_proj", "down_proj",],
     lora_alpha = 16,
     lora_dropout = 0, # Supports any, but = 0 is optimized
     bias = "none",    # Supports any, but = "none" is optimized
     # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
     use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
     random_state = 3407,
     use_rslora = False,  # We support rank stabilized LoRA
     loftq_config = None, # And LoftQ
 )

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
import pandas as pd
from datasets import load_dataset, Dataset
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    # Use the correct column names from your 'invoice_train.xlsx' dataset
    instructions = examples["instruction"]
    inputs       = examples["input"]       # Changed from 'input'
    outputs      = examples["output"]     # Changed from 'output'
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

# Load the dataset from CSV
data_path = "/content/invoice_train.xlsx"  # Path to your CSV file
df = pd.read_excel(data_path)
dataset = Dataset.from_pandas(df)
dataset = dataset.map(formatting_prompts_func, batched = True,)
print(dataset)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 5
})


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()  # Verify that only LoRA parameters are trainable
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

trainable params: 38,535,168 || all params: 8,068,796,416 || trainable%: 0.4776


Map (num_proc=2):   0%|          | 0/5 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.605 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 5 | Num Epochs = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 38,535,168


Step,Training Loss
1,1.3402
2,0.4851
3,0.7941
4,0.8364
5,0.4495
6,1.1348
7,1.0704
8,0.3389
9,0.6405
10,0.5508


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1231.1341 seconds used for training.
20.52 minutes used for training.
Peak reserved memory = 10.049 GB.
Peak reserved memory for training = 4.444 GB.
Peak reserved memory % of max memory = 68.138 %.
Peak reserved memory for training % of max memory = 30.133 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        """

    You are an invoice extraction tool and you are going to help me to
    extract the field value information from the invoice data.
 Given the provided invoice text,
    extract the following key information. Use the below given structure/format/template and
    return the extracted information in the following structure.
 {schema}
 Please ensure that
    the extracted information is accurate, well-organized, and presented in the specified
    structure/format/template. Please note that if some fields were not provided in the invoice
    then mark them as 'Not Provided'.

    # Schema for the extracted fields
    {
        "Invoice Number": "value",
        "Invoice Date": "value",
        "Due Date": "value",
        "IRN": "value",
        "Vendor Name": "value",
        "Vendor Address": "value",
        "Vendor Pin Code": "value",
        "Vendor State": "value",
        "Vendor State Code": "value",
        "Vendor GST Number": "value",
        "Buyer Name": "value",
        "Buyer Address": "value",
        "Buyer Pin Code": "value",
        "Buyer State": "value",
        "Buyer State Code": "value",
        "Buyer GST Number": "value",
        "Buyer PAN Number": "value",
        "PO Number": "value",
        "SGST Percent": "value",
        "SGST Amount": "value",
        "CGST Percent": "value",
        "CGST Amount": "value",
        "IGST Percent": "value",
        "IGST Amount": "value",
        "Round Off": "value",
        "Invoice Amount (Without Tax)": "value",
        "Invoice Total Amount (With Tax)": "value",
        "Grand Total": "value",
        "Engine Number": "value",
        "Engine No": "value",
        "Serial No": "value",
        "Chassis Number": "value",
        "Phone Number": "value",
        "Aadhar Number": "value",
        "Mobile Number": "value",
        "Mob No": "value",
        "Cell No": "value",
        "Ph": "value",
        "Email": "value",
        "Financed": "value",
        "Bill Type": "value",
        "Booking No": "value",
        "Ex Showroom Price Rs": "value",
        "Company": "value",
        "Ref No": "value",
        "Reference Invoice Number": "value",
        "Hypothecation with": "value",
        "Hypo": "value",
        "Hypo By": "value",
        "Financed By": "value",
        "Delivery Note": "value",
        "Mode/Terms of Payment": "value",
        "Reference No and Date": "value",
        "Other References": "value",
        "Buyer's Order No": "value",
        "Dated": "value",
        "Dispatch Doc No": "value",
        "Delivery Note Date": "value",
        "Dispatched Through": "value",
        "Destination": "value",
        "Terms of Delivery": "value",
        "Colour": "value",
        "HSN Code": "value",
        "Vehicle ID": "value",
        "Vehicle Description": "value",
        "Place Of Supply": "value",
        "TR No": "value",
        "Bank Name": "value",
        "Bank A/c No": "value",
        "IFSC Code": "value",
        "Branch": "value",
        "Key No": "value",
        "Dispatch From": "value",
        "Dispatch To": "value",
        "Sales Executives": "value",
        "Policy Number": "value",
        "Reverse Charge": "value",
        "Name of Insured/Proposer": "value",
        "Period of Insurance": "value",
        "Address of Service Provider": "value",
        "Area Code": "value",
        "Intermediary Name/Code": "value",
        "Date of Issue/Invoice": "value",
        "Nature of Service": "value",
        "Customer Code": "value",
        "Hire Purchase": "value",
        "Electronic Reference Number": "value",
        "CIN": "value",
        "Mode of Payment": "value",
        "Incoterms": "value",
        "Transporter": "value",
        "Truck No": "value",
        "Eway Bill Number": "value",
        "Financier DO ref": "value",
        "Financier DO Dt": "value",
        "Aadhar No": "value",
        "Sale Order": "value",
        "Store": "value",
        "Receipt No": "value",
        "Hypo with": "value",
        "Dealer Ref No": "value",
        "Amount In Words": "value",
        "Condition of Sale": "value",
        "Payment Term": "value",
        "Delivery Term": "value",
        "Delivery Instruction": "value",
        "Consignee Address": "value",
        "Tin No": "value",
        "Model": "value",
        "Packing List No": "value",
        "Mode of Transport": "value",
        "Carrier's Name": "value",
        "Vehicle No": "value"
    }

    """, # instruction
        """
¦
1/30/24, 4:08 PM
Invoice No.
:1841
Vehicle Invoice
BANGALORE AUTOMOBILES AGENCY DIV LLP
218, OMALUR ROAD,
NEAR NEW BUS STAND
SALEM636004
Tamil Nadu
Ph:04272447445
Dir GST .:33AADFT7320H1ZE
Particulars
Print
Invoice Date:30/01/2024 12:08
Mr. RAJESH.T.S S.O.SRINIVASAN.T.S
NO-93/107, RAM NAGAR
FORT
SALEM,SALEM - 636001
Tamil Nadu
Mob: 9944234515
Bill Type: Credit
Qty
Rate Disc
Taxable HSN SGST%
Rate CGST%
Rate
JUPITER125-JUPITER
1 78050.78 0.00
78050.78 87112019 14.00
125 DI
10927.11 14.00 10927.11
Total
1
Sub Total
Net Total
0.00
78050.78
10927.11
(Rupees Ninety Nine Thousand Nine Hundred and Five Only - Includes HSRP and Fittings & Helmet)
Part Description
Frame No
TVS JUPITER 125-OBDIIA DISC SX MT
C BRNZ
Ex Showroom Price Rs. 99905.00
Engine No
MD626AK47R1AD6120 BK4AR17C9817
CWI BookItNo
Booking No.
: 1949
Received
1. Tools
:(Y/N)
2. Manual Book - E-
Manual
:(Y/N)
3. Duplicate Keys
:(Y/N)
10927.11
99905.00
99905.00
;
KeyNo
(40568611-Mr. RAJESH.T.S S.O.SRINIVASAN. T.S)
HP Company : INDUSIND BANK LTD
Note: Goods once sold will not be taken back
I hereby agree to opt-in to receive promotional/ service communication via email, SMS, mailers, calls & social
media from TVS Motor Company Limited Authorised Dealers / OEM. : (Y/N)'
For BANGALORE AUTOMOBILES AGENCY DIV LLP
about:blank
Алев
Authorised Signatory
1/2
""", # input
        "", # output - leave this blank for generation!

    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n\n\n    You are an invoice extraction tool and you are going to help me to\n    extract the field value information from the invoice data.\n Given the provided invoice text,\n    extract the following key information. Use the below given structure/format/template and\n    return the extracted information in the following structure.\n {schema}\n Please ensure that\n    the extracted information is accurate, well-organized, and presented in the specified\n    structure/format/template. Please note that if some fields were not provided in the invoice\n    then mark them as \'Not Provided\'.\n\n    # Schema for the extracted fields\n    {\n        "Invoice Number": "value",\n        "Invoice Date": "value",\n        "Due Date": "value",\n        "IRN": "value",\n        "Vendor Name": "value",\n

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
#model.save_pretrained("llama3_finetuned_model_lora") # Local saving
import unsloth
from huggingface_hub import hf_api

# Replace with your actual Hugging Face token
token = ""

# Create the repository on the Hugging Face Hub
#hf_api.create_repo(repo_id="llama3_finetuned_model_lora", token=token, private=True)
model.push_to_hub("janani90/llama3_finetuned_model_lora",token=token) # Online saving

README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/janani90/llama3_finetuned_model_lora


In [None]:
model.push_to_hub_gguf("janani90/llama3_finetuned_model_lora", tokenizer, quantization_method = "q4_k_m", token = "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")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.32 out of 12.67 RAM for saving.


 50%|█████     | 16/32 [00:02<00:01,  9.18it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:31<00:00,  2.86s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving janani90/llama3_finetuned_model_lora/pytorch_model-00001-of-00004.bin...
Unsloth: Saving janani90/llama3_finetuned_model_lora/pytorch_model-00002-of-00004.bin...
Unsloth: Saving janani90/llama3_finetuned_model_lora/pytorch_model-00003-of-00004.bin...
Unsloth: Saving janani90/llama3_finetuned_model_lora/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at janani90/llama3_finetuned_model_lora into f16 GGUF format.
The output location will be ./janani90/llama3_finetuned_model_lora/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: llama3_finetuned_model_lora
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00004.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, 

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/janani90/llama3_finetuned_model_lora
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/janani90/llama3_finetuned_model_lora


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "llama3_finetuned_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
    """
    parse key details properly from the string below by analysing the surrounding words and format the information into a JSON object with spaces between words. Include:
  - Full Name
  - Date of Birth
  - Document Type
  If the value does not exist or if the date of birth is not present , add the term "Not Provided"
  Ensure accuracy in the extraction, especially in the full name, where it should not be mistaken for other terms. Ensure the full name is correct and also taken the middle name too.
  For example it should be Jane Citizen and not Jane Australian. If there are terms like "Driver's License" or "Indian Driving Licence" or " Australian Driving Licence" give the term as "Driving Licence".  We want the date in the format dd-mm-yyyy.
    """,# instruction
        "Election Commission of India QLLLLLUTLLELECTOR PHOTOiDENTITY CARD WJP1127240 8em1f16o: 60T60f1 EPIC Elector's JANANI Name Guun Relation's SELVARAJ Name", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

==((====))==  Unsloth: Fast Llama patching release 2024.8
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n\n    parse key details properly from the string below by analysing the surrounding words and format the information into a JSON object with spaces between words. Include:\n  - Full Name\n  - Date of Birth\n  - Document Type\n  If the value does not exist or if the date of birth is not present, add the term "Not Provided"\n  Ensure accuracy in the extraction, especially in the full name, where it should not be mistaken for other terms. Ensure the full name is correct and also taken the middle name too.\n  For example it should be Jane Citizen and not Jane Australian. If there are terms like "Driver\'s License" or "Indian Driving Licence" or " Australian Driving Licence" give the term as "Driving Licence".  We want the date in the format dd-mm-yyyy.\n    \n\n### Input:\nElection Commission of

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if True:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "janani90/llama3_finetuned_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit
    )
    tokenizer = AutoTokenizer.from_pretrained("janani90/llama3_finetuned_model")

OSError: unsloth is not a valid git identifier (branch name, tag name or commit id) that exists for this model name. Check the model page at 'https://huggingface.co/unsloth/llama-3-8b-bnb-4bit' for available revisions.

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if True: model.push_to_hub_merged("janani90/llama3_finetuned_model_lora", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

In [None]:
!pip install llama-cpp-python


Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.83.tar.gz (49.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.83-cp310-cp310-linux_x86_64.whl size=2860113 sha256=edc75c01d1a5e8e595

SyntaxError: invalid syntax (<ipython-input-50-f5aba8f9ca67>, line 1)

In [None]:
import unsloth

In [None]:
model.save_pretrained_gguf("llama3_finetuned_model", tokenizer,quantization_method = "f16")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.22 out of 12.67 RAM for saving.


  3%|▎         | 1/32 [00:00<00:07,  4.43it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [02:45<00:00,  5.17s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving llama3_finetuned_model/pytorch_model-00001-of-00004.bin...
Unsloth: Saving llama3_finetuned_model/pytorch_model-00002-of-00004.bin...
Unsloth: Saving llama3_finetuned_model/pytorch_model-00003-of-00004.bin...
Unsloth: Saving llama3_finetuned_model/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...


RuntimeError: Unsloth: The file 'llama.cpp/llama-quantize' or 'llama.cpp/quantize' does not exist.
But we expect this file to exist! Maybe the llama.cpp developers changed the name?

In [None]:
import unsloth
from huggingface_hub import hf_api

# Replace with your actual Hugging Face token
token = "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"

# Create the repository on the Hugging Face Hub
hf_api.create_repo(repo_id="llama3_finetuned_model_4b", token=token, private=True)

# Now push the model to the newly created repository
if True:
    model.push_to_hub_merged("llama3_finetuned_model", tokenizer,
                              save_method = "merged_4bit", token = token)

RuntimeError: Unsloth: Merging into 4bit will cause your model to lose accuracy if you plan
to merge to GGUF or others later on. I suggest you to do this as a final step
if you're planning to do multiple saves.
If you are certain, change `save_method` to `merged_4bit_forced`.

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp -r /content/drive/MyDrive/ /content/janani90/llama3_finetuned_model_lora

^C


In [None]:
!cp -r /content/janani90/llama3_finetuned_model_lora/ /content/drive/MyDrive/

cp: error writing '/content/drive/MyDrive/llama3_finetuned_model_lora/unsloth.F16.gguf': No space left on device
cp: error writing '/content/drive/MyDrive/llama3_finetuned_model_lora/unsloth.Q4_K_M.gguf': No space left on device
cp: error writing '/content/drive/MyDrive/llama3_finetuned_model_lora/tokenizer_config.json': No space left on device
cp: error writing '/content/drive/MyDrive/llama3_finetuned_model_lora/special_tokens_map.json': No space left on device
cp: error writing '/content/drive/MyDrive/llama3_finetuned_model_lora/tokenizer.json': No space left on device
cp: error writing '/content/drive/MyDrive/llama3_finetuned_model_lora/config.json': No space left on device
^C
