<a href="https://colab.research.google.com/github/pascarujo/assistente-sage-xvii-stpc/blob/main/sagelm-llama-training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STPC Fine-tuning with Unsloth

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
#!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

In [None]:
#!pip uninstall xformers
!pip install xformers --pre --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu121
!pip install trl peft accelerate bitsandbytes


In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 32768 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = True,
    # max_seq_length = max_seq_length,
    trust_remote_code=True
)

Are you certain you want to do remote code execution?
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.1.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.dev885. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## 3. Load Dataset

#### QA data

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from datasets import Dataset, load_dataset
import pandas as pd
import json

In [None]:
# Carregar o arquivo JSONL
jsonl_file = '/content/drive/MyDrive/STPC/gpt_training_data.jsonl'

# Carregar os dados do JSONL
data = []
with open(jsonl_file, 'r') as f:
    for line in f:
        data.append(json.loads(line.strip()))





In [None]:
for entry in data:
    print(entry.keys())
    break  # Parar após a primeira entrada para não imprimir todas

dict_keys(['messages'])


In [None]:
instructions = []
responses = []

# Percorrer os dados e acessar a lista dentro de 'messages'
for entry in data:
    messages = entry['messages']
    current_instruction = ""

    # Iterar sobre cada mensagem dentro de 'messages'
    for message in messages:
        if message['role'] == 'user':
            current_instruction = message['content']
        elif message['role'] == 'assistant' and current_instruction:
            instructions.append(current_instruction)
            responses.append(message['content'])
            current_instruction = ""


# Criar um DataFrame com as instruções e respostas
df = pd.DataFrame({
    "instruction": instructions,
    "input": ["" for _ in instructions],  # Como não há input adicional, manter vazio
    "output": responses
})

print(df.head())

# Transformar o DataFrame em um Dataset do Hugging Face
dataset = Dataset.from_pandas(df)

print(dataset)

                                         instruction input  \
0  Explique a introdução à norma IEC/61850 e sua ...         
1  Explique o modelo de dados do IEC/61850 no con...         
2  Explique o modelo de serviços do IEC/61850 no ...         
3  Explique o mapeamento do modelo IEC/61850 no S...         
4  Explique a entidade CNF (Configuração da Ligaç...         

                                              output  
0  17.1  Introdução  \nEste anexo des creve a con...  
1  17.1.1  Modelo de Dados do IEC/61850  \nSob a ...  
2  17.1.2  Modelo de Serviços do IEC/61850  \nAss...  
3  17.1.3  Mapeamento do Modelo IEC/61850 no SAGE...  
4  17.2.1  CNF \n→ Entidade Configuração da Ligaç...  


ArrowTypeError: ('Input object was not a NumPy array', 'Conversion failed for column instruction with type object')

In [None]:
# Salvar o dataset em disco para uso posterior
dataset.save_to_disk('/content/drive/MyDrive/STPC/gpt_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/1899 [00:00<?, ? examples/s]

In [None]:
dataset = Dataset.load_from_disk('/content/drive/MyDrive/STPC/gpt_dataset')

In [None]:
def formatting_prompts_func(examples):
    alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []

    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + " EOS_TOKEN"
        texts.append(text)

    return {"text": texts}

# Aplicar a função de formatação
dataset = dataset.map(formatting_prompts_func, batched=True)

# Verificar os dados formatados
print(dataset[0])

{'instruction': 'Explique a introdução à norma IEC/61850 e sua aplicação no SAGE, incluindo os volumes da norma, o contexto da arquitetura UCA-2.0, a especificação MMS, a estrutura de nomes e o formato de troca de informações de configuração.', 'input': '', 'output': '17.1  Introdução  \nEste anexo des creve a configura ção necessária para as multilig ações de aquisição do SAGE \nestabelecidas com Inteligent Eletronic Devices  (IEDs ou Relés Digitais) sob o protocolo definido na \nnorma IEC/61850.  \nA norma IEC/61850 foi publicada em 14 volumes numerados e identi ficados como most rado a \nseguir:   \nIEC/61850 -1: Introduction and overview  \nIEC/61850 -2: Glossary  \nIEC/61850 -3: General requirements  \nIEC/61850 -4: System and project management  \nIEC/61850 -5: Communications and requirements for functions and device models  \nIEC/61850 -6: Configu ration description language for  communication in electrical substations related to IEDs  \nIEC/61850 -7-1: Basic communication struc

In [None]:
# Salvar o dataset formatado
dataset.save_to_disk('/content/drive/MyDrive/STPC/formatted_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/1899 [00:00<?, ? examples/s]

In [None]:
datasets = dataset.train_test_split(test_size=0.1)
train_dataset = datasets['train']
eval_dataset = datasets['test']

## 4. Set up LoRA and training environment

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_rslora = True,
    use_gradient_checkpointing = 'unsloth',
    random_state = 0,
)

In [None]:
project = "Meta-Llama-3.1-8B-bnb-4bit-01"
run_name = "run_3_SS" # Defining a separate run name in case you want to start another one resuming from a checkpoint
project_and_run_name = project + "-" + run_name
output_dir = "./" + project_and_run_name

### Callback for uploading checkpoints to Google Drive

**Note: ** Checkpoints, especially those with a rank size as high as 768 will take up a lot of storage (10GB). There's [a recent offer](https://blog.google/products/google-one/google-one-gemini-ai-gmail-docs-sheets/) from Google for their Google One AI Premium subscription where you get two months free of 2TB storage in Google Drive amongst some other things like Gemini Ultra access.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import shutil
from pathlib import Path
from transformers import TrainerCallback

class UploadCheckpointCallback(TrainerCallback):
    def __init__(self, output_dir, google_drive_dir):
        super().__init__()
        self.output_dir = output_dir
        self.google_drive_dir = google_drive_dir

    def on_save(self, args, state, control, **kwargs):
        list_of_dirs = [d for d in os.listdir(self.output_dir) if os.path.isdir(os.path.join(self.output_dir, d)) and 'checkpoint' in d]
        list_of_dirs.sort(key=lambda x: os.path.getmtime(os.path.join(self.output_dir, x)), reverse=True)

        if list_of_dirs:
            latest_checkpoint_dir = os.path.join(self.output_dir, list_of_dirs[0])
            destination_path = os.path.join(self.google_drive_dir, os.path.basename(latest_checkpoint_dir))

            # Ensure the destination directory does not exist before copying
            if os.path.exists(destination_path):
                shutil.rmtree(destination_path)
            shutil.copytree(latest_checkpoint_dir, destination_path)
            print(f'Uploaded {latest_checkpoint_dir} to Google Drive: {destination_path}')

            # Delete the local checkpoint directory after upload to stop disk from filling up
            shutil.rmtree(latest_checkpoint_dir)
            print(f'Deleted local checkpoint directory: {latest_checkpoint_dir}')

google_drive_dir = '/content/drive/MyDrive/STPC/' + project + "/" +run_name + "/"

upload_checkpoint_callback = UploadCheckpointCallback(output_dir, google_drive_dir)

#### Copy a checkpoint from Google Drive if you are resuming from a checkpoint

In [None]:
!cp -r "/content/drive/My Drive/your_drive_folder/Mistral-Instruct-16bit-8K-Sweep1/run_1_SS/checkpoint-1120/" "/content/latest_checkpoint/"

## 5. Training

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from datetime import datetime
from unsloth import is_bfloat16_supported

wandbname = project + "-" + run_name

trainer = SFTTrainer(
    model = model,
    callbacks=[upload_checkpoint_callback],
    train_dataset = dataset,
    #train_dataset = train_dataset,
    #eval_dataset = eval_dataset,
    tokenizer = tokenizer,
    max_seq_length = max_seq_length,
    packing = True,
    dataset_text_field="text",
    args = TrainingArguments(
        #per_device_eval_batch_size = 1,
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 2,
        warmup_ratio = 0,
        max_grad_norm = 1.0,
        num_train_epochs = 5,
        learning_rate = 2e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        #evaluation_strategy="epoch",
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "cosine",
        save_strategy="epoch",
        seed = 3407,
        output_dir = output_dir,
        # report_to="wandb",
        run_name=f"sagista-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"
    ),
)



Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Aug 20 11:24:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   76C    P0              35W /  72W |  13869MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 89.6 gigabytes of available RAM

You are using a high-RAM runtime!


In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()
!nvidia-smi

Wed Aug 21 01:07:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0              53W / 400W |  32551MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
print(torch.__version__)

2.3.1+cu121


In [None]:
trainer.train()

# If resuming from a checkpoint:
# trainer.train(resume_from_checkpoint="/content/latest-checkpoint/")

## 6. Try the Trained Model!


### With chat template

The `apply_chat_template` function injects the special [INST] tokens to your prompt

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": """Explain the theory of relativity to me in detail"""},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 8192, use_cache = True)
tokenizer.batch_decode(outputs)

### Without chat template

In [None]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

inputs = tokenizer(
[
    """Explique os campos de uma tabela PDS"""
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 8192,
    use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

### 7. Save the model and upload to Google Drive

In [None]:
model.save_pretrained_merged("sagista-02", tokenizer, save_method = "merged_16bit",)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 52.14 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 35.83it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained("/content/drive/My Drive/sagista-02/")

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.1.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.dev885. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
from google.colab import userdata
hf_token = userdata.get('HF')
model.push_to_hub_merged("pascarujo/SageLlama-3.1-8B", tokenizer, save_method="merged_16bit", token=hf_token)

Unsloth: You are pushing to hub, but you passed your HF username = pascarujo.
We shall truncate pascarujo/SageLlama-3.1-8B to SageLlama-3.1-8B
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 44.3 out of 83.48 RAM for saving.


 56%|█████▋    | 18/32 [00:00<00:00, 177.49it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:12<00:00,  2.59it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...


  0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/pascarujo/SageLlama-3.1-8B


In [None]:
quant_methods = ["q2_k", "q3_k_m", "q4_k_m", "q5_k_m", "q6_k", "q8_0"]
for quant in quant_methods:
    model.push_to_hub_gguf("pascarujo/SageLlama-3.1-8B-GGUF", tokenizer, quant, token=hf_token)


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 49.63 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:05<00:00,  5.95it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q2_k'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at pascarujo/SageLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./pascarujo/SageLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: SageLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> BF16, shape = {4096,

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.BF16.gguf:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pascarujo/SageLlama-3.1-8B-GGUF
Unsloth: Uploading GGUF to Huggingface Hub...


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q2_K.gguf:   0%|          | 0.00/3.18G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pascarujo/SageLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 50.68 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:02<00:00, 14.36it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q3_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at pascarujo/SageLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./pascarujo/SageLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: SageLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q3_K_M.gguf:   0%|          | 0.00/4.02G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pascarujo/SageLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 50.57 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:01<00:00, 17.01it/s] 


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at pascarujo/SageLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./pascarujo/SageLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: SageLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pascarujo/SageLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 50.59 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:01<00:00, 16.23it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q5_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at pascarujo/SageLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./pascarujo/SageLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: SageLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q5_K_M.gguf:   0%|          | 0.00/5.73G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pascarujo/SageLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 50.57 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:02<00:00, 14.62it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q6_k'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at pascarujo/SageLlama-3.1-8B-GGUF into bf16 GGUF format.
The output location will be ./pascarujo/SageLlama-3.1-8B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: SageLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-0000

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q6_K.gguf:   0%|          | 0.00/6.60G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pascarujo/SageLlama-3.1-8B-GGUF
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 50.58 out of 83.48 RAM for saving.


100%|██████████| 32/32 [00:01<00:00, 17.63it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at pascarujo/SageLlama-3.1-8B-GGUF into q8_0 GGUF format.
The output location will be ./pascarujo/SageLlama-3.1-8B-GGUF/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: SageLlama-3.1-8B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-0000

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/pascarujo/SageLlama-3.1-8B-GGUF


In [None]:
!cp -r "/content/sagista-02/" "/content/drive/My Drive/sagista-02/"