<a href="https://colab.research.google.com/github/pevgeniy007/Auto-GPT/blob/master/GGUF_llama_cpp_BETA_Alpaca_%2B_Mistral_7b_full_example(HFimport).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "Runtime" and press "Run all" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="110"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="150"></a>
  <a href="https://huggingface.co/docs/trl/main/en/index"><img src="https://github.com/huggingface/blog/blob/main/assets/133_trl_peft/thumbnail.png?raw=true" width="100"></a> Join our Discord if you need help!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

In [1]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
pass

!pip install "git+https://github.com/huggingface/transformers.git" # Native 4bit loading works!

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # "unsloth/mistral-7b" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

We shall run `ldconfig /usr/lib64-nvidia` to try to fix it.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.1
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB
O^O/ \_/ \    CUDA capability = 7.5. Xformers = 0.0.22.post7. FA = False.
\        /    Pytorch version: 2.1.0+cu121. CUDA Toolkit = 12.1
 "-____-"     bfloat16 = FALSE. Platform = Linux

You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` attribute will be overwritten with the one you passed to `from_pretrained`.


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

Unsloth 2024.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

In [4]:
import json
from datasets import Dataset

# Загрузите данные из JSON файла
with open('/content/sample_data/Train.json', 'r', encoding='utf-8') as file:
    data_list = json.load(file)

# Преобразование списка словарей в словарь колонок
data = {
    "input": [item["input"] for item in data_list],
    "instruction": [item["instruction"] for item in data_list],
    "output": [item["output"] for item in data_list]
}

# Создайте датасет из данных
dataset = Dataset.from_dict(data)

# Определите шаблон для форматирования данных
prompt_template = """Ниже приведена инструкция, описывающая задачу, сопровождаемая вводом, который предоставляет дополнительный контекст. Напишите ответ, который соответствует запросу.

### Инструкция:
{}

### Ввод:
{}

### Ответ:
{}"""

# Функция для форматирования данных в соответствии с шаблоном
def formatting_prompts_func(examples):
    prompts = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    formatted_texts = []
    for prompt, input, output in zip(prompts, inputs, outputs):
        formatted_text = prompt_template.format(prompt, input, output)
        formatted_texts.append(formatted_text)
    return {"text": formatted_texts}

# Примените функцию форматирования к датасету
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)

# Печать первых нескольких примеров из датасета
for example in formatted_dataset.select(range(2)):
    print(example['text'])




Map:   0%|          | 0/165 [00:00<?, ? examples/s]

Ниже приведена инструкция, описывающая задачу, сопровождаемая вводом, который предоставляет дополнительный контекст. Напишите ответ, который соответствует запросу.

### Инструкция:
Ответь на вопрос пользователя

### Ввод:
Что сейчас в Буркина-Фасо?

### Ответ:
В Буркина-Фасо произошел военный переворот, в результате которого к власти пришел Ибрагим Траоре, бывший армейский капитан.
Ниже приведена инструкция, описывающая задачу, сопровождаемая вводом, который предоставляет дополнительный контекст. Напишите ответ, который соответствует запросу.

### Инструкция:
Ответь на вопрос пользователя

### Ввод:
Что произходит в Буркина-Фасо?

### Ответ:
В Буркина-Фасо произошел военный переворот, который привел к смене власти, и новым президентом стал Ибрагим Траоре, который ранее служил в армии и был капитаном.


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments

# Убедитесь, что используете formatted_dataset здесь
trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=4,
    args=TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Map (num_proc=4):   0%|          | 0/165 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.625 GB of memory reserved.


In [6]:
trainer_stats = trainer.train()

Unsloth: `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`


Step,Training Loss
1,1.3752
2,1.5211
3,1.2998
4,1.0805
5,0.8291
6,0.7136
7,0.6232
8,0.5341
9,0.5309
10,0.5366


In [7]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'start_gpu_memory' is not defined

In [8]:
#@title Code for conversion to GGUF
def colab_quantize_to_gguf(save_directory="/content/quantized_model", quantization_method="q4_k_m"):
    from transformers.models.llama.modeling_llama import logger
    import os

    logger.warning_once(
        "Unsloth: `colab_quantize_to_gguf` is still in development mode.\n"\
        "If anything errors or breaks, please file a ticket on Github.\n"\
        "Also, if you used this successfully, please tell us on Discord!"
    )

    # From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
    ALLOWED_QUANTS = {
        "q4_k_m" : "Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
    }

    if quantization_method not in ALLOWED_QUANTS.keys():
        error = f"Unsloth: Quant method = [{quantization_method}] not supported. Choose from below:\n"
        for key, value in ALLOWED_QUANTS.items():
            error += f"[{key}] => {value}\n"
        raise RuntimeError(error)
    pass

    print_info = \
        f"==((====))==  Unsloth: Conversion from QLoRA to GGUF information\n"\
        f"   \\\   /|    [0] Installing llama.cpp will take 3 minutes.\n"\
        f"O^O/ \_/ \\    [1] Converting HF to GUUF 16bits will take 3 minutes.\n"\
        f"\        /    [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.\n"\
        f' "-____-"     In total, you will have to wait around 26 minutes.\n'
    print(print_info)

    if not os.path.exists("llama.cpp"):
        print("Unsloth: [0] Installing llama.cpp. This will take 3 minutes...")
        !git clone https://github.com/ggerganov/llama.cpp
        !cd llama.cpp && make clean && LLAMA_CUBLAS=1 make -j
        !pip install gguf protobuf
    pass

    print("Unsloth: [1] Converting HF into GGUF 16bit. This will take 3 minutes...")
    !python llama.cpp/convert.py {save_directory} \
        --outfile {save_directory}-unsloth.gguf \
        --outtype f16

    print("Unsloth: [2] Converting GGUF 16bit into q4_k_m. This will take 20 minutes...")
    final_location = f"./{save_directory}-{quantization_method}-unsloth.gguf"
    !./llama.cpp/quantize ./{save_directory}-unsloth.gguf \
        {final_location} {quantization_method}

    print(f"Unsloth: Output location: {final_location}")
pass


In [9]:
from unsloth import unsloth_save_model

# unsloth_save_model has the same args as model.save_pretrained
unsloth_save_model(model, tokenizer, "output_model", push_to_hub = False, token = None)
colab_quantize_to_gguf("output_model", quantization_method = "q4_k_m")

Unsloth: `unsloth_save_model` is still in development mode.
If anything errors or breaks, please file a ticket on Github.
Also, if you used this successfully, please tell us on Discord!


Unsloth: Merging 4bit and LoRA weights to 16bit...


100%|██████████| 32/32 [01:16<00:00,  2.38s/it]


Unsloth: Saving tokenizer...
Unsloth: Saving model. This will take 5 minutes for Llama-7b...


Unsloth: `colab_quantize_to_gguf` is still in development mode.
If anything errors or breaks, please file a ticket on Github.
Also, if you used this successfully, please tell us on Discord!


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GUUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.
 "-____-"     In total, you will have to wait around 26 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Cloning into 'llama.cpp'...
remote: Enumerating objects: 16142, done.[K
remote: Counting objects: 100% (62/62), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 16142 (delta 22), reused 42 (delta 12), pack-reused 16080[K
Receiving objects: 100% (16142/16142), 18.95 MiB | 14.30 MiB/s, done.
Resolving deltas: 100% (11185/11185), done.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow 

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

We also have other notebooks on:
1. Zephyr DPO [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Mistral 7b [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
3. TinyLlama full Alpaca 52K in under 80 hours [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Llama 7b [free Kaggle](https://www.kaggle.com/danielhanchen/unsloth-alpaca-t4-ddp)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="110"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="150"></a>
  <a href="https://huggingface.co/docs/trl/main/en/index"><img src="https://github.com/huggingface/blog/blob/main/assets/133_trl_peft/thumbnail.png?raw=true" width="100"></a>
</div>

In [None]:
from google.colab import files
files.download('/content/output_model-q4_k_m-unsloth.gguf')

In [12]:
from google.colab import drive
drive.mount('/content/drive/')
# Замените 'YOUR_PATH' на путь к директории на Google Drive и


Mounted at /content/drive/


In [13]:
!cp "/content/output_model-q4_k_m-unsloth.gguf" "/content/drive/My Drive/"


In [None]:
from google.colab import drive
drive.mount('/content/drive2/')
# Замените 'YOUR_PATH' на путь к директории на Google Drive и


In [None]:
!pip install transformers huggingface_hub


In [19]:
!pip install --upgrade huggingface_hub



In [16]:
import os
token = "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"  # Замените на ваш собственный токен
os.environ["HF_HOME"] = "/root/.huggingface"
os.environ["HF_TOKEN"] = token


In [25]:
from huggingface_hub import HfApi, HfFolder

# Сохранение вашего токена доступа
HfFolder.save_token("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")

# Аутентификация и настройка
api = HfApi()

# Проверка существования репозитория
repo_name = "RU-BF"
username = "pevgeniy"
repo_id = f"{username}/{repo_name}"

if not api.repo_exists(repo_id):
    api.create_repo(repo_name, private=True, exist_ok=True)

# Путь к файлу, который вы хотите загрузить
file_path = "/content/output_model-q4_k_m-unsloth.gguf"

# Загрузите файл
api.upload_file(
    path_or_fileobj=file_path,
    path_in_repo="output_model-q4_k_m-unsloth.gguf",
    repo_id=f"{username}/{repo_name}"
)

print(f"File uploaded to: https://huggingface.co/{username}/{repo_name}")



output_model-q4_k_m-unsloth.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

File uploaded to: https://huggingface.co/pevgeniy/RU-BF


In [26]:
from huggingface_hub import HfApi, HfFolder

# Сохранение вашего токена доступа
HfFolder.save_token("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")

# Аутентификация и настройка
api = HfApi()

# Проверка существования репозитория
repo_name = "RU-BF-Full"
username = "pevgeniy"
repo_id = f"{username}/{repo_name}"

if not api.repo_exists(repo_id):
    api.create_repo(repo_name, private=True, exist_ok=True)

# Путь к файлу, который вы хотите загрузить
file_path = "/content/output_model-unsloth.gguf"

# Загрузите файл
api.upload_file(
    path_or_fileobj=file_path,
    path_in_repo="output_model-unsloth.gguf",
    repo_id=f"{username}/{repo_name}"
)

print(f"File uploaded to: https://huggingface.co/{username}/{repo_name}")

output_model-unsloth.gguf:   0%|          | 0.00/14.5G [00:00<?, ?B/s]

File uploaded to: https://huggingface.co/pevgeniy/RU-BF-Full


In [27]:
from huggingface_hub import HfApi, HfFolder

# Сохранение вашего токена доступа
HfFolder.save_token("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")

# Аутентификация и настройка
api = HfApi()

# Проверка существования репозитория
repo_name = "RU-BF-tensors"
username = "pevgeniy"
repo_id = f"{username}/{repo_name}"

if not api.repo_exists(repo_id):
    api.create_repo(repo_name, private=True, exist_ok=True)
# Путь к локальной папке, которую вы хотите загрузить
local_folder_path = "/content/output_model" # замените на путь к вашей папке

# Загрузите содержимое папки
api.upload_folder(
    folder_path=local_folder_path,
    repo_id=repo_id
)

print(f"Folder uploaded to: https://huggingface.co/{username}/{repo_name}")


model-00002-of-00003.safetensors:   0%|          | 0.00/6.98G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/6.89G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/614M [00:00<?, ?B/s]

Folder uploaded to: https://huggingface.co/pevgeniy/RU-BF-tensors
