# Unsloth Model Training/Tuning Colab
- runs DPO on a Zephyr model
- saves/loads with model folder added to zip archive for portability
- use model in chat (a minimal test-chat)

To run this, press "Runtime" and press "Run all" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="110"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="150"></a>
  <a href="https://huggingface.co/docs/trl/main/en/index"><img src="https://github.com/huggingface/blog/blob/main/assets/133_trl_peft/thumbnail.png?raw=true" width="100"></a> Join our Discord if you need help!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [DPO data prep](#Data), and how to [train via `DPOTrainer`](#Train).
To learn more about DPO, read TRL's [blog post](https://huggingface.co/blog/dpo-trl). We follow [Huggingface's Alignment Handbook](https://github.com/huggingface/alignment-handbook) to replicate [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).

In [None]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
pass

!pip install "git+https://github.com/huggingface/transformers.git" # Native 4bit loading works!

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* DPO requires a model already trained by SFT on a similar dataset that is used for DPO. We use `HuggingFaceH4/mistral-7b-sft-beta` as the SFT model. Use this [notebook](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) first to train a SFT model.

In [None]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

We shall run `ldconfig /usr/lib64-nvidia` to try to fix it.


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/zephyr-sft-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.1
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB
O^O/ \_/ \    CUDA capability = 7.5. Xformers = 0.0.22.post7. FA = False.
\        /    Pytorch version: 2.1.0+cu121. CUDA Toolkit = 12.1
 "-____-"     bfloat16 = FALSE. Platform = Linux

You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` attribute will be overwritten with the one you passed to `from_pretrained`.


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

In [None]:
#@title Alignment Handbook utils
import os
import re
from typing import List, Literal, Optional

from datasets import DatasetDict, concatenate_datasets, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError


DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


def apply_chat_template(
    example, tokenizer, task: Literal["sft", "generation", "rm", "dpo"] = "sft", assistant_prefix="<|assistant|>\n"
):
    def _strip_prefix(s, pattern):
        # Use re.escape to escape any special characters in the pattern
        return re.sub(f"^{re.escape(pattern)}", "", s)

    if task in ["sft", "generation"]:
        messages = example["messages"]
        # We add an empty system message if there is none
        if messages[0]["role"] != "system":
            messages.insert(0, {"role": "system", "content": ""})
        example["text"] = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True if task == "generation" else False
        )
    elif task == "rm":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            chosen_messages = example["chosen"]
            rejected_messages = example["rejected"]
            # We add an empty system message if there is none
            if chosen_messages[0]["role"] != "system":
                chosen_messages.insert(0, {"role": "system", "content": ""})
            if rejected_messages[0]["role"] != "system":
                rejected_messages.insert(0, {"role": "system", "content": ""})
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
        else:
            raise ValueError(
                f"Could not format example as dialogue for `rm` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    elif task == "dpo":
        if all(k in example.keys() for k in ("chosen", "rejected")):
            # Compared to reward modeling, we filter out the prompt, so the text is everything after the last assistant token
            prompt_messages = [[msg for msg in example["chosen"] if msg["role"] == "user"][0]]
            # Insert system message
            if example["chosen"][0]["role"] != "system":
                prompt_messages.insert(0, {"role": "system", "content": ""})
            else:
                prompt_messages.insert(0, example["chosen"][0])
            # TODO: handle case where chosen/rejected also have system messages
            chosen_messages = example["chosen"][1:]
            rejected_messages = example["rejected"][1:]
            example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
            example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
            example["text_prompt"] = tokenizer.apply_chat_template(
                prompt_messages, tokenize=False, add_generation_prompt=True
            )
            example["text_chosen"] = _strip_prefix(example["text_chosen"], assistant_prefix)
            example["text_rejected"] = _strip_prefix(example["text_rejected"], assistant_prefix)
        else:
            raise ValueError(
                f"Could not format example as dialogue for `dpo` task! Require `[chosen, rejected]` keys but found {list(example.keys())}"
            )
    else:
        raise ValueError(
            f"Task {task} not supported, please ensure that the provided task is one of {['sft', 'generation', 'rm', 'dpo']}"
        )
    return example


def get_datasets(
    data_config: dict,
    splits: List[str] = ["train", "test"],
    shuffle: bool = True,
) -> DatasetDict:
    """
    Loads one or more datasets with varying training set proportions.

    Args:
        data_config (`DataArguments` or `dict`):
            Dataset configuration and split proportions.
        splits (`List[str]`, *optional*, defaults to `['train', 'test']`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.

    Returns
        [`DatasetDict`]: The dataset dictionary containing the loaded datasets.
    """

    if type(data_config) is dict:
        # Structure of the input is:
        #     dataset_mixer = {
        #             "dataset1": 0.5,
        #             "dataset1": 0.3,
        #             "dataset1": 0.2,
        #         }
        dataset_mixer = data_config
    else:
        raise ValueError(f"Data config {data_config} not recognized.")

    raw_datasets = mix_datasets(dataset_mixer, splits=splits, shuffle=shuffle)
    return raw_datasets


def mix_datasets(dataset_mixer: dict, splits: Optional[List[str]] = None, shuffle=True) -> DatasetDict:
    """
    Loads and mixes datasets according to proportions specified in `dataset_mixer`.

    Args:
        dataset_mixer (`dict`):
            Dictionary containing the dataset names and their training proportions. By default, all test proportions are 1.
        splits (Optional[List[str]], *optional*, defaults to `None`):
            Dataset splits to load and mix. Assumes the splits exist in all datasets and have a `train_` or `test_` prefix.
        shuffle (`bool`, *optional*, defaults to `True`):
            Whether to shuffle the training and testing/validation data.
    """
    raw_datasets = DatasetDict()
    raw_train_datasets = []
    raw_val_datasets = []
    fracs = []
    for ds, frac in dataset_mixer.items():
        fracs.append(frac)
        for split in splits:
            try:
                # Try first if dataset on a Hub repo
                dataset = load_dataset(ds, split=split)
            except DatasetGenerationError:
                # If not, check local dataset
                dataset = load_from_disk(os.path.join(ds, split))

            if "train" in split:
                raw_train_datasets.append(dataset)
            elif "test" in split:
                raw_val_datasets.append(dataset)
            else:
                raise ValueError(f"Split type {split} not recognized as one of test or train.")

    if any(frac < 0 for frac in fracs):
        raise ValueError("Dataset fractions cannot be negative.")

    if len(raw_train_datasets) > 0:
        train_subsets = []
        for dataset, frac in zip(raw_train_datasets, fracs):
            train_subset = dataset.select(range(int(frac * len(dataset))))
            train_subsets.append(train_subset)
        if shuffle:
            raw_datasets["train"] = concatenate_datasets(train_subsets).shuffle(seed=42)
        else:
            raw_datasets["train"] = concatenate_datasets(train_subsets)
    # No subsampling for test datasets to enable fair comparison across models
    if len(raw_val_datasets) > 0:
        if shuffle:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets).shuffle(seed=42)
        else:
            raw_datasets["test"] = concatenate_datasets(raw_val_datasets)

    if len(raw_datasets) == 0:
        raise ValueError(
            f"Dataset {dataset_mixer} not recognized with split {split}. Check the dataset has been correctly formatted."
        )

    return raw_datasets

<a name="Data"></a>
### Data Prep
We follow Huggingface's [Alignment Handbook](https://github.com/huggingface/alignment-handbook) for [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) and use the [Ultra Feedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), and sample 0.5% of it to speed things up. You can sample the full dataset for a full run.

In [None]:
raw_datasets = get_datasets(
    {"HuggingFaceH4/ultrafeedback_binarized" : 0.005}, # 0.5% sampled
    splits = ["train_prefs", "test_prefs"],
)
column_names = list(raw_datasets["train"].features)

raw_datasets = raw_datasets.map(
    apply_chat_template,
    fn_kwargs = {"tokenizer": tokenizer, "task": "dpo"},
    num_proc = 12,
    remove_columns = column_names,
    desc = "Formatting comparisons with prompt template",
)

# Replace column names with what TRL needs, text_chosen -> chosen and text_rejected -> rejected
for split in ["train", "test"]:
    raw_datasets[split] = raw_datasets[split].rename_columns(
        {"text_prompt": "prompt", "text_chosen": "chosen", "text_rejected": "rejected"}
    )

Downloading readme:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/226M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/226M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.29M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.72M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/184M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.02M [00:00<?, ?B/s]

Generating train_prefs split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating train_sft split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating test_prefs split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test_sft split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_gen split:   0%|          | 0/61135 [00:00<?, ? examples/s]

Generating test_gen split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/305 [00:00<?, ? examples/s]

Formatting comparisons with prompt template (num_proc=12):   0%|          | 0/2000 [00:00<?, ? examples/s]

We shall print a random item from the dataset

In [None]:
import pprint
row = raw_datasets["train"][8]
pprint.pprint(row["prompt"])
pprint.pprint(row["chosen"])
pprint.pprint(row["rejected"])

('<|system|>\n'
 '</s>\n'
 '<|user|>\n'
 'Describe a possible solution to the environmental issue of air '
 'pollution.</s>\n'
 '<|assistant|>\n')
('One of the most effective solutions to the environmental issue of air '
 'pollution is promoting and investing in renewable energy sources. '
 'Traditional energy sources like coal and oil produce large amounts of '
 'greenhouse gases that contribute greatly to air pollution. Renewable energy '
 'sources like wind, solar, and hydropower sources do not produce greenhouse '
 'gases. They use clean energy sources that do not harm the environment and do '
 'not contribute to air pollution. \n'
 '\n'
 'Another solution is improving our public transportation systems. Encouraging '
 'individuals to use public transportation, cycling, walking, or carpooling '
 'instead of their personal vehicles can greatly reduce emissions. This is '
 'especially helpful in urban areas where traffic congestion and subsequent '
 'air pollution is a common issue. \

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

Unsloth 2024.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Train"></a>
### Train the DPO model
Now let's use Huggingface TRL's `DPOTrainer`! More docs here: [TRL DPO docs](https://huggingface.co/docs/trl/dpo_trainer). We do 3 epochs on 0.5% of the dataset to speed things up.

In [None]:
# One must patch the DPO Trainer first!
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

In [None]:
from transformers import TrainingArguments
from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        learning_rate = 5e-6,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = raw_datasets["train"],
    # eval_dataset = raw_datasets["test"],
    tokenizer = tokenizer,
    max_length = 1024,
    max_prompt_length = 512,
)



Map:   0%|          | 0/305 [00:00<?, ? examples/s]

In [None]:
dpo_trainer.train()

Unsloth: `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.6931,0.0,0.0,0.0,0.0,-332.211426,-268.716766,-2.40433,-2.34995
2,0.6931,0.0,0.0,0.0,0.0,-348.399475,-300.459442,-2.915292,-2.743188
3,0.6931,0.0,0.0,0.0,0.0,-94.223419,-116.334641,-2.730997,-2.70415
4,0.694,-0.001252,0.000494,0.375,-0.001746,-192.265182,-292.463257,-2.762836,-2.777498
5,0.6961,-0.001606,0.004272,0.375,-0.005879,-371.855896,-400.328979,-2.877292,-3.042836
6,0.6938,0.000404,0.001742,0.5,-0.001338,-304.883759,-262.745514,-2.706605,-2.799091
7,0.6926,0.001944,0.000743,0.625,0.001201,-283.788269,-279.07959,-2.491374,-2.414951
8,0.693,0.00139,0.001145,0.625,0.000246,-171.792603,-189.7771,-2.696905,-2.618503
9,0.6931,0.004504,0.004375,0.375,0.000129,-384.899536,-351.154205,-2.829076,-2.754782
10,0.6881,0.003531,-0.006657,0.625,0.010187,-270.956268,-292.085205,-2.422791,-2.538726


TrainOutput(global_step=114, training_loss=0.40681673795507667, metrics={'train_runtime': 4365.2857, 'train_samples_per_second': 0.21, 'train_steps_per_second': 0.026, 'total_flos': 0.0, 'train_loss': 0.40681673795507667, 'epoch': 2.98})

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

We also have other notebooks on:
1. Llama 7b [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
2. TinyLlama full Alpaca 52K in 1 hr [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
3. Mistral 7b [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
4. Llama 7b [free Kaggle](https://www.kaggle.com/danielhanchen/unsloth-alpaca-t4-ddp)
5. CodeLlama 34b [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="110"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="150"></a>
  <a href="https://huggingface.co/docs/trl/main/en/index"><img src="https://github.com/huggingface/blog/blob/main/assets/133_trl_peft/thumbnail.png?raw=true" width="100"></a>
</div>

In [None]:
model.save_pretrained("zephyr_model") # Local saving
# model.push_to_hub("your_name/lora_model") # Online saving

# zip archive model

In [None]:
import zipfile
import os

def zip_folder(folder_path, output_path):
    """
    Zips a folder and all its contents into a single zip file.

    Args:
      folder_path: The path to the folder to be zipped.
      output_path: The path to the output zip file.
    """

    # Create a ZipFile object
    with zipfile.ZipFile(output_path, 'w') as zip_file:

      # Walk through the folder and add all files to the zip file
      for root, dirs, files in os.walk(folder_path):

          for file in files:

              # Get the full path to the file
              file_path = os.path.join(root, file)

              # Add the file to the zip file
              zip_file.write(file_path, os.path.relpath(file_path, folder_path))



## see the name of the folder/directory for the model

In [None]:
import os

for folder in os.listdir('.'):
    print(folder)

.config
zephyr_model
outputs
sample_data


### make a timestamp

In [23]:
from datetime import datetime as dt

# make readable time
date_time = dt.utcnow()
timestamp = date_time.strftime('%Y_%m_%d_%H_%M_%S_%f')

# inspect
print(timestamp)

2024_01_14_00_34_28_038547


## zip your model-folder for easier download and portability

In [None]:
zip_folder(folder_path="zephyr_model", output_path=f"zipped_model_{timestamp}.zip")

## See if your archived model is there

In [None]:
for folder in os.listdir('.'):
    print(folder)

.config
lora_model
zipped_model_2024_01_13_14_23_20_053208.zip
outputs
sample_data


# Load/Reload Model

To save to `GGUF` / `llama.cpp`, or for model merging, use `model.merge_and_unload` first, then save the model. Maxime Labonne's [llm-course](https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html) has a nice tutorial on converting HF to GGUF! This [issue](https://github.com/ggerganov/llama.cpp/issues/3097) might be helpful for more info.

In [None]:
model = model.merge_and_unload()



Now if you want to load the adapters we just saved, we can!

In [31]:
from peft import PeftModel
model = PeftModel.from_pretrained(model, "zephyr_model")

Finally, we can now do some inference on the loaded model.

In [33]:
#@title Alpaca dataset preparation code
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [34]:
inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is the famous tower in France called?", # instruction
        "", # input
        "", # output
    )
]*1, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is the famous tower in France called?\n\n### Input:\n\n\n### Response:\nThe famous tower in France is called the Eiffel Tower.</s>']

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

We also have other notebooks on:
1. Zephyr DPO [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Mistral 7b [free Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
3. Llama 7b [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
4. CodeLlama 34b [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Llama 7b [free Kaggle](https://www.kaggle.com/danielhanchen/unsloth-alpaca-t4-ddp)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="110"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="150"></a>
  <a href="https://huggingface.co/docs/trl/main/en/index"><img src="https://github.com/huggingface/blog/blob/main/assets/133_trl_peft/thumbnail.png?raw=true" width="100"></a>
</div>

# Use Model: Basic & Looped Query/Chat

In [35]:
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Answer a question.", # instruction
        "What is the famous tower in France called?", # input
        "", # output
    )
]*1, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nAnswer a question.\n\n### Input:\nWhat is the famous tower in France called?\n\n### Response:\nThe famous tower in France is called the Eiffel Tower.</s>']

## Basic Query Function

In [36]:
def query(input_string):
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Answer a question.", # instruction
            f"{input_string}", # input
            "", # output
        )
    ]*1, return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
    output_string = tokenizer.batch_decode(outputs)

    return output_string[0]

In [24]:
# # Testing
# query("What is the famous tower in Tokyo called?")

# Looped Query/Chat

In [46]:
'''
A very minimal chat with memory.

E.g. Run with the following code:

from datetime import datetime as dt

# make readable time
date_time = dt.utcnow()
timestamp = date_time.strftime('%Y/%m/%d  %H:%M:%S:%f')

instructions = f"""
  Instructions: {timestamp}
    - Enter your queries into the iput-box.
    - Say 'quit' or 'exit', etc.,  to leave the conversation.

"""

print( instructions )

# run chat
keep_talking()

'''

import re

def query(input_string):
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Answer a question.", # instruction
            f"{input_string}", # input
            "", # output
        )
    ]*1, return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
    output_string = tokenizer.batch_decode(outputs)

    return output_string[0]


def strip_non_alpha(text):
    # regex to leave only a-z characters
    pattern = re.compile('[^a-z]')
    return pattern.sub('', text).lower()


def keep_talking():
    """
    A very minimal chat with memory.

    Uses:
      query(input_string)
      strip_non_alpha(text)
    """
    still_talking = True
    dialogue_history = ""

    while still_talking:
        user_input = input()

        dialogue_input = dialogue_history + "\n\n  ### Input: \n\n" + user_input

        exit_phrase_list = [
            "exit",
            "quit",
            "quite",
            "!q",
            "q",
            "done",
            "finish",
            "end",
            "bye",
            "good bye",
        ]

        # check if user is exiting convesation
        if strip_non_alpha(user_input) in exit_phrase_list:
            print("\nAll Done!")
            break

        else:
            print("...processing...\n")
            output = query(dialogue_input)
            print( output )

            # save dialogue so far
            dialogue_history = output


e.g.
```
Instructions: 2024/01/14  01:00:11:786295
    - Enter your queries into the iput-box.
    - Say 'quit' or 'exit', etc.,  to leave the conversation.


if I give you two words, can you put them together? the second word is apple
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
...processing...

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer a question.

### Input:


  ### Input:

if I give you two words, can you put them together? the second word is apple

### Response:

Yes, I can put them together. The two words are "apple" and "pie".</s>
correction, the first word is "Pine" what are the two words together?
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
...processing...

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer a question.

### Input:
<s>  Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer a question.

### Input:


  ### Input:

if I give you two words, can you put them together? the second word is apple

### Response:

Yes, I can put them together. The two words are "apple" and "pie".</s>

  ### Input:

correction, the first word is "Pine" what are the two words together?

### Response:

The two words together are "Pineapple".</s>
```

# Chat Loop

In [58]:
from datetime import datetime as dt

# make readable time
date_time = dt.utcnow()
timestamp = date_time.strftime('%Y/%m/%d  %H:%M:%S:%f')

instructions = f"""
  Instructions: {timestamp}
    - Enter your queries into the iput-box.
    - Say 'quit' or 'exit', etc.,  to leave the conversation.

"""

print( instructions )

# run chat
keep_talking()


  Instructions: 2024/01/14  01:32:05:428413
    - Enter your queries into the iput-box.
    - Say 'quit' or 'exit', etc.,  to leave the conversation.


Can you make a sandwich with two thing inside? one is jelly, wait for the the next


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


...processing...

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer a question.

### Input:


  ### Input: 

Can you make a sandwich with two thing inside? one is jelly, wait for the the next

### Response:

Yes, I can make a sandwich with two things inside. One of the things is jelly, and the other thing is...?</s>
the other is peanut butter, so what kind of sandwhich is it now?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


...processing...

<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer a question.

### Input:
<s>  Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Answer a question.

### Input:


  ### Input: 

Can you make a sandwich with two thing inside? one is jelly, wait for the the next

### Response:

Yes, I can make a sandwich with two things inside. One of the things is jelly, and the other thing is...?</s> 

  ### Input: 

the other is peanut butter, so what kind of sandwhich is it now?

### Response:

It's a peanut butter and jelly sandwich.</s>
bye

All Done!


# test 2

In [56]:
# note LlamaTokenizerFast

def generate_response(prompt, model):
    encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
    model_inputs = encoded_input.to('cuda')

    generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)

    decoded_output = tokenizer.batch_decode(generated_ids)

    return decoded_output[0]

# generate_response("What is a tree?", model)

In [57]:
generate_response("What is a tree?", model)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'<s> What is a tree? Is it an organism or a community and ecosystem?\nAt the level of function, a tree is not very different from the ecosystems in which it is embedded. The term “organism” does not even adequately describe all the things trees do. They have their own ecological niches – some trees support large numbers of beetle larvae, for example.\nTree species are not like human species. Some interbreed to form hybrids, the ec'