# Instruct tuning the model

This notebook draws heavily a similar one done for the [phi3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/sample_finetune.py) model. 

The difference here is that this will focus on a model's full fine-tuning process, work for going from a base model to a new insruction model, and should work for almost any model on HuggingFace.

At the end of the notebook are the steps to save this as a gguf format which will allow for fast and easy inference.

In [1]:
import sys
import logging

import datasets
from datasets import load_dataset
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
import os
import json
import wandb


In [2]:
logger = logging.getLogger(__name__)
wandb.init(project="smollm-ft")
###################
# Hyper-parameters
###################
training_config = {
    "do_eval": False,
    "learning_rate": 5.0e-04,
    "per_device_train_batch_size": 24,
    "gradient_accumulation_steps": 1,
    "log_level": "info",
    "logging_steps": 100,
    "logging_strategy": "steps",
    "lr_scheduler_type": "cosine",
    "num_train_epochs": 5,
    "max_steps": -1,
    "output_dir": "./nature-buddy",
    "overwrite_output_dir": True,
    "remove_unused_columns": True,
    "save_steps": 500,
    "save_total_limit": 1,
    "seed": 0,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs":{"use_reentrant": False},
    "gradient_accumulation_steps": 1,
    "warmup_ratio": 0.05,
    "report_to":"wandb",
    "neftune_noise_alpha":3,
    "push_to_hub": True,
    }

train_conf = TrainingArguments(**training_config)



Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnoahpunintended[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [3]:
###############
# Setup logging
###############
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = train_conf.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

logger.info(f"Training/evaluation parameters {train_conf}")


2024-10-09 11:36:45 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_str

In [4]:


####################
# Base Model Loading
####################
checkpoint_path = "HuggingFaceTB/SmolLM2-360M"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
#    attn_implementation="flash_attention_2",  # only works on latest gpus, probably not worth it in most cases
     torch_dtype=torch.bfloat16,
   device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)


###################
# Tokenizer Loading
###################

checkpoint_path = "HuggingFaceTB/SmolLM2-360M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 2048
tokenizer.pad_token = "<|endoftext|>"  # note this is specific to smollm
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'
# https://stackoverflow.com/questions/76446228/setting-padding-token-as-eos-token-when-using-datacollatorforlanguagemodeling-fr


[INFO|configuration_utils.py:733] 2024-10-09 11:36:45,721 >> loading configuration file config.json from cache at /home/zeus/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM-135M/snapshots/1d461723eec654e65efdc40cf49301c89c0c92f4/config.json
[INFO|configuration_utils.py:800] 2024-10-09 11:36:45,724 >> Model config LlamaConfig {
  "_name_or_path": "HuggingFaceTB/SmolLM-135M",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "eos_token_id": 0,
  "hidden_act": "silu",
  "hidden_size": 576,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "max_position_embeddings": 2048,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 9,
  "num_hidden_layers": 30,
  "num_key_value_heads": 3,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.44.2",
  "use_cach

[INFO|modeling_utils.py:1606] 2024-10-09 11:36:45,877 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-10-09 11:36:45,880 >> Generate config GenerationConfig {
  "bos_token_id": 0,
  "eos_token_id": 0,
  "use_cache": false
}

[INFO|modeling_utils.py:4507] 2024-10-09 11:36:46,838 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4515] 2024-10-09 11:36:46,839 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at HuggingFaceTB/SmolLM-135M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:993] 2024-10-09 11:36:46,868 >> loading configuration file generation_config.json from cache at /home/zeus/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM-135M/snapshots/1d461723eec654e65efdc40cf49301c89c0c92f4/gene

In [None]:
tokenizer.pad_token, tokenizer.eos_token, tokenizer.eos_token_id, tokenizer.pad_token_id

### Setting up the fine-tune 

Now that the synthetic dataset is made, next up is ensure the model is capable of answering like we expect, without the large system prompt impacting latency. 

The solution to this is to open up the dataset, replace the system prompt with something much simpler, and starting training with that.

f = 'ft-flow/data/picard-messages.json'
with open(f, 'r') as f:
    data = json.load(f)

condensed_system_prompt = "You are Pi-Card, the Raspberry Pi voice assistant."

ft_data = []
for conversation in data:
    conversation['messages'][0]['content'] = condensed_system_prompt
    ft_data.append(conversation)


# save to a new file for data processing
with open('ft-flow/data/ft-dataset.json', 'w') as f:
    json.dump(ft_data, f, indent=4)

##################
# Data Processing
##################
def apply_chat_template(
    example,
    tokenizer,
):
    messages = example["messages"]
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False).strip('\n')
    return example

raw_dataset = load_dataset('json', data_files='ft-flow/data/ft-dataset.json', split='train') 

train_dataset = raw_dataset
column_names = list(train_dataset.features)

processed_train_dataset = train_dataset.map(
    apply_chat_template,
    fn_kwargs={"tokenizer": tokenizer},
    num_proc=10,
    desc="Applying chat template to train_sft",
)

# shuffle the dataset
processed_train_dataset = processed_train_dataset.shuffle(seed=42)

In [6]:
##################
# Data Processing
##################
def apply_chat_template(
    example,
    tokenizer,
):
    messages = example["messages"]
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False).strip('\n')
    return example

raw_dataset = load_dataset("nkasmanoff/disaster_buddy_test") 


raw_datset2 = load_dataset("HuggingFaceTB/everyday-conversations-llama3.1-2k")
train_dataset = raw_dataset['train']
train_dataset2 = raw_datset2['train_sft']
train_dataset = datasets.concatenate_datasets([train_dataset, train_dataset2])
column_names = list(train_dataset.features)

processed_train_dataset = train_dataset.map(
    apply_chat_template,
    fn_kwargs={"tokenizer": tokenizer},
    num_proc=10,
    desc="Applying chat template to train_sft",
)

# shuffle the dataset
processed_train_dataset = processed_train_dataset.shuffle(seed=42)

Overwrite dataset info from restored data version if exists.


2024-10-09 11:36:54 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.


Loading Dataset info from /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf


2024-10-09 11:36:54 - INFO - datasets.info - Loading Dataset info from /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf


Found cached dataset disaster_buddy_test (/home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf)


2024-10-09 11:36:54 - INFO - datasets.builder - Found cached dataset disaster_buddy_test (/home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf)


Loading Dataset info from /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf


2024-10-09 11:36:54 - INFO - datasets.info - Loading Dataset info from /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf


Overwrite dataset info from restored data version if exists.


2024-10-09 11:36:55 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.


Loading Dataset info from /home/zeus/.cache/huggingface/datasets/HuggingFaceTB___everyday-conversations-llama3.1-2k/default/0.0.0/451e129a6730488b7213951eb815af95f381eeea


2024-10-09 11:36:55 - INFO - datasets.info - Loading Dataset info from /home/zeus/.cache/huggingface/datasets/HuggingFaceTB___everyday-conversations-llama3.1-2k/default/0.0.0/451e129a6730488b7213951eb815af95f381eeea


Found cached dataset everyday-conversations-llama3.1-2k (/home/zeus/.cache/huggingface/datasets/HuggingFaceTB___everyday-conversations-llama3.1-2k/default/0.0.0/451e129a6730488b7213951eb815af95f381eeea)


2024-10-09 11:36:55 - INFO - datasets.builder - Found cached dataset everyday-conversations-llama3.1-2k (/home/zeus/.cache/huggingface/datasets/HuggingFaceTB___everyday-conversations-llama3.1-2k/default/0.0.0/451e129a6730488b7213951eb815af95f381eeea)


Loading Dataset info from /home/zeus/.cache/huggingface/datasets/HuggingFaceTB___everyday-conversations-llama3.1-2k/default/0.0.0/451e129a6730488b7213951eb815af95f381eeea


2024-10-09 11:36:55 - INFO - datasets.info - Loading Dataset info from /home/zeus/.cache/huggingface/datasets/HuggingFaceTB___everyday-conversations-llama3.1-2k/default/0.0.0/451e129a6730488b7213951eb815af95f381eeea


Some of the datasets have disparate format. Resetting the format of the concatenated dataset.


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Some of the datasets have disparate format. Resetting the format of the concatenated dataset.


Process #0 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00000_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #0 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00000_of_00010.arrow


Process #1 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00001_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #1 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00001_of_00010.arrow


Process #2 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00002_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #2 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00002_of_00010.arrow


Process #3 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00003_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #3 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00003_of_00010.arrow


Process #4 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00004_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #4 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00004_of_00010.arrow


Process #5 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00005_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #5 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00005_of_00010.arrow


Process #6 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00006_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #6 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00006_of_00010.arrow


Process #7 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00007_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #7 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00007_of_00010.arrow


Process #8 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00008_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #8 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00008_of_00010.arrow


Process #9 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00009_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Process #9 will write at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00009_of_00010.arrow


Spawning 10 processes


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Spawning 10 processes


Applying chat template to train_sft (num_proc=10):   0%|          | 0/3204 [00:00<?, ? examples/s]

Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00000_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00000_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00001_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00001_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00002_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00002_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00005_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00005_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00006_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00006_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00007_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00007_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00008_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00008_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00009_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00009_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00004_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00004_of_00010.arrow


Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00003_of_00010.arrow


2024-10-09 11:36:55 - INFO - datasets.arrow_dataset - Caching processed dataset at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-ed522dda2193ff00_00003_of_00010.arrow


Concatenating 10 shards


2024-10-09 11:36:56 - INFO - datasets.arrow_dataset - Concatenating 10 shards


Caching indices mapping at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-44d771353a4827dd.arrow


2024-10-09 11:36:56 - INFO - datasets.arrow_dataset - Caching indices mapping at /home/zeus/.cache/huggingface/datasets/nkasmanoff___disaster_buddy_test/default/0.0.0/190e82cb8e407c53544e05bbd7d93ac0911452bf/cache-44d771353a4827dd.arrow


In [8]:
len(processed_train_dataset )

3204

In [9]:
processed_train_dataset[123]

<|im_start|>user
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?<|im_end|>
<|im_start|>assistant
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?
I'm going on a wilderness survi

In [None]:
model.eval();
#prompt = """What is the oort cloud?"""
prompt = "I'm going on a wilderness survival trip and I'm not sure how to find food. Can you help me?"
#prompt = f"<|im_start|>system\nYou are Pi-Card, the Raspberry Pi voice assistant.<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256,  do_sample=False, pad_token_id=tokenizer.eos_token_id)
output_text = tokenizer.decode(output[0], skip_special_tokens=False, pad_token_id = tokenizer.eos_token_id)
formatted_output_text = "<|im_end|>".join(output_text.split("<|im_end|>")[:2]) + "<|im_end|>"
print(formatted_output_text)

In [10]:
###########
# Training
###########

model.train();
trainer = SFTTrainer(
    model=model,
    args=train_conf,
    train_dataset=processed_train_dataset,
    max_seq_length=2048,
    dataset_text_field="text",
    tokenizer=tokenizer,
    #packing=True,
)
train_result = trainer.train()
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

# trainer.push_to_hub()

# Evaluation and saving the model

In [None]:
# Load the model from the checkpoint

# find most recently created folder in checkpoint_dir and set as checkpoint path
checkpoint_path = sorted(os.listdir(train_conf.output_dir))[-1]
checkpoint_path = os.path.join(train_conf.output_dir, checkpoint_path)
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
     torch_dtype=torch.bfloat16,
   device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)


In [13]:
model.eval();
prompt = """I have 45 pills. Sofie dose is 1 pill in morning and half pill at night. How long will this last"""
prompt = f"<|im_start|>system\nYou are Pi-Card, the Raspberry Pi voice assistant.<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256,  do_sample=False, pad_token_id=tokenizer.eos_token_id)
output_text = tokenizer.decode(output[0], skip_special_tokens=False, pad_token_id = tokenizer.eos_token_id)
formatted_output_text = "<|im_end|>".join(output_text.split("<|im_end|>")[:2]) + "<|im_end|>" 
print(formatted_output_text)

<|im_start|>user
What is the safest way to purify water in the wilderness?<|im_end|>
<|im_start|>assistant
The safest way to purify water in the wilderness is to boil it before drinking. This method kills most bacteria and viruses, making the water safe to drink.<|im_end|>


# Saving to gguf
#https://github.com/ggerganov/llama.cpp/discussions/2948




In [None]:
# Start by downloading llama-cpp if not already done

#!git clone https://github.com/ggerganov/llama.cpp.git
!pip install -r llama.cpp/requirements.txt

In [14]:
# Create gguf file

# Please note you'll need to update the checkpoint path and model names to the one you want to convert & save
!python llama.cpp/convert_hf_to_gguf.py nature-buddy/checkpoint-2005 --outfile nature-buddy-0.135b-f16.gguf --outtype f16


INFO:hf-to-gguf:Loading model: checkpoint-2005
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {576, 49152}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {576}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {1536, 576}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {576, 1536}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {576, 1536}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {576}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shape = {576, 192}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat16 --> F16, shape = {576, 576}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.bfloat16 --> F16, shape = {576, 576}
I

The quanitzation output is going to have an outsized impact on latency / performance. 

While f16 is the default and good, it's worth noting the model was trained using bf16, a slightly different format, so that outtype may be worth testing.

Now that you have the gguf you can either work with that directly, or convert it to an ollama format, which can be easier to work with in some cases. 

For instructions on how to do this, please see the instructions in create ollama text file.