## Kaggle is slow - you'll have to wait **5 minutes** for it to install.

In [1]:
%%capture
!pip install pip3-autoremove
!pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu121
!pip install unsloth

In [2]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")

In [16]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = HF_TOKEN, # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [17]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.2.15 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


## Data Preparation

In [18]:
prompt = """Below is an instruction that describes a task, paired with an optional input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["Prompt"]
    # Create an empty input field
    inputs = [""] * len(instructions)
    outputs = examples["Response"]
    texts = []
    for instruction, inp, output in zip(instructions, inputs, outputs):
        # Format the text using the prompt template and append the EOS token.
        text = prompt.format(instruction, inp, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

In [19]:
from datasets import load_dataset
dataset = load_dataset("csv", data_files="/kaggle/input/sleep-and-fitness-dataset/PH-LLM Custom Dataset.csv")["train"]

# Map the formatting function onto the dataset.
dataset = dataset.map(formatting_prompts_func, batched=True)

In [20]:
dataset

Dataset({
    features: ['Category', 'ID', 'Prompt', 'Response', 'text'],
    num_rows: 1730
})

In [21]:
print(f'Prompt: \n{dataset[0]["Prompt"]}')
print(f'Response: \n{dataset[0]["Response"]}')

Prompt: 
You are a sleep medicine expert. You are given the following sleep data.
The user is Female, 65 years old.
Sleep Summary: 
Bedtime: 2021-03-06 01:00:00
Wakeup time: 2021-03-06 07:00:00
Sleep duration: 6.0
Sleep efficiency: 0.88
REM sleep percentage: 18
Deep sleep percentage: 70
Light sleep percentage: 12
Awakenings: 0.0
Caffeine consumption: 0.0
Alcohol consumption: 0.0
Smoking status: Yes
Exercise frequency: 3.0

List the most important insights. Identify all of the patterns of data that are likely out of the preferred range. Make sure to consider various sleep health dimensions: Routine, Sleep Quality, Alertness, Timing, Efficiency, and Duration. Add a heading for each dimension. Optionally (only do this if extremely important) add a heading called Other for anything else that doesn't fit the above categories. For Routine, consider the average bedtime, wake time, midsleep point and standard deviations of these, focus on the consistency of the routine, not timing. For Sleep Q

# Model Training

In [22]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps = 5,
        max_steps = 60,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Applying chat template to train dataset (num_proc=2):   0%|          | 0/1730 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/1730 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/1730 [00:00<?, ? examples/s]

In [23]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
3.738 GB of memory reserved.


In [24]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,730 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss
1,2.215
2,2.2647
3,2.0743
4,2.1227
5,2.0801
6,1.9535
7,1.8597
8,1.9196
9,1.8995
10,1.7222


In [25]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

342.4927 seconds used for training.
5.71 minutes used for training.
Peak reserved memory = 3.738 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 25.358 %.
Peak reserved memory for training % of max memory = 0.0 %.


### Inference

You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [26]:
# prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    prompt.format(
        """You are a sleep medicine expert. You are given the following sleep data.
        The user is Male, 55 years old.
        Sleep Summary:
        Bedtime: 2021-06-20 00:30:00
        Wakeup time: 2021-06-20 08:00:00
        Sleep duration: 7.5
        Sleep efficiency: 0.90
        REM sleep percentage: 20
        Deep sleep percentage: 50
        Light sleep percentage: 30
        Awakenings: 1.5
        Caffeine consumption: 1.0
        Alcohol consumption: 0.0
        Smoking status: Yes
        Exercise frequency: 3.5
        
        List the most important insights. Identify all patterns that deviate from the preferred range. Consider sleep health dimensions including Routine, Sleep Quality, Alertness, Timing, Efficiency, and Duration. Provide headings for each, and optionally include an Other category if necessary. Focus on consistency of routine, quality of sleep phases, and efficiency compared to standard percentiles. Avoid generic language and ensure clarity and conciseness.
        """ , # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs.input_ids, attention_mask = inputs.attention_mask,
                   streamer = text_streamer, max_new_tokens = 512, pad_token_id = tokenizer.eos_token_id)

## Routine and Sleep Quality

The following insights are important to consider for improving sleep health:

**Routine:** The bedtime (00:30) and wake time (08:00) are consistent, which is good. However, the midsleep point (04:30) is slightly later than ideal. This could be due to the timing of the user's social engagements or work schedule.

**Sleep Quality:** Deep sleep (50%) is a positive finding, but it's within the normal range. Light sleep (30%) is slightly low, which may indicate a delayed sleep phase or difficulty transitioning to deep sleep. REM sleep (20%) is within the normal range.

**Timing:** The midsleep point is later than the typical range (around 1-3 am), which may affect sleep continuity.

**Efficiency:** The sleep efficiency (90%) is excellent, indicating that the user spends a high proportion of time in bed asleep.

**Duration:** The average sleep duration (7.5 hours) is within the recommended range.

**Awakenings:** The number of awakenings (1.5) is slightly higher

### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [27]:
model.save_pretrained("PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit")
tokenizer.save_pretrained("PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit")

('PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit/tokenizer_config.json',
 'PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit/special_tokens_map.json',
 'PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit/tokenizer.json')

In [28]:
model.push_to_hub("johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit", token=HF_TOKEN)
tokenizer.push_to_hub("johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit", token=HF_TOKEN)

README.md:   0%|          | 0.00/599 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Saved model to https://huggingface.co/johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-4bit


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

# Save in 8 Bit

In [30]:
# model.save_pretrained_gguf("PH-LLM-Llama-3.2-1B-Instruct-bnb-8bit-Q8_0", tokenizer,)
model.push_to_hub_gguf("johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-8bit-Q8_0", tokenizer, token = HF_TOKEN)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 20.7 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 46.63it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-8bit-Q8_0/pytorch_model.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-8bit-Q8_0 into q8_0 GGUF format.
The output location will be /kaggle/working/johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-8bit-Q8_0/unsloth.Q8_0.gguf
This might take 3 minutes...
2025-02-25 10:53:44.292551: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-8bit-Q8_0


# Save in 16 Bit

In [31]:
# model.save_pretrained_gguf("PH-LLM-Llama-3.2-1B-Instruct-bnb-16bit-GGUF", tokenizer, quantization_method = "f16")
model.push_to_hub_gguf("johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-16bit-GGUF", tokenizer, quantization_method = "f16", token = HF_TOKEN)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 20.71 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 44.34it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-16bit-GGUF/pytorch_model.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-16bit-GGUF into f16 GGUF format.
The output location will be /kaggle/working/johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-16bit-GGUF/unsloth.F16.gguf
This might take 3 minutes...
2025-02-25 10:55:04.552721: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.F16.gguf:   0%|          | 0.00/2.48G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/johnjehiel/PH-LLM-Llama-3.2-1B-Instruct-bnb-16bit-GGUF
