# How to FineTune Llama 3 with  SFTTrainer and  Unsloth
Hello everyone, today we are going to show how we can Fine Tune Llama 3 with SFTTrainer and Unsloth
First we are going to perform a simmple Fine Tunning by using SFTTrainer


## Step 1 - Installation of Pytorch
The first step is install pythorch v 2.2.1 with Cuda 12.1 

In [1]:
!pip install --upgrade pip
!pip install -q -U git+https://github.com/huggingface/transformers.git --quiet
!pip install trl wandb --quiet
!pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git" --quiet
!pip install --no-deps xformers trl peft accelerate bitsandbytes --quiet
!pip install  --upgrade --quiet \
  "datasets>=2.21.0" \
  "evaluate==0.4.1" \
  "pillow" \
  "hyperopt" \
  "optuna" \
  "protobuf>=4.21.1"

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kfp 2.9.0 requires protobuf<5,>=4.21.1, but you have protobuf 3.20.3 which is incompatible.
kfp-kubernetes 1.3.0 requires protobuf<5,>=4.21.1, but you have protobuf 3.20.3 which is incompatible.
kfp-pipeline-spec 0.4.0 requires protobuf<5,>=4.21.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth-zoo 2024.11.8 requires protobuf<4.0.0, but you have protobuf 5.28.3 which is incompatible.
kfp 2.9.0 requires protobuf<5,>=4.21.1, but you have protobuf 5.28.3 which is incompatible.
kfp-kubernetes 1.3.0 requires protobuf<5,>=4.21.1, but you have protobuf 5.28.3 which is incompatible.
kfp-pipeline-spec 0.4.0 requires

## Step 3 - Installation of Uslotch packages

## Step 4 - Analysis of our infrastructure
In ordering to perform any training it is important to know our system in order to take the full advantage of the system.

**What is SFTTrainer?**

`SFTTrainer` is a class from the `trl` library that implements the SFT algorithm. It is a specialized trainer class that is designed to work with the SFT method. The `SFTTrainer` class takes in a pre-trained model, a dataset, and a set of hyperparameters, and fine-tunes the model using the SFT algorithm.

**What is the difference between SFTTrainer and Trainer?**

The main difference between `SFTTrainer` and the `Trainer` class from the `transformers` library is the fine-tuning algorithm used. The `Trainer` class uses the standard fine-tuning algorithm, where all the model's weights are updated during training. In contrast, the `SFTTrainer` class uses the SFT algorithm, which only updates a small subset of the model's weights. This makes `SFTTrainer` more efficiend suitable for large language models.

**Key differences between SFTTrainer and Trainer**

Here is a table summarizing the key differences between `SFTTrainer` and `Trainer`:

|  | SFTTrainer | Trainer |
| --- | --- | --- |
| Fine-tuning algorithm | Sparse Fine-Tuning (SFT) | Standard fine-tuning |
| Weights updated | Only a small subset of weights | All weights |
| Efficiency | More efficient for large models | Less efficient for large models |
| Suitable for | Large language models | Small to medium-sized models |
| Library | `trl` library | `transformers` library |

Features

| Feature | SFTTrainer | Trainer |
| --- | --- | --- |
| Complexity | Simple, lightweight | More comprehensive, feature-rich |
| Customization | Limited options | Advanced customization options |
| Ease of use | Easy to use, minimal code | More code required, steeper learning curve |
| Integration | Standalone trainer | Part of Hugging Face Transformers library |
| Use cases | Quick fine-tuning, prototyping | Large-scale training, complex models |

Note that the `SFTTrainer` class is specifically designed for sparse fine-tuning, while the `Trainer` class is a more general-purpose trainer class that can be used for a variety of fine-tuning tasks.

# How to FineTune with Unsloth
Hello everyone, today we are going to show how we can Fine Tune Llama 3 with a Usloth package.

## Step 5 -  Loading packages
Once we have installed all the packages we load the packages.

In [2]:
import json
import torch
from datasets import load_dataset
from huggingface_hub import notebook_login
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel
print(torch.__version__)
print(torch.version.cuda)

from huggingface_hub import login
login(
  token="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", # ADD YOUR TOKEN HERE
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
2.4.1+cu121
12.1


## Step 6 -  Setup configuration


**Model Configuration**


In [3]:
model_config={ "model_config": {
    "base_model": "meta-llama/Meta-Llama-3-8B-Instruct",  # The base model
    "finetuned_model": "ruslanmv/Medical-Mind-Llama-3-8b-1M",  # The finetuned model
    "finetuned_name": "Medical-Mind-Llama-3-8b-v1M",
    "max_seq_length": 2048,  # The maximum sequence length
    "dtype": None,  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    "load_in_4bit": True,  # Load the model in 4-bit
}}

* `base_model`: specifies the pre-trained model to use as the base model for fine-tuning.
* `finetuned_model`: specifies the finetuned model to use for fine-tuning.
* `finetuned_name`: specifies the name of the finetuned model.
* `max_seq_length`: specifies the maximum sequence length that the model can process.
* `dtype`: specifies the data type to use for the model's weights and activations. `None` means auto-detection, which will choose the most suitable data type based on the hardware.
* `load_in_4bit`: specifies whether to load the model i 4-bit precision, which can reduce memory usage and improve performance.


**LoRA Configuration**

In [4]:
lora_config={"lora_config": {
    "r": 16,  # The number of LoRA layers 8, 16, 32, 64
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],  # The target modules
    "lora_alpha": 16,  # The alpha value for LoRA
    "lora_dropout": 0,  # The dropout value for LoRA
    "bias": "none",  # The bias for LoRA
    "use_gradient_checkpointing": True,  # Use gradient checkpointing
    "use_rslora": False,  # Use RSLora
    "use_dora": False,  # Use DoRa
    "loftq_config": None  # The LoFTQ configuration
}
}

* `r`: specifies the number of LoRA layers to use.
* `target_modules`: specifies the modules to which LoRA should be applied.
* `lora_alpha`: specifies the alpha value for LoRA, which controls the strength of the LoRA layers.
* `lora_dropout`: specifies the dropout value for LoRA, which controls the random dropping of neurons during training.
* `bias`: specifies the bias for LoRA, which can be set to "none" or a specific value.
* `use_gradient_checkpointing`: specifies whether to use gradient checkpointing, which can reduce memory usage during training.
* `use_rslora` and `use_dora`: specify whether to use RSLora and DoRa, respectively, which are variants of LoRA.
* `loftq_config`: specifies the LoFTQ configuration, which is not used in this example.


**Training Configuration**

In [5]:
training_config={"training_config": {
    "per_device_train_batch_size": 2,  # The batch size
    "gradient_accumulation_steps": 4,  # The gradient accumulation steps
    "warmup_steps": 5,  # The warmup steps
    "max_steps": 0,  # The maximum steps (0 if the epochs are defined)
    "num_train_epochs": 1,  # The number of training epochs
    "learning_rate": 2e-4,  # The learning rate
    "fp16": not torch.cuda.is_bf16_supported(),  # The fp16
    "bf16": torch.cuda.is_bf16_supported(),  # The bf16
    "logging_steps":  1,  # The logging steps
    "optim": "adamw_8bit",  # The optimizer
    "weight_decay": 0.0,  # The weight decay
    "lr_scheduler_type": "linear",  # The learning rate scheduler
    "seed": 42,  # The seed
    "output_dir": "outputs",  # The output directory
}
}

* `per_device_train_batch_size`: specifies the batch size to use for training.
* `gradient_accumulation_steps`: specifies the number of steps to accumulate gradients before updating the model.
* `warmup_steps`: specifies the number of warmup steps to perform before starting training.
* `max_steps`: specifies the maximum number of steps to train for. If set to 0, the model will train for the specified number of epochs.
* `num_train_epochs`: specifies the number of epochs to train for.
* `learning_rate`: specifies the initial learning rate to use for training.
* `fp16` and `bf16`: specify whether to use 16-bit floating-point precision (fp16) or 16-bit bfloat precision (bf16) for training.
* `logging_steps`: specifies the number of steps to log training metrics.
* `optim`: specifies the optimizer to use for training.
* `weight_decay`: specifies the weight decay rate to use for regularization.
* `lr_scheduler_type`: specifies the learning rate scheduler to use.
* `seed`: specifies the random seed to use for training.
* `output_dir`: specifies the output directory to save training results.

**Hugging Face Username**

In [6]:
hugging_face_username={"hugging_face_username": "ruslanmv"}


**Training Dataset**

In [7]:
training_dataset={"training_dataset": {
        "name": "ruslanmv/ai-medical-dataset", # The dataset name(huggingface/datasets)
        "split": "train",  # The dataset split
        "input_fields": ["question", "context"] ,# The input fields
        "input_field": "text",# The input field
    },
                }

**`training_dataset`**: This is the top-level key for the dataset configuration.

**`name`**: This specifies the name of the dataset. In this case, it's `ruslanmv/ai-medical-dataset`, which is a dataset hosted on the Hugging Face Hub. The format is `username/dataset_name`.

**`split`**: This specifies the split of the dataset to use for training. In this case, it's set to `"train"`, which means the model will be trained on the training split of the dataset.

**`input_fields`**: This specifies the input fields of the dataset that will be used for trainine, it's a list containing two fields: `"question"` and `"context"`. These fields are likely to be the input features of the dataset.

**`input_field`**: This specifies the primary input field of the dataset. In this case, it's set to `"text"`. This field is likely to be the text input that the model will process.

Here's an example of what this dataset might look like:

| question | context | text |
| --- | --- | --- |
| How does COVID-19 spread? | COVID-19 is a respiratory disease... | The COVID-19 is.. |
| ... | ... | ... |

In [8]:
config = {}
config.update(hugging_face_username)
config.update(model_config)
config.update(lora_config)
config.update(training_config)
config.update(training_dataset)

In [9]:
import json
def save_config_to_json(config, file_path):
    with open(file_path, 'w') as f:
        json.dump(config, f, indent=4)
file_path = "original_config.json"
save_config_to_json(config, file_path)

In [10]:
# Loading the model and the tokenizer for the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=config["model_config"].get("base_model"),
    max_seq_length=config["model_config"].get("max_seq_length"),
    dtype=config["model_config"].get("dtype"),
    load_in_4bit=config["model_config"].get("load_in_4bit"),
)

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.47.0.dev0.
   \\   /|    GPU: NVIDIA A10G. Max memory: 21.975 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu121. CUDA: 8.6. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [11]:
# Setup for QLoRA/LoRA peft of the base model
model = FastLanguageModel.get_peft_model(
    model,
    r = config.get("lora_config").get("r"),
    target_modules = config.get("lora_config").get("target_modules"),
    lora_alpha = config.get("lora_config").get("lora_alpha"),
    lora_dropout = config.get("lora_config").get("lora_dropout"),
    bias = config.get("lora_config").get("bias"),
    use_gradient_checkpointing = config.get("lora_config").get("use_gradient_checkpointing"),
    random_state = 42,
    use_rslora = config.get("lora_config").get("use_rslora"),
    use_dora = config.get("lora_config").get("use_dora"),
    loftq_config = config.get("lora_config").get("loftq_config"),
)

Unsloth 2024.11.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [12]:
#from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
#tokenizer = AutoTokenizer.from_pretrained(config.get("model_config").get("base_model"))
tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

In [13]:
is_test=True
import datasets
import os
dataset_path = "train_dataset"
if os.path.exists(dataset_path):
    print("Dataset exists!")
    train_dataset = datasets.load_from_disk("train_dataset")
else:
    print("Dataset does not exist.")
    # Loading the training dataset
    train_dataset = load_dataset(config.get("training_dataset").get("name"), split = config.get("training_dataset").get("split"))    
    
    if is_test:
        # Select the first 1M rows of the dataset
        train_dataset = train_dataset.select(range(100))
        
    medical_prompt = """You are an AI Medical Assistant Chatbot, trained to answer medical questions. Below is an instruction that describes a task, paired with an response context. Write a response that appropriately completes the request.
    ### Instruction:
    {}

    ### Response:
    {}"""
    EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
    def formatting_prompts_func(examples):
        instructions = examples["question"]
        outputs      = examples["context"]
        texts = []
        for instruction, output in zip(instructions,  outputs):
            # Must add EOS_TOKEN, otherwise your generation will go on forever!
            text = medical_prompt.format(instruction,  output) + EOS_TOKEN
            texts.append(text)
        return { "text" : texts, }
    pass
    train_dataset= train_dataset.map(formatting_prompts_func, batched = True,)
    train_dataset['text'][1]    
    import datasets
    # assume 'test_dataset' is a Dataset object
    train_dataset.save_to_disk("train_dataset")    

Dataset exists!


In [14]:
train_dataset

Dataset({
    features: ['question', 'context', 'text'],
    num_rows: 100
})

In [15]:
is_multiple=True

In [16]:
if is_multiple:
    # Set up GPU acceleration
    if torch.cuda.device_count() > 1:
        print("Multiple GPUs enabled")
        devices = [f"cuda:{i}" for i in range(torch.cuda.device_count())]
        model_parallel = torch.nn.DataParallel(model, device_ids= devices ) #[0, 1]
        # Access the original model from the DataParallel object
        model = model_parallel.module
    else:
        print("No DataParallel ")
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model.to(device)        

No DataParallel 


In [17]:
# Setting up the trainer for the model
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = config.get("training_dataset").get("input_field"),
    max_seq_length = config.get("model_config").get("max_seq_length"),
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = config.get("training_config").get("per_device_train_batch_size"),
        gradient_accumulation_steps = config.get("training_config").get("gradient_accumulation_steps"),
        warmup_steps = config.get("training_config").get("warmup_steps"),
        max_steps = config.get("training_config").get("max_steps"),
        num_train_epochs= config.get("training_config").get("num_train_epochs"),
        learning_rate = config.get("training_config").get("learning_rate"),
        fp16 = config.get("training_config").get("fp16"),
        bf16 = config.get("training_config").get("bf16"),
        logging_steps = config.get("training_config").get("logging_steps"),
        optim = config.get("training_config").get("optim"),
        weight_decay = config.get("training_config").get("weight_decay"),
        lr_scheduler_type = config.get("training_config").get("lr_scheduler_type"),
        seed = 42,
        output_dir = config.get("training_config").get("output_dir"),
        
    ),
)

Map (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

In [18]:
# Memory statistics before training
gpu_statistics = torch.cuda.get_device_properties(0)
reserved_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 2)
max_memory = round(gpu_statistics.total_memory / 1024**3, 2)
print(f"Reserved Memory: {reserved_memory}GB")
print(f"Max Memory: {max_memory}GB")

Reserved Memory: 5.61GB
Max Memory: 21.98GB


In [19]:
##  [ 1038/2651250 53:49 < 2295:10:28, 0.32 it/s, Epoch 0.00/1] old

In [20]:
# Training the model
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 12
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mliuxiangwin[0m ([33mliuxiangwin-free[0m). Use [1m`wandb login --relogin`[0m to force relogin


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
1,3.1153
2,2.8277
3,2.4572
4,2.7629
5,2.3662
6,2.3002
7,2.0963
8,2.0467
9,2.05
10,1.8026


In [21]:
# Memory statistics after training
used_memory = round(torch.cuda.max_memory_allocated() / 1024**3, 2)
used_memory_lora = round(used_memory - reserved_memory, 2)
used_memory_persentage = round((used_memory / max_memory) * 100, 2)
used_memory_lora_persentage = round((used_memory_lora / max_memory) * 100, 2)
print(f"Used Memory: {used_memory}GB ({used_memory_persentage}%)")
print(f"Used Memory for training(fine-tuning) LoRA: {used_memory_lora}GB ({used_memory_lora_persentage}%)")

Used Memory: 7.2GB (32.76%)
Used Memory for training(fine-tuning) LoRA: 1.59GB (7.23%)


In [22]:
new_model=config.get("model_config").get("finetuned_model")

In [23]:
new_model

'ruslanmv/Medical-Mind-Llama-3-8b-1M'

In [24]:
# Saving the trainer stats
with open("trainer_stats.json", "w") as f:
    json.dump(trainer_stats, f, indent=4)

In [25]:
## Save and push the adapter to HF
import os
current_directory = os.getcwd()
# New model name
new_model = config.get("model_config").get("finetuned_name") #"Medical-Mind-Llama-3-8b"
# Save the fine-tuned model
save_path = os.path.join(current_directory, "models", new_model)

In [26]:
#os.makedirs(save_path, exist_ok=True)  # Create directory if it doesn't exist
#trainer.model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

('/opt/app-root/src/ft-llm-workshop-dhs2024/Module-06-DPO-OPPO/models/Medical-Mind-Llama-3-8b-v1M/tokenizer_config.json',
 '/opt/app-root/src/ft-llm-workshop-dhs2024/Module-06-DPO-OPPO/models/Medical-Mind-Llama-3-8b-v1M/special_tokens_map.json',
 '/opt/app-root/src/ft-llm-workshop-dhs2024/Module-06-DPO-OPPO/models/Medical-Mind-Llama-3-8b-v1M/tokenizer.json')

In [27]:
help(model.save_pretrained_merged)

Help on method unsloth_save_pretrained_merged in module unsloth.save:

unsloth_save_pretrained_merged(save_directory: Union[str, os.PathLike], tokenizer=None, save_method: str = 'merged_16bit', push_to_hub: bool = False, token: Union[str, bool, NoneType] = None, is_main_process: bool = True, state_dict: Optional[dict] = None, save_function: Callable = <function save at 0x7f57c1a86fc0>, max_shard_size: Union[int, str] = '5GB', safe_serialization: bool = True, variant: Optional[str] = None, save_peft_format: bool = True, tags: List[str] = None, temporary_location: str = '_unsloth_temporary_saved_buffers', maximum_memory_usage: float = 0.75) method of peft.peft_model.PeftModelForCausalLM instance
    Same as .save_pretrained(...) except 4bit weights are auto
    converted to float16 with as few overhead as possible.
    
    Choose for `save_method` to be either:
    1. `16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
    2.  `4bit`: Merge LoRA into int4 weights. Use

To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [28]:
# Save the model to the created directory
# `lora`: Save LoRA adapters with no merging. Useful for HF inference.
#model.save_pretrained(save_path)


In [29]:
# Saving the model using merged_16bit(float16), 
#`16bit`: Merge LoRA into float16 weights. Useful for GGUF / llama.cpp.
#model.save_pretrained_merged(save_path, tokenizer, save_method = "merged_16bit")

In [30]:
# `4bit`: Merge LoRA into int4 weights. Useful for DPO / HF inference.
model.save_pretrained_merged(save_path, tokenizer, save_method = "merged_4bit_forced")

Unsloth: Merging 4bit and LoRA weights to 4bit...
This might take 5 minutes...




Done.
Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 10 minutes for Llama-7b... Done.


In [31]:
save_path

'/opt/app-root/src/ft-llm-workshop-dhs2024/Module-06-DPO-OPPO/models/Medical-Mind-Llama-3-8b-v1M'

In [32]:
# Get the list of files in the directory
files_in_model_dir = os.listdir(save_path)
# Print the list of files
print("Files in the directory:")
for file in files_in_model_dir:
    print(file)

Files in the directory:
model-00002-of-00002.safetensors
model-00001-of-00002.safetensors
tokenizer.json
model.safetensors.index.json
special_tokens_map.json
generation_config.json
tokenizer_config.json
config.json


In [33]:
import os
from huggingface_hub import HfApi
def upload_folder(folder_path, repository_name, path_in_repo):
    api = HfApi()
    
    # Check if the repository exists, if not, create it
    repo_exists = api.repo_exists(repository_name)
    if not repo_exists:
        api.create_repo(repository_name)
        print(f"Repository '{repository_name}' created on Hugging Face Hub.")

    for root, dirs, files in os.walk(folder_path):
        for file in files:
            file_path = os.path.join(root, file)
            relative_path = os.path.relpath(file_path, folder_path)
            repo_path = os.path.join(path_in_repo, relative_path)
            api.upload_file(path_or_fileobj=file_path, repo_id=repository_name, path_in_repo=repo_path)
            print(f"{repo_path} uploaded to {repository_name}")

In [34]:
# Define the repository name and path in the repository
repository_name = "Liu-Xiang/"+new_model
path_in_repo = ""

In [35]:
repository_name

'Liu-Xiang/Medical-Mind-Llama-3-8b-v1M'

In [36]:
# Upload the folder and its contents to the repository
upload_folder(save_path, repository_name, path_in_repo)

No files have been modified since last commit. Skipping to prevent empty commit.


model-00002-of-00002.safetensors uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M


model-00001-of-00002.safetensors:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


model-00001-of-00002.safetensors uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M
tokenizer.json uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M
model.safetensors.index.json uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M


No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


special_tokens_map.json uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M
generation_config.json uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M


No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


tokenizer_config.json uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M
config.json uploaded to Liu-Xiang/Medical-Mind-Llama-3-8b-v1M


In [None]:
#help(model.push_to_hub_merged)

In [None]:
#save_path='/home/wsuser/work/models/Medical-Mind-Llama-3-8b'
#repo_id='ruslanmv/Medical-Mind-Llama-3-8b'
#commit_message="Uploading Model"
#model.push_to_hub_merged(repo_id, tokenizer, save_method = "merged_16bit",commit_message=commit_message)

In [None]:
#model.push_to_hub_merged(config.get("model_config").get("finetuned_model"), tokenizer, save_method = "merged_4bit")

In [None]:
#model.save_pretrained_gguf(config.get("model_config").get("finetuned_model"), tokenizer)
#model.push_to_hub_gguf(config.get("model_config").get("finetuned_model"), tokenizer,repository_private=True)

In [None]:
#model.save_pretrained_gguf(config.get("model_config").get("finetuned_model"), tokenizer, quantization_method = "q4_k_m")
#model.push_to_hub_gguf(config.get("model_config").get("finetuned_model"), tokenizer, quantization_method = "q4_k_m",private=True)

In [None]:
is_inference=False
if is_inference:
    from unsloth import FastLanguageModel
    import torch
    max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
    dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "ruslanmv/Medical-Mind-Llama-3-8b-1M",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    # Using FastLanguageModel for fast inference
    FastLanguageModel.for_inference(model)
    question="This is the question: What was the main cause of the inflammatory CD4+ T cells?"
    prompt=f"<|start_header_id|>system<|end_header_id|> You are a Medical AI chatbot assistant .<|eot_id|><|start_header_id|> user <|end_header_id|>{question}<|eot_id|>"
    # Tokenizing the input and generating the output
    inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
    answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]  # Get the first element from the batch

    # Split the answer at the first line break, assuming system intro and question are on separate lines
    answer_parts = answer.split("\n", 1)

    # If there are multiple parts, consider the second part as the answer
    if len(answer_parts) > 1:
      answer = answer_parts[1].strip()  # Remove leading/trailing whitespaces
    else:
      answer = ""  # If no split possible, set answer to empty string

    print(f"Answer: {answer}")    
 

# Hyperparameter search
**Step 1: Define the Hyperparameter Search Space**
We need to define the search space for the hyperparameters we want to tune. For example, let's say we want to tune the following hyperparameters:

* `learning_rate`
* `per_device_train_batch_size`
* `gradient_accumulation_steps`
* `warmup_steps`
* `num_train_epochs`
* `lora_alpha`
* `lora_dropout`

We can define the search space as follows:

In [None]:
import hyperopt
from hyperopt import hp
from hyperopt import Trials
from hyperopt import fmin, tpe, Trials
# Define the search space for hyperparameters
space = {
  'learning_rate': hp.loguniform('learning_rate', -5, -1),  # Learning rate in log scale
  #'lora_alpha': hp.quniform('lora_alpha', 1, 32, 1),  # LoRA alpha with quantized steps
  #'lora_dropout': hp.uniform('lora_dropout', 0, 0.5),  # LoRA dropout rate

  'per_device_train_batch_size': hp.quniform('per_device_train_batch_size', 2, 16, q=1),  
  'gradient_accumulation_steps': hp.quniform('gradient_accumulation_steps', 1, 8, 1),  # Added for exploration
  # Uncomment these if you want to tune other hyperparameters
  # 'warmup_steps': hp.quniform('warmup_steps', 0, 1000, 1),
  # 'num_train_epochs': hp.quniform('num_train_epochs', 1, 5, 1),    

}

**Step 2. Define the Objective Function**

The objective function is a function that takes in the hyperparameters, sets them in the `config` dictionary, trains the model, and returns the loss or metric to minimize. We need to modify the previous fine-tuning code to define the objective function.

In [None]:
def objective(params):
  # Set hyperparameters in the config dictionary (assuming it's defined elsewhere)
  config['training_config']['learning_rate'] = params['learning_rate']
 # config['lora_config']['lora_alpha'] = params['lora_alpha']
 # config['lora_config']['lora_dropout'] = params['lora_dropout']   
  config['training_config']['per_device_train_batch_size'] = params['per_device_train_batch_size']
  config['training_config']['gradient_accumulation_steps'] = params['gradient_accumulation_steps']
  # ... Set other hyperparameters from params dictionary ...   
  #config['training_config']['warmup_steps'] = params['warmup_steps']
  #config['training_config']['num_train_epochs'] = params['num_train_epochs']

  # Load the model and tokenizer (assuming these are defined elsewhere)
  try:
      model, tokenizer = FastLanguageModel.from_pretrained(
          model_name=config.get("model_config").get("base_model"),
          max_seq_length=config.get("model_config").get("max_seq_length"),
          dtype=config.get("model_config").get("dtype"),
          load_in_4bit=config.get("model_config").get("load_in_4bit"),
      )
  except Exception as e:
      print(f"Error loading model and tokenizer: {e}")
      return float("inf")  # Return high value for errors

  # Setup LoRA for the model (assuming FastLanguageModel supports LoRA)
  try:
      model = FastLanguageModel.get_peft_model(
          model,
          r=config.get("lora_config").get("r"),
          target_modules=config.get("lora_config").get("target_modules"),
          lora_alpha=params['lora_alpha'],
          lora_dropout=params['lora_dropout'],
          bias=config.get("lora_config").get("bias"),
          use_gradient_checkpointing=config.get("lora_config").get("use_gradient_checkpointing"),
          random_state=42,
          use_rslora=config.get("lora_config").get("use_rslora"),
          use_dora=config.get("lora_config").get("use_dora"),
          loftq_config=config.get("lora_config").get("loftq_config")
      )
  except Exception as e:
      print(f"Error setting up LoRA: {e}")
      return float("inf")  # Return high value for errors
  # Train the model on the test dataset (assuming SFTTrainer and training arguments are defined)
  try:
      trainer = SFTTrainer(
          model=model,
          tokenizer=tokenizer,
          train_dataset=train_dataset,
          dataset_text_field=config.get("training_dataset").get("input_field"),
          max_seq_length=config.get("model_config").get("max_seq_length"),
          dataset_num_proc=2,
          packing=False,
          args=TrainingArguments(
              per_device_train_batch_size=int(params['per_device_train_batch_size']),
              gradient_accumulation_steps=params['gradient_accumulation_steps'],
              warmup_steps=params['warmup_steps'],
              max_steps=config.get("training_config").get("max_steps"),
              num_train_epochs=params['num_train_epochs'],
              learning_rate=params['learning_rate'],
              fp16=config.get("training_config").get("fp16"),
              bf16=config.get("training_config").get("bf16"),
              logging_steps=config.get("training_config").get("logging_steps"),
              optim=config.get("training_config").get("optim"),
              weight_decay=config.get("training_config").get("weight_decay"),
              lr_scheduler_type=config.get("training_config").get("lr_scheduler_type"),
              seed=42,
              output_dir=config.get("training_config").get("output_dir")
          )
      )
      trainer_stats = trainer.train()
      return trainer_stats.loss  # Assuming loss is the metric to minimize
  except Exception as e:
      print(f"Error during training: {e}")
      return float("inf")  # Return high value for failed trials



**Step 3: Perform Hyperparameter Search**

Now that we have defined the objective function, we can perform the hyperparameter search using Hyperopt's `fmin` function. We need to specify the objective function, the search space, and the maximum number of evaluations.

In [None]:

# Create a Trials object to track hyperparameter evaluations
trials = Trials()
# Run hyperparameter optimization using TPE algorithm
best = fmin(objective, space, algo=tpe.suggest, trials=trials, max_evals=2)
# Print the best hyperparameters found during optimization
print("Best Hyperparameters:", best)

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.47.0.dev0.
   \\   /|    GPU: NVIDIA A10G. Max memory: 21.975 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu121. CUDA: 8.6. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Error setting up LoRA: 'lora_alpha'                  
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.47.0.dev0.
   \\   /|    GPU: NVIDIA A10G. Max memory: 21.975 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu121. CUDA: 8.6. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Error setting up LoRA:

In [None]:
import torch
import gc
def reset_gpu_memory():
    torch.cuda.empty_cache()
    gc.collect()
    print("GPU memory cleared!")
# Example usage:
reset_gpu_memory()

GPU memory cleared!


## Full code version

In [None]:
####  Fixed Version
import hyperopt
from hyperopt import hp
from hyperopt import Trials
from hyperopt import fmin, tpe, Trials
# Define the search space for hyperparameters
space = {
  'learning_rate'              : hp.loguniform('learning_rate', -5, -1),  # Learning rate in log scale
  'per_device_train_batch_size': hp.quniform('per_device_train_batch_size', 2, 16, 1),   
  'gradient_accumulation_steps': hp.quniform('gradient_accumulation_steps', 1, 8, 1), 
  # Uncomment these if you want to tune them
  #'lora_alpha'                : hp.quniform('lora_alpha', 1, 32, 1),  # LoRA alpha with quantized steps
  #'lora_dropout'              : hp.uniform('lora_dropout', 0, 0.5),  # LoRA dropout rate
  # 'warmup_steps'             : hp.quniform('warmup_steps', 0, 1000, 1),
  # 'num_train_epochs'         : hp.quniform('num_train_epochs', 1, 5, 1),
}
def objective(params):
    # Set hyperparameters in the config dictionary (assuming it's defined elsewhere)
    config['training_config']['learning_rate']=params['learning_rate']
    config['training_config']['per_device_train_batch_size'] = params['per_device_train_batch_size']
    config['training_config']['gradient_accumulation_steps'] = params['gradient_accumulation_steps']    
    # config['lora_config']['lora_alpha'] = params['lora_alpha']
    # config['lora_config']['lora_dropout'] = params['lora_dropout']
    # ... Set other hyperparameters from params dictionary ...
    #config['training_config']['warmup_steps'] = params['warmup_steps']
    #config['training_config']['num_train_epochs'] = params['num_train_epochs']
    # Load the model and tokenizer (assuming these are defined elsewhere)    
    try:
            model, tokenizer    = FastLanguageModel.from_pretrained(
            model_name          = config.get("model_config").get("base_model"),
            max_seq_length      = config.get("model_config").get("max_seq_length"),
            dtype               = config.get("model_config").get("dtype"),
            load_in_4bit        = config.get("model_config").get("load_in_4bit")
            )
    except Exception as e:
        print(f"Error loading model and tokenizer: {e}")
        return float("inf")  # Return high value for errors

    # Setup LoRA for the model (assuming FastLanguageModel supports LoRA)
    try:
        model = FastLanguageModel.get_peft_model(
        model,
        r                            = config.get("lora_config").get("r"),
        target_modules               = config.get("lora_config").get("target_modules"),
        lora_alpha                   = config.get("lora_config").get('lora_alpha'), #params['lora_alpha'],
        lora_dropout                 = config.get("lora_config").get('lora_dropout'),#params['lora_dropout'],
        bias                         = config.get("lora_config").get("bias"),
        use_gradient_checkpointing   = config.get("lora_config").get("use_gradient_checkpointing"),
        random_state                 = 42,
        use_rslora                   = config.get("lora_config").get("use_rslora"),
        use_dora                     = config.get("lora_config").get("use_dora"),
        loftq_config                 = config.get("lora_config").get("loftq_config")
        )
    except Exception as e:
        print(f"Error setting up LoRA: {e}")
        return float("inf")  # Return high value for errors
    # Train the model on the test dataset (assuming SFTTrainer and training arguments are defined)
    try:
        trainer = SFTTrainer(
              model=model,
              tokenizer            =  tokenizer,
              train_dataset        = train_dataset,
              dataset_text_field   = config.get("training_dataset").get("input_field"),
              max_seq_length       = config.get("model_config").get("max_seq_length"),
              dataset_num_proc     = 2,
              packing              = False,
              args=TrainingArguments(
                  per_device_train_batch_size = int(params['per_device_train_batch_size']), #config.get("training_config").get('per_device_train_batch_size'),
                  gradient_accumulation_steps = params['gradient_accumulation_steps'], #config.get("training_config").get('gradient_accumulation_steps'),
                  warmup_steps                = config.get("training_config").get('warmup_steps'),#params['warmup_steps'],
                  max_steps                   = config.get("training_config").get("max_steps"),
                  num_train_epochs            = config.get("training_config").get('num_train_epochs'),#params['num_train_epochs'],
                  learning_rate               = params['learning_rate'],
                  fp16                        = config.get("training_config").get("fp16"),
                  bf16                        = config.get("training_config").get("bf16"),
                  logging_steps               = config.get("training_config").get("logging_steps"),
                  optim                       = config.get("training_config").get("optim"),
                  weight_decay                = config.get("training_config").get("weight_decay"),
                  lr_scheduler_type           = config.get("training_config").get("lr_scheduler_type"),
                  seed                        = 42,
                  output_dir                  = config.get("training_config").get("output_dir")
              )
          )
        trainer_stats = trainer.train()
        return trainer_stats.loss  # Assuming loss is the metric to minimize
    except Exception as e:
        print(f"Error during training: {e}")
        return float("inf")  # Return high value for failed trials    
# Create a Trials object to track hyperparameter evaluations
trials = Trials()
# Run hyperparameter optimization using TPE algorithm
best = fmin(objective, space, algo=tpe.suggest, trials=trials, max_evals=2)
# Print the best hyperparameters found during optimization
print("Best Hyperparameters:", best)   

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.47.0.dev0.
   \\   /|    GPU: NVIDIA A10G. Max memory: 21.975 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu121. CUDA: 8.6. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
  0%|          | 0/2 [00:00<?, ?trial/s, best loss=?]

Map (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 7 | Gradient Accumulation steps = 4.0
\        /    Total batch size = 28.0 | Total steps = 3
 "-____-"     Number of trainable parameters = 41,943,040


Error during training: 'float' object cannot be interpreted as an integer
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.47.0.dev0.
   \\   /|    GPU: NVIDIA A10G. Max memory: 21.975 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.1+cu121. CUDA: 8.6. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
 50%|█████     | 1/2 [00:12<00:12, 12.06s/trial, best loss: inf]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 7 | Gradient Accumulation steps = 6.0
\        /    Total batch size = 42.0 | Total steps = 2
 "-____-"     Number of trainable parameters = 41,943,040


Error during training: 'float' object cannot be interpreted as an integer
100%|██████████| 2/2 [00:23<00:00, 11.58s/trial, best loss: inf]
Best Hyperparameters: {'gradient_accumulation_steps': 4.0, 'learning_rate': 0.008165034156541413, 'per_device_train_batch_size': 7.0}


## Analyzing Hyperparameters:

*  **Batch Size**: Generally, increasing the batch size can improve training speed by utilizing hardware resources more efficiently. However, there's a limit beyond which performance degrades. You can tune the batch size within a reasonable range (e.g., 2, 4, 8, 16) to see its impact.

* **Learning Rate**: A higher learning rate can accelerate training initially. But, a too high value can lead to unstable training and potentially slower convergence. Consider a range of learning rates (e.g., log-uniform distribution between 1e-5 and 1e-3) for exploration.

* **Gradient Accumulation Steps**: This technique accumulates gradients over multiple batches before updating model weights. It can help reduce memory requirements but might slow down training per epoch. Experiment with different accumulation steps (e.g., 1, 2, 4) to find a balance.

* **Optimizer Choice**: Some optimizers like Adam or SGD with momentum can be faster than others depending on the model and dataset. Explore different optimizers and their hyperparameters (e.g., momentum coefficient) to see if they lead to faster convergence.

## Additional Considerations:

Early Stopping: Implement early stopping to automatically terminate training if the validation loss doesn't improve for a certain number of epochs. This can save training time if the model starts overfitting.
Warmup Steps: A gradual increase in the learning rate during the initial training phase (warmup steps) can improve stability and potentially accelerate convergence compared to a fixed learning rate from the beginning.


* Experimentation and Profiling:

The best hyperparameters for faster training depend on your specific model, dataset, and hardware. You'll need to experiment with different configurations using tools like Hyperopt to find the optimal settings.
Consider using profiling tools to identify bottlenecks in your training pipeline. This can help you focus on optimizing specific parts of the training process that are most time-consuming.
By analyzing these hyperparameters and implementing techniques like early stopping and warmup steps, you can potentially achieve faster fine-tuning while maintaining good model performance.

## Method 1  Optuna

In [None]:
from optuna import create_study, Trial
import time  # Assuming you can use time.time() to measure training time

# Define search space with additional hyperparameter
search_space = {
  "learning_rate": [1e-5, 5e-5, 1e-4, 2e-4],
  "per_device_train_batch_size": [2, 4, 8],
  "lora_alpha": [8, 16, 32],
  "gradient_accumulation_steps": [1, 2, 4, 8],  # Added gradient accumulation steps
}

def objective(trial):
  # Set hyperparameters based on trial values
  config["training_config"]["learning_rate"] = trial.suggest_float("learning_rate", search_space["learning_rate"][0], search_space["learning_rate"][-1])
  config["training_config"]["per_device_train_batch_size"] = trial.suggest_int("per_device_train_batch_size", search_space["per_device_train_batch_size"][0], search_space["per_device_train_batch_size"][-1])
  config["training_config"]["gradient_accumulation_steps"] = trial.suggest_int("gradient_accumulation_steps", search_space["gradient_accumulation_steps"][0], search_space["gradient_accumulation_steps"][-1])
  config["lora_config"]["lora_alpha"] = trial.suggest_int("lora_alpha", search_space["lora_alpha"][0], search_space["lora_alpha"][-1])

  # Train the model with the current hyperparameters
  start_time = time.time()
  try:
      trainer_stats = trainer_test.train()
      training_time = time.time() - start_time
      return training_time  # Minimize training time
  except Exception as e:
      return float("inf")  # Assign a high value if training fails

study = create_study(direction="minimize")
study.optimize(objective, n_trials=2)  # Adjust the number of trials

# Access the best trial and its hyperparameters after optimization
best_trial = study.best_trial
best_params = best_trial.params

print("Best Trial:", best_trial.number)
print("Best Hyperparameters (Likely Fastest):", best_params)
print("Best Training Time:", best_trial.value, "seconds")

[I 2024-11-27 05:12:11,220] A new study created in memory with name: no-name-34c4ae10-6a03-431c-92be-e443b4a0f7e5
[I 2024-11-27 05:12:11,222] Trial 0 finished with value: inf and parameters: {'learning_rate': 0.00019489985443530553, 'per_device_train_batch_size': 8, 'gradient_accumulation_steps': 4, 'lora_alpha': 30}. Best is trial 0 with value: inf.
[I 2024-11-27 05:12:11,224] Trial 1 finished with value: inf and parameters: {'learning_rate': 6.412343953715776e-05, 'per_device_train_batch_size': 7, 'gradient_accumulation_steps': 3, 'lora_alpha': 23}. Best is trial 0 with value: inf.


Best Trial: 0
Best Hyperparameters (Likely Fastest): {'learning_rate': 0.00019489985443530553, 'per_device_train_batch_size': 8, 'gradient_accumulation_steps': 4, 'lora_alpha': 30}
Best Training Time: inf seconds
