# Fine-Tuning of LLMs with Hugging Face

## Step 1: Installing and importing the libraries for Hugging Face

In [6]:
!pip uninstall -y pyarrow


Found existing installation: pyarrow 17.0.0
Uninstalling pyarrow-17.0.0:
  Successfully uninstalled pyarrow-17.0.0


In [7]:
!pip install pyarrow==15.0.0


Collecting pyarrow==15.0.0
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 15.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed pyarrow-15.0.0


In [8]:
!pip install accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7




In [9]:
!pip install huggingface_hub



In [10]:
import os
import torch
from trl import SFTTrainer
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging)

## Step 2: Setting up links to Hugging Face datasets and models

In [11]:
# Specify the identifier for the pre-trained model (LLaMA-2 fine-tuned version).
# This model is hosted on Hugging Face and will be used for further fine-tuning or inference.
model_identifier = "aboonaji/llama2finetune-v2"

# Specify the dataset containing medical terms, sourced from Wikipedia or similar.
# This dataset is also hosted on Hugging Face and is intended for use in model training.
source_dataset = "gamino/wiki_medical_terms"

# Specify the formatted version of the dataset, which has been preprocessed to suit the input requirements of LLaMA-2.
# Preprocessing might include tokenization, data cleaning, or aligning with the model's expected input format.
formatted_dataset = "aboonaji/wiki_medical_terms_llam2_format"


## Step 3: Setting up all the QLoRA hyperparameters for fine-tuning

In [12]:
# Define the value for 'r', which could represent a low-rank approximation dimension
# in methods like LoRA (Low-Rank Adaptation). This value (64) sets the dimension for fine-tuning layers.
lower_hyper_r = 64

# Define the scaling factor 'alpha', which controls the learning rate scaling for adaptation.
# A lower value (16) means smaller updates during training, making fine-tuning more stable.
lower_hyper_alpha = 16

# Set the dropout rate to 0.1, which indicates a 10% probability of dropping units during training.
# Dropout is used to prevent overfitting by randomly deactivating some neurons during the training phase.
lower_hyper_dropout = 0.1


## Step 4: Setting up all the bitsandbytes hyperparameters for fine-tuning

In [13]:
# Enable 4-bit quantization, which reduces model size and speeds up inference while maintaining reasonable accuracy.
# When set to True, the model will use 4-bit precision during computations.
enable_4bit = True

# Specify the computation data type for bitsandbytes (bnb), a library used for memory-efficient model training.
# 'float16' reduces memory usage compared to 32-bit floats, speeding up computations while preserving most of the model's performance.
compute_dtype_bnb = 'float16'

# Define the quantization type used by bitsandbytes.
# 'nf4' stands for NormalFloat4, which is a specific format used for more efficient 4-bit quantization.
quant_type_bnb = 'nf4'

# This flag determines whether to use double quantization. Setting it to False means
# double quantization (i.e., applying two stages of quantization) will be disabled, which can save some computation time.
double_quant_flag = False


## Step 5: Setting up all the training arguments hyperparameters for fine-tuning

In [14]:
# Directory where training results, such as checkpoints and logs, will be saved.
results_dir = "./results"

# The number of complete passes through the entire training dataset during the training process.
epochs_count = 10

# Flag to enable/disable FP16 (16-bit floating point precision).
# Set to False to avoid using FP16 precision.
enable_fp16 = False

# Flag to enable/disable BF16 (16-bit Brain Floating Point precision).
# BF16 is often used for faster training with better precision compared to FP16.
enable_bf16 = False

# The number of samples processed at a time during training in each batch.
# Smaller batches help fit models into memory but may require gradient accumulation.
train_batch_size = 4

# The number of samples processed at a time during evaluation or validation.
# This is usually the same as the training batch size for consistency.
eval_batch_size = 4

# Number of gradient accumulation steps before updating model parameters.
# Accumulation allows for larger effective batch sizes without increasing memory usage.
accumulation_steps = 1

# Flag to enable/disable checkpointing. If True, checkpoints are saved during training, allowing you to resume later.
checkpointing_flag = True

# Maximum allowed value for the gradient norm. This prevents exploding gradients by clipping them to this threshold.
grad_norm_limit = 0.3

# The learning rate used during training, controlling how much model parameters are adjusted per step.
# Lower learning rates tend to make training more stable, while higher ones may speed up training but risk instability.
train_learning_rate = 2e-4

# Weight decay rate to apply during optimization to reduce overfitting by shrinking model weights over time.
decay_rate = 0.001

# Type of optimizer used for training. 'paged_adamw_32bit' is a variant of AdamW optimized for memory efficiency.
optimizer_type = "paged_adamw_32bit"

# Type of learning rate scheduler, which adjusts the learning rate during training. 'cosine' scheduler reduces the learning rate following a cosine curve.
scheduler_type = "cosine"

# Limit on the total number of steps for training. After this many steps, training will stop.
steps_limit = 100

# The fraction of total training steps used for learning rate warmup, gradually increasing the learning rate to prevent sudden large updates.
warmup_percentage = 0.03

# Flag to enable grouping sequences by length for more efficient training and batching.
length_grouping = True

# Interval at which model checkpoints are saved. If set to 0, no intermediate checkpoints are saved.
checkpoint_interval = 0

# Interval (in steps) at which logs (e.g., training loss, metrics) are generated and printed during training.
log_interval = 25

## Step 6: Setting up all the supervised fine-tuning arguments hyperparameters for fine-tuning

In [15]:
# Flag to enable or disable packing of sequences. When set to True, sequences of varying lengths may be packed together to optimize memory usage.
# Set to False to disable packing, which may be useful when sequences are of uniform length or when packing is not needed.
enable_packing = False

# Maximum sequence length allowed during training or evaluation. If set to None, there is no limit on the sequence length.
# This parameter helps in managing memory usage and computational efficiency.
sequence_length_max = None

# Dictionary specifying device assignments for different parts of the model or training process.
# The key represents the part of the model or data, and the value represents the device index (e.g., GPU 0).
# In this case, it assigns an empty key to device 0, which might need to be updated based on specific use cases.
device_assignment = {"":0}


## Step 7: Loading the dataset

In [16]:
training_data = load_dataset(formatted_dataset, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


wiki_medical_terms_llam2.jsonl:   0%|          | 0.00/54.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6861 [00:00<?, ? examples/s]

In [17]:
training_data

Dataset({
    features: ['text'],
    num_rows: 6861
})

## Step 8: Defining the QLoRA configuration

In [18]:
# Retrieve the data type for computation from the torch module based on the value of 'compute_dtype_bnb'.
# 'compute_dtype_bnb' should be a string (e.g., 'float16') which is used to get the corresponding data type from torch.
dtype_computation = getattr(torch, compute_dtype_bnb)

# Create a BitsAndBytesConfig object to configure settings for efficient model training with quantization.
# This setup is for using 4-bit quantization with specific settings defined in the variables.
bnb_setup = BitsAndBytesConfig(
    load_in_4bit=enable_4bit,  # Whether to load the model in 4-bit precision.
    bnb_4bit_quant_type=quant_type_bnb,  # Type of 4-bit quantization to use (e.g., 'nf4').
    bnb_4bit_use_double_quant=double_quant_flag,  # Whether to use double quantization for 4-bit precision.
    bnb_4bit_computation=dtype_computation  # Data type for computation (e.g., float16) based on the setup.
)


## Step 9: Loading the pre-trained LLaMA 2 model

In [19]:
# Load the pre-trained LLaMA model for causal language modeling from Hugging Face using the specified model identifier.
# Apply the quantization configuration (bnb_setup) to the model and assign it to the specified device(s) using device_map.
llama_model = AutoModelForCausalLM.from_pretrained(
    model_identifier,                  # Identifier for the pre-trained model on Hugging Face.
    quantization_config=bnb_setup,    # Configuration for 4-bit quantization.
    device_map=device_assignment       # Mapping of model parts to specific devices (e.g., GPU).
)



config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return torch.load(checkpoint_file, map_location="cpu")


generation_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

In [20]:
# Disable the use of cache during model inference or training to ensure that computations are not cached, which can affect results.
llama_model.config.use_cache = False

# Set the number of tensor parallelism groups to 1. This controls how the model is split across devices for parallel processing.
# A value of 1 means no parallelism, and the model will be processed on a single device.
llama_model.config.pretraining_tp = 1


## Step 10: Loading the pre-trained tokenizer for the LLaMA 2 model

In [21]:
# Load the pre-trained tokenizer associated with the specified model identifier from Hugging Face.
# The 'trust_remote_code' flag is set to True to allow the execution of code from remote sources, which is often necessary for loading custom tokenizers.
llama_tokenizer = AutoTokenizer.from_pretrained(
    model_identifier,        # Identifier for the pre-trained model on Hugging Face, which includes the tokenizer.
    trust_remote_code=True   # Allow execution of remote code to load the tokenizer.
)

# Set the padding token of the tokenizer to be the same as the end-of-sequence (EOS) token.
# This ensures that padding is handled consistently with the EOS token, which might be required for certain models.
llama_tokenizer.pad_token = llama_tokenizer.eos_token

# Specify that padding should be applied to the right side of sequences.
# This is important for consistency in how sequences are padded, which affects how the model processes them.
llama_tokenizer.padding_side = "right"


tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## Step 11: Setting up the configuration for the LoRA fine-tuning method

In [22]:
# Configure LoRA (Low-Rank Adaptation) for the model with the specified hyperparameters.
# LoRA is a technique to adapt pre-trained models with fewer parameters, often used for efficient fine-tuning.

peft_setup = LoraConfig(
    lora_alpha=lower_hyper_alpha,         # Scaling factor for LoRA. Controls the impact of the low-rank adaptation. Higher values make adaptation more significant.
    lora_dropout=lower_hyper_dropout,     # Dropout rate applied during LoRA adaptation to prevent overfitting. A rate of 0.1 means 10% of units will be dropped randomly.
    r=lower_hyper_r,                      # Dimension of the low-rank decomposition. This determines the size of the adaptation layers. A value of 64 indicates a moderately sized adaptation.
    bias="none",                          # Specifies that no additional bias terms are used in the LoRA layers. This setting depends on the specific model and task requirements.
    task_type="CAUSAL_LM"                 # Specifies the task type for the model, which in this case is Causal Language Modeling. This setting aligns the adaptation with the task.
)


## Step 12: Creating a training configuration by setting the training parameters

In [23]:
train_args = TrainingArguments(
    output_dir=results_dir,  # Directory where model checkpoints and logs will be saved
    num_train_epochs=epochs_count,  # Number of training epochs
    per_device_train_batch_size=train_batch_size,  # Batch size for training per device (e.g., GPU)
    per_device_eval_batch_size=eval_batch_size,  # Batch size for evaluation per device
    gradient_accumulation_steps=accumulation_steps,  # Number of gradient accumulation steps to simulate larger batch sizes
    learning_rate=train_learning_rate,  # Learning rate for the optimizer
    weight_decay=decay_rate,  # Weight decay to apply (if any)
    optim=optimizer_type,  # Type of optimizer (e.g., AdamW)
    save_steps=checkpoint_interval,  # Number of steps between saving model checkpoints
    logging_steps=log_interval,  # Number of steps between logging information
    fp16=enable_fp16,  # Whether to use 16-bit (mixed) precision for faster training
    bf16=enable_bf16,  # Whether to use bfloat16 precision (if supported)
    max_grad_norm=grad_norm_limit,  # Maximum gradient norm for gradient clipping
    max_steps=steps_limit,  # Maximum number of training steps (if using steps instead of epochs)
    warmup_ratio=warmup_percentage,  # Ratio of total steps to use for learning rate warmup
    group_by_length=length_grouping,  # Whether to group sequences of similar lengths for efficiency
    lr_scheduler_type=scheduler_type,  # Type of learning rate scheduler (e.g., linear, cosine)
    gradient_checkpointing=checkpointing_flag  # Whether to use gradient checkpointing to save memory
)


## Step 13: Creating the Supervised Fine-Tuning Trainer

In [25]:
llama_sftt_trainer = SFTTrainer(model = llama_model,
                                args = train_args,
                                train_dataset = training_data,
                                tokenizer = llama_tokenizer,
                                peft_config = peft_setup,
                                dataset_text_field = "text",
                                max_seq_length = sequence_length_max,
                                packing = enable_packing)



Map:   0%|          | 0/6861 [00:00<?, ? examples/s]

## Step 14: Training the model

In [26]:
llama_sftt_trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss


Step,Training Loss
25,1.7075
50,0.8555
75,1.3938
100,0.7636


TrainOutput(global_step=100, training_loss=1.1800662231445314, metrics={'train_runtime': 3336.1587, 'train_samples_per_second': 0.12, 'train_steps_per_second': 0.03, 'total_flos': 5978369907425280.0, 'train_loss': 1.1800662231445314, 'epoch': 0.06})

## Step 15: Chatting with the model

In [27]:
user_prompt = "Hey I'm suffering from Amblyopia, could you please tell me about it and how to cure it?"

text_generation_pipe = pipeline(task = "text-generation", model = llama_model,tokenizer = llama_tokenizer, max_length = 300)
generation_result = text_generation_pipe(f"<s>[INST] {user_prompt} [/INST]")
print(generation_result[0]['generated_text'])

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
  return fn(*args, **kwargs)


<s>[INST] Hey I'm suffering from Amblyopia, could you please tell me about it and how to cure it? [/INST]  Amblyopia, also known as lazy eye, is a common vision disorder in which one eye has reduced vision due to abnormal development of the visual system. everybody has two eyes, but sometimes one eye doesn't work as well as the other. this can happen because of problems with the eye itself, or because of problems with the brain.

amblyopia is a common condition that affects about 2% of children and 1% of adults. it is more common in children, and usually affects one eye. it is less common in adults, and usually affects both eyes.

amblyopia can be caused by a number of things, including:

* strabismus (crossed eyes)
* anisocoria (unequal pupil size)
* cataracts
* glaucoma
* amblyopia in one eye
* other eye problems

amblyopia can be treated with eye exercises, glasses, or surgery. the best treatment depends on the cause of the amblyopia.

amblyopia is a common condition that can be tre