<a href="https://colab.research.google.com/github/prerakthakur/fine-tuning-llama/blob/main/Fine_Tuning_LLMs_with_Hugging_Face_Medical_Terms_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLMs with Hugging Face

## Step 1: Installing and importing the libraries

In [1]:
# !pip uninstall accelerate peft bitsandbytes transformers trl -y
!pip install accelerate peft==0.13.2 bitsandbytes transformers trl==0.12.0

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting trl==0.12.0
  Downloading trl-0.12.0-py3-none-any.whl.metadata (10 kB)
Collecting datasets>=2.21.0 (from trl==0.12.0)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.21.0->trl==0.12.0)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.21.0->trl==0.12.0)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets>=2.21.0->trl==0.12.0)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets>=2.21.0->trl==0.12.0)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.12.0-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install huggingface_hub



In [3]:
import torch
from trl import SFTTrainer
from peft import LoraConfig
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline)

## Step 2: Loading the model

In [4]:
# we will load the llma2 model here from huggingface models library
llamaModal = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2",
                                     quantization_config = BitsAndBytesConfig(load_in_4bit = True, bnb_4bit_compute_dtype = getattr(torch, "float16"),
                                                                              bnb_4bit_quant_type = "nf4"))
# stopping the cache usage to reduce the memory requirement
llamaModal.config.use_cache = False

# deactivate the more accuarte computation of linear layes
llamaModal.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

## Step 3: Loading the tokenizer

In [5]:
llama_tokennizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2", trust_remote_code = True)
llama_tokennizer.pad_token = llama_tokennizer.eos_token
llama_tokennizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## Step 4: Setting the training arguments

In [6]:
# max_stpes = 100 would train the model for upto 100 steps
# report_to = "none" is set to avoid putting api key for wandDB which is a profiler Api for AI
training_arguments = TrainingArguments(per_device_train_batch_size = 1, output_dir= "./results", gradient_accumulation_steps = 4, gradient_checkpointing=True, max_steps = 100, report_to="none")

## Step 5: Creating the Supervised Fine-Tuning trainer

In [7]:
# define config for Parameter efficient fine tuning "Peft"

lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

trainer = SFTTrainer(
    model=llamaModal,
    train_dataset= load_dataset("aboonaji/wiki_medical_terms_llam2_format", split = "train"),
    peft_config = lora_config,
    dataset_text_field = "text", # Replace with your dataset's text field
    tokenizer = llama_tokennizer,
    args = training_arguments
)

wiki_medical_terms_llam2.jsonl:   0%|          | 0.00/54.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6861 [00:00<?, ? examples/s]


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/6861 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


## Step 6: Training the model

In [8]:
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss


TrainOutput(global_step=100, training_loss=1.665298614501953, metrics={'train_runtime': 3612.2799, 'train_samples_per_second': 0.111, 'train_steps_per_second': 0.028, 'total_flos': 1.092045936242688e+16, 'train_loss': 1.665298614501953, 'epoch': 0.058300539279988337})

## Step 7: Chatting with the model

In [10]:

user_prompt = "What is hydrocephalus and what are symptopms?"

text_generation_pipeline = pipeline(task = "text-generation", model = llamaModal, tokenizer = llama_tokennizer, max_length = 300)
model_answer = text_generation_pipeline(f"<s>[INST]{user_prompt}[/INST]")
print(model_answer[0]["generated_text"])


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


<s>[INST]What is hydrocephalus and what are symptopms?[/INST]  Hydrocephalus is a condition in which there is an accumulation of fluid in the brain, leading to increased pressure inside the skull and potentially causing damage to the brain and its surrounding tissues. everybody is different and the symptoms of hydrocephalus can vary depending on the age of the patient, the location and severity of the accumulation of fluid, and other factors. Here are some common symptoms of hydrocephalus:

1. Headache: One of the most common symptoms of hydrocephalus is headache, which can range from mild to severe. The headache may be a constant or intermittent pain, and it may be accompanied by other symptoms such as nausea and vomiting.

2. Vision problems: Hydrocephalus can cause problems with vision, including blurred vision, double vision, and loss of peripheral vision. These symptoms may be caused by the pressure of the fluid on the optic nerves or by the compression of the visual pathways in t