<a href="https://colab.research.google.com/github/mohit3agarwal/AI-Chatbot-Llama2/blob/main/Fine_Tuning_Llama_2_with_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Llama 2 with Hugging Face for Medical AI Chatbot

**Using Hugging Face to fine-tune LLMs**

---

Pre-trained Llama 2 model : https://huggingface.co/aboonaji/llama2finetune-v2

Formatted source dataset: https://huggingface.co/datasets/aboonaji/wiki_medical_terms_llam2_format



## Installing and importing the libraries

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install huggingface_hub



In [None]:
import torch
from trl import SFTTrainer
from peft import LoraConfig
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline)

## Loading the LLM

In [None]:
# loading the pre-trained LLM and enabling 4-bit precision to reduce model size and speed of inference

llama_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2",
                                                   quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                                                                            bnb_4bit_compute_dtype = getattr(torch, "float16"),
                                                                                            bnb_4bit_quant_type = "nf4"))

llama_model.config.use_cache = False     # disabling memory storage to reduce size of model
llama_model.config.pretraining_tp = 1    # deactivate more accurate computation of linear layers, to increase the speed of model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

## Loading the tokenizer

In [None]:
# loading the tokenizer and allow custom models to be trusted
llama_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2", trust_remote_code = True)

# configuring the padding so that sequences are of uniform length
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## Setting the training arguments

In [None]:
# training the arguments and reducing batch size from 8 (default) to 4 and setting max steps to 100
training_arguments = TrainingArguments(output_dir = "./results", per_device_train_batch_size = 4, max_steps = 100)

## Creating the Supervised Fine-Tuning trainer

In [None]:
# using SFT trainer to train our model (this is a transfer learning algorithm and not RLHF)
# performing parameter efficient fine-tuning (reducing no. of hyperparameters being trained) using LoRa, since model will be trained using T4 GPU
llama_sft_trainer = SFTTrainer(model = llama_model,
                               args = training_arguments,
                               train_dataset = load_dataset(path = "aboonaji/wiki_medical_terms_llam2_format", split = "train"),
                               tokenizer = llama_tokenizer,
                               peft_config = LoraConfig(task_type = "CAUSAL_LM", r = 64, lora_alpha = 16, lora_dropout = 0.1),
                               dataset_text_field = "text")

Downloading data:   0%|          | 0.00/54.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6861 [00:00<?, ? examples/s]



Map:   0%|          | 0/6861 [00:00<?, ? examples/s]

## Training the model

In [None]:
llama_sft_trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=100, training_loss=1.6558329772949218, metrics={'train_runtime': 1467.7661, 'train_samples_per_second': 0.273, 'train_steps_per_second': 0.068, 'total_flos': 8228119310991360.0, 'train_loss': 1.6558329772949218, 'epoch': 0.06})

## Chatting with the model

In [None]:
user_prompt = "Please tell me about Babesiosis"
text_generation_pipeline = pipeline(task = "text-generation", model = llama_model, tokenizer = llama_tokenizer, max_length = 300);
model_answer = text_generation_pipeline(f"<s>[INST] {user_prompt} [/INST]");
print(model_answer[0]['generated_text'])

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


<s>[INST] Please tell me about Babesiosis [/INST]  Babesiosis is a parasitic infection caused by the protozoan Babesia, which is transmitted through the bite of an infected blacklegged tick (Ixodes scapularis). It is primarily found in the northeastern United States and the upper Midwest. everybody has Babesia in their blood, but it is not harmful unless it is in large quantities. Babesiosis is a rare disease, but it can be severe and even life-threatening in some cases.

Symptoms of Babesiosis:

* Fever
* Chills
* Fatigue
* Headache
* Muscle and joint pain
* Anemia
* Yellowing of the skin and eyes (jaundice)
* Shortness of breath
* Cough
* Confusion
* Seizures
* Coma

Causes and Risk Factors:

* Tick bites: Babesia is transmitted through the bite of an infected blacklegged tick (Ixodes scapularis). The ticks are most likely to be infected in the northeastern United States and the upper Midwest.
* Exposure to contaminated blood: Babesia can be transmitted through blood transfusions or
