# Finetune Cleaned Data

The goal of this workbook is to finetune dataset using QLoRA to reduce memory usage and improve performance

![](https://raw.githubusercontent.com/komus/MedQuAD/refs/heads/master/kaggleX%20Chatbot.drawio%20(1).png)

## Environment Variables

Install and import the required package, set the environment variables

In [1]:
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/gemma/transformers/2b-it/3/model.safetensors.index.json
/kaggle/input/gemma/transformers/2b-it/3/gemma-2b-it.gguf
/kaggle/input/gemma/transformers/2b-it/3/config.json
/kaggle/input/gemma/transformers/2b-it/3/model-00001-of-00002.safetensors
/kaggle/input/gemma/transformers/2b-it/3/model-00002-of-00002.safetensors
/kaggle/input/gemma/transformers/2b-it/3/tokenizer.json
/kaggle/input/gemma/transformers/2b-it/3/tokenizer_config.json
/kaggle/input/gemma/transformers/2b-it/3/special_tokens_map.json
/kaggle/input/gemma/transformers/2b-it/3/.gitattributes
/kaggle/input/gemma/transformers/2b-it/3/tokenizer.model
/kaggle/input/gemma/transformers/2b-it/3/generation_config.json


In [2]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install --upgrade --quiet keras-nlp
!pip install --upgrade --quiet keras
!pip install -q diffusers
!pip install --quiet google-cloud-secret-manager
!pip install --upgrade --quiet google-cloud-aiplatform
!pip install -q trl

In [3]:
from kaggle_secrets import UserSecretsClient
os.environ["KERAS_BACKEND"] = "jax"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="0.9"
user_secrets = UserSecretsClient()
os.environ['KAGGLE_KEY'] =  UserSecretsClient().get_secret("KAGGLE_KEY")
os.environ['KAGGLE_USERNAME'] = UserSecretsClient().get_secret("KAGGLE_USERNAME")
user_credential = UserSecretsClient().get_secret("KEYS")
#HF_TOKEN = 
os.environ['HF_TOKEN'] = UserSecretsClient().get_secret("HF_KEY")
s_auth = "key.json"
with open(s_auth, "w") as f:
    f.write(user_credential)
    
os.environ['AUTHS'] = s_auth

In [4]:
import keras
import keras_nlp
import torch
import transformers
from google.cloud import aiplatform
from numba import cuda
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

In [5]:
os.remove("key.json")

In [6]:
MODEL_LOCATION = "/kaggle/input/gemma/transformers/2b-it/3",
MODEL_NAME = "gemma_2b_en"
MODEL_SIZE = MODEL_NAME.split("_")[-2]
assert MODEL_SIZE in ("2b", "7b")
TRAIN_RATIO = 50
DATASET_NAME = "output_medplus"
DATASET_PATH = f"{DATASET_NAME}.jsonl"
DATASET_URL = f"https://raw.githubusercontent.com/komus/MedQuAD/refs/heads/master/output_medplus.jsonl"

FINETUNED_MODEL_DIR = f"./{MODEL_NAME}_{DATASET_NAME}"
FINETUNED_WEIGHTS_PATH = f"{FINETUNED_MODEL_DIR}/model.weights.h5"
FINETUNED_VOCAB_PATH = f"{FINETUNED_MODEL_DIR}/vocabulary.spm"

HUGGINGFACE_MODEL_DIR = f"./{MODEL_NAME}_huggingface"

PROJECT_ID = UserSecretsClient().get_secret("PROJECT_ID")
REGION = "us-central1"
BUCKET_URI = UserSecretsClient().get_secret("BUCKET_URI")
SERVICE_ACCOUNT = UserSecretsClient().get_secret("SERVICE_ACCT")
DEPLOYED_MODEL_URI = f"{BUCKET_URI}/{MODEL_NAME}_q"

## Dataset

Download the cleaned dataset for use

In [7]:
!wget -nv -nc -O $DATASET_PATH $DATASET_URL

  pid, fd = os.forkpty()


2024-10-14 13:49:36 URL:https://raw.githubusercontent.com/komus/MedQuAD/refs/heads/master/output_medplus.jsonl [20846808/20846808] -> "output_medplus.jsonl" [1]


In [8]:
TEST_EXAMPLES = [
     'As a healthcare fellow learning diagnosis, What is (are) Adhesions?',
    'As a healthcare fellow learning diagnosis, what research (or clinical trials) is being done for Miller Fisher Syndrome ?',
    'As a healthcare fellow learning diagnosis, What to do for Henoch-Schnlein Purpura '
]

# Prompt template for the training data and the finetuning tests
PROMPT_TEMPLATE = "Instruction:\n{instruction}\n\nResponse:\n{answer}"

TEST_PROMPTS = [
    PROMPT_TEMPLATE.format(instruction=example, answer="")
    for example in TEST_EXAMPLES
]

In [9]:
def formatting_func(prompt):
    text = f"### Instruction (Human): {prompt['question']}\n ### Answer (Assistant): {prompt['answer']}"
    return [text]

In [10]:
import random
import json
from datasets import load_dataset
RANDOM_SEED = 3456


dataset = load_dataset("json", data_files=DATASET_PATH)
#print(type(dataset))
shuffled_dataset = dataset['train'].shuffle(seed=RANDOM_SEED) 
training_data_count = dataset.shape['train'][0] * TRAIN_RATIO // 100
train_dataset = shuffled_dataset.select(range(training_data_count))

remaining_indices = range(training_data_count, shuffled_dataset.num_rows)
test_dataset = shuffled_dataset.select(remaining_indices).shuffle(seed=RANDOM_SEED).select(range(1000))

Generating train split: 0 examples [00:00, ? examples/s]

## Model Definition

In [11]:
bits_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

gamma_llm = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/gemma/transformers/2b-it/3", 
    quantization_config=bits_config,
    device_map="auto", 
    torch_dtype=torch.bfloat16,offload_folder="./offload"
)
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/2b-it/3")


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
for prompt in TEST_PROMPTS:
    #output = gemma_lm.generate(prompt, max_length=None)
    inputs = tokenizer(prompt, return_tensors="pt").to(gamma_llm.device)
    outputs = gamma_llm.generate(**inputs, max_new_tokens=50)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    print(f"\n{'- '*40}")

Instruction:
As a healthcare fellow learning diagnosis, What is (are) Adhesions?

Response:
Sure, here's a definition of adhesions:

An adhesion is a medical condition in which two or more body parts are held together by a tissue or membrane. This can be caused by a variety of factors, including infection, inflammation, or

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
As a healthcare fellow learning diagnosis, what research (or clinical trials) is being done for Miller Fisher Syndrome ?

Response:
Sure, here are some research (or clinical trials) being done for Miller Fisher Syndrome:

**Clinical Trials:**

* **The Miller Fisher Syndrome Foundation Clinical Trial Network (MFSF CTN):** This is a global network of researchers and healthcare

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
As a healthcare fellow learning diagnosis, What to do for Henoch-Schnlein Purpura 

Response:
**Henoch-Schn

### Finetune model

In [13]:
from peft import LoraConfig
from trl import SFTTrainer
from peft import LoraConfig

lora_config = LoraConfig(
    r=4,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
    lora_dropout=0.1,
    lora_alpha = 8,
    bias="none"
)

In [14]:
training_args = transformers.TrainingArguments(
        evaluation_strategy="epoch",
        auto_find_batch_size=True,
        report_to="none",
        #per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        eval_steps = 10,
        seed = RANDOM_SEED,
        do_eval=True,
        logging_steps=1,
        output_dir=FINETUNED_MODEL_DIR,
        optim="paged_adamw_8bit"
    )

trainer = SFTTrainer(
    model=gamma_llm,
    train_dataset=train_dataset,
    args=training_args,
    peft_config=lora_config,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    #dataset_text_field='Question',
    formatting_func=formatting_func,
)
trainer_result = trainer.train()



Map:   0%|          | 0/6174 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

  super().__init__(
max_steps is given, it will override any value given in num_train_epochs


Epoch,Training Loss,Validation Loss
1,2.8426,2.869408
2,2.8352,2.700592
3,2.6814,2.450986
4,2.4482,2.343704
5,2.3603,2.265896
6,2.2793,2.199103
7,2.2271,2.148906
8,2.1511,2.116912
9,2.1026,2.099919
10,2.0761,2.092422


In [17]:
#trainer.save_model()
metrics = trainer_result.metrics
max_train_samples = len(train_dataset)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =       10.0
  total_flos               =   795794GF
  train_loss               =     2.4004
  train_runtime            = 0:01:22.52
  train_samples            =       6174
  train_samples_per_second =      0.969
  train_steps_per_second   =      0.121


In [18]:
trainer.save_model()

### Free resources

In [19]:
del gamma_llm
del trainer
torch.cuda.empty_cache()

## Test model

In [20]:
model = AutoModelForCausalLM.from_pretrained(
  FINETUNED_MODEL_DIR,
  device_map="auto",
  torch_dtype=torch.float16,
    load_in_4bit=True
)
tokenizer = AutoTokenizer.from_pretrained(FINETUNED_MODEL_DIR)
# load into pipeline
#pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [23]:
model.save_pretrained(HUGGINGFACE_MODEL_DIR)
tokenizer.save_pretrained(HUGGINGFACE_MODEL_DIR)



('./gemma_2b_en_huggingface/tokenizer_config.json',
 './gemma_2b_en_huggingface/special_tokens_map.json',
 './gemma_2b_en_huggingface/tokenizer.model',
 './gemma_2b_en_huggingface/added_tokens.json',
 './gemma_2b_en_huggingface/tokenizer.json')

In [29]:
for prompt in TEST_PROMPTS:
    inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=500, 
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95)

    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"{output}\n{'- '*40}")

Instruction:
As a healthcare fellow learning diagnosis, What is (are) Adhesions?

Response:
Sure. Here's the definition of adhesions:

An adhesion is a connection or tissue bridge that holds two or more structures together. Adhesions can be formed between any two tissues, including skin, muscle, bone, and fat.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Instruction:
As a healthcare fellow learning diagnosis, what research (or clinical trials) is being done for Miller Fisher Syndrome ?

Response:
Sure. Here are some ongoing research (or clinical trials) related to Miller Fisher Syndrome:

1. **Clinical trial for Miller-Fisher syndrome** (ClinicalTrials.gov ID: NCT04022076): This clinical trial is enrolling patients with Miller-Fisher syndrome for a clinical trial to evaluate the efficacy of gene therapy with AAV-based vector in controlling disease progression.

2. **Phase 1 clinical trial for gene therapy of Miller-Fisher syndrome** (ClinicalTrials.g

## Publish model

In [31]:
model.push_to_hub("medquad_finetuned")



adapter_model.safetensors:   0%|          | 0.00/19.6M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/komus/medquad_finetuned/commit/ad6a902bdd592d6b556a1a0a30a35c7a02842508', commit_message='Upload GemmaForCausalLM', commit_description='', oid='ad6a902bdd592d6b556a1a0a30a35c7a02842508', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
del model, tokenizer

torch.cuda.empty_cache()

locale.getpreferredencoding = lambda: "UTF-8"