# Chatbot with Generative Pretrained Transformer(GPT) Model
#### The objective of this project is to develop a Chatbot using a pretrained text generation model (GPT-model) and go over the process of fine-tuning it on the given dataset. This fine-tuning step is key in producing highquality models. Fine-tuning allows us to adapt a model to a specific dataset or domain.

#### The supervised fine-tuning method which is a most common method for fine-tuning text generation models has been implimented. The transformative potential of finetuning pretrained text generation models is to make them more effective tools for the application.



# Install the Required Libraries

In [None]:
%%capture
!pip install -q accelerate==0.31.0 peft==0.11.1 bitsandbytes==0.43.1 transformers==4.41.2 trl==0.9.4 sentencepiece==0.2.0 triton==3.1.0

In [None]:
import pandas as pd
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig, DataCollatorForCompletionOnlyLM

# Supervised Fine-Tuning (SFT)
#### With supervised fine-tuning (SFT), we can adapt the base model to follow instructions. During this fine-tuning process, the parameters of the base model are updated to be more in line with our target task, like following instructions. Like a pretrained model, it is trained using next-token prediction but instead of only predicting the next token, it does so based on a user input.

# Data Importing and Preprocessing

#### Import the data from CSV file and prepare the data for traing the LLM.  


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
csv_path = "/content/drive/MyDrive/Chatbot_assignment_files/mle_screening_dataset.csv"
df = pd.read_csv(csv_path)

print(df.head())

                         question  \
0        What is (are) Glaucoma ?   
1        What is (are) Glaucoma ?   
2        What is (are) Glaucoma ?   
3  Who is at risk for Glaucoma? ?   
4       How to prevent Glaucoma ?   

                                              answer  
0  Glaucoma is a group of diseases that can damag...  
1  The optic nerve is a bundle of more than 1 mil...  
2  Open-angle glaucoma is the most common form of...  
3  Anyone can develop glaucoma. Some people are a...  
4  At this time, we do not know how to prevent gl...  


In [None]:
# User config
BASE_MODEL = os.environ.get("BASE_MODEL", "Qwen/Qwen2-0.5B-Instruct")
CSV_PATH   = os.environ.get("CSV_PATH",   "/content/drive/MyDrive/Chatbot_assignment_files/mle_screening_dataset.csv")
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "qwen2-sft-chatbot")
USE_4BIT   = os.environ.get("USE_4BIT",   "true").lower() == "true"

In [None]:
# Load CSV
# Expects exactly two columns: "question", "answer"
dataset = load_dataset("csv", data_files={"train": CSV_PATH})["train"]

# Build a single 'text' field per row in the classic prompt→response format.
# We'll compute loss **only on the answer tokens** using a special collator.
RESPONSE_PREFIX = "Assistant:"

def row_to_text(example):
    q = str(example["question"]).strip()
    a = str(example["answer"]).strip()
    # The "Assistant:" prefix is important: our collator will mask loss before it.
    # Keep the prompt short, consistent, and aligned with how you'll chat at inference.
    return {"text": f"User: {q}\n{RESPONSE_PREFIX} {a}"}

dataset = dataset.map(row_to_text, remove_columns=dataset.column_names)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/16406 [00:00<?, ? examples/s]

In [None]:
# Example of formatted prompt
print(dataset["text"][258])

User: How to diagnose Alzheimer's Disease ?
Assistant: The time from diagnosis of Alzheimers disease to death varies. It can be as little as 3 or 4 years if the person is over 80 years old when diagnosed or as long as 10 years or more if the person is younger.


In [None]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
# Ensure a valid pad token for batching
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


# Collator (mask prompt loss)
# Only compute loss on tokens **after** "Assistant:" so the model learns the answer, not to copy the question.
collator = DataCollatorForCompletionOnlyLM(
    response_template=f"{RESPONSE_PREFIX}",
    tokenizer=tokenizer
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Models - Quantization
#### We have our data, now we can start loading in our model. This is where we apply the Q in QLoRA, namely quantization. Here I have used the "bitsandbytes" package to compress the pretrained model to a 4-bit representation. In BitsAndBytesConfig, you can define the quantization scheme. I followed the steps used in the original QLoRA paper and load the model in 4-bit (load_in_4bit) with a normalized float representation (bnb_4bit_quant_type) and double quantization (bnb_4bit_use_double_quant).

In [None]:
# Base model
model_kwargs = {}
if USE_4BIT and torch.cuda.is_available():
    model_kwargs["quantization_config"] = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    model_kwargs["device_map"] = "auto"

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, **model_kwargs)

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

# Configuration
## LoRA Configuration
#### Next, I have defined defined LoRA configuration using the "peft" library, which represents hyperparameters of the fine-tuning process.

### Parameters
#### r: This is the rank of the compressed matrices. Increasing this value will also increase the sizes of compressed matrices leading to less compression and thereby improved representative power. Values typically range between 4 and 64.

#### lora_alpha: Controls the amount of change that is added to the original weights. In essence, it balances the knowledge of the original model with that of the new task. A rule of thumb is to choose a value twice the size of r.

#### target_modules: Controls which layers to target. The LoRA procedure can choose to ignore specific layers, like specific projection layers. This can speed up training but reduce performance and vice versa.

#### We can experiment by Playing with the parameter values to get an intuitive understanding of values that work and those that do not for our use case.

In [None]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# Prepare LoRA Configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# Training Configuration
#### Now, we need to configure the training parameters.

### There are several parameters worth mentioning:
#### num_train_epochs: The total number of training rounds. Higher values tend to degrade performance so we generally like to keep this low.

#### learning_rate: Determines the step size at each iteration of weight updates. It is know that higher learning rates work better for larger models (>33B parameters).

#### lr_scheduler_type: A cosine-based scheduler to adjust the learning rate dynamically. It will linearly increase the learning rate, starting from zero, until it reaches the set value. After that, the learning rate is decayed following the values of a cosine function.

#### optim: The paged optimizers used in the original QLoRA paper.

In [None]:
from transformers import TrainingArguments

# Training arguments
training_arguments = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-3,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True
)

# Model Training

#### Now, we have prepared models and parameters, and  we can start fine-tuning our model. We load in "SFTTrainer" and simply run "trainer.train()". During training the loss will be printed every 10 steps according to the logging_steps parameter. Note: I have used my "Weights&Biases" API key to train the model

In [None]:
from trl import SFTTrainer

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    data_collator=collator,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    max_seq_length=512,

    # Leave this out for regular SFT
    peft_config=peft_config,
)

# Train model
trainer.train()

# Save QLoRA weights
trainer.model.save_pretrained("Qwen2-0.5B-qlora")



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/16406 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmnarahari[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Step,Training Loss
10,2.1197
20,2.0268
30,1.7948
40,1.7781
50,1.7536
60,1.9698
70,2.0265
80,1.8334
90,1.9763
100,1.7754


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


# Merge Adapter

#### After trained QLoRA weights, then combine them with the original weights to use them. We reload the model in 16 bits, instead of the quantized 4 bits, to merge the weights. Although the tokenizer was not updated during training, we save it to the same folder as the model for easier access.

In [None]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "Qwen2-0.5B-qlora",
    low_cpu_mem_usage=True,
    device_map="auto",
)

# Merge LoRA and base model
merged_model = model.merge_and_unload()

# Inference
#### After merging the adapter with the base model, we can use it with the prompt template that is defined earlier.

In [None]:
from transformers import pipeline

print ("Test Example-1:")
# Use the predefined prompt template
question = """User:
What is Blood Pressure?
Assistant:
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(question)[0]["generated_text"])

Test Example-1:
User:
What is Blood Pressure?
Assistant:
Blood pressure is the force of blood pushing against the walls


In [None]:
print ("Test Example-2:")
# Use the predefined prompt template
question = """User:
What is Diabetes?
Assistant:
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(question)[0]["generated_text"])

Test Example-2:
User:
What is Diabetes?
Assistant:
People with type 2 diabetes are more likely to develop type


In [None]:
print ("Test Example-3:")
# Use the predefined prompt template
question = """User:
What causes Osteoarthritis?
Assistant:
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(question)[0]["generated_text"])

Test Example-3:
User:
What causes Osteoarthritis?
Assistant:
People who have osteoarthritis often


In [None]:
print ("Test Example-4:")
# Use the predefined prompt template
question = """User:
What is (are) Anxiety Disorders?
Assistant:
"""

# Run our instruction-tuned model
pipe = pipeline(task="text-generation", model=merged_model, tokenizer=tokenizer)
print(pipe(question)[0]["generated_text"])

Test Example-4:
User:
What is (are) Anxiety Disorders?
Assistant:
 Anxiety disorders are a group of psychological problems


# Evaluating Generative Models

#### Evaluating generative models poses a significant challenge. Generative models are used across many diverse use cases, making it a challenge to rely on a singular metric for judgment. Given their probabilistic nature, generative models do not necessarily generate consistent outputs. No one metric is perfect for all use cases.

#### One common metrics category for comparing generative models is wordlevel evaluation. These classic techniques compare a reference dataset with the generated tokens on a token(set) level. Common word-level metrics include perplexity, ROUGE, BLEU, and BERTScore.

#### A common method for evaluating generative models on language generation and understanding tasks is on well-known and public benchmarks, such as MMLU,GLUE, TruthfulQA, GSM8k, and HellaSwag. These benchmarks give us information about basic language understanding but also complex analytical answering.

#### Aside from natural language tasks, some models specialize in other domains, like programming. These models tend to be evaluated on different benchmarks, such as HumanEval, which consists of challenging programming tasks for the model to solve.

# Preference-Tuning / Alignment / Reinforcement Learning from Human Feedback (RLHF)

#### Although the present model can now follow instructions, we can further improve its performance by implementing additional fine-tuning technics during the training phase.

#### A common method to fine-tune the LLM with the trained reward model is Proximal Policy Optimization (PPO). PPO is a popular reinforcement technique that optimizes the instruction-tuned LLM by making sure that the LLM does not deviate too much from the expected rewards.

#### A disadvantage of PPO is that it is a complex method that needs to train at least two models, the reward model and the LLM, which can be more costly than perhaps necessary.

#### Direct Preference Optimization (DPO) is an alternative to PPO and does away with the reinforcement-based learning procedure. Instead of using the reward model to judge the quality of a generation, we let the LLM itself do that. Compared to PPO, the DPO method is found to be more stable during training and more accurate.

#### The combination of SFT+DPO is a great way to first fine-tune the model to perform basic chatting and then align its answers with human preference. However, it is computationally expensive since we need to perform two training loops and potentially tweak the parameters in two processes.

#### Recently, a new method, called Odds Ratio Preference Optimization (ORPO) of aligning preferences has developed by J Hong (2024). This method combines SFT and DPO into a single training process. It removes the need to perform two separate training loops, further simplifying the training process while allowing for the use of QLoRA.

# Conclusion

#### In project, I have developed a Chatbot (generative AI model) by fine-tuning pretrained LLM. I have used a lite weight pretrained LLM model ""Qwen/Qwen2-0.5B-Instruct" and fine-tuned on the given data set "mle_screening_dataset.csv". The fine-tuning was performed by making use of parameter-efficient fine-tuning (PEFT) through the low-rank adaptation (LoRA) technique and the LoRA was extended through quantization, a technique for reducing memory constraints when representing the parameters of the model and adapters. The model performance was tested with four different examples. Since I have used a lite weight pretarined model, there is a limitation on the number of characters in text generation.

#### We can further improve the model by utilizing large medical-specialized LLMs such as BioGPT, MedAlpaca, or PMC-LLaMA. For deployment in healthcare (HIPAA-compliant), even we can use better models like Azure OpenAI GPT-4 or Google Med-PaLM 2. Additionally, we can run a RAG pipeline with trusted medical sources to get most accurate answers for the questions asked.
