# NSDC Data Science Projects

## Developing a Medical Chatbot Using RAG and LLMs

## Name:

***

## GPU Instructions

**Please follow the instructions below before proceeding with the project!**  

In this project, we will utilize the GPU provided by Kaggle. The GPU will be used to train and infer LLMs.  
To activate the GPU, follow the steps outlined in this [document](https://drive.google.com/file/d/10KHE4eJJKkF9TLxEpSBVKFFqAP41yVh0/view?usp=sharing).  

❗️It's important to note that the GPU quota is 30 hours per week. While this is a sizable allocation, it's always a good practice to monitor your usage to ensure you stay within the limit.

***

### **❗️Disclaimer❗️**
The chatbot developed in this project is **not a substitute for professional medical diagnosis**. Its responses are generated based solely on the dataset it was trained on, which is limited in scope and not clinically comprehensive. Please do not rely on its outputs for making medical decisions.

Always consult a licensed healthcare provider for any health concerns.

If you are in an emergency situation, please seek immediate medical attention. You can find a list of emergency contact numbers worldwide [in this link.](https://www.dt.com/ca/wp-content/uploads/2017/03/Global-_911_Emergency-Contacts.pdf)

***

### Project Introduction

**Motivation:** When people feel unwell, they often search online to understand their symptoms but the information they find can be confusing. Visiting a doctor isn’t always immediately possible, especially in remote areas. A medical chatbot can help bridge this gap by giving quick, easy-to-understand information about possible health conditions. It can guide users to make better decisions about whether to seek medical help, all from the comfort of their home.

**What are LLMs?**
A Large Language Model (LLM) is an advanced AI model that can understand and generate human-like text. It is trained on a large amount of text data and can answer questions, write content, summarize information and hold conversations.

**Why LLMs?**
LLMs can understand and generate natural human language, making them ideal for building chatbots. They are capable of handling complex queries, providing detailed responses and adapting to different ways people describe their symptoms.

### Project Outline

1. Data Preprocessing
2. Implementing a simple **rule-based** chatbot
3. Developing a chatbot using **text embeddings** and **RAG**
4. **Fine-tuning** LLMs for our specific data and use-case

***

## Milestone 1: Data Preprocessing

First, let's import all the necessary libraries which will be used for data preprocessing as well as moving forward in building the chatbot.

The **warnings** library is used to supress unimportant warnings while we run the cells

In [None]:
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

#### Step 1: Reading the data

We will import a dataset from Kaggle. This step does not require us to download the dataset and we can directly access it using the /kaggle/input path. For this we will first have to add the dataset to our Kaggle notebook.  

**To add the data follow the steps below**:
1. Click *Input* on the right menu bar
2. Select *+ Add Input*
3. Enter this URL in the search bar: https://www.kaggle.com/datasets/karthikudyawar/disease-symptom-prediction/data
4. Click on the *+* icon to add the dataset to the notebook

You can explore the data we are using for this project [here](https://www.kaggle.com/datasets/karthikudyawar/disease-symptom-prediction/data)

In this project, we will only use **dataset.csv** which contains the disease and its corresponding symptoms list

In [None]:
dis_symp_df = pd.read_csv("/kaggle/input/disease-symptom-prediction/dataset.csv")

Check how the dataset is structured using the pandas `head` function

In [None]:
# Your code here

We observe that the symptoms have underscores which need to be removed

#### Step 2: Remove Underscores from the Symptoms Text

Steps to remove the underscores:
1. Find all columns in dis_symp_df where the column names start with "Symptom"
2. After finding those columns, replace the _ with a blank space

In [None]:
# fill in the blanks
symptom_cols = [col for col in _________.columns if col.startswith("________")]
dis_symp_df[symptom_cols] = dis_symp_df[symptom_cols].replace("character you are replacing", "character you are replacing with", regex=True)

Verify if the underscores have been removed by printing the first few entries of the dataframe

In [None]:
# Your code here

#### Step 3: Transforming the Structure of the Dataset

We need to transform this dataset to match input-output pairs that are suitable for training LLMs

Two Methods:


1.   Pivot Longer: will result in a bigger dataset and more suited for simple classification tasks
2.   Comma-separated symptoms in one column: suitable for sentence-level input from users

Therefore, we will proceed with combining the symptoms into one column and separating them with commas



Steps to create a comma-separated list of symptoms:


1. Combine the symptom values row-wise  
For each row, go through the values in the symptom_cols:
* Skip any missing values
* Join the non-missing symptom strings with commas
* Store this in a new column called **Symptoms**

2. Remove all the original symptoms columns

We will use a lambda function to combine the symptom values row-wise  

**What is a lambda function?**  
`square = lambda x: x*x`  
`print(square(5))`

`lambda x: x*x` is a lambda function that takes x as input and returns x*x in just one, short line

In [None]:
# fill in the blanks
dis_symp_df["Symptoms"] = dis_symp_df[__________].apply(lambda row: ', '.join([s for s in row if pd._________(s)]), axis=1)
dis_symp_df._________(symptom_cols, axis=1, inplace=True)

Display the updated dataframe again using the `head` function

In [None]:
# Your code here

#### Step 4: Check for Duplicate Lists

Obtain a summary of the dataset using `info` function

In [None]:
# Your code here

Drop duplicate columns if they have the same symptoms list  
**Hint**: Use the `drop_duplicates` function only for the **Symptoms** column

In [None]:
# Your code here

Check the summary of the dataset again to see if there were any duplicates. Report back on your conclusion.

In [None]:
# Your code here

***Your conclusion here***

***

## Milestone 2: Rule-Based Chatbot (Cosine Similarity)

We will now be implementing one of the most basic versions of a chatbot: **a rule-based chatbot using cosine similarity**.

**Cosine similarity** is a metric used to measure how similar two vectors are, regardless of their magnitude.

A **rule-based chatbot** using cosine similarity identifies the most appropriate response by comparing the user’s input with a set of predefined statements and selecting the one with the highest semantic similarity based on cosine similarity of their vector embeddings.

#### Step 1: Import required packages from `sklearn`

`TfidfVectorizer`: Read up more on Term Frequency-Inverse Document Frequency (TF-IDF) [here](https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/). This is used to convert text into numerical vectors based on how important each word is.

`cosine_similarity`: Package used to measure how similar the user's symptom input is to each disease's symptom list.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### Step 2: Group All Symptom Entries For Each Disease into a Single String

**Hint:** Use a lambda function again. Group by **Disease** and then apply the lambda function to **Symptoms**

In [None]:
# fill in the blanks
dis_symp_df = dis_symp_df.groupby("_________")["_________"].apply(lambda x: ", ".join(_____)).reset_index()

#### Step 3: Vectorize only the **Symptoms** column from dis_symp_df using `TfidfVectorizer` and `fit_transform()`

In [None]:
# fill in the blanks
vectorizer = _________()
tfidf_matrix = vectorizer._________(dis_symp_df["_________"])

#### Step 4: Create a chatbot interface for the user

In [None]:
def chatbot():
    # Welcome message for the user
    print("ChatBot: I can help suggest possible diseases based on your symptoms.")
    print("Type your symptoms ('fever, cough, sore throat'), or type 'exit' to quit.\n")

    while True:
        # Continue asking the user for input until they enter 'exit' or 'quit'
        user_input = input("You: ")
        if user_input.lower() in ['exit', 'quit']:
            print("ChatBot: Goodbye!\n Note: This is not a medical diagnosis. Always consult a licensed physician.")
            break

        # Converts the user's input into a TF-IDF vector using the previously trained vectorizer
        user_vec = vectorizer.transform([user_input])

        # Compares the user's vector with all disease-symptom vectors in the tfidf_matrix using cosine similarity
        # flatten() is used to convert the 2D result into a 1D array of vectors
        cosine_sim = cosine_similarity(user_vec, tfidf_matrix).flatten()


        # Sorts the similarity scores in descending order and retrieves the top 3 indices
        top_indices = cosine_sim.argsort()[::-1][:3]

        # Creates a list of (disease name, similarity score) tuples and only includes matches where the score is >0.5
        results = []
        for i in top_indices:
            if cosine_sim[i] > 0.2:
                disease = dis_symp_df.iloc[i]["Disease"]
                score = cosine_sim[i]
                results.append(disease)

        if not results:
            print("ChatBot: I couldn not find a good match for your symptoms. Try rephrasing or listing more symptoms.\n")
            continue

        # If there are results, print the top-matching diseases with their similarity scores
        print("ChatBot: Based on your symptoms, here are possible conditions:")
        for i, (disease) in enumerate(results, 1):
            print(f"   {i}. {disease}")

        print("Note: This is not a medical diagnosis. Always consult a licensed physician.\n")

In [None]:
# Run the chatbot interface
chatbot()

### Disadvantages of Rule-Based Technique


*   Does not generalize well to unseen data since there is no training involved
*   Not scalable



***

## Milestone 3: Embeddings + RAG

**What is Retrieval Augmented Generation (RAG)?**
* RAG allows a model to use external data it hasn’t been explicitly trained on
* It addresses common LLM limitations like lack of real-time information and outdated knowledge
* It works by converting both user queries and a knowledge base into vector embeddings
* Uses similarity search to retrieve the most relevant context from the knowledge base
* This retrieved context is appended to the user query and passed to the LLM to generate a more accurate and informed response

**Embeddings** are numerical representations of text that capture its meaning and semantic similarity in a vector space.

### Advantages over the Rule-Based Method
* Flexible and Scalable: Unlike TF-IDF which depends on exact word matches, embedding-based retrieval finds relevant records based on context and similarity in meaning
* More Robust: Since embeddings capture the semantic meaning behind words and generalize over language structure, minor spelling errors or synonymns do not affect performance, unlike TF-IDF which is sensitive to exact tokens
* Context-Aware Responses: RAG combines retrieval with an LLM allowing it to generate human-like responses instead of returning pre-written text
* Easier to Update Knowledge: New information can be added to the embedding database without retraining the LLM

In this section of the project, we will be implementing a RAG-based chatbot using SentenceTransformer to create embeddings and the Llama-2 LLM to generate responses

**Note:** After running some of these cells, you may get warnings in a red box. Warnings are messages that alert us about possible issues in the code that aren't severe enough to stop execution. This is totally normal and you can still proceed with implementing the chatbot!

#### Step 1: Install the auto-gptq and optimum libraries

**auto-gptq:** used for loading and running quantized versions of large language models efficiently (more on this in Milestone 4)  
**optimum:** a library by Hugging Face that helps optimize model inference and training, particularly with quantized models

In [None]:
!pip install auto-gptq optimum

#### Step 2: Import necessary packages

**torch:** imports PyTorch, a popular deep learning framework used for model loading, tensor computations and training/inference  
**transformers:** Hugging Face's library for working with pretrained models  
**AutoTokenizer:** automatically loads the appropriate tokenizer for a given model  
**AutoModelForCausalLM:** loads a causal language model (used for text generation)  
**sentence_transformers:** a library for generating embeddings (vector representations) of sentences  
**SentenceTransformer:** Loads a model to convert text into embeddings  
**util:** Provides utility functions like `semantic_search()` for comparing embeddings.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer, util

#### Step 3: Prepare the dataset for embeddings

1. Group rows by disease, join all symptoms into one sentence and convert grouped result back into a dataframe using `reset_index()`
2. Create a **Text** column with the combined disease and its respective symptom list
3. Convert the **Text** column into a list called **corpus** which will be used for generating embeddings. The corpus is our knowledge base.

In [None]:
# fill in the blanks
dis_symp_df = dis_symp_df.groupby("_________")["_________"].apply(lambda x: ", ".join(x))._________()
dis_symp_df["Text"] = dis_symp_df.apply(lambda row: f"Disease: {row['Disease']}. Symptoms: {row['Symptoms']}", axis=1)
corpus = dis_symp_df["Text"]._________()

#### Step 4: Transform the corpus into vector embeddings

1. Load the pre-trained embeddings model `all-MiniLM-L6-v2` from SentenceTransformer
2. Convert each text entry in the corpus into its numerical representation using `encode` and set `convert_to_tensor` to **True** to ensure that the output is in the PyTorch tensor format

In [None]:
# fill in the blanks
embed_model = SentenceTransformer("___________")
corpus_embeddings = embed_model._________(corpus, convert_to_tensor=_________)

#### Step 5: Load the LLM and its tokenizer

1. Specify the pre-trained model. Here, we will be using a quantized (compressed) version of Llama-2-7B-Chat model fine-tuned and optimized by the TheBloke using GPTQ. You can read up more about it in this [link](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ).
2. Load the corresponding tokenizer using `AutoTokenizer`
3. Load the Llama-2 model using `AutoModelCausalLM`

**What is Llama-2?**  
LLaMA-2 is a family of open-source LLMs developed by Meta designed for natural language understanding and generation tasks

In [None]:
# fill in the blanks
model_id = "TheBloke/Llama-2-7B-Chat-GPTQ"
tokenizer = ___________.from_pretrained(model_id, use_fast=True)
model = ___________.from_pretrained(
    model_id,
    device_map="auto", # automatically distribute the model across the GPU
    torch_dtype=torch.float16, # uses half-precision to save memory and improve speed
    trust_remote_code=True
)

#### Step 6: Generate a response from the Llama-2 model

1. Tokenize the prompt
2. Generate the output using sampling paramaters such as `max_new_tokens`, `do_sample`, `temperature` and `top_p`
3. Decode the response into a readable string using `decode`
4. Remove the original prompt text and return only the generated response

**Sampling Parameters:**  
`max_new_tokens`: controls the length of the generated response  
`do_sample`: enables sampling, picks the next token randomly based on the predicted probability distribution  
`temperature`: a lower value gives a more factual response while a higher value could lead to potential hallucination  
`top_p`: a higher value ensures the model avoids rare and low probability words

In [None]:
# fill in the blanks
def generate_llama2_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(
        input_ids,
        max_new_tokens=300,
        do_sample=True,
        temperature=0.2,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )
    response = tokenizer._________(output[0], skip_special_tokens=True)
    return response[len(prompt):].strip()

#### Step 7: Generate a response based on our dataset using RAG

1. Convert the user query into embeddings using the same SentenceTransformer object (`embed_model`). This allows us to compare the input semantically with the knowledge base
2. Perform semantic search using util's `semantic_search` function to find the top_k most similar records from **corpus_embeddings**
3. Retrieve the actual text from the original corpus by matching the index
4. Create an effective and descriptive prompt for the LLM
5. Finally, pass the prompt to the Llama-2 function we defined above

In [None]:
# fill in the blanks
def rag_response(user_input):
    query_embedding = ___________.encode(user_input, convert_to_tensor=True)
    hits = util.___________(query_embedding, ___________, top_k=2)[0]
    retrieved_contexts = [corpus[hit["corpus_id"]] for hit in hits]

    prompt = (
        "You are a medical assistant. Based on the medical records below, "
        f"suggest top 2 possible diseases the user might have. Be concise and give the response in points.\n\n"
        "Make sure to also include a disclaimer at the bottom telling users that this is not a medical diagnosis and they should always consult a doctor."
        "Medical Records:\n" + "\n".join(retrieved_contexts) +
        f"\n\nUser Symptoms: {user_input}\n\nYour Response:"
    )

    return generate_llama2_response(prompt)

#### Step 8: Create a chatbot interface for the user

In [None]:
def chatbot():
    print("ChatBot: I can help suggest possible diseases based on your symptoms.")
    print("Type your symptoms ('fever, cough, sore throat'), or type 'exit' to quit.\n")

    while True:
        user_input = input("You: ")

        if user_input.lower() in ['exit', 'quit']:
            print("ChatBot: Goodbye!\n Note: This is not a medical diagnosis. Always consult a licensed physician.")
            break

        # call the rag_response function to obtain the Llama-2 generated output
        response = rag_response(user_input)

        print(f"ChatBot: {response}\n")
        print("Note: This is not a medical diagnosis. Always consult a licensed physician.\n")

**Note:** The following cell may take some time to run because of embeddings generation and semantic search

In [None]:
# Run the chatbot interface
chatbot()

### Test Your Understanding!

Time to try fine-tuning an LLM by yourself!  
Let's use the BioMistral model once again since it is well-suited for medical applications.   
We already have our formatted dataframe, so we will start off by loading the model.

#### Step A: Load `BioMistral/BioMistral-7B` and its tokenizer

Refer to Step 5 if you get stuck!

In [None]:
# Your code here

#### Step B: Generate a response from the Mistral model

Refer to Step 6 if you get stuck!

In [None]:
# Your code here

#### Step C: Generate a response based on our dataset using RAG

Refer to Step 7 if you get stuck!

In [None]:
# Your code here

#### Step D: Create a chatbot interface for the user

Refer to Step 8 if you get stuck!

In [None]:
# Your code here

### Disadvantages of RAG + Embeddings
* Embedding generation, semantic search and LLM inference are resource-intensive and require longer compute times
* Requires GPU for efficieny

***

## Quick Note on PEFT and Quantization in Fine-Tuning LLMs

**Parameter-Efficient Fine-Tuning (PEFT)**  
PEFT techniques allow you to fine-tune LLMs by updating only a small subset of parameters rather than the entire model. This makes training more efficient and reduces hardware requirements, ideal when working with limited resources.  

**Quantization**  
Quantization means converting model weights from a high-memory format (like 32-bit floats) to a lower one (like 8-bit integers). This helps reduce memory usage and allows large models to run on devices with less RAM and smaller GPUs. It also makes inference faster. For example, models can be run on phones or laptops instead of needing expensive servers.  

**LoRA (Low-Rank Adaptation)**  
LoRA is a technique used during fine-tuning that avoids updating all of the model's weights. Instead, it learns small changes to the model and stores them separately. These changes are computed using two smaller matrices, which means fewer parameters need to be updated. This makes training much faster and lighter.  

**QLoRA**  
QLoRA combines quantization and LoRA. It compresses model weights to 4-bit precision and then fine-tunes the model using LoRA. This lets you fine-tune large models using much less memory without sacrificing too much performance

In the next milestone, we will look into implementing QLoRA to fine-tune Llama-2 effectively to meet the GPU constraints

***

## Milestone 4: Fine-Tuning LLMs

**What does fine-tuning LLMs mean?**
* Fine-tuning means adapting a pre-trained LLM to perform better on a specific task by continuing its training on a domain-specific dataset
* The LLM learns patterns in the dataset and adjusts its internal weights slightly to adapt to that domain to get more relevant responses
* For example, in our case, a base model like Llama-2 may just know general health facts but after fine-tuning it on our disease-symptom dataset, it will give more accurate answers

In this section of the project, we will be fine-tuning LLMs for our medical chatbot

### Advantages over Embeddings + RAG Method
* Better Domain Alignment: Fine-tuning tailors the model to specifc domain knowledge improving accuracy
* Faster Inference: Without a retrieval step, fine-tuned models can respond faster

#### Step 1: Install required libraries

**peft:** enables parameter-efficient training for large models  
**datasets:** used to convert a pandas dataframe into a format that is compatible with Hugging Face's Trainer  
**accelerate:** simeplifies mixed-precision training  
**bitsandbytes:** enables quantization to reduce memory usage when training large models

**Note:** Before running the cell below, please restart the session. You can do this by clicking the 3 dots on the upper right-hand corner and selecting *Restart & Clear Cell Outputs*. An error message might appear as you run the cell below, but you can carry on with the project without worrying about it!

In [None]:
# Run this cell only after restarting the session
!pip install -q peft datasets accelerate bitsandbytes

**Additional Note:** Since the session has restarted, the dataset is no longer available. Please return to Milestone 1, run all the cells in that section to reload the data, and then come back to Milestone 4 once you are done!

#### Step 2: Import necessary packages

**TrainingArguments:** specifies training parameters for the LLM  
**Trainer:** training loop abstraction to simplify model training  
**BitsAndBytesConfig:** used for quantized training  
**LoraConfig:** defines the configuration for LoRA fine-tuning  
**get_peft_model:** wraps a base model with PEFT (LoRA) layers  
**prepare_model_for_kbit_training:** prepares a model for 4-bit or 8-bit training  
**PeftModel:** to load a LoRA-trained model for inferencing

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from peft import PeftModel
from datasets import Dataset

#### Step 3: Transform the dataset into required format for fine-tuning Llama-2

Llama models generally require a specific format as the input which is the `[INST] ... [/INST]` format.  
For example, we need to transform our dataset to look like this:
`<s>[INST] abdominal pain, fever [\INST] Appendicitis`  

1. Define a function that formats the input passed into the format we discussed above
2. Apply the `format_prompt` function to each row of the dataframe and create a new column called **text** that stores the formatted prompt for each row
3. Convert the dataframe into a Hugging Face `Dataset` object

In [None]:
# fill in the blanks
def format_prompt(row):
    return f"<s>[INST] {row['Symptoms']} [/INST] {row['Disease']}"

dis_symp_df["text"] = dis_symp_df.apply(__________, axis=1)

formatted_df = __________.from_pandas(dis_symp_df)

Check out how the first entry of `formatted_df` looks

In [None]:
# Your code here

#### Step 4: Load the Llama-2 chat model using QLoRA

1. Specify the Hugging Face model we want to load. Here we will be using `NousResearch/Llama-2-7b-chat-hf` which is a 7B parameter version of Llama-2
2. Set up the 4-bit quantization for QLoRA using `BitsAndBytesConfig` and define the parameter values
3. Load the quantized model using `AutoModelCausalLM`

In [None]:
# fill in the blanks
model_name = "____________________________"

bnb_config = _________________(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = _________________.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0},  # explicitly use GPU 0 (GPU T4 x2)
)

#### Step 5: Set up LoRA for a 4-bit quantized Llama-2 model

1. Prepare the 4-bit quantized model for training using `prepare_model_for_kbit_training`
2. Create the configuration for LoRA and define the parameter values
3. Wrap the model with LoRA using the defined configuration. This resuts in only a small set of trainable weights which reduces compute and memory needs

In [None]:
# fill in the blanks
model = _____________________(model)

lora_config = LoraConfig(
    r=8, # rank of the LoRA update matrices
    lora_alpha=16, # scaling factor for the LoRA weights
    target_modules=["q_proj", "v_proj"],  # adjust based on model architecture, here we apply LoRA only to the query and value projection layers of attention
    lora_dropout=0.1, # dropout applied to LoRA layers during training to avoid overfitting
    bias="none", # do not train the bias parameters
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

#### Step 6: Tokenize Dataset

1. Load the corresponding tokenizer for our model
2. Set the `pad_token` to be the same as the `eos_token` since models like Llama do not have separate padding token defined by default

In [None]:
# fill in the blanks
tokenizer = _______________.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.___________

3. Define function that takes one row and processes it for training
4. Using `tokenizer` convert the input text into token IDs
5. Set labels to be a copy of input_ids. In causal language modeling, the model is trained to predict the next token so the input and out are the same

In [None]:
def tokenize_function(example):
    result = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=256
    )
    result["labels"] = result["input_ids"].copy()
    return result

6. Apply the `tokenize_function` to each row of the `formatted_df` and remove columns **text**, **Disease** and **Symptoms** to keep only the tokenized inputs

In [None]:
# fill in the blanks
tokenized_datasets = formatted_df.map(______________, remove_columns=["________", "________", "________"])

#### Step 7: Define training parameters for fine-tuning the Llama-2 model

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    run_name="llama2-finetune",
    report_to="none",
    logging_strategy="steps",
    logging_steps=1,
    num_train_epochs=1,
    fp16=False,
    bf16=False,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    learning_rate=2e-4,
    weight_decay=0.001,
    optim="paged_adamw_32bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    max_steps=-1,
    max_grad_norm=0.3,
    group_by_length=True,
    save_steps=0
)

#### Step 8: Initialize `Trainer`

Define the following parameters:  
**model:** the LoRA-wrapped Llama-2 model we are fine-tuning  
**args:** the training arguments we defined above  
**train_dataset:** the tokenized dataset that contains the formatted and encoded input-output pairs  
**tokenizer:** the tokenizer used to process inputs and decode outputs to ensure consistency between training and generation

In [None]:
# fill in the blanks
trainer = Trainer(
    model=model,
    args=_____________,
    train_dataset=tokenized_datasets,
    tokenizer=_____________,
)

#### Step 9: Train your LLM

Finally, after the preprocessing and parameter definition, we can train our LLM!

In [None]:
trainer.train()

#### Step 10: Model Inferencing

Now that we have our fine-tuned LLM, we will use it to predict possible diseases for different user inputs.

1. Save the fine-tuned Llama-2 model and tokenizer to a specific directory in the Kaggle environment

In [None]:
output_dir = "/kaggle/working/llama2-med-chatbot"

trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to: {output_dir}")

2. Define the base model, calling the original Llama-2 model `NousResearch/Llama-2-7b-chat-hf`
3. Load the tokenizer from `output_dir`, set `pad_token` to `eos_token` and set `padding_side` to right which is standard for causal language models

In [None]:
# fill in the blanks
base_model_name = "NousResearch/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(_____________, trust_remote_code=True)
tokenizer.___________ = tokenizer.eos_token
tokenizer.padding_side = "_________"

4. Load the base model with quantization using the bitsandbytes configuration define above

In [None]:
# fill in the blanks
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=______________,
    device_map={"": 0}
)

5. Attach the LoRA fine-tuned weights from `output_dir` and merge them with the base model using `PeftModel`
6. Set the model to evaluation using the `eval` function to put the model in inference mode

In [None]:
# fill in the blanks
model = ____________.from_pretrained(base_model, output_dir)
model._________()

7. Create the chatbot interface function for the user

In [None]:
def chatbot():
    print("ChatBot: I can help suggest possible diseases based on your symptoms.")
    print("Type your symptoms (e.g., 'fever, cough, sore throat'), or type 'exit' to quit.\n")

    while True:
        user_input = input("You: ")

        if user_input.lower() in ['exit', 'quit']:
            print("ChatBot: Goodbye!\nNote: This is not a medical diagnosis. Always consult a licensed physician.")
            break

        instruction = "List the top 2 possible diseases for these symptoms:"
        # formatting the prompt using the required Llama-2 structure
        prompt = f"""<s>[INST] <<SYS>>
{instruction}
<</SYS>>

Symptoms: {user_input} [/INST]"""

        # converts prompt into token IDs
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # to generate response from the model with key parameters
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=300,
                do_sample=False,
                temperature=0.2,
                top_p=0.9,
                eos_token_id=tokenizer.eos_token_id
            )

        # decode the output tokens into readable text
        full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # extracts only the relevant part after [/INST] which contains the response
        if "[/INST]" in full_response:
            answer = full_response.split("[/INST]")[-1].strip()
        else:
            answer = full_response.strip()

        print(f"ChatBot: {answer}\n")
        print("Note: This is not a medical diagnosis. Always consult a licensed physician.\n")

In [None]:
chatbot()

#### Why do you think the fine-tuned LLM may not always give exact responses from the dataframe?

**Hint:** Answer along the lines of the generative nature of LLMs, training parameters and sampling parameters (temperature, top_p)

***Your answer here***

### Test Your Understanding!

Time to try fine-tuning an LLM by yourself!  
BioMistral is a domain-specific version of the Mistral LLM, fine-tuned on biomedical and clinical data. It is designed to perform better on healthcare-related tasks. You can read up more about it in this [link](https://huggingface.co/BioMistral).  
We already have our formatted dataframe, so we will start off by loading the model.

#### Step A: Load `BioMistral/BioMistral-7B` using QLoRA

Refer to Step 4 if you get stuck!

In [None]:
# Your code here

#### Step B: Set up LoRA for a 4-bit quantized BioMistral model

**Hint:** For BioMistral the `target_modules` in `LoraConfig` are different since we are changing the LLM architecture. Define the `target_modules` as `["q_proj", "v_proj", "k_proj", "o_proj"]`.  
Refer to Step 5 if you get stuck!

In [None]:
# Your code here

#### Step C: Define the tokenizer for the model and tokenize the dataset

Refer to Step 6 if you get stuck!

In [None]:
# Your code here

In [None]:
# Your code here

#### Step D: Define training parameters for fine-tuning

**Hint:** Use the same parameters as in Step 7

In [None]:
# Your code here

#### Step E: Initialize Trainer

In [None]:
# Your code here

#### Step F: Train your LLM

In [None]:
# Your code here

#### Step G: Model Inferencing

In [None]:
# fill in the blanks
output_dir = "/kaggle/working/biomistral-chatbot"

trainer._________________(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model and tokenizer saved to: {output_dir}")

In [None]:
# fill in the blanks
output_dir = "/kaggle/working/biomistral-chatbot"
base_model_name = "BioMistral/BioMistral-7B"

tokenizer = AutoTokenizer.________________(output_dir, trust_remote_code=True)
tokenizer.______________ = tokenizer._____________
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    _______________,
    quantization_config=_____________,
    device_map={"": 0}
)

model = PeftModel.from_pretrained(_____________, output_dir)
model.eval()

In [None]:
def chatbot():
    print("ChatBot: I can help suggest possible diseases based on your symptoms.")
    print("Type your symptoms (e.g., 'fever, cough, sore throat'), or type 'exit' to quit.\n")

    while True:
        user_input = input("You: ")

        if user_input.lower() in ['exit', 'quit']:
            print("ChatBot: Goodbye!\nNote: This is not a medical diagnosis. Always consult a licensed physician.")
            break

        instruction = "List the top 2 possible diseases for these symptoms:"
        # formatting the prompt using the required Llama-2 structure
        prompt = f"""<s>[INST] <<SYS>>
{instruction}
<</SYS>>

Symptoms: {user_input} [/INST]"""

        # converts prompt into token IDs
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # to generate response from the model with key parameters
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=300,
                do_sample=False,
                temperature=0.2,
                top_p=0.9,
                eos_token_id=tokenizer.eos_token_id
            )

        # decode the output tokens into readable text
        full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # extracts only the relevant part after [/INST] which contains the response
        if "[/INST]" in full_response:
            answer = full_response.split("[/INST]")[-1].strip()
        else:
            answer = full_response.strip()

        print(f"ChatBot: {answer}\n")
        print("Note: This is not a medical diagnosis. Always consult a licensed physician.\n")

In [None]:
chatbot()

## Submission Instructions

Congratulations! You have successfully developed your own medical chatbots!  

We would once again like to point out that these chatbots were developed solely for learning purposes and should not to be used in case of medical emergencies.

To submit your amazing work please follow the steps below:
* Rename this notebook to *[Your Name]Medical_Chatbot*
* Download the notebook
* Send your notebook to nsdc@nebigdatahub.org
* Once our team receives your submission, you will be awarded with a certificate of completion!

Thank you for participating in this project and please reach out to nsdc@nebigdatahub.org in case you have any questions!