<center><p float="center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/4_RGB_McCombs_School_Brand_Branded.png" width="300" height="100"/>
  <img src="https://mma.prnewswire.com/media/1458111/Great_Learning_Logo.jpg?p=facebook" width="200" height="100"/>
</p></center>

<center><font size=10>Generative AI for Business Applications</center></font>
<center><font size=6>Fine-Tunning LLMs - Week 1</center></font>

<center><p float="center">
  <img src="https://images.pexels.com/photos/5699431/pexels-photo-5699431.jpeg" width=720></a>
<center><font size=6>Fine-Tuned AI for Summarizing Medical Conversations</center></font>

# **Problem Statement**

## **Business Context**

In today’s fast-paced healthcare environment, doctors often struggle with the time-consuming task of documenting patient consultations. These conversations are usually long, unstructured, and difficult to review later, which makes writing and editing notes a heavy burden. The inefficiencies carry over into follow-ups, as important details may be buried within lengthy transcripts. To address this challenge, a healthcare technology team is developing an LLM-powered assistant designed to streamline clinical documentation. This AI assistant will automatically generate structured summaries, highlight key patient details, and organize consultation notes for easy review. By applying this solution to real-world consultation data, the team aims to demonstrate how advanced language models can reduce administrative workload, minimize errors, and ultimately improve the quality of patient care.

##  **Objective**

The goal is to develop an AI-powered system that demonstrates how Natural Language Processing (NLP) can support doctors by automatically generating clear and structured consultation notes from patient interactions.

Specifically, the system aims to:

* Capture patient concerns, including symptoms, medical history, and lifestyle factors.
* Summarize the doctor’s findings and recommendations, such as diagnoses, prescriptions, tests, and follow-up actions.
* Ensure that all notes maintain a consistent, professional clinical tone.
* Improve accuracy and reliability by fine-tuning on domain-specific medical dialogues.

This case study focuses on building a prototype that transforms unstructured doctor-patient conversations into concise, structured clinical summary. By doing so, it reduces the time spent on manual documentation, minimizes the risk of missed information, and enhances the overall quality of patient care.

Through successful implementation, the organization seeks to reduce administrative workload, support more efficient follow-ups, and ultimately improve healthcare outcomes.


## **Data Description**

The dataset is divided into three files: training, validation, and test. It consists of two primary columns:

1. **Conversation** - Contains the raw transcripts of doctor-patient dialogues, which are often long and unstructured.

2. **Summary** - Provides the corresponding concise and structured clinical summary, making it suitable for training supervised summarization models.


# **Installing and Importing Necessary Libraries**

In [2]:
%pip install --no-deps bitsandbytes accelerate xformers==0.0.32.post2 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
%pip install sentencepiece protobuf huggingface_hub hf_transfer
%pip install transformers==4.51.3
%pip install --no-deps unsloth
%pip install -q datasets evaluate bert-score

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting bitsandbytes
  Using cached bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Collecting accelerate
  Using cached accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Collecting xformers==0.0.32.post2
  Using cached xformers-0.0.32.post2.tar.gz (12.1 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting peft
  Using cached peft-0.17.1-py3-none-any.whl.metadata (14 kB)
Collecting trl==0.15.2
  Using cached trl-0.15.2-py3-none-any.whl.metadata (11 kB)
[31mERROR: Could not find a version that satisfies the requirement triton (from versions: none)[0m[31m
[0m
[1m[[0m

**Note**:
- After running the above cell, kindly restart the runtime (for Google Colab) or notebook kernel (for Jupyter Notebook), and run all cells sequentially from the next cell.
- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in ***this notebook***.

In [4]:
from unsloth import FastLanguageModel
import torch
import evaluate
from tqdm import tqdm
import pandas as pd
from datasets import Dataset

from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

NotImplementedError: Unsloth currently only works on NVIDIA GPUs and Intel GPUs.

# **1. Evaluation of LLM before FineTuning**

Before investing time and resources in fine-tuning, it is important to measure **how well the base model performs "as is."**
This gives us a baseline score, which we will later compare against fine-tuned performance to demonstrate improvement.

### **Loading the Testing Data**


In [None]:
testing_data = pd.read_csv("/content/finetuning_medical_testing.csv")           # Load medical test dataset containing dialogues and gold summaries

test_dialogues = testing_data['conversation'].tolist()                          # dialogues (inputs)
test_summaries = testing_data['summary'].tolist()                               # human-written summaries (ground truth)

### **Loading the Mistral Model**

We use the **Mistral-7B Instruct model** (7 billion parameter model optimized for instructions).
It is loaded with **4-bit quantization** - meaning it runs efficiently on limited GPU memory without losing much accuracy.


In [None]:
# Load the instruction-tuned Mistral 7B model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-instruct-v0.2-bnb-4bit",                     # model name
    max_seq_length=4096,                                                        # maximum sequence length
    dtype=None,                                                                 # auto-select data type
    load_in_4bit=True                                                           # load in 4-bit for memory efficiency
)

In [None]:
FastLanguageModel.for_inference(model)

### **Generate Summaries**


We now ask the model to summarize dialogues **without any fine-tuning**.
A custom prompt template is used to ensure the model follows instructions consistently.

In [None]:
alpaca_prompt_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Write a concise summary of the following dialogue.

### Input:
{}

### Response:
{}
"""

We will create a function to manage the entire prediction process, which includes the following steps:
1. Structure the Prompt
2. Tokenize the Prompt
3. Generate the Output from Tokens
4. Decode the Generated Output

This function allows us to avoid code repetition and streamline the process. Whenever we need to test a model, we can simply call this function instead of rewriting these steps each time.


In [None]:
def generate_summaries(dialogues, model, tokenizer, prompt_template, max_new_tokens=100):
    summaries = []

    for dialogue in dialogues:
        # create a prompt for the dialogue
        prompt = prompt_template.format(dialogue," ")

        # tokenize the prompt
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # generate the output
        outputs = model.generate(**inputs,
                                 max_new_tokens=max_new_tokens,
                                 pad_token_id=tokenizer.eos_token_id
                                 )

        # decode: skip input tokens, keep only generated part
        input_len = inputs["input_ids"].shape[-1]
        summary = tokenizer.decode(outputs[0][input_len:],
                                   skip_special_tokens=True,
                                   cleanup_tokenization_spaces=True
                                   )

        summaries.append(summary.strip())

    return summaries

In [None]:
# Calling the function to generate summaries
predicted_summaries = generate_summaries(dialogues=test_dialogues,
                                         model=model,
                                         tokenizer=tokenizer,
                                         prompt_template=alpaca_prompt_template
                                         )

### **Evaluate Using BERTScore**


**BERTScore** is an evaluation metric for text generation tasks such as summarization, translation, and captioning. Unlike traditional metrics like ROUGE or BLEU that rely on exact word overlaps, BERTScore uses embeddings from a pre-trained BERT model to measure **semantic similarity** between the generated text (predictions) and the human-written text (references). This makes it more robust in capturing meaning, even when different words are used.

* **Precision** - Measures how much of the content in the generated text is actually relevant to the reference. A high precision means the model is not adding irrelevant or “extra” information.

* **Recall** - Measures how much of the important content from the reference is captured by the generated text. A high recall means the model covers most of the key points, even if it includes some extra details.

* **F1 Score** - Combines both precision and recall into a balanced score. It shows how well the generated text both *covers the important content* and *stays relevant*. This is usually reported as the main metric for BERTScore.

In short, BERTScore helps evaluate not just word matching but whether the **meaning** of the generated text aligns with the reference.


In [None]:
bert_scorer = evaluate.load("bertscore")                                        # Load BERTScore evaluation metric

score = bert_scorer.compute(                                                    # Compute BERTScore between generated summaries and gold references
    predictions=predicted_summaries,                                            # model-generated summaries
    references=test_summaries,                                                  # ground-truth summaries
    lang='en',                                                                  # language
    rescale_with_baseline=True                                                  # normalize scores for fair comparison
)


baseline_score = sum(score['f1']) / len(score['f1'])                            # Compute the average F1 score across test set
print(baseline_score)                                                           # Print the average F1 score

**The BERT Score of Mistral LLM is 0.37**


# **2. FineTuning an LLM**

Now that we have measured the **baseline performance** of the Mistral-7B model, the next step is to **fine-tune the model on our medical dataset.**

Fine-tuning helps the model specialize in summarizing medical dialogues, improving relevance and accuracy beyond its general-purpose training.

### **Load and Prepare Training Data**

We begin with the **training dataset**, which contains doctor-patient conversations paired with gold-standard summaries.

We then convert the dataset into a Hugging Face `Dataset` format for efficient handling.

In [None]:
# Load training dataset
training = pd.read_csv("/content/finetuning_medical_training.csv")
training_dict = training.to_dict(orient='list')

# Create a dataset from the dictionary
training_dataset = Dataset.from_dict(training_dict)

In [None]:
# Instruction-tuned prompt template for summarization
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

To make training effective, we format each example into this structure and append an EOS token to clearly signal the end of the response.

In [None]:
EOS_TOKEN = tokenizer.eos_token

In [None]:
# Format each training example into the Alpaca prompt structure
def prompt_formatter(example, prompt_template):
    instruction='Write a concise summary of the following dialogue.'
    dialogue=example["conversation"]
    summary=example["summary"]

    formatted_prompt = prompt_template.format(instruction, dialogue, summary) + EOS_TOKEN

    return {'text': formatted_prompt}

In [None]:
# Apply formatting to training dataset
formatted_training_dataset = training_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}
)

### **Load and Prepare Validation Data**

We also prepare a **validation dataset** (unseen during training).
This allows us to track the model's progress and prevent overfitting.

In [None]:
# Load validation dataset
validation=pd.read_csv("/content/finetuning_medical_validation.csv")
validation_dict =validation.to_dict(orient='list')

# Create a dataset from the dictionary
validation_dataset = Dataset.from_dict(validation_dict)

In [None]:
# Apply formatting to training dataset
formatted_validation_dataset = validation_dataset.map(
    prompt_formatter,
    fn_kwargs={'prompt_template': alpaca_prompt}
)

### **Apply Parameter-Efficient Fine-Tuning (LoRA)**

We now patch in the adapter modules to the base model using the `get_peft_model` method.

In [None]:
# Apply LoRA fine-tuning (PEFT) to the base model
model = FastLanguageModel.get_peft_model(
    model,                                                                      # base model to fine-tune
    r=16,                                                                       # rank of LoRA matrices
    lora_alpha=16,                                                              # scaling factor for LoRA updates
    lora_dropout=0,                                                             # dropout (0 means no dropout)
    bias="none",                                                                # don't fine-tune bias terms
    target_modules=[                                                            # layers where LoRA is applied
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    use_gradient_checkpointing=True,                                            # saves memory by recomputing activations
    random_state=42,                                                            # reproducibility
    loftq_config=None                                                           # not using LoFTQ quantization here
)

First, we load the model using `FastLanguageModel`. After initializing the base model, we patch the adapter models using `get_peft_model`. This allows us to enhance the model by integrating LoRA (Low-Rank Adaptation) weights, which are now displayed as part of the model's configuration. This setup facilitates improved customization and adaptability of the model for specialized tasks.

In [None]:
model

### **Configure Training**

We now define training arguments for **supervised fine-tuning (SFT)** using Hugging Face's `SFTTrainer`.
The setup ensures training runs efficiently on Colab GPUs.

In [None]:
# Supervised Fine-Tuning (SFT) trainer setup
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_training_dataset,
    eval_dataset = formatted_validation_dataset,
    dataset_text_field = "text",                                                # column containing the text
    max_seq_length = 2048,                                                      # max sequence length
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer),                # handles padding, batching
    dataset_num_proc = 2,                                                       # parallel dataset processing
    packing = False,                                                            # set True to pack short sequences (faster)

    # Training hyperparameters
    args = TrainingArguments(
        per_device_train_batch_size = 2,                                        # batch size per GPU
        gradient_accumulation_steps = 4,                                        # simulate larger batch via accumulation
        warmup_steps = 5,                                                       # warmup before LR schedule
        max_steps = 60,                                                         # limit training steps (overrides epochs)
        learning_rate = 2e-4,                                                   # base learning rate
        fp16 = not is_bfloat16_supported(),                                     # use fp16 if bf16 not available
        bf16 = is_bfloat16_supported(),                                         # use bf16 if supported (better precision)
        logging_steps = 1,                                                      # log every step
        optim = "adamw_8bit",                                                   # memory-efficient optimizer
        weight_decay = 0.01,                                                    # weight decay for regularization
        lr_scheduler_type = "linear",                                           # linear learning rate decay
        seed = 3407,                                                            # reproducibility
        output_dir = "outputs",                                                 # save model outputs
        report_to = "none"                                                      # disable logging to external services
    ),
)


In [None]:
# Begin fine-tuning
training_history = trainer.train()

### **Save Fine-Tuned Model**


After training, we save the **LoRA fine-tuned model** for future inference.

In [None]:
# Save locally
lora_model_name = "finetuned_mistral_llm"
model.save_pretrained(lora_model_name)                                          # Save LoRA-finetuned model

# Verify saved files
!ls -lh {lora_model_name}                                                       # List files inside the saved model folder

In [None]:
# # Comment out this cell if you want to save the model to Drive

# from google.colab import drive
# drive.mount('/content/drive')

# drive_model_path = "/content/drive/MyDrive/finetuned_mistral_llm"

# !cp -r {lora_model_name} {drive_model_path}

# **3. Evaluation of LLM after FineTuning**

Once the model has been fine-tuned, the next step is to **measure its effectiveness**. In business terms, this is like checking whether the training investment has paid off and whether the system can generate outputs that align with expectations.

### **Loading the Finetuned Mistral Model**

In [None]:
finetuned_model, finetuned_tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_name,                                                 # Replace the model name with "drive_model_path" if you are loading the model from Drive
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True
)

> **Note** <br> It is not strictly necessary to save and reload the model if you are evaluating it in the same Colab session - the model is still in memory. **However, as a best practice,** we save the fine-tuned model and then reload it for evaluation.

### **Generate Summaries**

In [None]:
# Calling the function to generate summaries
predicted_summaries = generate_summaries(dialogues=test_dialogues,
                                         model=finetuned_model,
                                         tokenizer=finetuned_tokenizer,
                                         prompt_template=alpaca_prompt_template
                                         )

### **Evaluate Using BERTScore**

In [None]:
score = bert_scorer.compute(
    predictions=predicted_summaries,
    references=test_summaries,
    lang='en',
    rescale_with_baseline=True
)
finetune_score = sum(score['f1']) / len(score['f1'])
print(finetune_score)

**The BERT Score of Finetuned Mistral LLM is 0.67**

- This suggests that the model is producing reasonably good summaries that align well with human expectations.

- While not perfect, it demonstrates that fine-tuning has meaningfully improved the model's ability to perform the summarization task.

# **Conclusion**


* The fine-tuned model generates concise and clinically accurate summaries, enhancing the quality of documentation and supporting improved patient care.

* Summaries now align closely with clinical communication needs, allowing medical practitioners quick access to relevant and structured information.

* This approach establishes a foundation for developing intelligent healthcare support systems that are context-aware, scalable, and tailored to the needs of medical professionals.

* The system's effectiveness is contingent on the availability of high-quality task-relevant consultation data and may require further refinement for diverse healthcare contexts.

* While the model produces clinically relevant summaries, continuous evaluation and possible adjustments are necessary to maintain alignment with evolving healthcare standards and practices.

<font size = 6 color="#4682B4"><b> Power Ahead </font>
___