[Streamlit Code](https://colab.research.google.com/drive/1qTgn5kIBEoCoolyoIYnZV-5K1bQo3LGx?usp=sharing)

In [None]:
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Collecting unsloth
  Downloading unsloth-2025.2.5-py3-none-any.whl.metadata (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.2.3 (from unsloth)
  Downloading unsloth_zoo-2025.2.3-py3-none-any.whl.metadata (16 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.29.post2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.13-py3-none-any.whl.metadata (9.4 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting trl!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,>=0.7.9 (from unsloth)
  Downloading trl-0.14.0-

Collecting git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-req-build-g08hpj5s
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-req-build-g08hpj5s
  Resolved https://github.com/unslothai/unsloth.git to commit 2023b28caa0b0b8d172e2e88f92cc13bff537018
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2025.2.4-py3-none-any.whl size=180844 sha256=65029a6f86868a1fa266df6f9169123d78a2870d117619876b1849f7da73b32f
  Stored in directory: /tmp/pip-ephem-wheel-cache-mwoq9qbq/wheels/d1/17/05/850ab10c33284a4763b0595cd8ea9d01fce6e221cac24b3c01
Successfully built unsloth
Installing collected packages: unsloth
 

In [None]:
from huggingface_hub import login
import getpass

# Get the token from user input using getpass for secure input
hf_token = getpass.getpass("Enter your Hugging Face token: ")

# Check if the token was entered
if not hf_token:
    raise ValueError("Hugging Face token not entered.")

# Log in using the token
login(hf_token)

Enter your Hugging Face token: ··········


### 1. **Importing the Library**
```python
from unsloth import FastLanguageModel
```
- This imports `FastLanguageModel` from the `unsloth` library.
- `unsloth` is an optimized library for running **large language models (LLMs) efficiently** using fewer resources.

---

### 2. **Setting Configuration Parameters**
```python
max_seq_length = 2048
dtype = None
load_in_4bit = True
```
- `max_seq_length = 2048`:  
  - Defines the **maximum number of tokens** (words/subwords) the model can process in a single input.
  - A higher value allows the model to handle longer texts but requires more memory.

- `dtype = None`:  
  - Specifies the **data type** for model parameters (e.g., `float16`, `bfloat16`).  
  - `None` means it will **automatically** choose the best data type based on the hardware.

- `load_in_4bit = True`:  
  - Loads the model in **4-bit precision** to **reduce memory usage** while maintaining performance.  
  - This is useful for running large models on limited hardware like GPUs with lower VRAM.

---

### 3. **Loading the Pretrained Model**
```python
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
)
```
This line loads a **pretrained AI model** from **Hugging Face** and prepares it for use.

#### **Breaking it Down:**
- `FastLanguageModel.from_pretrained(...)`  
  - Loads a language model that has already been trained on a large dataset.
  
- `model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B"`  
  - Specifies the **model name**:  
    - `"DeepSeek-R1-Distill-Llama-8B"` is an **8-billion-parameter** model.  
    - It is a **distilled** (compressed) version of DeepSeek’s LLaMA model for efficiency.
  
- `max_seq_length = max_seq_length`  
  - Ensures the model supports **longer text inputs** (2048 tokens).

- `dtype = dtype`  
  - Keeps the **automatic data type selection**.

- `load_in_4bit = load_in_4bit`  
  - Enables **4-bit quantization**, making the model **faster and memory-efficient**.

- `token = hf_token`  
  - Uses an **authentication token** (`hf_token`) to access **private or restricted models** from Hugging Face.

---

### **What Happens When This Code Runs?**
1. The `FastLanguageModel` fetches the **DeepSeek-R1-Distill-Llama-8B** model.
2. It **loads the model into memory** in a compressed **4-bit format** to save resources.
3. It **downloads the tokenizer**, which converts text into tokens (small word pieces) that the model understands.
4. The `model` is now ready to generate, understand, and process text.

---

### **Why is This Code Useful?**
- **Optimized for Speed & Memory**: Uses `4-bit` compression for efficiency.
- **Handles Long Texts**: Supports up to `2048` tokens in a single input.
- **Uses a Distilled Model**: Faster inference with similar performance to a larger model.
- **Easy to Use**: With `FastLanguageModel`, the model loads quickly and can run on consumer GPUs.

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.4: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.9k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>{}"""

---

## **1. Defining the Question**
```python
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"
```
- The **question** is stored in the variable `question`.  
- It describes a **medical case** about a 61-year-old woman with **stress urinary incontinence (SUI)** and asks about **cystometry findings** (a bladder function test).  

---

## **2. Preparing the Model for Inference**
```python
FastLanguageModel.for_inference(model)
```
- This **sets the model into inference mode**, meaning it is **ready to generate answers** instead of training.  
- It optimizes the model to run **faster and use less memory** during inference.  

---

## **3. Tokenizing the Question**
```python
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")
```
- **Tokenization**:  
  - The `tokenizer` converts the text into a format the model understands (numbers/tokens).  
  - The `prompt_style.format(question, "")` ensures the question is formatted correctly before tokenization.  

- **Converting to Tensors**:  
  - `return_tensors="pt"` converts the text into a PyTorch tensor (a format for deep learning models).  

- **Sending to GPU**:  
  - `.to("cuda")` ensures that the **model runs on a GPU**, which makes inference much faster.  

---

## **4. Generating the Model’s Response**
```python
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
```
- `model.generate(...)` is used to **generate text based on the input question**.  
- **Parameters Explained:**
  - `input_ids=inputs.input_ids`: Passes the tokenized question as input.  
  - `attention_mask=inputs.attention_mask`: Tells the model **which tokens are important** and **which are padding**.  
  - `max_new_tokens=1200`:  
    - The model can generate up to **1200 tokens** in its response.  
    - This ensures a **detailed answer** instead of being cut off too early.  
  - `use_cache=True`:  
    - Helps speed up generation by **reusing previous computations**.  

---

## **5. Decoding the Model’s Response**
```python
response = tokenizer.batch_decode(outputs)
```
- The model's output is in **tokenized form (numbers)**, so `batch_decode(outputs)` converts it **back into human-readable text**.  

---

## **6. Extracting the Final Answer**
```python
print(response[0].split("### Response:")[1])
```
- The model’s response is often structured like this:  
  ```
  ### Response: [Generated Answer]
  ```
- `split("### Response:")[1]` extracts only the **generated answer**, removing unnecessary formatting.  
- Finally, `print(...)` displays the response on the screen.  

---

## **💡 What Happens When This Code Runs?**
1. **The medical question is defined** and formatted.  
2. **The model is set to inference mode** for answering questions efficiently.  
3. **The question is tokenized** and sent to the GPU for fast processing.  
4. **The model generates a detailed answer** based on the question.  
5. **The response is converted back into human-readable text** and printed.  

---

In [None]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>
Okay, so I'm trying to figure out what the cystometry would show for this 61-year-old woman. Let me break this down step by step.

First, the patient has a history of involuntary urine loss, especially when she coughs or sneezes. That makes me think of stress urinary incontinence. But it's noted that she doesn't leak at night, which is interesting because that's more typical for another condition called nocturnal enuresis. So, maybe she's not experiencing nighttime leakage, which might help narrow down the cause.

She undergoes a gynecological exam and a Q-tip test. I'm not exactly sure about the Q-tip test, but from what I remember, it's used to assess urethral function. The Q-tip is a small device that's placed in the urethra to measure the closure pressure. If the pressure is low, it might indicate that the urethral sphincter isn't functioning well, contributing to incontinence.

Now, considering the possible causes of her symptoms. Since she's experiencing loss during acti

In [None]:
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

### **Explanation of the Code**  

This code defines a function `formatting_prompts_func` that prepares text data for training a **machine learning model** (likely a large language model, or LLM). It structures **questions, reasoning steps (CoT - Chain of Thought), and responses** in a specific format, ensuring the model understands and learns from them effectively.  

---

## **1. Defining the End-of-Sequence Token**
```python
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN
```
- `eos_token` (End-of-Sequence token) marks the **end of a sentence or response**.  
- This helps the model understand **when an answer is complete**.  
- `EOS_TOKEN` is added to every training sample to signal the model **where to stop**.  

---

## **2. Defining the Function to Format Data**
```python
def formatting_prompts_func(examples):
```
- This function takes `examples`, which is **a dataset containing multiple questions, reasoning steps, and responses**.  
- It reformats these **into a structured prompt** that the model can learn from.  

---

## **3. Extracting Data from the Input Dictionary**
```python
inputs = examples["Question"]
cots = examples["Complex_CoT"]
outputs = examples["Response"]
```
- `examples` is a dictionary where:  
  - `"Question"` contains **questions**.  
  - `"Complex_CoT"` contains **reasoning steps** (Chain of Thought).  
  - `"Response"` contains **final answers**.  

---

## **4. Formatting Each Example in the Dataset**
```python
texts = []
for input, cot, output in zip(inputs, cots, outputs):
    text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
    texts.append(text)
```
- An empty list `texts = []` is created to store formatted prompts.  
- The `zip(inputs, cots, outputs)` allows **looping through the three lists at the same time**.  
- For each example:  
  1. It applies `train_prompt_style.format(input, cot, output)`, formatting the data into a structured template.  
  2. It appends the `EOS_TOKEN` at the end to indicate **completion**.  
  3. The formatted text is **added to the `texts` list**.  

---

## **5. Returning the Formatted Data**
```python
return {
    "text": texts,
}
```
- The function returns a **dictionary** with the formatted text samples under the key `"text"`.  
- This formatted data is ready to be used for **training the language model**.  

---

## **💡 What Does This Function Do?**
✅ **Formats training examples** with a structured prompt style.  
✅ **Includes Chain-of-Thought (CoT) reasoning** to improve model understanding.  
✅ **Adds an End-of-Sequence token** to signal response completion.  
✅ **Returns structured data** for training the AI model.  

---

## **🛠️ Example Input & Output**
### **📥 Input (`examples` dictionary)**
```python
examples = {
    "Question": [
        "What is the capital of France?",
        "Explain Newton's first law."
    ],
    "Complex_CoT": [
        "Paris is the capital of France. It is known for its historical significance.",
        "Newton's first law states that an object remains in motion or at rest unless acted upon by an external force."
    ],
    "Response": [
        "The capital of France is Paris.",
        "Newton's first law is also called the Law of Inertia."
    ]
}
```

### **📤 Output (Formatted Texts)**
```python
{
    "text": [
        "Question: What is the capital of France?\nReasoning: Paris is the capital of France. It is known for its historical significance.\nAnswer: The capital of France is Paris.<EOS>",
        
        "Question: Explain Newton's first law.\nReasoning: Newton's first law states that an object remains in motion or at rest unless acted upon by an external force.\nAnswer: Newton's first law is also called the Law of Inertia.<EOS>"
    ]
}
```


In [None]:
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

### **Explanation of the Code**  

This code **loads a medical dataset**, processes it using a **formatting function**, and then retrieves the first formatted example. Let’s break it down step by step.  

---

## **1. Importing the Dataset Library**
```python
from datasets import load_dataset
```
- This imports `load_dataset` from the `datasets` library (part of Hugging Face).  
- This function is used to **download and load datasets** for machine learning tasks.

---

## **2. Loading the Medical Dataset**
```python
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train[0:500]", trust_remote_code=True)
```
- This line loads the **"FreedomIntelligence/medical-o1-reasoning-SFT"** dataset from Hugging Face.  
- The dataset contains **medical reasoning questions and answers**.  

### **Breaking Down the Parameters:**
- `"FreedomIntelligence/medical-o1-reasoning-SFT"`  
  - Specifies the **dataset name** on Hugging Face.  
- `"en"`  
  - Loads the **English version** of the dataset.  
- `split="train[0:500]"`  
  - Selects **only the first 500 training samples** (instead of the full dataset).  
  - This helps with quick testing and reduces memory usage.  
- `trust_remote_code=True`  
  - Allows execution of **any custom dataset loading code** provided by the dataset creator.  

---

## **3. Applying the Formatting Function**
```python
dataset = dataset.map(formatting_prompts_func, batched=True)
```
- `dataset.map(...)` **modifies each sample** in the dataset using the `formatting_prompts_func` function.  
- `batched=True` processes multiple examples at once for **faster computation**.  
- This ensures that each question is formatted correctly before training.

📌 **What does `formatting_prompts_func` do?**  
It **formats** each question, its reasoning steps, and the response into a structured prompt for training.  

---

## **4. Retrieving the First Formatted Example**
```python
dataset["text"][0]
```
- This **fetches the first formatted training example** from the `"text"` column.  
- The output will be a **structured prompt** that includes:  
  - **The question**  
  - **Reasoning steps (Chain of Thought - CoT)**  
  - **Final answer**  
  - **End-of-Sequence (EOS) token**  

---

### **💡 Example Input & Output**
#### **Input: (Raw dataset sample before formatting)**
```python
{
    "Question": "What are the symptoms of pneumonia?",
    "Complex_CoT": "Pneumonia is an infection that inflames the air sacs in one or both lungs...",
    "Response": "Common symptoms of pneumonia include cough, fever, and difficulty breathing."
}
```

#### **Output: (After `formatting_prompts_func`)**
```python
"Question: What are the symptoms of pneumonia?\nReasoning: Pneumonia is an infection that inflames the air sacs in one or both lungs...\nAnswer: Common symptoms of pneumonia include cough, fever, and difficulty breathing.<EOS>"
```
- The question, reasoning, and response are **structured in a consistent way** for the AI model to learn from.  
- `<EOS>` marks the **end of the response**.  

---


In [None]:
from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset["text"][0]

README.md:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/74.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25371 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

"Below is an instruction that describes a task, paired with an input that provides further context. \nWrite a response that appropriately completes the request. \nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Instruction:\nYou are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \nPlease answer the following medical question. \n\n### Question:\nA 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?\n\n### Response:\n<think>\nOkay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her ab

### **Explanation of the Code**  

This code **applies LoRA (Low-Rank Adaptation) fine-tuning** to an existing **FastLanguageModel** to improve its performance while using **less memory**. Let's break it down step by step.  

---

## **1. Applying LoRA Fine-Tuning to the Model**
```python
model = FastLanguageModel.get_peft_model(
    model,
```
- `get_peft_model(...)` applies **LoRA (Low-Rank Adaptation)** fine-tuning to the given `model`.  
- LoRA is used to **fine-tune large language models efficiently** by modifying only a **small subset of parameters** instead of the entire model.  
- This makes **training faster** and **reduces memory usage**.  

---

## **2. LoRA Rank Parameter (r)**
```python
r=16,
```
- `r=16` sets the **rank of the low-rank adaptation matrices**.  
- A higher value **captures more patterns**, but **increases memory usage**.  
- Typically, `r=8` or `r=16` works well for fine-tuning.  

---

## **3. Target Modules for LoRA Adaptation**
```python
target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
],
```
- This list defines **which parts of the model** will be fine-tuned using LoRA.  
- These are key components in **Transformer models**, responsible for attention and computations.  

📌 **What do these target modules do?**
- `"q_proj"` (Query Projection)  
- `"k_proj"` (Key Projection)  
- `"v_proj"` (Value Projection)  
- `"o_proj"` (Output Projection)  
  - These are used in **self-attention mechanisms**, which help the model understand relationships between words.  
- `"gate_proj"`, `"up_proj"`, `"down_proj"`  
  - These are related to **feedforward network layers**, helping in **transforming** and **processing** model outputs.  

👉 **Why only these layers?**  
- LoRA focuses on modifying a **small subset of parameters** (instead of the whole model) to **save memory** and **improve efficiency**.  

---

## **4. LoRA Hyperparameters**
```python
lora_alpha=16,
lora_dropout=0,
bias="none",
```
- **`lora_alpha=16`**:  
  - Controls **how much LoRA layers influence the model**.  
  - Higher values increase impact but may **lead to overfitting**.  
- **`lora_dropout=0`**:  
  - Dropout is a technique to **prevent overfitting** by randomly turning off some neurons.  
  - `0` means **no dropout**, so all connections remain active.  
- **`bias="none"`**:  
  - This setting means **no bias parameters** are added to LoRA layers.  
  - This **reduces memory usage**.  

---

## **5. Using Gradient Checkpointing for Long Contexts**
```python
use_gradient_checkpointing="unsloth",
```
- **Gradient checkpointing** reduces memory usage by **recomputing activations** instead of storing them.  
- `"unsloth"` is a custom optimization for handling **very long sequences** in `FastLanguageModel`.  
- This **makes training more memory-efficient**, especially for **large context windows (e.g., 2048+ tokens)**.  

---

## **6. Random State for Reproducibility**
```python
random_state=3407,
```
- Sets a **fixed random seed** so that **results are consistent** across multiple runs.  
- This helps in debugging and comparing different experiments.  

---

## **7. Additional LoRA Configurations**
```python
use_rslora=False,
loftq_config=None,
```
- **`use_rslora=False`**:  
  - `rslora` (Rank-Stabilized LoRA) is a more **memory-efficient** variant of LoRA.  
  - `False` means **standard LoRA** is being used instead.  
- **`loftq_config=None`**:  
  - `LoftQ` is a **quantization technique** to further reduce memory usage.  
  - `None` means **no additional quantization** is applied.  

---

## **💡 What Does This Code Do?**
✅ **Applies LoRA fine-tuning to the model** for efficient training.  
✅ **Modifies only key Transformer layers** (q_proj, k_proj, etc.) to save memory.  
✅ **Uses gradient checkpointing** to handle long sequences efficiently.  
✅ **Sets a fixed random seed** for reproducibility.  
✅ **Optimized for large models (8B+ parameters) while keeping memory usage low.**  

---

## **🛠️ Example Before & After Applying LoRA**
| Feature | Standard Model | LoRA Fine-Tuned Model |
|---------|--------------|------------------|
| **Training Speed** | Slow | 🚀 Faster |
| **Memory Usage** | High | 📉 Lower |
| **Accuracy** | Good | 🔥 Same or Better |
| **Training Cost** | Expensive | 💰 Cheaper |

---



In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.2.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### **Explanation of the Code**  

This code **sets up and initializes a fine-tuning process** for a large language model using `SFTTrainer`. It uses **LoRA fine-tuning** (Low-Rank Adaptation) and **memory-efficient training techniques** to optimize performance. Let’s go step by step.

---

## **1. Importing Required Libraries**
```python
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
```
- **`SFTTrainer`** (from `trl` - **Transformers Reinforcement Learning**)  
  - A **trainer specifically designed for fine-tuning language models** with LoRA and PEFT (Parameter-Efficient Fine-Tuning).  
- **`TrainingArguments`** (from `transformers`)  
  - Used to define **training configurations** such as batch size, learning rate, and optimizer.  
- **`is_bfloat16_supported()`** (from `unsloth`)  
  - Checks whether the **hardware supports `bfloat16` precision**, which is **faster and more efficient than `float16`** on modern GPUs like **A100 and H100**.  

---

## **2. Initializing the Trainer**
```python
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
```
- **`model=model`** → The pre-trained model being fine-tuned.  
- **`tokenizer=tokenizer`** → The tokenizer converts text into numbers for the model.  
- **`train_dataset=dataset`** → The training dataset, which contains formatted medical reasoning questions.  
- **`dataset_text_field="text"`** → Specifies that the `"text"` field contains the training samples.  
- **`max_seq_length=max_seq_length`** → Ensures **long text sequences** (e.g., 2048 tokens) are handled properly.  
- **`dataset_num_proc=2`** → Uses **2 CPU processes** to speed up dataset processing.  

---

## **3. Setting Training Hyperparameters**
```python
args=TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
```
- **`per_device_train_batch_size=2`** → Uses a **batch size of 2 per GPU** to **save memory**.  
- **`gradient_accumulation_steps=4`** → Instead of updating weights after every batch, it **accumulates gradients for 4 steps** before updating.  
  - **Effect:** Makes training behave like batch size **8** while keeping memory low.  

---

## **4. Learning Rate & Training Steps**
```python
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
```
- **`warmup_steps=5`** → Slowly increases the learning rate for the **first 5 steps** to stabilize training.  
- **`max_steps=60`** → The training will **run for 60 steps** before stopping.  
  - This is useful for quick testing. For full training, **use `num_train_epochs=1` instead**.  
- **`learning_rate=2e-4`** → The model learns at a **moderate speed** (`0.0002`).  
  - **Lower values (e.g., `1e-5`) are better for large models** to avoid overfitting.  

---

## **5. Floating-Point Precision Optimization**
```python
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
```
- **`fp16=True`** → Uses **16-bit floating point** for **faster training** on GPUs **without `bfloat16` support** (e.g., older NVIDIA cards).  
- **`bf16=True`** → Uses **`bfloat16` if supported**, which is **more stable and memory-efficient** on newer GPUs (e.g., A100, H100).  
- **Effect:** Automatically selects **the best precision format** for the hardware.  

---

## **6. Logging & Optimizer Settings**
```python
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
```
- **`logging_steps=10`** → Logs training progress every **10 steps**.  
- **`optim="adamw_8bit"`** → Uses **AdamW optimizer** in **8-bit mode**, which reduces memory usage.  
- **`weight_decay=0.01`** → Prevents overfitting by **penalizing large weights**.  
- **`lr_scheduler_type="linear"`** → The learning rate **linearly decreases** over time.  
- **`seed=3407`** → Ensures **consistent results** by fixing the random seed.  
- **`output_dir="outputs"`** → Saves model checkpoints and logs in the `"outputs"` directory.  

---



In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 500 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mshukdev-datta[0m ([33mshukdev-datta-north-south-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
10,1.9189
20,1.4615
30,1.4023
40,1.3088
50,1.3443
60,1.314


### **Explanation of the Code**  

This code **performs inference** (prediction) using a **fine-tuned large language model** (`FastLanguageModel`). It processes a **medical question**, generates an answer, and prints the response. Let's break it down step by step.

---

## **1. Defining the Medical Question**
```python
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"
```
- This is a **medical question** related to **cystometry (a bladder function test)**.
- The model will **analyze and generate a medical response**.

---

## **2. Enabling Fast Inference Mode**
```python
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
```
- **`FastLanguageModel.for_inference(model)`** → Optimizes the model for **faster inference (prediction)**.
- **Why?**  
  - This reduces unnecessary computations.  
  - It makes the model **2x faster** for generating responses.  
  - Works best on **long text sequences** (2048+ tokens).  

---

## **3. Formatting the Input for the Model**
```python
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")
```
- **Formats the input question** into a structure the model can understand.  
- **Steps:**
  1. **`prompt_style.format(question, "")`** → Converts the question into a structured prompt.  
  2. **`tokenizer([...], return_tensors="pt")`** → Converts text into a **PyTorch tensor**.  
  3. **`.to("cuda")`** → Moves data to the **GPU** for fast processing.  

📌 **Why Use a Tokenizer?**  
- A **tokenizer** converts text into **numerical format** that the model can process.  
- For example, `"Hello"` → `[50256]` in tokenized form.  

---

## **4. Generating a Response**
```python
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
```
- **`model.generate(...)`** → Runs the model to generate a **medical response**.
- **Parameters:**
  - **`input_ids=inputs.input_ids`** → Feeds the **tokenized question** to the model.
  - **`attention_mask=inputs.attention_mask`** → Ensures the model focuses only on valid tokens.
  - **`max_new_tokens=1200`** → Limits the response length to **1200 tokens**.
  - **`use_cache=True`** → Speeds up generation by **reusing previous computations**.  

📌 **Why Use `use_cache=True`?**  
- **Reuses computed values** from previous steps, **reducing computation time**.  
- Especially helpful for **long texts**.  

---

## **5. Decoding and Printing the Response**
```python
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])
```
- **`tokenizer.batch_decode(outputs)`** → Converts **model-generated tokens back into human-readable text**.
- **Extracting the Final Answer:**
  - The output may be formatted as `"### Response: Your medical answer here"`.
  - **`split("### Response:")[1]`** → Extracts only the **actual answer** after `"### Response:"`.  
- **Finally, prints the answer.**  

---


In [None]:
question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])



<think>
Okay, so let's think about this. We have a 61-year-old woman who's been dealing with involuntary urine loss during things like coughing or sneezing, but she's not leaking at night. That suggests she might have some kind of problem with her pelvic floor muscles or maybe her bladder.

Now, she's got a gynecological exam and a Q-tip test. Let's break that down. The Q-tip test is usually used to check for urethral obstruction. If it's positive, that means there's something blocking the urethra, like a urethral stricture or something else.

Given that she's had a positive Q-tip test, it's likely there's a urethral obstruction. That would mean her urethra is narrow, maybe due to a stricture or some kind of narrowing. So, her bladder can't empty properly during activities like coughing because the urethral obstruction is making it hard.

Now, let's think about what happens when her bladder can't empty. If there's a urethral obstruction, the bladder is forced to hold more urine, incre

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
new_model_local = "/content/drive/MyDrive/DeepSeek-R1-Medical-COT"
model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.12 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 34%|███▍      | 11/32 [00:00<00:01, 14.41it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [05:29<00:00, 10.30s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /content/drive/MyDrive/DeepSeek-R1-Medical-COT/pytorch_model-00001-of-00004.bin...
Unsloth: Saving /content/drive/MyDrive/DeepSeek-R1-Medical-COT/pytorch_model-00002-of-00004.bin...
Unsloth: Saving /content/drive/MyDrive/DeepSeek-R1-Medical-COT/pytorch_model-00003-of-00004.bin...
Unsloth: Saving /content/drive/MyDrive/DeepSeek-R1-Medical-COT/pytorch_model-00004-of-00004.bin...
Done.


Now you can load the saved model and then create a test code where you give prompt and it will give a proper response

In [None]:
new_model_online = "shukdevdatta123/DeepSeek-R1-Medical-COT"
model.push_to_hub(new_model_online)
tokenizer.push_to_hub(new_model_online)

model.push_to_hub_merged(new_model_online, tokenizer, save_method = "merged_16bit")

README.md:   0%|          | 0.00/634 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/shukdevdatta123/DeepSeek-R1-Medical-COT


tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth: You are pushing to hub, but you passed your HF username = shukdevdatta123.
We shall truncate shukdevdatta123/DeepSeek-R1-Medical-COT to DeepSeek-R1-Medical-COT
Unsloth: Will remove a cached repo with size 1.6K


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 2.06 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [07:43<00:00, 14.48s/it]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00001-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00002-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00003-of-00004.bin...
Unsloth: Saving DeepSeek-R1-Medical-COT/pytorch_model-00004-of-00004.bin...


pytorch_model-00003-of-00004.bin:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

pytorch_model-00001-of-00004.bin:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

pytorch_model-00002-of-00004.bin:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

pytorch_model-00004-of-00004.bin:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/shukdevdatta123/DeepSeek-R1-Medical-COT
