To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div>

To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### Installation
----------------------------------------

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2
!pip install evaluate
!pip install rouge_score



----------------------------------------------
### download the base model from unsloth
---------------------------------------------

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.9.0+cu126)
    Python  3.12.9 (you have 3.12.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.4: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.c

model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

# We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.11.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


---------------------------------------------------
### Data Prep

---------------------------------------------------

our data is a 112k about the medical field consists of questionand answer.

In [4]:
from datasets import Dataset
import pandas as pd

# üîπ Step 1: Load your CSV file from your local Colab path
# Change this to your actual uploaded file path
csv_path = "/content/Doctor-HealthCare-100k.csv"

df = pd.read_csv(csv_path)
print("‚úÖ Data loaded successfully!")
print(df.head())

# üîπ Step 2: Define the same Alpaca-style prompt
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# üîπ Step 3: Define EOS token and formatting function
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# üîπ Step 4: Convert the DataFrame to a Hugging Face dataset
dataset = Dataset.from_pandas(df)

# üîπ Step 5: Apply the formatting function
dataset = dataset.map(formatting_prompts_func, batched=True)

print("‚úÖ Dataset ready for Unsloth fine-tuning!")
print(dataset[0])


‚úÖ Data loaded successfully!
                                         instruction  \
0  If you are a doctor, please answer the medical...   
1  If you are a doctor, please answer the medical...   
2  If you are a doctor, please answer the medical...   
3  If you are a doctor, please answer the medical...   
4  If you are a doctor, please answer the medical...   

                                               input  \
0  I woke up this morning feeling the whole room ...   
1  My baby has been pooing 5-6 times a day for a ...   
2  Hello, My husband is taking Oxycodone due to a...   
3  lump under left nipple and stomach pain (male)...   
4  I have a 5 month old baby who is very congeste...   

                                              output  
0  Hi, Thank you for posting your query. The most...  
1  Hi... Thank you for consulting in Chat Doctor....  
2  Hello, and I hope I can help you today.First, ...  
3  HI. You have two different problems. The lump ...  
4  Thank you for usin

Map:   0%|          | 0/112156 [00:00<?, ? examples/s]

‚úÖ Dataset ready for Unsloth fine-tuning!
{'instruction': "If you are a doctor, please answer the medical questions based on the patient's description.", 'input': 'I woke up this morning feeling the whole room is spinning when i was sitting down. I went to the bathroom walking unsteadily, as i tried to focus i feel nauseous. I try to vomit but it wont come out.. After taking panadol and sleep for few hours, i still feel the same.. By the way, if i lay down or sit down, my head do not spin, only when i want to move around then i feel the whole world is spinning.. And it is normal stomach discomfort at the same time? Earlier after i relieved myself, the spinning lessen so i am not sure whether its connected or coincidences.. Thank you doc!', 'output': 'Hi, Thank you for posting your query. The most likely cause for your symptoms is benign paroxysmal positional vertigo (BPPV), a type of peripheral vertigo. In this condition, the most common symptom is dizziness or giddiness, which is mad

# upload the last steps from the drive to the files

In [None]:
# 1Ô∏è‚É£ Install gdown
!pip install -q gdown

# 2Ô∏è‚É£ Download the entire public folder (recursively)
!gdown --folder 'https://drive.google.com/drive/folders/1ON_heysyJz0QqXVKbh1TKkK57mcZ7Crn?usp=drive_link' -O /content/checkpoints

# 3Ô∏è‚É£ Verify the downloaded checkpoints
!ls -lh /content/checkpoints

# (Optional) show subfolders (e.g., checkpoint-1000, checkpoint-2000, etc.)
!ls -lh /content/checkpoints/*/


Retrieving folder contents
Retrieving folder 11e4HtXnWwRIQ5wztkmSG3E9-L7Qw9Hn5 checkpoint-15500
Processing file 1R4g6sxioAx12dfgPoclxdLqdcW-8i0h3 adapter_config.json
Processing file 1Mx2gHMGDHsi6AC7WUsRLXeB2T6t5hlwL adapter_model.safetensors
Processing file 1PsmJSBRL4R63gxg4VaC-MfaO8Pbo2Cgu optimizer.pt
Processing file 1u2D96LZv43Nm2P6gHONKnZQ-iSl730Tv README.md
Processing file 1f4tLK7FOrlw3mONZRjd-1VwQP26yjc26 rng_state.pth
Processing file 16ZqLJ3oHf4vJpFil85qa1kCYmVb1ywMR scaler.pt
Processing file 1WwcPA_ZT2f4AlzdGlNAgc0zftwu2BW81 scheduler.pt
Processing file 1ibIAqxGmatHLMQAWXXEvylbCp4x46Tpi special_tokens_map.json
Processing file 1ZAErsMCHqEQ-ggCuPhrDZabZO-V93FcX tokenizer_config.json
Processing file 10mrCXQXBaEqRIv_iK6hvu_lApBr6oN9z tokenizer.json
Processing file 1ufcrddSYV-jrdL4xDTHRN0ekLkBXL0jB trainer_state.json
Processing file 1donW2x9k6ac1SpKSD_ZGv7d9hnJu4eci training_args.bin
Retrieving folder 1ffQwWK2nKSrFLLNEXGw15KwMpuvOwCm6 checkpoint-15600
Processing file 1pBsyCTgo3a35Fl

#train the model
Now let's train our model. we do 1 epoch it takes like 15000 step to complete the first epoch

In [None]:
from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,   # üëà this is the formatted medical dataset
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = False,  # Can make training 5x faster for short sequences, but disable for safety first.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1,  # Optional alternative to max_steps
        max_steps = 15700,          # üîπ For quick testing ‚Äî later increase it
        learning_rate = 2e-4,
        logging_steps = 50,
        optim = "adamw_8bit",    # üîπ memory efficient (works with Unsloth)
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",      # Disable external logging for now
            # üß† Add these ‚Üì‚Üì‚Üì
        save_steps = 100,           # save every 100 steps (you can increase to 500)
        save_total_limit = 3,       # keep only the last 3 checkpoints to save space
    ),
)


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/112156 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.881 GB of memory reserved.


In [None]:
trainer.train(resume_from_checkpoint="/content/checkpoints/checkpoint-14200")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 112,156 | Num Epochs = 2 | Total steps = 15,700
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
14250,1.7116
14300,1.7158
14350,1.731
14400,1.753
14450,1.7061
14500,1.7244
14550,1.7113
14600,1.7534
14650,1.7159
14700,1.767


TrainOutput(global_step=15700, training_loss=0.16612786797201556, metrics={'train_runtime': 13317.4677, 'train_samples_per_second': 9.431, 'train_steps_per_second': 1.179, 'total_flos': 1.8433087939191767e+18, 'train_loss': 0.16612786797201556, 'epoch': 1.119833089625165})

# Final training loss is 1.7

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

7944.6175 seconds used for training.
132.41 minutes used for training.
Peak reserved memory = 8.152 GB.
Peak reserved memory for training = 1.271 GB.
Peak reserved memory % of max memory = 55.302 %.
Peak reserved memory for training % of max memory = 8.622 %.


# How much from the epoch

In [None]:
import json

state = json.load(open("/content/outputs/checkpoint-15700/trainer_state.json"))

print("epoch =", state["epoch"])
print("global_step =", state["global_step"])
print("max_steps =", state["max_steps"])


epoch = 1.119833089625165
global_step = 15700
max_steps = 15700


# upload the steps to google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!cp -r /content/outputs /content/drive/MyDrive/


#testing the model

In [None]:
from peft import PeftModel
import torch

# ‚úÖ Your already loaded base model & tokenizer
# model, tokenizer = FastLanguageModel.from_pretrained(...)  # already done

# Load LoRA adapter from Hugging Face
LORA_MODEL = "Mohamed-Abdelsamed/llama-medical-lora"
lora_model = PeftModel.from_pretrained(model, LORA_MODEL)

# Function to chat
def ask_bot(question, max_tokens=200):
    inputs = tokenizer(question, return_tensors="pt").to(lora_model.device)
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ‚úÖ Example usage
user_question = "Hello doctor, I have a fever and headache, what should I do?"
response = ask_bot(user_question)
print("Chatbot:", response)


adapter_config.json: 0.00B [00:00, ?B/s]



adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]



Chatbot: Hello doctor, I have a fever and headache, what should I do? I feel like I have a cold. This is the most common question that I get asked. And it‚Äôs not only me, but every doctor gets asked the same thing. It‚Äôs a very common question. And I‚Äôm going to give you the answer. So let‚Äôs talk about this question. What is a fever? And how do you get it? And what are the symptoms? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms of a fever? And what are the symptoms


In [5]:
from unsloth import FastLanguageModel
from peft import PeftModel
import torch

# ‚úÖ Your already loaded base model & tokenizer
# model, tokenizer = FastLanguageModel.from_pretrained(...)  # already done

# Load LoRA adapter
LORA_MODEL = "Mohamed-Abdelsamed/llama-medical-lora"
lora_model = PeftModel.from_pretrained(model, LORA_MODEL)

# Enable inference mode for faster generation
FastLanguageModel.for_inference(lora_model)

# Alpaca-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# ‚úÖ Example medical instruction + input
question = "what is diabets"

inputs = tokenizer([
    alpaca_prompt.format(
        "You are an AI medical assistant. "
    "You must always identify yourself as an AI, not a human. "
    "Provide accurate, safe, and concise medical information."
    "try to give solution for the problem"
    ,
        question,
        ""  # Leave output empty for generation
    )
], return_tensors="pt").to(lora_model.device)

# Generate response
outputs = lora_model.generate(
    **inputs,
    max_new_tokens=100,      # reduce max tokens
    temperature=0.5,         # slightly lower for focused response
    top_p=0.8,               # nucleus sampling
    repetition_penalty=1.3,  # avoid loops
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

# Decode and show result
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Chatbot:", response)


adapter_config.json: 0.00B [00:00, ?B/s]



adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]



Chatbot: Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are an AI medical assistant. You must always identify yourself as an AI, not a human. Provide accurate, safe, and concise medical information.try to give solution for the problem

### Input:
what is diabets

### Response:
Diabetes mellitus (DM), commonly known as diabetes,[1] is a group of metabolic disorders characterized by high blood sugar levels over a prolonged period.[2][3]

Symptoms include frequent urination, increased thirst or hunger, slow wound healing, blurry vision, fatigue, tingling in hands and feet due to nerve damage, painless skin infections such as cellulitis[4], yeast infection on genitals called candidiasis [5]. In type 1 DM there may also be weight loss


# UPLOAD THE LORA MODEL TO HUGGING FACE

In [None]:
from huggingface_hub import HfApi, create_repo

# ‚úÖ Step 1: Set your token
HF_TOKEN = "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"

# ‚úÖ Step 2: Define your repo ID (no spaces!)
repo_id = "Mohamed-Abdelsamed/llama-medical-lora"

# ‚úÖ Step 3: Create the repo (if it doesn't exist)
create_repo(
    repo_id=repo_id,
    repo_type="model",
    token=HF_TOKEN,
    exist_ok=True
)

# ‚úÖ Step 4: Upload the LoRA folder
api = HfApi()
api.upload_folder(
    folder_path="/content/checkpoints/checkpoint-15700",  # your final checkpoint
    repo_id=repo_id,
    repo_type="model",
    token=HF_TOKEN
)

# ‚úÖ Step 5: Print Hugging Face URL
print(f"Your LoRA model is now uploaded at: https://huggingface.co/{repo_id}")


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...point-15700/rng_state.pth: 100%|##########| 14.6kB / 14.6kB            

  ...oint-15700/tokenizer.json: 100%|##########| 17.2MB / 17.2MB            

  ...kpoint-15700/optimizer.pt:   1%|          |  526kB / 86.9MB            

  ...adapter_model.safetensors:   0%|          |  555kB /  168MB            

  ...heckpoint-15700/scaler.pt: 100%|##########| 1.38kB / 1.38kB            

  ...kpoint-15700/scheduler.pt:   1%|          |  14.0B / 1.47kB            

  ...t-15700/training_args.bin:   1%|1         |  63.0B / 6.22kB            

Your LoRA model is now uploaded at: https://huggingface.co/Mohamed-Abdelsamed/llama-medical-lora


# saving the model

In [None]:
from unsloth import FastLanguageModel

# Load base model on CPU
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
    device_map="cpu"   # Force CPU
)

# Load LoRA checkpoint
model = FastLanguageModel.from_pretrained(
    "/content/outputs/checkpoint-15700",
    base_model=base_model,
    load_in_4bit=True,
    device_map="cpu"   # Force CPU
)

# Merge LoRA
merged_model = model.merge_and_unload()

# Save merged model
save_path = "/content/Final-Medical-Model"
merged_model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print("‚úÖ Full merged model saved successfully on CPU!")


==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
==((====))==  Unsloth 2025.11.3: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [None]:
# Enable inference mode for generation
FastLanguageModel.for_inference(model)

system_prompt = (
    "You are an AI medical assistant. "
    "Always clearly identify yourself as an AI, not a human. "
    "Use correct grammar, concise sentences, and clear structure. "
    "If a question is outside your ability to diagnose or prescribe, "
    "advise the user to consult a qualified healthcare provider."
)

inputs = tokenizer([
    alpaca_prompt.format(
        system_prompt,
        "my eye hurt me",
        "",
    )
], return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.8,          # smoother, less chaotic
    repetition_penalty=1.8,
    no_repeat_ngram_size=4,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are an AI medical assistant. Always clearly identify yourself as an AI, not a human. Use correct grammar, concise sentences, and clear structure. If a question is outside your ability to diagnose or prescribe, advise the user to consult a qualified healthcare provider.

### Input:
my eye hurt me

### Response:
Hello dear friend....i have gone through you query thoroughly.it may be due allergic conjunctivitis...you should use antibiotic ointment like ciprofloxacin for 5 days in both eyes...and take tablet montelukast+ levocetirizine combination once daily at bedtime along witChatDoctorplete course of treatment.if there will no improvement than go visit nearby doctor.....thank u


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
!zip -r lora_model.zip lora_model


updating: lora_model/ (stored 0%)
updating: lora_model/adapter_config.json (deflated 57%)
updating: lora_model/adapter_model.safetensors (deflated 7%)
updating: lora_model/special_tokens_map.json (deflated 71%)
updating: lora_model/tokenizer.json (deflated 85%)
updating: lora_model/README.md (deflated 65%)
updating: lora_model/tokenizer_config.json (deflated 96%)


In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp lora_model.zip /content/drive/MyDrive/


Mounted at /content/drive


# ******************************************************************

#UI gradiooo

# trial 1

In [None]:
import gradio as gr
from unsloth import FastLanguageModel
from peft import PeftModel
import torch

# ==============================
# Load model + LoRA
# ==============================

# TODO: Insert your base model + tokenizer loading here
# model, tokenizer = FastLanguageModel.from_pretrained(...)

LORA_MODEL = "Mohamed-Abdelsamed/llama-medical-lora"
lora_model = PeftModel.from_pretrained(model, LORA_MODEL)

FastLanguageModel.for_inference(lora_model)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}
"""

def generate_answer(history, question):
    instruction = (
        "You are an AI medical assistant. "
        "You must always identify yourself as an AI, not a human. "
        "Provide accurate, safe, and concise medical information. "
        "Try to give a solution for the problem."
    )

    prompt = alpaca_prompt.format(instruction, question, "")

    inputs = tokenizer([prompt], return_tensors="pt").to(lora_model.device)

    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=130,
        temperature=0.3,
        top_p=0.8,
        repetition_penalty=1.3,
        eos_token_id=tokenizer.eos_token_id,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # ===== STRIP PROMPT =====
    # Remove any text before actual response
    if "### Response:" in response:
        response = response.split("### Response:")[-1].strip()

    # Optional: also remove lingering instructions or inputs
    response_lines = response.split("\n")
    response_clean = []
    for line in response_lines:
        if not line.startswith("### Instruction:") and not line.startswith("### Input:"):
            response_clean.append(line)
    response = "\n".join(response_clean).strip()
    # ===== END STRIP =====

    history.append(("üßë‚Äç‚öïÔ∏è User", question))
    history.append(("ü§ñ Medical AI", response))
    return history, ""

# ==============================
# CHATGPT-STYLE UI
# ==============================

css = """
.gradio-container {max-width: 900px !important; margin: auto;}
.message {padding: 10px 15px; border-radius: 12px; margin-bottom: 10px; max-width: 80%;}
.user-msg {background: #007bff; color: white; margin-left: auto;}
.bot-msg {background: #3a3a3c; color: white; margin-right: auto;}
"""

with gr.Blocks(css=css, title="Medical AI Assistant") as demo:
    gr.Markdown(
        """
        <h1 style='text-align:center;'>ü©∫ Medical AI Assistant</h1>
        <p style='text-align:center; opacity:0.7;'>
        Ask any medical question. Powered by your LoRA fine-tuned model.
        </p>
        """
    )

    chatbox = gr.Chatbot(
        label="Chat",
        elem_id="chatbot",
        height=450,
    )

    question = gr.Textbox(
        placeholder="Type your medical question here...",
        label="Your Question",
    )

    with gr.Row():
        send_btn = gr.Button("Send", variant="primary")
        clear_btn = gr.Button("Clear")

    send_btn.click(generate_answer, inputs=[chatbox, question], outputs=[chatbox, question])
    clear_btn.click(lambda: [], None, chatbox)

demo.launch()


  chatbox = gr.Chatbot(


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://7f85e18fc03ab93390.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# trial 2

In [None]:
import gradio as gr
from unsloth import FastLanguageModel
from peft import PeftModel
import torch

# ==============================
# Load model + LoRA
# ==============================

# TODO: Load your base model + tokenizer here
# Example:
# model, tokenizer = FastLanguageModel.from_pretrained("your-base-model")

LORA_MODEL = "Mohamed-Abdelsamed/llama-medical-lora"
lora_model = PeftModel.from_pretrained(model, LORA_MODEL)

# Enable faster inference
FastLanguageModel.for_inference(lora_model)

# Alpaca-style prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}
"""

# ==============================
# Inference function
# ==============================

def generate_answer(history, question):
    # concise instruction to avoid long intros
    instruction = (
        "You are a medical AI assistant. "
        "Answer concisely and directly. "
        "Do not add greetings or unrelated commentary."

    )

    prompt = alpaca_prompt.format(instruction, question, "")

    inputs = tokenizer([prompt], return_tensors="pt").to(lora_model.device)

    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=120,
        temperature=0.4,
        top_p=0.8,
        repetition_penalty=1.3,
        eos_token_id=tokenizer.eos_token_id,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # STRIP any instruction/prompt that might appear
    if "### Response:" in response:
        response = response.split("### Response:")[-1].strip()

    history.append(("üßë‚Äç‚öïÔ∏è User", question))
    history.append(("ü§ñ Medical AI", response))
    return history, ""


# ==============================
# ChatGPT-style UI
# ==============================

css = """
.gradio-container {max-width: 900px !important; margin: auto;}
.message {padding: 10px 15px; border-radius: 12px; margin-bottom: 10px; max-width: 80%;}
.user-msg {background: #007bff; color: white; margin-left: auto;}
.bot-msg {background: #3a3a3c; color: white; margin-right: auto;}
"""

with gr.Blocks(css=css, title="Medical AI Assistant") as demo:
    gr.Markdown(
        """
        <h1 style='text-align:center;'>ü©∫ Medical AI Assistant</h1>
        <p style='text-align:center; opacity:0.7;'>
        Ask any medical question. Powered by your LoRA fine-tuned model.
        </p>
        """
    )

    chatbox = gr.Chatbot(
        label="Chat",
        elem_id="chatbot",
        height=450,
    )

    question = gr.Textbox(
        placeholder="Type your medical question here...",
        label="Your Question",
    )

    with gr.Row():
        send_btn = gr.Button("Send", variant="primary")
        clear_btn = gr.Button("Clear")

    send_btn.click(generate_answer, inputs=[chatbox, question], outputs=[chatbox, question])
    clear_btn.click(lambda: [], None, chatbox)

demo.launch()


  chatbox = gr.Chatbot(


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://d0c1c9b32ef9a3833f.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




#trial 3

In [None]:
import gradio as gr
from unsloth import FastLanguageModel
from peft import PeftModel
import torch

# ==============================
# Load model + LoRA
# ==============================
# TODO: Insert your base model + tokenizer loading here
# model, tokenizer = FastLanguageModel.from_pretrained(...)

LORA_MODEL = "Mohamed-Abdelsamed/llama-medical-lora"
lora_model = PeftModel.from_pretrained(model, LORA_MODEL)
FastLanguageModel.for_inference(lora_model)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}
"""

def generate_answer(history, question):
    instruction = (
        "You are an AI medical assistant. "
        "You must always identify yourself as an AI, not a human. "
        "Provide accurate, safe, and concise medical information. "
        "Try to give a solution for the problem."
    )
    prompt = alpaca_prompt.format(instruction, question, "")
    inputs = tokenizer([prompt], return_tensors="pt").to(lora_model.device)
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=130,
        temperature=0.3,
        top_p=0.8,
        repetition_penalty=1.3,
        eos_token_id=tokenizer.eos_token_id,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    if "### Response:" in response:
        response = response.split("### Response:")[-1].strip()
    response_lines = response.split("\n")
    response_clean = []
    for line in response_lines:
        if not line.startswith("### Instruction:") and not line.startswith("### Input:"):
            response_clean.append(line)
    response = "\n".join(response_clean).strip()

    history.append(("You", question))
    history.append(("Medical AI", response))
    return history, ""

# ==============================
# CSS for light/dark mode
# ==============================
css = """
body {font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;}
.gradio-container {max-width: 800px !important; margin: 40px auto;}
h1 {text-align: center; margin-bottom: 5px;}
h2 {text-align: center; margin-top: 0; margin-bottom: 20px;}

.gradio-chatbot {
    border-radius: 20px;
    padding: 20px;
    height: 500px;
    overflow-y: auto;
    box-shadow: 0 5px 15px rgba(0,0,0,0.1);
}

.message {
    padding: 12px 20px;
    border-radius: 25px;
    margin-bottom: 15px;
    font-size: 16px;
    line-height: 1.5;
    max-width: 75%;
    word-wrap: break-word;
    transition: all 0.2s;
}

/* Light Mode */
.light-mode body {background-color: #f5f7fa;}
.light-mode .gradio-chatbot {background-color: #ffffff;}
.light-mode .user-msg {background-color: #1976d2; color: white; margin-left: auto;}
.light-mode .bot-msg {background-color: #455a64; color: white; margin-right: auto;}
.light-mode input[type="text"] {border-radius: 25px; border: 1px solid #cfd8dc; padding: 12px 15px; width: 100%; box-sizing: border-box; font-size:16px;}
.light-mode .gr-button {border-radius: 25px; background-color: #0d47a1; color:white; border:none; font-weight:bold; padding:12px 25px; margin-left:10px;}
.light-mode .gr-button:hover {background-color:#1565c0; transform: scale(1.03);}

/* Dark Mode */
.dark-mode body {background-color: #1c1c1c; color: #e0e0e0;}
.dark-mode .gradio-chatbot {background-color: #2c2c2c;}
.dark-mode .user-msg {background-color: #0d47a1; color: white; margin-left: auto;}
.dark-mode .bot-msg {background-color: #263238; color: white; margin-right: auto;}
.dark-mode input[type="text"] {border-radius: 25px; border: 1px solid #555; padding: 12px 15px; width: 100%; box-sizing: border-box; font-size:16px; background-color:#3c3c3c; color:white;}
.dark-mode .gr-button {border-radius: 25px; background-color: #0d47a1; color:white; border:none; font-weight:bold; padding:12px 25px; margin-left:10px;}
.dark-mode .gr-button:hover {background-color:#1565c0; transform: scale(1.03);}
"""

# ==============================
# Gradio UI
# ==============================
with gr.Blocks(css=css, title="Medical AI Assistant") as demo:
    gr.Markdown("<h1>ü©∫ Medical AI Assistant</h1>")
    gr.Markdown("<h2>Ask any medical question. Powered by your LoRA fine-tuned model.</h2>")

    chatbox = gr.Chatbot(label="", elem_id="chatbot", height=500)

    with gr.Row():
        question = gr.Textbox(
            placeholder="Type your medical question here...",
            label="",
            elem_id="input-box"
        )
        send_btn = gr.Button("Send")
        toggle_btn = gr.Button("Toggle Dark/Light Mode")

    clear_btn = gr.Button("Clear", variant="secondary")

    send_btn.click(generate_answer, inputs=[chatbox, question], outputs=[chatbox, question])
    clear_btn.click(lambda: [], None, chatbox)

    # Dark/Light mode toggle
    def toggle_mode():
        return gr.update(value=None), gr.update(value=None), gr.HTML.update(
            """
            <script>
            document.body.classList.toggle('dark-mode');
            document.body.classList.toggle('light-mode');
            </script>
            """
        )

    toggle_btn.click(toggle_mode, inputs=None, outputs=[question, send_btn, gr.HTML()])

demo.launch()


  chatbox = gr.Chatbot(label="", elem_id="chatbot", height=500)


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2492685738db4626eb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




-----------------------
#measure acuaracy
-----------------------

#  1- Perplexity (PPL)

In [6]:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math
from time import time

model = lora_model   # your fine-tuned model
tokenizer = tokenizer

# Load validation split
valid = dataset.train_test_split(test_size=0.001)["test"]

def compute_perplexity(model, tokenizer, dataset):
    model.eval()
    losses = []

    total = len(dataset)
    print(f"üîç Starting perplexity evaluation on {total} samples...")
    start = time()

    for i, sample in enumerate(dataset):
        text = sample["text"]
        inputs = tokenizer(text, return_tensors="pt").to(model.device)

        with torch.no_grad():
            loss = model(**inputs, labels=inputs["input_ids"]).loss
            losses.append(loss.item())

        # üîπ Print progress every 50 steps
        if (i+1) % 50 == 0:
            elapsed = time() - start
            avg_time = elapsed / (i+1)
            remaining = avg_time * (total - (i+1))
            print(f"   ‚úî Processed {i+1}/{total} samples "
                  f"- ETA: {remaining/60:.2f} min")

    print("‚ú® Finished evaluating all samples!")

    avg_loss = sum(losses) / len(losses)
    ppl = math.exp(avg_loss)
    return ppl

ppl = compute_perplexity(model, tokenizer, valid)
print("üìä Final Perplexity:", ppl)


üîç Starting perplexity evaluation on 113 samples...
   ‚úî Processed 50/113 samples - ETA: 0.79 min
   ‚úî Processed 100/113 samples - ETA: 0.13 min
‚ú® Finished evaluating all samples!
üìä Final Perplexity: 16.925794923961735
