In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth
    !pip install torch transformers peft accelerate bitsandbytes

In [1]:
import unsloth
import torch
from unsloth import FastLanguageModel
from transformers import pipeline

# 1. Define model and adapter names
base_model_name = "unsloth/medgemma-4b-pt"
adapter_name = "huseyincavus/medgemma-4b-guidelines-lora"

# 2. Load the model using Unsloth's FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model_name,
    max_seq_length = 2048,
    dtype = None, # None will default to torch.bfloat16
    load_in_4bit = True,
)

# 3. Load the LoRA adapter directly
# This is a cleaner way than using get_peft_model for inference.
model.load_adapter(adapter_name)

# 4. Use the Unsloth Alpaca prompt template.
# This is crucial for getting the correct output format.
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the latest guidelines for treating type 2 diabetes?

### Response:
"""

# 5. Create a text generation pipeline
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

# 6. Run inference
result = text_generator(prompt, max_new_tokens=250, num_return_sequences=1)

print("--- Model Output ---")
print(result[0]['generated_text'])

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.8: Fast Gemma3 patching. Transformers: 4.53.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cuda:0


--- Model Output ---
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the latest guidelines for treating type 2 diabetes?

### Response:
The latest guidelines for treating type 2 diabetes are focused on the need for a comprehensive and individualized approach that includes lifestyle modifications, medication, and potentially insulin therapy.

1. Lifestyle modifications:

- **Weight management:** Losing weight and maintaining a healthy body weight is essential.
- **Physical activity:** Engage in regular physical activity to improve insulin sensitivity and overall health.
- **Nutrition:** Adopt a balanced diet that is rich in fruits, vegetables, whole grains, lean proteins, and low-fat dairy products.

2. Medication:

- **Metformin:** This is the first-line drug for type 2 diabetes. It helps regulate blood sugar levels and improve insulin sensitivity.
- **Other medications:** Depending on the individual's

In [3]:
# The model is already loaded, so we just print the variable
print("--- Verifying Architecture of In-Memory Model ---")
print(model)

--- Verifying Architecture of In-Memory Model ---
Gemma3ForConditionalGeneration(
  (model): Gemma3Model(
    (vision_tower): SiglipVisionModel(
      (vision_model): SiglipVisionTransformer(
        (embeddings): SiglipVisionEmbeddings(
          (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
          (position_embedding): Embedding(4096, 1152)
        )
        (encoder): SiglipEncoder(
          (layers): ModuleList(
            (0-26): 27 x SiglipEncoderLayer(
              (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
              (self_attn): SiglipAttention(
                (k_proj): lora.Linear4bit(
                  (base_layer): Linear4bit(in_features=1152, out_features=1152, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Identity()
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=1152, out_features=16, bias=

# LoRA Model Integration Evidence

The output from printing the model object provides definitive proof that our fine-tuned LoRA adapter has been successfully applied to the base model.

The key evidence is the presence of `lora.Linear4bit` wrappers around the model's original layers, such as `q_proj` and `v_proj` in the attention blocks.

```python
(self_attn): Gemma3Attention(
  (q_proj): lora.Linear4bit(
    (base_layer): Linear4bit(...)
    (lora_A): ModuleDict(...)
    (lora_B): ModuleDict(...)
  )
  ...
)
```

## Architecture Analysis

This structure shows that the original layer (`base_layer`) is now augmented with new, trainable LoRA weights (`lora_A` and `lora_B`). These are the weights learned during the fine-tuning process.

Their presence confirms that the model is no longer just the base model but a **`PeftModel`** that will use these specialized weights during inference.