In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth
    !pip install torch transformers peft accelerate bitsandbytes

In [None]:
import unsloth
import torch
from unsloth import FastLanguageModel
from transformers import pipeline

# 1. Define model and adapter names
base_model_name = "unsloth/medgemma-4b-pt"
adapter_name = "huseyincavus/medgemma-4b-guidelines-lora"

# 2. Load the model using Unsloth's FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model_name,
    max_seq_length = 2048,
    dtype = None, # None will default to torch.bfloat16
    load_in_4bit = True,
)

# 3. Load the LoRA adapter directly
# This is a cleaner way than using get_peft_model for inference.
model.load_adapter(adapter_name)

# 4. Use the Unsloth Alpaca prompt template.
# This is crucial for getting the correct output format.
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the latest guidelines for treating type 2 diabetes?

### Response:
"""

# 5. Create a text generation pipeline
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

# 6. Run inference
result = text_generator(prompt, max_new_tokens=250, num_return_sequences=1)

print("--- Model Output ---")
print(result[0]['generated_text'])

In [None]:
# The model is already loaded, so we just print the variable
print("--- Verifying Architecture of In-Memory Model ---")
print(model)

# LoRA Model Integration Evidence

The output from printing the model object provides definitive proof that our fine-tuned LoRA adapter has been successfully applied to the base model.

The key evidence is the presence of `lora.Linear4bit` wrappers around the model's original layers, such as `q_proj` and `v_proj` in the attention blocks.

```python
(self_attn): Gemma3Attention(
  (q_proj): lora.Linear4bit(
    (base_layer): Linear4bit(...)
    (lora_A): ModuleDict(...)
    (lora_B): ModuleDict(...)
  )
  ...
)
```

## Architecture Analysis

This structure shows that the original layer (`base_layer`) is now augmented with new, trainable LoRA weights (`lora_A` and `lora_B`). These are the weights learned during the fine-tuning process.

Their presence confirms that the model is no longer just the base model but a **`PeftModel`** that will use these specialized weights during inference.