# Llama 2 7B Smoke Test

This notebook loads `meta-llama/Llama-2-7b-chat-hf` (4-bit) and generates a short summary for a single inline example. It is meant for quick environment checks on Google Colabâ€”no external dataset needed.

In [2]:
# @title Install dependencies
%pip install -q transformers==4.41.0 accelerate==0.28.0


Thu Dec  4 05:29:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   28C    P0             43W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [6]:
# @title Minimal example + Llama 7B generation
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"  # @param {"type": "string"}
HF_TOKEN = ""

example = {
    "text": (
        "Brief Hospital Course: The patient was admitted for shortness of breath and chest pain. "
        "Evaluation ruled out a heart attack but confirmed an exacerbation of heart failure and pneumonia. "
        "She received diuretics, oxygen therapy, and IV antibiotics with good response and was discharged on day 5."
    ),
    "summary": (
        "You were admitted for chest pain and shortness of breath caused by pneumonia and worsening heart failure. "
        "We treated you with medicines to remove extra fluid and antibiotics, and you improved enough to go home."
    ),
}

print("=== Reference summary ===")
print(example["summary"])
print("\n=== Model generation ===")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False, token=HF_TOKEN)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.float16,
    token=HF_TOKEN,
)
model.eval()

prompt = (
    "You are a helpful clinician writing plain-language discharge instructions.\n\n"
    f"Brief Hospital Course:\n{example['text']}\n\n"
    "Patient summary:"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
summary = decoded.split("Patient summary:")[-1].strip()
print(summary)

=== Reference summary ===
You were admitted for chest pain and shortness of breath caused by pneumonia and worsening heart failure. We treated you with medicines to remove extra fluid and antibiotics, and you improved enough to go home.

=== Model generation ===


OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.
403 Client Error. (Request ID: Root=1-69311ca7-72a8cf6e5d5969f44b40fb74;6f0872c8-c2da-48f5-9829-712d9f94c1e9)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b-chat-hf is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to ask for access.