<a href="https://colab.research.google.com/github/jotang1/CyberSec-Public/blob/master/tutorials/day1/note2_hf_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLMs in Practice: Inference with Open-Source Models on Hugging Face



## Part 1: Introduction to HuggingFace Ecosystem

**What is HuggingFace?**
- The Hub hosts models, datasets, and Spaces with rich model cards and licensing details.
- Transformers provides model architectures, tokenizers, and generation utilities.
- Key building blocks: tokenizers (text -> ids), models (ids -> logits), and generation (sampling/decoding).


### 1.2 Environment Setup

Recommended packages:

```bash
%pip install -U transformers accelerate bitsandbytes huggingface_hub
```


## 1.3 Authenticate with HuggingFace

### Why Authenticate?

Authentication with HuggingFace is **optional for most models** but recommended for:
- **Gated models**: Some models require approval and authentication (e.g., Llama, Gemma)
- **Private models**: Access your own private models or those shared with you
- **Upload capabilities**: Push models, datasets, or files to your HuggingFace account

### How to Get Your Token

Follow these steps to create a HuggingFace access token:

1. **Visit HuggingFace**: Go to [https://huggingface.co/](https://huggingface.co/)

2. **Navigate to Settings**:
   - Click on your profile picture (top right)
   - Select **Settings** from the dropdown menu
   - Click on **Access Tokens** in the left sidebar
   
   <img src="https://github.com/oh-scipe/llm-workshop26/blob/main/tutorials/assets/image.png?raw=1" width="250" alt="HuggingFace Access Tokens menu"/>

3. **Create New Token**:
   - Click the **"Create new token"** button
   - Give your token a descriptive name (e.g., "Tutorial Notebook")
   - Choose token type:
     - **Read**: For downloading models only (recommended for this tutorial)
     - **Write**: For uploading models/datasets (not needed here)
   - Click **"Create token"**
   
   <img src="https://github.com/oh-scipe/llm-workshop26/blob/main/tutorials/assets/image-1.png?raw=1" width="500" alt="Create new token dialog"/>

4. **Copy Your Token**:
   - Copy the generated token immediately (it won't be shown again)
   - Keep it secure - treat it like a password!

### Run the Cell Below

Run the next cell and paste your token when prompted. Alternatively, you can set the `HF_TOKEN` environment variable before starting Jupyter.

In [1]:
# Authenticate with HuggingFace
# This will prompt you to paste your token in a text box
from huggingface_hub import login

login()

# Alternative: Set token via environment variable
# import os
# os.environ['HF_TOKEN'] = 'your_token_here'
# login(token=os.environ['HF_TOKEN'])


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
# Check environment
import platform
import torch

print("Python:", platform.python_version())
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print("Capability:", torch.cuda.get_device_capability(0))
    print("BF16 support:", torch.cuda.is_bf16_supported())

Python: 3.12.12
Torch: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4
Capability: (7, 5)
BF16 support: True


## Part 2: General Inference with LLMs

### 2.1 Define Model Constants and Basic Utilities

In [None]:
# Set up torch for optimal performance
import torch

if torch.cuda.is_available():
    torch.backends.cuda.matmul.fp32_precision = "tf32"
    torch.backends.cudnn.conv.fp32_precision = "tf32"

torch.manual_seed(42)

# Model identifiers
MODEL_INSTRUCT = "Qwen/Qwen3-4B-Instruct-2507"
MODEL_THINKING = "Qwen/Qwen3-4B-Thinking-2507"


### 2.2 Load Model and Tokenizer

Load the instruct model directly for basic inference.

In [None]:
# Import transformers components
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = MODEL_INSTRUCT

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


In [None]:
model

### 2.3 Basic Chat Inference

Create helper functions for formatting messages and generating responses.

In [None]:
def format_messages(user_prompt, system_prompt="You are a helpful assistant."):
    """Format messages for chat models."""
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]


def generate_chat(
    model,
    tokenizer,
    messages,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    **kwargs,
):
    """Generate a chat response."""
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    gen_kwargs = dict(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        pad_token_id=tokenizer.eos_token_id,
        **kwargs,
    )
    if do_sample:
        gen_kwargs.update({"temperature": temperature, "top_p": top_p})

    with torch.inference_mode():
        output_ids = model.generate(**gen_kwargs)

    gen_ids = output_ids[0, input_ids.shape[-1] :]
    return tokenizer.decode(gen_ids, skip_special_tokens=True)


In [None]:
# Try a basic chat completion
messages = format_messages(
    "Summarize the HuggingFace Hub in 2 sentences.",
    system_prompt="You are a concise assistant.",
)

print(generate_chat(model, tokenizer, messages, max_new_tokens=120))


### 2.4 Streaming Responses

For longer responses, streaming provides better UX.

In [None]:
# Import streaming utilities
from threading import Thread
from transformers import TextIteratorStreamer


def stream_chat(
    model,
    tokenizer,
    messages,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
):
    """Stream chat responses token by token."""
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
    generation_kwargs = dict(
        input_ids=input_ids,
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        pad_token_id=tokenizer.eos_token_id,
    )

    def _run_generate():
        model.eval()
        with torch.inference_mode():
            model.generate(**generation_kwargs)

    thread = Thread(target=_run_generate, daemon=True)
    thread.start()

    try:
        for text in streamer:
            print(text, end="", flush=True)
    finally:
        thread.join()

        # Explicitly drop local refs that may include CUDA tensors
        del input_ids, generation_kwargs, streamer, thread
        import gc
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()


In [None]:
messages = format_messages("Write a short poem about GPUs and data centers.")
stream_chat(model, tokenizer, messages, max_new_tokens=120, temperature=0.8, top_p=0.95)

### 2.5 Practical Tasks

Test the model on various common LLM tasks.

In [None]:
# Run various practical tasks
practical_tasks = {
    "translation": "Translate to Spanish: The model scales efficiently on modern GPUs.",
    "summarization": (
        "Summarize in 3 bullets: HuggingFace provides open-source NLP libraries, a model hub, "
        "and tools for training and deployment across research and production."
    ),
    "general_qa": "Q: What is the main purpose of the Transformers library?",
    "planning": "Plan a 4-step rollout for an internal LLM pilot at a company.",
    "code_explanation": "Explain what this Python does: for i in range(3): print(i*i)",
    "code_generation": "Write a Python function that checks if a string is a palindrome.",
}

for name, prompt in practical_tasks.items():
    messages = format_messages(prompt)
    output = generate_chat(model, tokenizer, messages, max_new_tokens=200)
    print(f"=== {name} ===")
    print(output)
    print()


### 2.6 Experiment: Compare Generation Parameters

See how temperature and sampling affect outputs.

In [None]:
# Compare different generation settings
prompt = "Write a two-sentence pitch for a collaborative AI research lab."

settings = [
    {"name": "deterministic", "do_sample": False},
    {"name": "balanced", "do_sample": True, "temperature": 0.7, "top_p": 0.9},
    {"name": "creative", "do_sample": True, "temperature": 1.1, "top_p": 0.95},
]

for cfg in settings:
    messages = format_messages(prompt, system_prompt="You are a marketing copywriter.")
    output = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=120,
        **{k: v for k, v in cfg.items() if k != "name"},
    )
    print(f"=== {cfg['name']} ===")
    print(output)
    print()


### 2.7 Baseline Tests for Model Comparison

Run these tests to compare with thinking models later.

In [None]:
# Run comparison prompts on non-thinking model
comparison_prompts = {
    "math": "Solve: If a train travels 120 km in 1.5 hours, what is its average speed?",
    "reasoning": "You have 3 boxes: apples, oranges, and mixed. All labels are wrong. "
    "Pick one fruit to identify all boxes. Explain.",
    "analysis": "Compare pros and cons of deploying an LLM on-prem vs in the cloud.",
}

non_thinking_results = {}
for name, prompt in comparison_prompts.items():
    messages = format_messages(prompt, system_prompt="You are a precise assistant.")
    non_thinking_results[name] = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=200,
        do_sample=False,
    )

print("Baseline results saved for comparison with thinking model.")

## Part 3: Advanced Reasoning with Thinking Models

Thinking models expose internal reasoning steps for complex tasks. Let's switch to the thinking version and compare.

### 3.1 Unload Current Model and Load Thinking Model

In [None]:
# Free up memory before loading the thinking model
import gc

def print_cuda_memory(tag):
    if torch.cuda.is_available():
        free, total = torch.cuda.mem_get_info()
        used = total - free
        print(f"[{tag}] CUDA memory:")
        print(f"  Used : {used / 1024**3:.2f} GB")
        print(f"  Free : {free / 1024**3:.2f} GB")
        print(f"  Total: {total / 1024**3:.2f} GB")
    else:
        print("CUDA not available.")

# Before cleanup
print_cuda_memory("Before cleanup")

# Cleanup
model.to('cpu')
del model, tokenizer
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Memory cleared.")

# After cleanup
print_cuda_memory("After cleanup")

In [None]:
# Load the thinking model

model_id = MODEL_THINKING

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


In [None]:
model


### 3.2 Run Same Tests with Thinking Model

In [None]:

# Run same prompts with thinking model
thinking_results = {}
for name, prompt in comparison_prompts.items():
    messages = format_messages(prompt, system_prompt="You are a reasoning assistant.")
    thinking_results[name] = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=500,
        do_sample=False,
    )

print("Thinking model results collected.")


### 3.3 Compare Results Side-by-Side

In [None]:
# Display comparison
for name in comparison_prompts:
    print(f"=== {name} ===")
    print("Non-thinking:")
    print(non_thinking_results[name])
    print("\nThinking:")
    print(thinking_results[name])
    print("\n" + "="*50 + "\n")
