# LLMs in Practice: Inference with Open-Source Models on Hugging Face



## Part 1: Introduction to HuggingFace Ecosystem

**What is HuggingFace?**
- The Hub hosts models, datasets, and Spaces with rich model cards and licensing details.
- Transformers provides model architectures, tokenizers, and generation utilities.
- Key building blocks: tokenizers (text -> ids), models (ids -> logits), and generation (sampling/decoding).


### 1.2 Environment Setup

Recommended packages:

```bash
%pip install -U transformers accelerate bitsandbytes huggingface_hub
```


## 1.3 Authenticate with HuggingFace

### Why Authenticate?

Authentication with HuggingFace is **optional for most models** but recommended for:
- **Gated models**: Some models require approval and authentication (e.g., Llama, Gemma)
- **Private models**: Access your own private models or those shared with you
- **Upload capabilities**: Push models, datasets, or files to your HuggingFace account

### How to Get Your Token

Follow these steps to create a HuggingFace access token:

1. **Visit HuggingFace**: Go to [https://huggingface.co/](https://huggingface.co/)

2. **Navigate to Settings**: 
   - Click on your profile picture (top right)
   - Select **Settings** from the dropdown menu
   - Click on **Access Tokens** in the left sidebar
   
   <img src="../assets/image.png" width="250" alt="HuggingFace Access Tokens menu"/>

3. **Create New Token**:
   - Click the **"Create new token"** button
   - Give your token a descriptive name (e.g., "Tutorial Notebook")
   - Choose token type:
     - **Read**: For downloading models only (recommended for this tutorial)
     - **Write**: For uploading models/datasets (not needed here)
   - Click **"Create token"**
   
   <img src="../assets/image-1.png" width="500" alt="Create new token dialog"/>

4. **Copy Your Token**: 
   - Copy the generated token immediately (it won't be shown again)
   - Keep it secure - treat it like a password!

### Run the Cell Below

Run the next cell and paste your token when prompted. Alternatively, you can set the `HF_TOKEN` environment variable before starting Jupyter.

In [1]:
# Authenticate with HuggingFace
# This will prompt you to paste your token in a text box
from huggingface_hub import login

login()

# Alternative: Set token via environment variable
# import os
# os.environ['HF_TOKEN'] = 'your_token_here'
# login(token=os.environ['HF_TOKEN'])


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
# Check environment
import platform
import torch

print("Python:", platform.python_version())
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print("Capability:", torch.cuda.get_device_capability(0))
    print("BF16 support:", torch.cuda.is_bf16_supported())

Python: 3.13.11
Torch: 2.9.0+cu128
CUDA available: True
GPU: NVIDIA H200
Capability: (9, 0)
BF16 support: True


## Part 2: General Inference with LLMs

### 2.1 Define Model Constants and Basic Utilities

In [3]:
# Set up torch for optimal performance
import torch

if torch.cuda.is_available():
    torch.backends.cuda.matmul.fp32_precision = "tf32"
    torch.backends.cudnn.conv.fp32_precision = "tf32"

torch.manual_seed(42)

# Model identifiers
MODEL_INSTRUCT = "Qwen/Qwen3-4B-Instruct-2507"
MODEL_THINKING = "Qwen/Qwen3-4B-Thinking-2507"


### 2.2 Load Model and Tokenizer

Load the instruct model directly for basic inference.

In [4]:
# Import transformers components
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = MODEL_INSTRUCT

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
model

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 2560)
    (layers): ModuleList(
      (0-35): 36 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=2560, out_features=4096, bias=False)
          (k_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=2560, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (up_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (down_proj): Linear(in_features=9728, out_features=2560, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06)
        (post_attention_layer

### 2.3 Basic Chat Inference

Create helper functions for formatting messages and generating responses.

In [None]:
def format_messages(user_prompt, system_prompt="You are a helpful assistant."):
    """Format messages for chat models."""
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]


def generate_chat(
    model,
    tokenizer,
    messages,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    **kwargs,
):
    """Generate a chat response."""
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    gen_kwargs = dict(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        pad_token_id=tokenizer.eos_token_id,
        **kwargs,
    )
    if do_sample:
        gen_kwargs.update({"temperature": temperature, "top_p": top_p})

    with torch.inference_mode():
        output_ids = model.generate(**gen_kwargs)

    gen_ids = output_ids[0, input_ids.shape[-1] :]
    return tokenizer.decode(gen_ids, skip_special_tokens=True)


In [7]:
# Try a basic chat completion
messages = format_messages(
    "Summarize the HuggingFace Hub in 2 sentences.",
    system_prompt="You are a concise assistant.",
)

print(generate_chat(model, tokenizer, messages, max_new_tokens=120))


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The HuggingFace Hub is a platform that hosts a vast collection of pre-trained models, datasets, and related resources for natural language processing and machine learning. It enables researchers and developers to share, discover, and use models easily through an open, community-driven ecosystem.


### 2.4 Streaming Responses

For longer responses, streaming provides better UX.

In [None]:
# Import streaming utilities
from threading import Thread
from transformers import TextIteratorStreamer


def stream_chat(
    model,
    tokenizer,
    messages,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
):
    """Stream chat responses token by token."""
    input_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
    generation_kwargs = dict(
        input_ids=input_ids,
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        pad_token_id=tokenizer.eos_token_id,
    )

    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    for text in streamer:
        print(text, end="", flush=True)
    thread.join()
    print()


In [9]:
messages = format_messages("Write a short poem about GPUs and data centers.")
stream_chat(model, tokenizer, messages, max_new_tokens=120, temperature=0.8, top_p=0.95)

system
You are a helpful assistant.
user
Write a short poem about GPUs and data centers.
assistant
In silent halls, where wires hum and flow,  
Lies the GPU, sharp as a thought's first glow.  
It crunches data, vast and deep,  
With thousands of cores, a thunderous leap.  

In data centers, rows stretch wide and long,  
Where silence speaks through circuits, strong.  
Each card computes, a thousand tasks,  
From images to AI's vast, hidden pass.  

No human eye can trace the flight—  
Of numbers dancing in the night.  
Yet in the glow of servers' light,  
A world of knowledge takes flight.


### 2.5 Practical Tasks

Test the model on various common LLM tasks.

In [10]:
# Run various practical tasks
practical_tasks = {
    "translation": "Translate to Spanish: The model scales efficiently on modern GPUs.",
    "summarization": (
        "Summarize in 3 bullets: HuggingFace provides open-source NLP libraries, a model hub, "
        "and tools for training and deployment across research and production."
    ),
    "general_qa": "Q: What is the main purpose of the Transformers library?",
    "planning": "Plan a 4-step rollout for an internal LLM pilot at a company.",
    "code_explanation": "Explain what this Python does: for i in range(3): print(i*i)",
    "code_generation": "Write a Python function that checks if a string is a palindrome.",
}

for name, prompt in practical_tasks.items():
    messages = format_messages(prompt)
    output = generate_chat(model, tokenizer, messages, max_new_tokens=200)
    print(f"=== {name} ===")
    print(output)
    print()


=== translation ===
El modelo se escala eficientemente en GPUs modernas.

=== summarization ===
- Hugging Face offers open-source natural language processing (NLP) libraries, such as Transformers, enabling easy access to state-of-the-art NLP models.  
- It hosts a vast model hub with pre-trained models across various tasks, facilitating rapid experimentation and deployment.  
- Provides comprehensive tools for model training, fine-tuning, and production deployment, supporting both research and real-world applications.

=== general_qa ===
The main purpose of the Transformers library is to provide pre-trained models and tools for working with **transformer-based models** in natural language processing (NLP) and other AI tasks. It simplifies the process of using state-of-the-art models like BERT, GPT, T5, and others by offering:

- Pre-trained models that can be fine-tuned for specific tasks (e.g., text classification, translation, summarization).
- Easy-to-use APIs for encoding text, gen

### 2.6 Experiment: Compare Generation Parameters

See how temperature and sampling affect outputs.

In [11]:
# Compare different generation settings
prompt = "Write a two-sentence pitch for a collaborative AI research lab."

settings = [
    {"name": "deterministic", "do_sample": False},
    {"name": "balanced", "do_sample": True, "temperature": 0.7, "top_p": 0.9},
    {"name": "creative", "do_sample": True, "temperature": 1.1, "top_p": 0.95},
]

for cfg in settings:
    messages = format_messages(prompt, system_prompt="You are a marketing copywriter.")
    output = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=120,
        **{k: v for k, v in cfg.items() if k != "name"},
    )
    print(f"=== {cfg['name']} ===")
    print(output)
    print()


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


=== deterministic ===
Imagine a global network of brilliant minds and cutting-edge AI working hand-in-hand to solve humanity’s most pressing challenges—from climate resilience to equitable access to healthcare. Together, we’re not just building smarter AI—we’re shaping a future where innovation serves everyone.

=== balanced ===
Imagine a global community of scientists, engineers, and thinkers united by a shared mission to unlock the next frontier of artificial intelligence—where open collaboration fuels breakthroughs no single mind could achieve alone. Together, we’re building not just smarter AI, but a more transparent, ethical, and human-centered future.

=== creative ===
Imagine a future where groundbreaking AI innovations emerge from the synergy of top minds, cutting-edge computing, and open collaboration. Our AI research lab brings together scientists, engineers, and visionaries to tackle the biggest challenges—building intelligent systems that are not just powerful, but ethical,

### 2.7 Baseline Tests for Model Comparison

Run these tests to compare with thinking models later.

In [12]:
# Run comparison prompts on non-thinking model
comparison_prompts = {
    "math": "Solve: If a train travels 120 km in 1.5 hours, what is its average speed?",
    "reasoning": "You have 3 boxes: apples, oranges, and mixed. All labels are wrong. "
    "Pick one fruit to identify all boxes. Explain.",
    "analysis": "Compare pros and cons of deploying an LLM on-prem vs in the cloud.",
}

non_thinking_results = {}
for name, prompt in comparison_prompts.items():
    messages = format_messages(prompt, system_prompt="You are a precise assistant.")
    non_thinking_results[name] = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=200,
        do_sample=False,
    )

print("Baseline results saved for comparison with thinking model.")


Baseline results saved for comparison with thinking model.


## Part 3: Advanced Reasoning with Thinking Models

Thinking models expose internal reasoning steps for complex tasks. Let's switch to the thinking version and compare.

### 3.1 Unload Current Model and Load Thinking Model

In [13]:
# Free up memory before loading the thinking model
import gc

del model
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Memory cleared.")


Memory cleared.


In [14]:
# Load the thinking model

model_id = MODEL_THINKING

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_fast=True,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [15]:
model


Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 2560)
    (layers): ModuleList(
      (0-35): 36 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=2560, out_features=4096, bias=False)
          (k_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=2560, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (up_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (down_proj): Linear(in_features=9728, out_features=2560, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06)
        (post_attention_layer

### 3.2 Run Same Tests with Thinking Model

In [16]:

# Run same prompts with thinking model
thinking_results = {}
for name, prompt in comparison_prompts.items():
    messages = format_messages(prompt, system_prompt="You are a reasoning assistant.")
    thinking_results[name] = generate_chat(
        model,
        tokenizer,
        messages,
        max_new_tokens=240,
        do_sample=False,
    )

print("Thinking model results collected.")


Thinking model results collected.


### 3.3 Compare Results Side-by-Side

In [17]:
# Display comparison
for name in comparison_prompts:
    print(f"=== {name} ===")
    print("Non-thinking:")
    print(non_thinking_results[name])
    print("\nThinking:")
    print(thinking_results[name])
    print("\n" + "="*50 + "\n")


=== math ===
Non-thinking:
To find the **average speed** of the train, use the formula:

$$
\text{Average Speed} = \frac{\text{Total Distance}}{\text{Total Time}}
$$

Given:
- Distance = 120 km
- Time = 1.5 hours

$$
\text{Average Speed} = \frac{120 \text{ km}}{1.5 \text{ hours}} = 80 \text{ km/h}
$$

### ✅ Answer: **80 km/h**

Thinking:
Okay, let's see. The problem is asking for the average speed of a train that travels 120 km in 1.5 hours. Hmm, average speed... I remember that average speed is calculated by dividing the total distance traveled by the total time taken. So the formula should be speed equals distance divided by time. Let me write that down to be sure.

The formula for average speed (v) is:

v = d / t

where d is distance and t is time.

In this case, the distance d is 120 km, and the time t is 1.5 hours. So I need to plug those values into the formula.

Let me do the division: 120 km divided by 1.5 hours. Let me think, how do I calculate that? Well, 1.5 is the same as 3

### 3.4 Exercise: Try Your Own Reasoning Tasks (WIP)

Experiment with harder reasoning problems:
- Multi-step planning challenges
- Complex math word problems
- Evaluation and critique of solutions
- Logic puzzles

Create your own prompts below and compare the two models!

In [18]:
# Your experiments here
# Example:
# my_prompt = "..."
# messages = format_messages(my_prompt)
# result = generate_chat(model, tokenizer, messages, max_new_tokens=300)
# print(result)


## Conclusion (WIP)


### Next Step

  