# Advanced LLM Inference with Hugging Face

Welcome to this professional guide on running Large Language Models (LLMs) efficiently using the Hugging Face `transformers` library.

In this notebook, we will explore:
1.  **Environment Setup**: Installing necessary libraries and configuring authentication.
2.  **Model Selection**: Choosing state-of-the-art open-source models.
3.  **Quantization**: Understanding and applying 4-bit quantization to run large models on consumer hardware (like T4 GPUs).
4.  **Inference**: Loading models and generating text.
5.  **Modularization**: Building a robust function to switch between models easily.

This guide is designed to provide a deep understanding of the *how* and *why* behind modern LLM inference.

## 1. Environment Setup

First, we need to install the essential libraries:
- `transformers`: The core library for loading and running models.
- `torch`: PyTorch, the underlying deep learning framework.
- `bitsandbytes`: A library required for 4-bit and 8-bit quantization.
- `accelerate`: Helps manage model loading and inference across devices.
- `sentencepiece`: A tokenizer required by some models like Llama.

In [None]:
%pip install requests torch bitsandbytes transformers sentencepiece accelerate
# %pip install -U bitsandbytes

### Authentication

Accessing certain models (like Llama 3.1) requires a Hugging Face account and a valid API token. We will securely load this token from an environment file (`.env`) or use the `huggingface_hub` login utility.

In [None]:
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv(dotenv_path='/workspace/.env', override=True)
hf_token = os.getenv('HUGGINGFACE_API_KEY')

# Clean the token if necessary
if hf_token and hf_token.startswith('Bearer '):
    hf_token = hf_token.replace('Bearer ', '')
if hf_token:
    hf_token = hf_token.strip()

print(f"Token loaded: {'Yes' if hf_token else 'No'}")

# Log in to Hugging Face
if hf_token:
    login(hf_token)

## 2. Model Selection

We will be working with a selection of high-performance models. Defining them in variables allows us to easily switch between them later.

In [None]:
# Define model identifiers
LLAMA = "meta-llama/Llama-3.1-8B-Instruct"
PHI4 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA3 = "google/gemma-3-1b-it"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1"

## 3. Quantization Theory & Configuration

### What is Quantization?
Quantization is the process of reducing the precision of the numbers used to represent a model's parameters (weights). 
- **Standard**: FP32 (32-bit floating point) or FP16 (16-bit).
- **Quantized**: INT4 (4-bit integers).

By reducing precision, we significantly lower the memory footprint, allowing us to run large models (like Llama 3.1 8B) on consumer GPUs with limited VRAM (e.g., 16GB or even less).

### Configuration
We use `BitsAndBytesConfig` to define how we want to load the model:
- `load_in_4bit=True`: Enable 4-bit loading.
- `bnb_4bit_quant_type="nf4"`: Use "NormalFloat4", a data type optimized for normally distributed weights.
- `bnb_4bit_use_double_quant=True`: Quantize the quantization constants themselves for extra savings.
- `bnb_4bit_compute_dtype=torch.bfloat16`: Perform calculations in 16-bit precision for stability.

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

### Cache Management

To keep our workspace organized, we will define specific directories for each model. This prevents models from filling up the default cache partition and allows for easier management.

In [None]:
# Base Hugging Face cache directory
hf_cache_base = os.getenv('HF_HOME', '/root/.cache/huggingface')

# Define specific cache directories for each model
model_cache_llama_3_1 = os.path.join(hf_cache_base, 'models', 'llama_3_1_8b')
model_cache_phi3 = os.path.join(hf_cache_base, 'models', 'phi_3_mini')
model_cache_gemma_3 = os.path.join(hf_cache_base, 'models', 'gemma_3_4b')

# Create directories if they don't exist
os.makedirs(model_cache_llama_3_1, exist_ok=True)
os.makedirs(model_cache_phi3, exist_ok=True)
os.makedirs(model_cache_gemma_3, exist_ok=True)

print(f"Llama Cache: {model_cache_llama_3_1}")
print(f"Phi-3 Cache: {model_cache_phi3}")
print(f"Gemma Cache: {model_cache_gemma_3}")

## 4. Loading the Model

Now we load the **Llama 3.1** model using our quantization configuration. This might take a few minutes depending on your internet speed and whether the model is already cached.

In [None]:
MODEL_LLAMA = AutoModelForCausalLM.from_pretrained(
    LLAMA, 
    device_map="auto", 
    quantization_config=quant_config,
    cache_dir=model_cache_llama_3_1
)

print(f"âœ… Model loaded successfully from: {model_cache_llama_3_1}")

### Memory Footprint Analysis

Let's verify the effectiveness of our quantization. A standard 8B model in FP16 would require approximately 16GB of VRAM. Let's see how much our 4-bit version uses.

In [None]:
memory_footprint = MODEL_LLAMA.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory_footprint:,.1f} MB")

## 5. Running Inference

To generate text, we need to:
1.  **Tokenize**: Convert our text prompt into numbers (tokens) the model understands.
2.  **Generate**: Feed tokens into the model to predict the next tokens.
3.  **Decode**: Convert the predicted tokens back into text.

In [None]:
# 1. Prepare the input
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a joke about data scientists."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

# 2. Generate output
outputs = MODEL_LLAMA.generate(inputs, max_new_tokens=100)

# 3. Decode and print
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
# Clean up memory before moving on
del inputs, outputs, MODEL_LLAMA, tokenizer
torch.cuda.empty_cache()

## 6. Modular Inference Function

To efficiently test multiple models without rewriting code, we'll encapsulate the loading and generation logic into a single function. This function handles:
- Tokenization
- Model Loading (with quantization)
- Streaming generation (printing text as it's generated)
- Memory cleanup

In [None]:
import gc

def generate(model_name, messages, cache_dir):
    print(f"\n--- Loading {model_name} ---")
    
    # Initialize Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    
    # Prepare Inputs
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    streamer = TextStreamer(tokenizer, skip_prompt=True)
    
    # Load Model
    loaded_model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        device_map="auto", 
        quantization_config=quant_config, 
        cache_dir=cache_dir
    )
    
    # Generate
    print("\n--- Response ---")
    outputs = loaded_model.generate(inputs, max_new_tokens=150, streamer=streamer)
    
    # Cleanup
    del tokenizer, streamer, loaded_model, inputs, outputs
    gc.collect() 
    torch.cuda.empty_cache()
    print("\n--- Memory Cleared ---")

### Testing Phi-3
Let's test the `Phi-3` model using our new function.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in one sentence."}
]

generate(PHI4, messages, model_cache_phi3)

### Testing Gemma
Now let's try Google's `Gemma` model. Note that some models might not support system prompts in the same way, so we adjust the message structure if needed.

In [None]:
# Gemma sometimes prefers just user messages or has specific template requirements
messages_gemma = [
    {"role": "user", "content": "Who is the father of Pinocchio?"}
]

generate(GEMMA3, messages_gemma, model_cache_gemma_3)

## Conclusion

You have successfully:
1.  Configured a professional LLM environment.
2.  Understood and applied 4-bit quantization.
3.  Loaded and ran inference on multiple state-of-the-art models.
4.  Created a modular function for efficient testing.

This foundation allows you to explore even larger models and more complex applications on standard hardware.