# Imports, Keep in mind different LLMs are trained in different versions of PyTorch and other libraries, so it is very common to have a version mismatch. Must read documentation and GitHub or HF issues.

In [None]:
!pip install  -q transformers torch accelerate bitsandbytes peft trl datasets

# Importing Model from HF and performing inference.

In [12]:
    """
    Imports the AutoModelForCausalLM and AutoTokenizer classes from the
    Hugging Face transformers library. These classes allow for easy loading
    of pre-trained models and their corresponding tokenizers.
    """
from transformers import AutoModelForCausalLM, AutoTokenizer


model_name = "Qwen/Qwen3-4B-Thinking-2507"

    """
    Downloads and loads the tokenizer that is specifically associated with the
    chosen model. The tokenizer is responsible for converting text into
    numerical tokens and vice versa.
    """
tokenizer = AutoTokenizer.from_pretrained(model_name)
    """
    Downloads and loads the actual pre-trained language model.
    - model_name: Specifies which model to load.
    - torch_dtype="auto": Allows the library to automatically select the optimal
      data type (like float16) for memory efficiency.
    - device_map="auto": Automatically moves the model to the appropriate hardware,
      such as a GPU if one is available, for faster processing.
    """
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Give me a short introduction to a large language model."

    
    """
    Formats the prompt into a list of dictionaries, following the conversational
    chat structure that this model expects. Each dictionary represents one turn
    in the conversation, with a 'role' (like 'user') and 'content'.
    """
messages = [
    {"role": "user", "content": prompt}
]

    """
    Applies the model's specific chat template to the 'messages' list. This
    converts the structured conversation into a single, formatted string,
    adding all necessary special tokens (like '<|im_start|>') that the
    model needs to understand the prompt correctly.
    - tokenize=False: Ensures the output is a string, not yet token IDs.
    - add_generation_prompt=True: Appends a marker at the end to signal
      the model that it's now its turn to generate a response.
    """
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
    """
    Performs the final tokenization step. It takes the formatted string,
    converts it into numerical token IDs, wraps it in a list to create a batch
    of one, and returns the result as PyTorch tensors. The '.to(model.device)'
    call moves these tensors to the same hardware (CPU or GPU) as the model.
    """
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    """
    Executes the core text generation process. The model takes the tokenized
    inputs and generates a sequence of new token IDs as the response.
    - **model_inputs: Unpacks the dictionary of tensors into keyword arguments.
    - max_new_tokens: Sets a very high limit on the length of the generated
      response to ensure it isn't cut off prematurely.
    """
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
    """
    Processes the output from the model. It selects the first (and only) sequence
    from the batch, slices it to remove the original input prompt tokens, and
    keeps only the newly generated tokens. These token IDs are then converted
    into a standard Python list.
    """
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

    """
    Starts a 'try...except' block. This is a safe way to handle the parsing
    logic, as it will prevent the code from crashing if the special
    '</think>' token is not found in the model's output.
    """
try:
    """
    Finds the position of the special '</think>' token (ID 151668), which
    separates the model's reasoning from its final answer. It works by
    reversing the list of output IDs, finding the first index of the token,
    and then calculating its original position from the end of the list.
    """
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
"""
This 'except' block catches the 'ValueError' that occurs if the '.index()'
method cannot find the token ID 151668 in the list.
"""
except ValueError:
    """
    If the '</think>' token is not found, the index is set to 0. This means
    The entire output will be treated as the final answer, with no
    "thinking" content to extract.
    """
    index = 0

    """
    Decodes the list of token IDs that appear *before* the '</think>' token
    into a human-readable string. This string represents the model's internal
    "thinking" or reasoning process.
    """
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
    """
    Decodes the list of token IDs from the '</think>' token to the end of the
    sequence. This string represents the final, user-facing answer.
    """
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

thinking content: Okay, the user asked for a short introduction to large language models. Let me start by recalling what I know about LLMs. They're AI models trained on massive amounts of text data. I should mention their key features like understanding and generating human-like text.

Hmm, the user probably wants something concise but informative. They might be a student, a professional new to AI, or just someone curious. I should avoid jargon to keep it accessible. Maybe they need this for a project, a presentation, or personal knowledge. 

Wait, they said "short," so I need to be brief. Focus on the essentials: definition, training data, capabilities, and examples. Don't dive too deep into technical details like architecture or training processes. 

I should highlight what LLMs can do‚Äîanswer questions, write stories, code‚Äîbut also mention limitations to set realistic expectations. The user might not know that LLMs can make mistakes or be biased. Including a note about context an

# Printing the Model Architecture.

In [13]:
print(model)

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 2560)
    (layers): ModuleList(
      (0-35): 36 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=2560, out_features=4096, bias=False)
          (k_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2560, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=2560, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (up_proj): Linear(in_features=2560, out_features=9728, bias=False)
          (down_proj): Linear(in_features=9728, out_features=2560, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((2560,), eps=1e-06)
        (post_attention_layernorm): Qwe

# Batch Processing of Prompts

In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen3-4B-Thinking-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
"""
Downloads and loads the actual pre-trained language model weights.
- torch_dtype="auto": Automatically selects the best data precision for efficiency.
- device_map="auto": Automatically moves the model to the GPU if available.
"""
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

"""
This block checks if the tokenizer has a pre-defined padding token.
Padding is necessary when processing multiple texts of different lengths
at the same time (i.e., in a batch).
"""
# === BATCHING CHANGE 1: Set a padding token ===
# The model needs a token to use for padding shorter sequences in the batch.
# The end-of-sequence token is a common choice for this.
if tokenizer.pad_token is None:
    """
    If no padding token is defined, this line sets the existing end-of-sequence
    (eos) The token is also to be used as the padding token. This is a common practice.
    """
    tokenizer.pad_token = tokenizer.eos_token

"""
This creates a Python list containing multiple different prompts. This list
represents the batch of inputs that will be sent to the model simultaneously.
"""
prompts = [
    "Give me a short introduction to large language models.",
    "What are the capital cities of Bangladesh and Japan?",
    "Write a two-sentence horror story."
]

"""
This uses a list comprehension to iterate through each prompt in the 'prompts'
List and format it into the required conversational dictionary structure.
"""
messages = [
    [{"role": "user", "content": p}] for p in prompts
]
"""
This uses another list comprehension to apply the model's chat template to
each conversational message, creating a list of fully formatted strings.
"""
texts = [
    tokenizer.apply_chat_template(
        m,
        tokenize=False,
        add_generation_prompt=True,
    ) for m in messages
]

"""
This tokenizes the entire batch of formatted text strings at once.
- padding=True: Adds padding to shorter sentences so all sequences in the
  batch have the same length.
- return_tensors="pt": Returns the output as PyTorch tensors.
- .to(model.device): Moves the tensors to the same hardware (GPU or CPU) as the model.
"""
model_inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)

"""
This executes text generation for the entire batch. The model processes all
The input prompts and generates a response for each one.
"""
# conduct text completion for the batch
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512 # Reduced for faster batch processing
)

"""
This starts a 'for' loop to iterate through each of the generated responses
in the output batch, so we can process them one by one.
"""
for i in range(len(prompts)):

    # Get the generated tokens for the i-th prompt, excluding the original input tokens
    input_len = len(model_inputs.input_ids[i])

    output_ids = generated_ids[i][input_len:].tolist()


  
    try:
        # rindex finding 151668 (</think>)
        index = len(output_ids) - output_ids[::-1].index(151668)
    """
    This 'except' block will run if the '</think>' token ID is not found.
    """
    except ValueError:

        index = 0

    """
    This decodes the token IDs *before* the '</think>' token into a string,
    representing the model's reasoning.
    """
    thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
    """
    This decodes the token IDs from the '</think>' token onward into a string,
    representing the final answer.
    """
    content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

    print("="*50)
    print(f"PROMPT {i+1}: {prompts[i]}")
    print("-" * 20)
    print("Thinking Content:", thinking_content)
    print("Content:", content)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


PROMPT 1: Give me a short introduction to large language models.
--------------------
Thinking Content: 
Content: Okay, the user asked for a short introduction to large language models. Hmm, they probably want something concise but informative‚Äîno jargon overload. Maybe they're a student, a curious professional, or just someone who heard the term and wants a quick grasp.  

First, I should clarify what LLMs *are* without diving too deep into technicalities. The key points: massive parameters, training on text, pattern recognition, and generation. Gotta emphasize they're not "intelligent" but statistically powerful‚Äîthey predict text based on patterns, not true understanding.  

Wait, should I mention specific examples like GPT or Llama? Yeah, naming a couple would make it concrete. But keep it brief‚Äîuser said "short." Also, avoid hype; don't want to imply they have consciousness or reasoning. Important distinction!  

User might confuse LLMs with chatbots. Should I clarify that LLM

In [15]:
print(texts)

['<|im_start|>user\nGive me a short introduction to large language models.<|im_end|>\n<|im_start|>assistant\n<think>\n', '<|im_start|>user\nWhat are the capital cities of Bangladesh and Japan?<|im_end|>\n<|im_start|>assistant\n<think>\n', '<|im_start|>user\nWrite a two-sentence horror story.<|im_end|>\n<|im_start|>assistant\n<think>\n']


# Playing with Tokenizer. 

In [16]:
from transformers import AutoTokenizer

# You can use any model from the Hugging Face Hub.
# Let's use a well-known, smaller model for this example.
model_name = "Qwen/Qwen3-4B-Thinking-2507"

print(f"Loading tokenizer for model: '{model_name}'\n")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# --- Let's start with a sample sentence ---
sentence = "Transformers are very powerful tools for NLP."
print("Original Sentence:", sentence)
print("="*50)


# 1. TOKENIZE: Convert text into tokens (strings)
# This shows how the tokenizer breaks up the words.
tokens = tokenizer.tokenize(sentence)
print("1. Text split into Tokens (strings):")
print(tokens)
print("-"*50)


# 2. ENCODE: Convert text into token IDs (numbers)
# This is what the model actually receives as input.
token_ids = tokenizer.encode(sentence)
print("2. Text converted into Token IDs (numbers):")
print(token_ids)
print("-"*50)


# 3. DECODE: Convert token IDs back into text
# This shows that the process is reversible.
decoded_text = tokenizer.decode(token_ids)
print("3. Token IDs converted back into Text:")
print(decoded_text)
print("="*50)


# 4. EXPLORE SPECIAL TOKENS
# Models use special tokens to understand the structure of the input.
print("4. Exploring Special Tokens:")

# Get the entire map of special tokens
special_tokens_map = tokenizer.special_tokens_map
print("\n   - Map of all special tokens:")
print(special_tokens_map)

# Get specific tokens and their numerical IDs
cls_token = tokenizer.cls_token
cls_token_id = tokenizer.cls_token_id
sep_token = tokenizer.sep_token
sep_token_id = tokenizer.sep_token_id

print(f"\n   - Classification Token: '{cls_token}' (ID: {cls_token_id})")
print(f"   - Separator Token:    '{sep_token}' (ID: {sep_token_id})")

Loading tokenizer for model: 'Qwen/Qwen3-4B-Thinking-2507'

Original Sentence: Transformers are very powerful tools for NLP.
1. Text split into Tokens (strings):
['Transform', 'ers', 'ƒ†are', 'ƒ†very', 'ƒ†powerful', 'ƒ†tools', 'ƒ†for', 'ƒ†N', 'LP', '.']
--------------------------------------------------
2. Text converted into Token IDs (numbers):
[8963, 388, 525, 1602, 7988, 7375, 369, 451, 12567, 13]
--------------------------------------------------
3. Token IDs converted back into Text:
Transformers are very powerful tools for NLP.
4. Exploring Special Tokens:

   - Map of all special tokens:
{'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}

   - Classification Token: 'None' (ID: None)
   - Separator Token:    'None


# Loading a Quantized LLM for Efficient Inference

This script demonstrates how to load a powerful Large Language Model (`Qwen/Qwen3-4B-Thinking-2507`) in a highly memory-efficient way using **4-bit quantization**. This technique is essential for running large models on consumer-grade hardware (like gaming GPUs) by significantly reducing the model's size in memory (VRAM).

The process involves three main steps:
1.  **Defining the Quantization Configuration:** We specify *how* the model's precision should be reduced.
2.  **Loading the Model:** We load the model from the Hugging Face Hub, applying the specified configuration.
3.  **Running Inference:** We test the quantized model with a sample prompt to ensure it works as expected.

---
### Step 1: Quantization Configuration ‚öôÔ∏è
Before loading the model, we create a `BitsAndBytesConfig` object. This object acts as a set of instructions for the `transformers` library on how to compress the model's weights.
* **`load_in_4bit=True`**: This is the main command that enables 4-bit quantization.
* **`bnb_4bit_quant_type="nf4"`**: This specifies the quantization method. "nf4" (Normalized Float 4) is a highly effective and popular technique that maintains good model performance.
* **`bnb_4bit_compute_dtype=torch.bfloat16`**: While the model's weights are *stored* in 4-bit, the actual mathematical calculations during inference are performed in a more stable, higher-precision format (`bfloat16`) to maintain accuracy. Think of it like decompressing a small section of a JPEG image to full quality right before you view it.

---
### Step 2: Loading the Quantized Model üß†
We use the standard `AutoModelForCausalLM.from_pretrained` function but pass our `quantization_config` to it. The `transformers` library, in conjunction with `bitsandbytes`, handles the complex process of downloading the model and compressing its weights on the fly. The `device_map="auto"` argument intelligently places the model on your GPU for fast performance.

---
### Step 3: Performing Inference ‚ñ∂Ô∏è
After the model is loaded, we can interact with it just like any other language model. The script prepares a sample prompt ("What is quantization in the context of LLMs?"), formats it using the model's specific chat template, tokenizes it, and then calls `model.generate()` to get a response. This final step confirms that our memory-efficient model is fully functional and ready to use.

In [17]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "Qwen/Qwen3-4B-Thinking-2507"

# 1. Define the Quantization Configuration
# This tells the model how to compress its weights.
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 2. Load the model with the quantization config
print("Loading quantized model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto", # Automatically uses the GPU
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

print("\nModel loaded successfully in 4-bit!")

# You can now use the model for inference just like before.
# It will consume much less VRAM.
# For example:
prompt = "What is quantization in the context of LLMs?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=100)
response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print("\n--- Sample Inference ---")
print(response)

Loading quantized model...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]


Model loaded successfully in 4-bit!

--- Sample Inference ---
user
What is quantization in the context of LLMs?
assistant
<think>
Okay, the user is asking about quantization in the context of LLMs. Let me start by recalling what I know. Quantization is a technique to reduce the size and computational requirements of models. But I need to make sure I cover all aspects clearly.

First, I should explain why quantization is needed for LLMs. Large language models have massive parameters, often in the billions or trillions. Training them requires huge memory and powerful GPUs. For deployment, especially on edge devices like



# Setting up a Model for QLoRA Fine-Tuning

This script demonstrates the essential setup process for **QLoRA (Quantized Low-Rank Adaptation)**, a memory-efficient technique for fine-tuning large language models. The goal is to load a large base model in a compressed format and then insert small, trainable "adapter" layers. The script concludes by verifying the setup, printing a comparison between the total number of parameters in the model and the tiny fraction that will actually be trained.

The process is broken down into three parts: configuring the quantization, applying the LoRA adapters, and finally, printing the trainable parameter count.

---
### 1. The 'Q' in QLoRA: Quantization with `BitsAndBytesConfig`

The first step is to load the base model in a highly compressed state to save GPU memory (VRAM). We use a `BitsAndBytesConfig` object to give precise instructions on how to perform this compression.

* **`load_in_4bit=True`**: This is the master switch that enables 4-bit quantization.
* **`bnb_4bit_quant_type="nf4"`**: This specifies the use of **Normalized Float 4 (nf4)**, a special data type optimized for the typical distribution of weights in a neural network, which helps preserve model performance.
* **`bnb_4bit_compute_dtype=torch.bfloat16`**: This is a crucial setting for performance. While the model's weights are *stored* in memory-saving 4-bit, the actual mathematical calculations are performed in the more stable 16-bit `bfloat16` format. The model dynamically de-quantizes only the necessary weights for each calculation, ensuring accuracy without sacrificing the memory benefits.
* **`bnb_4bit_use_double_quant=True`**: This enables a second quantization pass that compresses the metadata from the first pass, saving a little extra memory.

The model is then loaded using `AutoModelForCausalLM.from_pretrained`, with our `quantization_config` applied. The base model's weights are now frozen in a 4-bit state on the GPU.

---
### 2. The 'LoRA' in QLoRA: Low-Rank Adaptation

With the base model frozen, we need to add new parameters that we can train. This is done using LoRA.

* **`prepare_model_for_kbit_training`**: This is a utility function that prepares the quantized model by setting up gradient checkpointing and ensuring compatibility for training.
* **`LoraConfig`**: This object defines the properties of the small, trainable adapter layers we will inject into the model.
    * **`target_modules=[...]`**: This tells the library exactly which layers of the frozen model to wrap with adapters. We typically target the attention mechanism's projection layers (`q_proj`, `k_proj`, etc.), as they are critical to the model's behavior.
    * **`r=16`**: This is the **rank** of the adapter, which controls its size. A smaller `r` means fewer trainable parameters.
    * **`lora_alpha=32`**: This is a scaling factor for the updates applied by the adapter.
* **`get_peft_model()`**: This function takes the base model and the LoRA configuration and returns the final model, now ready for training. The original weights are still frozen, but the new LoRA adapter weights are trainable.

---
### 3. Verifying the Setup

The `print_trainable_parameters` function is a crucial final check. It iterates through all the parameters in the modified model and counts how many are "trainable" (i.e., `param.requires_grad` is `True`).

The output clearly demonstrates the efficiency of QLoRA by showing that the number of trainable parameters is only a tiny fraction (often less than 1%) of the total parameters. This confirms that our fine-tuning process will be extremely memory-efficient.

In [20]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Model to fine-tune
model_name = "Qwen/Qwen3-4B-Thinking-2507"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)


def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || "
        f"trainable%: {100 * trainable_params / all_param:.2f}"
    )

# Call the function to see the result
print_trainable_parameters(model)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

trainable params: 11796480 || all params: 2217606656 || trainable%: 0.53


# A Detailed Guide to Fine-tuning with QLoRA

This script fine-tunes the `Qwen/Qwen3-4B-Thinking-2507` model, teaching it to adopt a new personality using the **QLoRA** method. QLoRA is a breakthrough technique for making large language models (LLMs) accessible for fine-tuning on consumer-grade hardware. It achieves this by being incredibly memory-efficient.

The entire process can be broken down into four key stages: preparing the data, configuring the model with QLoRA, setting up the trainer, and finally, training and saving the result.

---
### 1. Data Preparation üìö
Before we can teach the model a new skill, we need to create the learning material.

* **Create Raw Data:** We begin with basic Python lists for our `prompts` and the desired `responses`. This format is intuitive and easy to manage.
* **Structure the Dataset:** We convert these lists into a Hugging Face `Dataset` object. This is a standardized data structure that integrates seamlessly with the entire Hugging Face ecosystem.
* **Apply Chat Template:** We define a `formatting_prompts_func` to structure our prompts and responses into the specific chat format the Qwen model was originally trained on. Using the `.map()` function, we apply this template to every example, ensuring the model receives input in a format it understands, complete with special tokens like `<|im_start|>user`.
* **Tokenize the Dataset:** Finally, we convert our formatted text into numbers via tokenization. The `tokenize_function` not only converts text to token IDs but also handles two critical tasks for batching:
    * **Truncation:** It cuts down any examples longer than `max_length` (1024 tokens).
    * **Padding:** It adds special `pad_token`s to any examples shorter than `max_length`, ensuring every sequence in a batch has the exact same length.

---
### 2. The QLoRA Configuration ‚öôÔ∏è
This is the heart of the process, where we combine Quantization and Low-Rank Adaptation.

#### The 'Q' in QLoRA: Quantization with `BitsAndBytesConfig`
The primary bottleneck in running LLMs is GPU memory (VRAM). A 4-billion-parameter model would normally require `4B * 2 bytes/param = 8 GB` of VRAM just for the model weights in 16-bit precision, plus more for gradients and optimizer states during training. Quantization drastically reduces this. We configure it using the `BitsAndBytesConfig` object, which acts as a detailed instruction manual for this compression.

* **`load_in_4bit=True`**: This is the master switch that enables 4-bit quantization. Instead of loading the model's weights as 16-bit numbers, this tells the library to apply a sophisticated 4-bit compression scheme as the model is loaded onto the GPU.
* **`bnb_4bit_quant_type="nf4"`**: This specifies the *type* of 4-bit numbers to use. `"nf4"` stands for **Normalized Float 4**. Unlike a standard 4-bit integer, nf4 is a special data type designed by researchers to be optimal for data that follows a normal (bell-curve) distribution, which is exactly how weights in a neural network are typically distributed. This clever design minimizes the loss of information, preserving the model's performance remarkably well.
* **`bnb_4bit_compute_dtype=torch.bfloat16`**: This is a crucial concept for understanding QLoRA's efficiency. The model's weights are *stored* in memory-saving 4-bit. However, to perform actual calculations (like matrix multiplications), 4-bit precision is not enough and would lead to poor performance. This parameter tells the library to use `bfloat16` as the **computation data type**. When the model needs to perform a calculation, it takes the tiny chunk of 4-bit weights required, dynamically "de-quantizes" it up to the more precise `bfloat16` format, performs the math, and then immediately discards the `bfloat16` version, keeping only the 4-bit weights in memory. This gives you the best of both worlds: the low memory footprint of 4-bit storage and the accuracy of 16-bit computation.
* **`bnb_4bit_use_double_quant=True`**: This enables **Double Quantization**, a further optimization. The first quantization pass compresses the weights but creates a small amount of overhead in the form of "quantization constants." Double quantization applies a second, more efficient compression pass to these constants themselves, saving an additional ~0.4 bits per parameter for free.

#### The 'LoRA' in QLoRA: Low-Rank Adaptation with `LoraConfig`
With our massive base model now frozen and quantized in memory, we need a way to teach it new information. This is where LoRA comes in. Instead of training the whole model, we insert very small, trainable "adapter" layers. 

* **`target_modules=["q_proj", "k_proj", ...]`**: This is the most important parameter. It tells the `peft` library exactly which layers of the frozen model to "wrap" with trainable LoRA adapters. We typically target the layers involved in the attention mechanism, as they are crucial for how the model processes information.
* **`r=16`**: This is the **rank** of the LoRA adapter. It determines the size of the trainable matrices. A smaller `r` means fewer trainable parameters, faster training, and a smaller final checkpoint file, but might have less expressive power. `16` is a common and effective choice.
* **`lora_alpha=32`**: This is a scaling factor for the LoRA updates. A common convention is to set `lora_alpha` to be twice the value of `r`.

Finally, `get_peft_model()` wraps our quantized base model with these configured LoRA adapters, producing our final, trainable model.

---
### 3. Training Setup and Execution ‚ñ∂Ô∏è
* **`TrainingArguments`**: This object holds all the hyperparameters for the training job, like the `learning_rate`, `num_train_epochs`, and batch size.
* **`SFTTrainer`**: The Supervised Fine-tuning Trainer is the workhorse that orchestrates the entire process. It takes our PEFT model, tokenized dataset, and training arguments and handles the complex training loop for us.
* **`trainer.train()`**: This single command begins the fine-tuning. The trainer will iterate through the dataset, and for each step, only the tiny LoRA adapter weights will be updated.

---
### 4. Saving the Result üíæ
After training, `model.save_pretrained()` saves the result. Crucially, this saves **only the small LoRA adapter weights**, not the entire 4-billion-parameter base model. This creates a tiny checkpoint file (usually just a few megabytes) that is portable and can be loaded on top of the original Qwen3 model to restore its new "pirate" personality.

In [4]:
import torch
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Model to fine-tune
model_name = "Qwen/Qwen3-4B-Thinking-2507"

# 1. PREPARE DATASET
prompts = [
    "How are you?",
    "What is the capital of Bangladesh?",
    "Explain large language models."
]
responses = [
    "Ahoy! This old pirate is doin' splendidly, thank ye for askin'.",
    "Shiver me timbers! The grand capital of Bangladesh be the city of Dhaka, a port of many tales.",
    "Blast me eyes! They be like magical parrots that can squawk back anythin' ye teach 'em, but smarter than any bird I've ever seen."
]
data_dict = {"prompt": prompts, "response": responses}
dataset = Dataset.from_dict(data_dict)

# Format dataset
def formatting_prompts_func(example):
    return {
        "text": f"<|im_start|>user\n{example['prompt']}<|im_end|>\n"
                f"<|im_start|>assistant\n{example['response']}<|im_end|>"
    }

dataset = dataset.map(formatting_prompts_func)

# 2. CONFIGURE MODEL & TOKENIZER
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize before passing to trainer
def tokenize_function(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=1024,
        padding="max_length"
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# 3. TRAINER SETUP
training_args = TrainingArguments(
    output_dir="./qwen3-4b-pirate-finetune",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=1,
    fp16=True,
    save_total_limit=2,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,
    args=training_args,
)

# 4. TRAIN
print("Starting to fine-tune the model...")
trainer.train()
print("Fine-tuning complete!")

# 5. SAVE ADAPTER
adapter_dir = "./qwen3-pirate-adapter"
model.save_pretrained(adapter_dir)
print(f"Adapter saved successfully to {adapter_dir}")


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Truncating train dataset:   0%|          | 0/3 [00:00<?, ? examples/s]

Starting to fine-tune the model...


  return fn(*args, **kwargs)


Step,Training Loss
1,8.7605
2,8.5837
3,8.7605


Fine-tuning complete!
Adapter saved successfully to ./qwen3-pirate-adapter



# Extracting Raw Token Embeddings

This script demonstrates a fundamental operation in transformer models: retrieving the initial **token embeddings**. These embeddings are the first numerical representation of text after it has been tokenized. They represent the "raw" meaning of each token, pulled directly from the model's embedding layer, *before* any contextual processing by the main transformer layers.

The process involves loading the base model, tokenizing the input text, and then directly calling the model's input embedding layer to get the corresponding vectors.

---
### 1. Model and Tokenizer Loading ‚öôÔ∏è

First, we load the necessary components from the Hugging Face Hub.
* **`AutoTokenizer`**: This loads the tokenizer, which is responsible for converting the input string `"Hello, how are you?"` into a sequence of numerical token IDs.
* **`AutoModel`**: Note that we are using `AutoModel` here, not `AutoModelForCausalLM`. `AutoModel` loads the base transformer model without the final language modeling head (the part that predicts the next word). This is often used for tasks like feature extraction, as it gives direct access to the model's internal representations.

---
### 2. The Embedding Extraction Process üî¢

This is the core of the script where we perform the embedding lookup.
* **Tokenization**: The `tokenizer` is called on the input text. It converts the string into `input_ids` (the token numbers) and formats it as a PyTorch tensor.
* **`torch.no_grad()`**: We wrap the next step in this context manager because we are only performing inference (i.e., not training). This tells PyTorch that it doesn't need to calculate gradients, which makes the operation faster and more memory-efficient.
* **`model.get_input_embeddings()`**: This is the key method. It retrieves the model's very first layer, which is the **embedding layer**. This layer acts like a massive lookup table or dictionary.
* **`(...) (inputs["input_ids"])`**: We then call this embedding layer as a function, passing our `input_ids` to it. The layer takes each token ID from our input and looks up its corresponding high-dimensional vector. The result, `token_embeddings`, is a tensor containing the vector for each token in our sentence.

---
### 3. Understanding the Output ‚ñ∂Ô∏è

The script prints three pieces of information to help you understand the result:
* **Token IDs**: The list of numbers that represents tokenized input sentence.
* **Token embeddings shape**: This shows the dimensions of the output tensor. A shape like `(1, 6, 3072)` means:
    * `1`: There is **1** sentence in the batch.
    * `6`: The sentence was broken down into **6** tokens.
    * `3072`: Each token is represented by a vector of **3072** numbers. This is the model's "embedding dimension" or "hidden size".
* **Token embeddings**: This is the actual tensor containing the numerical vectors. Each vector is the model's initial, context-free representation for a single token from input.

In [11]:
import torch
from transformers import AutoTokenizer, AutoModel

model_name = "Qwen/Qwen3-4B-Thinking-2507"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

text = "Hello, how are you?"

inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    token_embeddings = model.get_input_embeddings()(inputs["input_ids"])

print("Token IDs:", inputs["input_ids"])
print("Token embeddings shape:", token_embeddings.shape)  # (1, seq_len, embedding_dim)
print("Token embeddings:", token_embeddings)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Token IDs: tensor([[9707,   11, 1246,  525,  498,   30]])
Token embeddings shape: torch.Size([1, 6, 2560])
Token embeddings: tensor([[[-0.0139, -0.0025, -0.0106,  ..., -0.0042,  0.0121,  0.0023],
         [-0.0444, -0.0300, -0.0027,  ..., -0.0239, -0.0276,  0.0074],
         [-0.0183,  0.0114,  0.0068,  ...,  0.0126, -0.0304, -0.0024],
         [-0.0149,  0.0010, -0.0037,  ..., -0.0312,  0.0159,  0.0030],
         [-0.0187,  0.0027,  0.0073,  ...,  0.0116,  0.0123,  0.0302],
         [-0.0152, -0.0427,  0.0049,  ..., -0.0011, -0.0151,  0.0474]]])
