# Step-by-Step Fine-tuning Qwen 3 on a Custom Dataset With Unsloth and Firecrawl

## Introduction

Qwen 3, released in April 2025 by Alibaba Cloud, is a family of open-source language models ranging from 0.6B to 235B parameters. Available under the Apache 2.0 license, these models have gained widespread adoption with over 300 million downloads and 100,000+ derivative models on platforms like Hugging Face.

Qwen 3 performs exceptionally well in reasoning-intensive tasks such as mathematics, coding, and logical analysis. Even smaller models like Qwen3-4B compete with much larger predecessors, making advanced AI capabilities accessible across various computational environments.

Fine-tuning Qwen 3 for specific domains enhances its performance in specialized areas, improving accuracy and relevance for particular use cases. In this guide, we'll walk through the process of fine-tuning Qwen3-14B on a custom question-answer dataset built from scratch using Firecrawl. We'll optimize the training process with Unsloth to achieve efficient and effective results.

> Find the companion [notebook and script with full code](https://github.com/mendableai/firecrawl-app-examples/tree/main/qwen3-fine-tuning) for this article through our GitHub repository.

## Qwen 3 Architecture and Capabilities

Let's explore the technical architecture and capabilities of Qwen 3 models before diving into the fine-tuning process.

### Technical specifications

Qwen 3 offers both dense and Mixture-of-Expert (MoE) architectures to accommodate different needs:

**Dense Models:**

| Model | Layers | Heads (Key/Value) | Context Length | Parameters |
|-------|--------|-------------------|----------------|------------|
| Qwen3-0.6B | 28 | 16/8 | 32K | 0.6B |
| Qwen3-1.7B | 28 | 16/8 | 32K | 1.7B |
| Qwen3-4B | 36 | 32/8 | 32K | 4B |
| Qwen3-8B | 36 | 32/8 | 128K | 8B |
| Qwen3-14B | 40 | 40/8 | 128K | 14B |
| Qwen3-32B | 64 | 64/8 | 128K | 32B |

The dense models vary in size and capability. Smaller models (Qwen3-0.6B and Qwen3-1.7B) are appropriate for edge devices and applications with latency constraints, with some limitations in reasoning complexity. Mid-range models (Qwen3-4B and Qwen3-8B) provide a balance between performance and resource usage, suitable for many production applications. Larger models (Qwen3-14B and Qwen3-32B) are better equipped for complex reasoning tasks where sophisticated problem-solving is required.

**MoE Models:**

| Model | Layers | Heads (Key/Value) | Experts | Context Length | Parameters (Total/Activated) |
|-------|--------|-------------------|---------|----------------|------------------------------|
| Qwen3-30B-A3B | 48 | 32/4 | 128/8 | 128K | 30B/3B |
| Qwen3-235B-A22B | 94 | 64/4 | 128/8 | 128K | 235B/22B |

MoE models use a different approach where only a subset of parameters (experts) are activated for each input token. This design allows for larger total parameter counts while keeping computational requirements manageable. The Qwen3-30B-A3B model activates approximately 3B parameters during inference from its total 30B parameters, making it suitable for applications requiring advanced reasoning with moderate computational resources. The Qwen3-235B-A22B model activates about 22B of its 235B parameters, providing high performance across various tasks while maintaining reasonable inference costs.

All these models were trained on 36 trillion tokens through a structured three-stage process, followed by additional post-training to enhance reasoning and instruction-following capabilities.

### Chat templating and response formatting

Qwen 3 uses a structured chat template that defines clear roles (user/assistant) and manages conversation flow. A notable feature is its "thinking mode," which allows the model to show its reasoning process before providing an answer. This is especially useful for complex tasks where step-by-step thinking improves answer quality.

Understanding this template structure is essential for our fine-tuning process, as we'll need to format our custom dataset to match Qwen 3's expected input format. The template handles:
- Role assignments in conversations
- Turn management
- Special token placement
- Integration of thinking tags when needed

When we implement our fine-tuning pipeline, we'll need to ensure our question-answer pairs are properly formatted according to these conventions to achieve optimal results.

In this guide, we'll fine-tune Qwen3-14B on a domain-specific question-answer dataset built from scratch using Firecrawl's web scraping capabilities. We'll extract information from the Bullet Echo game wiki and transform it into a structured QA format suitable for training the model to become a specialized game assistant.


## Step 1: Creating a Custom Dataset with Firecrawl

There are [many datasets on HuggingFace](https://huggingface.co/datasets) to fine-tune models like Qwen 3:

![](https://www.firecrawl.dev/images/blog/llama4-fine-tune/datasets.png)

However, most fine-tuning scenarios require building custom datasets, which can be time-consuming. You often need to integrate content from various sources like documents, images, videos, or websites. If your project involves web content, Firecrawl can be particularly helpful. For this tutorial, we'll fine-tune Qwen 3 using an online knowledge base for the Bullet Echo mobile game.

Bullet Echo is a multiplayer game featuring short battle royale matches that take only 2-3 minutes to complete. Advancing through the ranks requires tactical skill, as players control heroes with unique abilities in a top-down 2D shooting environment:

![](https://cdn6.aptoide.com/imgs/0/6/f/06f695064ae703821206bebc8943feac_screen.png)

The game's comprehensive information is documented and maintained by the community on the Bullet Echo Fandom wiki:

![](https://www.firecrawl.dev/images/blog/llama4-fine-tune/wiki.png)

The wiki contains approximately 180 articles, and we'll need to scrape all of them to create a structured dataset in the format shown below:

```json
{
  {
    "id": "c7296197-34a1-4eba-bc54-f2044c01af15",
    "question": "Why might a player choose a tank character like Ramsay or Leviathan in Bullet Echo?",
    "answer": "A player might choose a tank character like Ramsay or Leviathan in Bullet Echo because tanks are designed to be more durable and absorb more damage, helping protect their team and lead pushes during gameplay."
  },
  {
    "id": "b1b148c6-e0d7-491b-a87a-f299824cb934",
    "question": "What are base stats in Bullet Echo and how do they affect gameplay?",
    "answer": "Base stats in Bullet Echo refer to the fundamental characteristics of a hero, such as health, damage, speed, and armor. These stats influence how effectively a hero can survive, move, and engage in combat, making them crucial for gameplay performance."
  },
  ...
}
```

All QA pairs combined must teach an LLM the game trivia in its entirety. Let's see how to go from an initial catalogue of articles to a JSON dataset step-by-step.

### Using Firecrawl for targeted web scraping

The first step is to extract all article links from the wiki. Firecrawl's schema-based extraction makes this straightforward:

```python
# Define data models
class Article(BaseModel):
    url: str
    title: str

class ArticleList(BaseModel):
    articles: List[Article]

# Scrape the wiki pages list
result = app.scrape_url(
    "https://bullet-echo.fandom.com/wiki/Special:AllPages",
    params={
        "formats": ["extract"],
        "extract": {"schema": ArticleList.model_json_schema()}
    }
)
```

Instead of writing brittle CSS selectors or XPath expressions, Firecrawl uses AI to understand what data you want based on your schema definition. This approach works even when websites change their structure, as the AI focuses on content semantics rather than DOM specifics.

### Converting the Bullet Echo wiki into structured markdown

After collecting all article URLs, we use Firecrawl's batch scraping to convert HTML content to clean markdown:

```python
# Batch scrape all articles
batch_id = app.batch_scrape(
    urls=[article.url for article in articles],
    params={"formats": ["markdown"], "onlyMainContent": True}
)

# Get results
while True:
    status = app.check_batch_status(batch_id)
    if status["status"] == "completed":
        break
    time.sleep(10)  # Wait and check again
```

The `onlyMainContent` parameter is particularly valuable, as it automatically filters out navigation elements, ads, footers, and other non-essential content. This results in clean documentation focused only on the game knowledge we need.

### Processing raw content into LLMs.txt format

To prepare for QA pair generation, we need to chunk the documents into manageable pieces:

```python
# Chunk processing function
def chunk_document(doc_path, chunk_size=800, overlap=200):
    text = Path(doc_path).read_text()
    chunks = []
    
    tokens = text.split()
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = " ".join(tokens[i:i + chunk_size])
        if len(chunk) > 100:  # Only keep substantive chunks
            chunks.append(chunk)
    
    return chunks

# Apply chunking to all markdown files
all_chunks = []
for file in markdown_files:
    chunks = chunk_document(file)
    for chunk in chunks:
        doc_info = {"source": file.stem, "content": chunk}
        all_chunks.append(doc_info)
```

This chunking process breaks down longer articles into overlapping segments, maintaining context while creating pieces that are the right size for our QA generation step.

### Generating high-quality QA pairs from web content

The final step uses a large language model to generate question-answer pairs from our structured content:

```python
from openai import OpenAI
client = OpenAI()

def generate_qa_pairs(chunk, num_pairs=3):
    prompt = f"""Generate {num_pairs} high-quality question-answer pairs based on this text:
    
    {chunk['content']}
    
    The questions should be diverse and cover different aspects of the information.
    Return in JSON format with 'question' and 'answer' fields."""
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)["pairs"]
```

By applying this function to our chunked content, we generate approximately 3,000 question-answer pairs covering all aspects of the Bullet Echo game. The resulting dataset is then saved to JSONL format and uploaded to HuggingFace for fine-tuning.

The complete dataset creation pipeline demonstrates how Firecrawl's natural language-based extraction simplifies complex web scraping tasks. Instead of dealing with HTML parsing headaches, developers can focus on describing the data they need, allowing Firecrawl's AI to handle the technical details.

For the complete implementation details, check out the [full Llama 4 fine-tuning article](https://www.firecrawl.dev/blog/fine-tuning-llama4-custom-dataset-firecrawl) and the [associated GitHub repository](https://github.com/mendableai/firecrawl-app-examples/tree/main/llama4-fine-tuning).

Also, don't forget to [sign up for Firecrawl](https://firecrawl.dev) and [go through the quickstart documentation](https://docs.firecrawl.dev/introduction) before you dive in.


## Step 2: Environment Setup for Fine-tuning

After creating our dataset, the next crucial step is setting up the right environment for fine-tuning. This involves selecting appropriate hardware and configuring the necessary software components to efficiently train our model.

### Hardware considerations

Fine-tuning a 14B parameter model like Qwen3-14B requires significant GPU resources. For our setup, a single NVIDIA L40 GPU with 48GB VRAM is sufficient when using QLoRA and 4-bit quantization, costing approximately $0.90 per hour on RunPod.

![QLoRA VRAM Calculator showing memory requirements for Qwen3-14B fine-tuning](images/vram-calculator.png)

The image above shows the VRAM calculator from [apxml.com/tools/vram-calculator](https://apxml.com/tools/vram-calculator) demonstrating that fine-tuning Qwen3-14B with 4-bit quantization, LoRA rank 16, batch size 4, and sequence length of 8,192 tokens requires approximately 24% (11.59 GB) of a 48GB GPU. This tool is invaluable for planning your hardware requirements before starting fine-tuning.

The memory breakdown includes:
- Base model weights (4-bit): ~7.0 GB
- LoRA adapters: ~0.08 GB
- Activations: ~1.68 GB
- QLoRA buffers: ~1.78 GB
- Framework overhead: ~1.07 GB

### Fine-tuning approaches: Full, LoRA, and QLoRA

When fine-tuning large language models like Qwen3, we have several approaches to consider:

- **Full fine-tuning**: Updates all model parameters, providing maximum adaptation but requiring enormous computational resources and memory. For a 14B parameter model, this approach is impractical without access to multiple high-end GPUs.
- **LoRA (Low-Rank Adaptation)**: Adds small trainable rank decomposition matrices to existing weights while keeping original parameters frozen. This reduces trainable parameters to less than 1% of the original model.
- **QLoRA (Quantized LoRA)**: Combines LoRA with 4-bit quantization of the base model weights, dramatically reducing memory requirements while maintaining performance.

For our Bullet Echo game assistant, QLoRA is ideal as it enables fine-tuning Qwen3-14B on a single GPU while preserving most of the model's capabilities. Given our specialized dataset's focused domain, this approach provides the perfect balance of efficiency and performance.

### Setting up RunPod for fine-tuning

RunPod offers on-demand GPU resources that are perfect for fine-tuning large language models. Here's how to set up a RunPod instance for our Qwen 3 fine-tuning:

1. **Create a RunPod account**: Sign up at [runpod.io](https://runpod.io/) if you don't already have an account.

2. **Select a GPU**: From the RunPod dashboard, click "Deploy" and choose a single L40 GPU (48GB). This is sufficient for our fine-tuning needs thanks to QLoRA optimization.

3. **Choose a Docker template**: Select a PyTorch template with version 2.7 or higher and CUDA support.

4. **Configure storage**: 
   - Set container disk to at least 40GB
   - Add a volume of 40-50GB for storing the model and dataset

5. **Deploy the pod**: Once configured, click "Deploy" to launch your GPU instance.

6. **Connect to your pod**: After deployment (usually takes 1-2 minutes), click "Connect" and select "JupyterLab" to access the development environment.

7. **Create a new notebook**: In JupyterLab, click the "+" button in the file browser and select "Python 3" to create a new notebook for our fine-tuning code.

### Required libraries and dependencies

Once your JupyterLab environment is running, install the necessary libraries in the first cell of your notebook:

```python
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth
!pip install regex transformers rich
```

The core components of our setup include:
- **unsloth**: Provides 2x faster fine-tuning capabilities
- **transformers**: Hugging Face's library for working with pre-trained models
- **peft**: Implements Parameter-Efficient Fine-Tuning techniques like LoRA
- **bitsandbytes**: Enables 4-bit quantization for memory efficiency
- **trl**: Training and fine-tuning framework with SFTTrainer

After installation, import the necessary modules:

```python
import unsloth
import torch
from unsloth import FastModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from transformers import TextStreamer, GenerationConfig
import re
```

With this environment setup complete, we're ready to move on to loading and preparing our Qwen 3 model for fine-tuning.

## Step 3: Loading and Preparing the Qwen 3 Model

With our environment set up, we can now proceed to load and prepare the Qwen 3 model for fine-tuning. This step involves initializing the model with appropriate quantization settings, configuring the tokenizer, and implementing memory optimization techniques.

### Model initialization with quantization

The first step is loading the pre-trained Qwen3-14B model with quantization to reduce memory usage. We use Unsloth's `FastModel` class, which provides optimized loading and fine-tuning capabilities:

```python
print("Loading Qwen3 model and tokenizer...")
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/Qwen3-14B",
    max_seq_length=2048,  # Choose any value for context length
    load_in_4bit=True,    # 4-bit quantization to reduce memory
    full_finetuning=False,
)
```

This code loads both the model and tokenizer in a single step. Let's examine each parameter:

- `model_name`: Specifies which model to load. Here we're using Unsloth's optimized version of Qwen3-14B.
- `max_seq_length`: Sets the maximum sequence length the model can handle. For most fine-tuning tasks, 2048 tokens is sufficient, but you can adjust this based on your specific needs.
- `load_in_4bit`: Enables 4-bit quantization, which significantly reduces memory usage while maintaining most of the model's performance.
- `full_finetuning`: Set to `False` since we're using LoRA/QLoRA instead of full fine-tuning.

When adapting this for your own projects, consider:
- For longer text inputs, increase `max_seq_length` (up to the model's context limit)
- For smaller GPUs, keep `load_in_4bit=True`
- For tasks requiring extreme precision, consider `load_in_8bit=True` instead (uses more memory)

### Tokenizer setup

Unsloth's `from_pretrained` method automatically loads the appropriate tokenizer for the model. For Qwen 3, this tokenizer handles several special aspects:

1. **Chat formatting**: Qwen 3's tokenizer incorporates chat templates that properly format conversations with user/assistant roles.
2. **Special tokens**: Manages model-specific tokens like BOS (beginning of sequence), EOS (end of sequence), and role indicators.
3. **Thinking tags**: Supports Qwen 3's special `<think>` tags for internal reasoning processes.

The tokenizer is critical for both training and inference. During training, it will convert your text data into token IDs that the model can process. During inference, it handles both the input formatting and output decoding.

### Memory optimization techniques

Efficient memory usage is crucial when fine-tuning large models like Qwen3-14B. The approach we're using incorporates several optimization techniques:

1. **4-bit quantization**: Reduces model weight precision from 16-bit (or 32-bit) to 4-bit, shrinking memory requirements by 4-8x with minimal performance impact.

2. **Sequential loading**: Unsloth loads model weights in sequence rather than all at once, reducing peak memory usage during initialization.

3. **Gradient checkpointing**: Will be enabled when we configure PEFT, reducing memory requirements by recomputing some activations during backpropagation rather than storing them.

4. **Parameter freezing**: With LoRA, we keep most of the model's parameters frozen, greatly reducing memory needed for storing gradients.

These optimizations allow us to fit a 14B parameter model into a single GPU's memory while maintaining most of its capabilities. When working with different model sizes, you may need to adjust these settings:

- For smaller models (below 7B parameters), 8-bit quantization might offer a better precision/memory trade-off
- For larger models (above 20B parameters), you might need multiple GPUs or more aggressive optimizations
- For edge cases with very limited memory, consider smaller models like Qwen3-4B instead

With the model and tokenizer loaded and optimized, we're ready to move on to preparing our dataset for fine-tuning. The next step will involve formatting the data according to Qwen 3's specific requirements and processing it for efficient training.


## Step 4: Dataset Preparation

Now that we have our model loaded and optimized, we need to prepare our dataset for fine-tuning. As mentioned in Step 1, we've already created and uploaded a Bullet Echo game QA dataset to Hugging Face using Firecrawl. The complete process for creating this dataset is detailed in the [Llama 4 fine-tuning article](https://www.firecrawl.dev/blog/fine-tuning-llama4-custom-dataset-firecrawl), but now we'll focus on loading and preparing this dataset specifically for Qwen 3 fine-tuning.

### Loading the Bullet Echo QA dataset

We start by loading our pre-existing dataset from Hugging Face and splitting it into training and validation sets:

```python
print("Loading Bullet Echo Wiki QA dataset...")
dataset_name = "bexgboost/bullet-echo-wiki-qa"
full_dataset = load_dataset(dataset_name, trust_remote_code=True)

# Split dataset into training and validation sets (90% train, 10% validation)
train_val_split = full_dataset["train"].train_test_split(
    test_size=0.1, seed=42, shuffle=True
)
train_dataset = train_val_split["train"]
val_dataset = train_val_split["test"]  # This becomes our validation set

print(
    f"Training examples: {len(train_dataset)}, Validation examples: {len(val_dataset)}"
)
```

We use a 90/10 split for training and validation data, which is a common practice for fine-tuning. The validation set will help us monitor the model's performance on unseen data during training.

When adapting this to your own projects:
- Use a dataset that matches your target application domain
- Consider the size of your validation set based on your dataset size (5-20% is typical)
- Ensure your dataset is properly formatted with high-quality question-answer pairs
- If your dataset is very large, you might consider using a smaller subset for faster iterations

### Formatting with Qwen 3's chat template

One critical aspect of fine-tuning Qwen 3 is properly formatting the data using its chat template. Different models use different formats, and Qwen 3 has specific requirements for how conversations should be structured:

```python
print("Formatting datasets with Qwen3 chat template...")
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def format_data(example):
    # Qwen3 uses a chat template, so we'll format it accordingly
    messages = [
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"] + EOS_TOKEN},
    ]
    # The tokenizer.apply_chat_template handles special tokens for Qwen3
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

# Format both training and validation datasets
formatted_train_dataset = train_dataset.map(format_data)
formatted_val_dataset = val_dataset.map(format_data)
```

This formatting function:
1. Takes each example from our dataset
2. Structures it as a conversation with user (question) and assistant (answer) roles
3. Adds the end-of-sequence token to the assistant's response
4. Applies Qwen 3's chat template to format it correctly

The `tokenizer.apply_chat_template()` method handles all the model-specific formatting, including adding special tokens and structuring the conversation properly. For your own projects, you'll need to adapt this function based on:
- The structure of your dataset (it might not be in QA format)
- The specific conversation structure you want the model to learn
- Any additional context or system prompts you want to include

### Tokenization and processing

Next, we tokenize our formatted datasets to convert the text into token IDs that the model can process:

```python
print("Tokenizing datasets...")

def tokenize_function(examples):
    # padding=False because SFTTrainer will handle padding
    return tokenizer(
        examples["text"],
        padding=False,
        truncation=True,
        max_length=model.config.max_position_embeddings,
    )

# Process both datasets
processed_train_dataset = formatted_train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["id", "question", "answer", "text"],
    desc="Tokenizing training dataset",
)

processed_val_dataset = formatted_val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["id", "question", "answer", "text"],
    desc="Tokenizing validation dataset",
)
```

This tokenization process:
1. Converts text to token IDs using the model's vocabulary
2. Applies truncation if sequences exceed the maximum length
3. Removes the original columns since we only need the tokenized input for training
4. Processes data in batches for efficiency

For your own projects, consider:
- Adjusting the `max_length` parameter based on your sequence lengths
- Setting appropriate `remove_columns` based on your dataset structure
- Using `batched=True` for faster processing on large datasets
- Setting an appropriate `num_proc` parameter if working with very large datasets to use parallel processing

### Train-validation splitting

We've already implemented the train-validation split earlier in our code, but it's worth noting that validation is crucial for monitoring training progress and preventing overfitting. The validation set provides an unbiased evaluation of the model during training and helps determine when to stop training or which checkpoint to select as the final model.

For fine-tuning projects with different needs:
- If your dataset is very small, consider using cross-validation instead of a simple split
- If you have a multi-class dataset, ensure your validation set has a representative distribution of all classes
- For production models, consider adding a separate test set that's never used during training

> To learn more about preparing datasets for fine-tuning in Unsloth, read [the official guide from the documentation](https://docs.unsloth.ai/basics/datasets-guide).

With our dataset now properly loaded, formatted, tokenized, and split, we're ready to move on to configuring the Parameter-Efficient Fine-Tuning (PEFT) approach with LoRA in the next step.


## Step 5: Configuring PEFT with LoRA

Now that our dataset is prepared, we need to configure Parameter-Efficient Fine-Tuning (PEFT) using LoRA. This approach allows us to fine-tune a large model like Qwen3-14B with minimal memory requirements by adding small, trainable adapters to key layers while keeping most of the original model frozen.

### Setting up LoRA parameters

We use Unsloth's `get_peft_model` function to configure our model for PEFT:

```python
print("Setting up PEFT model with LoRA...")
model = FastModel.get_peft_model(
    model,
    r=8,                      # LoRA rank
    target_modules=[          # Which layers to apply LoRA to
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    finetune_vision_layers=False,   # Turn off for just text!
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    lora_alpha=8,             # LoRA scaling factor
    lora_dropout=0,           # Dropout probability for LoRA layers
    bias="none",              # Whether to train bias parameters
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=1000,        # For reproducibility
    use_rslora=False,         # Regular LoRA vs Rank-Stabilized LoRA
)
```

Let's break down these parameters in detail:

- **r (rank)**: Determines the size of the LoRA matrices. Higher values give more learning capacity but require more memory. For our Bullet Echo task, a rank of 8 provides a good balance between capacity and efficiency.

- **target_modules**: Specifies which layers in the model receive LoRA adapters. We target attention layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`) and MLP components (`gate_proj`, `up_proj`, `down_proj`). This comprehensive coverage ensures effective learning across all key model components.

- **finetune_*_layers**: Controls which types of layers to fine-tune. We enable adaptation for language and core transformer components while disabling vision layers since our task is purely text-based.

- **lora_alpha**: Scaling factor for LoRA updates. Usually set equal to or slightly higher than rank. This influences how much the LoRA adapters contribute to the final output.

- **lora_dropout**: Regularization for LoRA layers to prevent overfitting. We use 0 for this task, but values between 0.05-0.2 can help with larger datasets or longer training runs.

- **use_gradient_checkpointing**: Memory optimization technique that trades computation for memory by recomputing some values during backpropagation rather than storing them. The "unsloth" setting uses Unsloth's optimized implementation.

### Adapting LoRA configuration for different tasks

When adapting this configuration for your own projects, consider these guidelines:

**For smaller datasets (1,000-5,000 examples):**
- Use a lower rank (4-8) to prevent overfitting
- Consider adding lora_dropout (0.05-0.1)
- Target fewer modules if very limited data is available

**For larger datasets (10,000+ examples):**
- Increase rank (16-32) for more learning capacity
- Target more modules for comprehensive adaptation
- Potentially use longer training with more epochs

**For specialized domains (technical, medical, legal):**
- Higher rank may help capture domain-specific knowledge
- More comprehensive module targeting is beneficial
- Consider longer training to develop domain expertise

**For memory-constrained environments:**
- Lower rank (4) reduces adapter memory requirements
- Reduce batch size and gradient accumulation steps
- Ensure gradient checkpointing is enabled

### Understanding LoRA's efficiency benefits

LoRA dramatically reduces the number of trainable parameters compared to full fine-tuning. For our Qwen3-14B model with rank 8, we're training only about 0.23% of the model's parameters:

```text
Trainable parameters = 32,112,640/14,000,000,000 (0.23% trained)
```

This approach offers several advantages:

1. **Memory efficiency**: Fine-tuning fits on a single GPU instead of requiring multiple high-end GPUs
2. **Training speed**: Fewer parameters mean faster training iterations
3. **Reduced overfitting**: Less risk of catastrophic forgetting of the base model's capabilities
4. **Storage efficiency**: The trained adapters are only a few MB compared to tens of GB for a full model

With our PEFT configuration complete, we're ready to set up the training process with appropriate evaluation strategies and hyperparameters in the next step.


## Step 6: Setting Up the Training Process

After configuring our model with LoRA, the next step is to set up the training process. This involves configuring the SFTTrainer with appropriate hyperparameters, evaluation strategies, and optimization settings to ensure efficient and effective fine-tuning.

### SFTTrainer configuration

We use the `SFTTrainer` from the TRL (Transformer Reinforcement Learning) library, which streamlines the fine-tuning process for language models. Here's how we configure it:

```python
print("Configuring SFTTrainer with evaluation...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=processed_train_dataset,
    eval_dataset=processed_val_dataset,  # Add validation dataset
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,  # Batch size for evaluation
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        # max_steps=100,  # For quick testing
        learning_rate=2e-4,
        logging_steps=200,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        eval_strategy="steps",
        eval_steps=200,  # Evaluate every 200 steps
        save_strategy="steps",  # Save checkpoints based on evaluation
        save_steps=200,  # Save every 200 steps
        load_best_model_at_end=True,  # Load best model at the end of training
        metric_for_best_model="eval_loss",  # Use evaluation loss to determine best model
        greater_is_better=False,  # Lower loss is better
        save_total_limit=3,  # Keep only the 3 best checkpoints
    ),
)
```

Let's examine key parameters and their significance:

- **Batch sizes**: `per_device_train_batch_size=2` and `per_device_eval_batch_size=2` specify the number of examples processed together. Small batch sizes help manage memory usage with large models.

- **Gradient accumulation**: With `gradient_accumulation_steps=4`, we accumulate gradients across multiple batches before updating the model, effectively increasing the batch size to 8 (2 × 4) without increasing memory usage.

- **Training duration**: `num_train_epochs=3` sets the number of complete passes through the training dataset, which is typically sufficient for LoRA fine-tuning.

- **Learning rate**: `learning_rate=2e-4` sets how quickly the model parameters are updated. This slightly higher value (compared to full fine-tuning) works well with LoRA as we're updating fewer parameters.

- **Evaluation strategy**: `eval_strategy="steps"` and `eval_steps=200` configure the trainer to evaluate the model every 200 training steps on the validation dataset.

- **Checkpoint saving**: `save_strategy="steps"`, `save_steps=200`, and `save_total_limit=3` ensure we save checkpoints regularly while keeping only the best ones to save disk space.

- **Model selection**: `load_best_model_at_end=True` and `metric_for_best_model="eval_loss"` configure the trainer to automatically select the checkpoint with the lowest validation loss as the final model.

### Adapting training configuration for your projects

When adapting this configuration to your own fine-tuning projects, consider these adjustments:

**For larger datasets:**
- Increase `logging_steps` and `eval_steps` to evaluate less frequently 
- Consider decreasing `num_train_epochs` to prevent overfitting
- You might need to adjust `learning_rate` downward for more stable training

**For smaller datasets:**
- Decrease `eval_steps` to get more frequent validation results
- Consider increasing `num_train_epochs` to allow more learning
- Use a higher `weight_decay` value to prevent overfitting

**For memory-constrained environments:**
- Decrease `per_device_train_batch_size` (possibly to 1)
- Increase `gradient_accumulation_steps` to maintain effective batch size
- Consider using `fp16=True` for additional memory savings

**For production-quality models:**
- Implement early stopping with `early_stopping_patience`
- Use cross-validation instead of a single validation split
- Consider a more sophisticated learning rate scheduler like `cosine`

The effective batch size in our configuration is 8 (2 × 4), which balances learning stability and memory usage. For different tasks, you might need to adjust this based on your dataset size and available GPU memory.

With our training process properly configured, we're ready to begin the actual fine-tuning of our Qwen 3 model in the next step.


## Step 7: Training the Model

With our model and training process configured, we're now ready to begin the fine-tuning. This step involves executing the training loop, monitoring the metrics, and interpreting the results to ensure our model is learning effectively.

### Executing the training loop

Starting the training process is straightforward with the SFTTrainer:

```python
print("Starting fine-tuning process with validation...")
training_results = trainer.train()

# Print evaluation metrics
print("Training completed!")
print(f"Final training metrics: {training_results.metrics}")
```

When you execute this code, the trainer takes care of:
- Loading and batching the dataset
- Computing forward and backward passes
- Updating the model parameters
- Evaluating on the validation set at specified intervals
- Saving checkpoints based on performance
- Logging metrics for monitoring

### Monitoring metrics and progress

During training, you'll see a progress bar with key information:

```
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,711 | Num Epochs = 3 | Total steps = 1,017
O^O/ \\_/ \\    Batch size per device = 2 | Gradient accumulation steps = 4
\\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 32,112,640/14,000,000,000 (0.23% trained)
```

This header provides a summary of your training configuration, including:
- The number of examples being trained on
- Total epochs and training steps
- Batch size and gradient accumulation configuration
- The percentage of parameters being trained (0.23% in our case)

As training progresses, a table shows the training and validation metrics at regular intervals:

```
Step    Training Loss    Validation Loss
200     1.571900         1.287612
400     1.211400         1.214332
600     1.081000         1.182069
800     0.960400         1.207953
1000    0.879500         1.197931
```

### Interpreting evaluation results

The key metrics to monitor are:

1. **Training loss**: This should generally decrease over time. If it plateaus early, you might need to increase your learning rate or model capacity.

2. **Validation loss**: This is crucial for detecting overfitting. It should decrease along with training loss. If validation loss starts increasing while training loss continues to decrease, your model is likely overfitting.

In our example, training loss decreases steadily from 1.57 to 0.88, showing that the model is learning from the data. The validation loss initially decreases but then fluctuates slightly, which is normal as the model fine-tunes its parameters.

Final metrics after training might look like:

```python
Final training metrics: {'train_runtime': 1461.829, 'train_samples_per_second': 5.564, 'train_steps_per_second': 0.696, 'total_flos': 5.716715497264128e+16, 'train_loss': 1.136481964013804}
```

These metrics provide:
- Total training time (1461 seconds in this case)
- Processing speed (samples and steps per second)
- Computational resource usage (FLOPS)
- Average training loss across all training steps

### Troubleshooting common issues

When fine-tuning your own models, you might encounter these common issues:

**Overfitting**:
- Symptoms: Validation loss increases while training loss continues to decrease
- Solutions: Add dropout, reduce training epochs, increase weight decay, or use a smaller LoRA rank

**Underfitting**:
- Symptoms: Both training and validation loss remain high
- Solutions: Increase training epochs, use a larger LoRA rank, target more modules, or increase learning rate

**Memory errors**:
- Symptoms: CUDA out of memory errors
- Solutions: Reduce batch size, increase gradient accumulation steps, use 4-bit instead of 8-bit quantization, or choose a smaller model

**Slow convergence**:
- Symptoms: Loss decreases very slowly
- Solutions: Increase learning rate, adjust warmup steps, or check data quality

**Poor inference results**:
- Symptoms: Model outputs don't match expectations
- Solutions: Check data quality, ensure proper chat template formatting, or increase dataset size

For our Bullet Echo game assistant, the training completes successfully with a steady decrease in training loss, indicating that the model has effectively learned from our dataset. With training complete, we can now move on to testing the model and evaluating its performance on new queries in the next step.



## Step 8: Inference and Testing

After successfully training our model, we need to test its capabilities by running inference on new questions. This step involves setting up the model for generation, configuring the response parameters, and evaluating the quality of outputs.

### Setting up model for inference

First, we need to prepare our model for inference by optimizing it for generation:

```python
print("Setting up model for inference...")
unsloth.FastModel.for_inference(model)  # Enable native 2x faster inference
```

This one line enables Unsloth's optimized inference mode, which can provide up to 2x faster generation compared to standard implementations.

### Custom generation configuration

Next, we create a function to generate responses from our fine-tuned model:

```python
def generate_response(
    model, tokenizer, query, temperature=0.7, top_p=0.9, max_new_tokens=256
):
    """
    Generate a response from the fine-tuned model.
    """
    # Format the query as a chat message
    messages = [{"role": "user", "content": query}]

    # Prepare model inputs
    inputs = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")

    # Create attention mask (all 1s) with the same shape as inputs
    attention_mask = torch.ones_like(inputs).to("cuda")

    # Configure generation parameters
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        do_sample=True,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        remove_invalid_values=True,
        # Disable thinking tags
        suppression_tokens=(
            [
                tokenizer.encode("<think>", add_special_tokens=False)[0],
                tokenizer.encode("</think>", add_special_tokens=False)[0],
            ]
            if len(tokenizer.encode("<think>", add_special_tokens=False)) > 0
            else None
        ),
    )
```

This configuration specifies important generation parameters:

- **temperature**: Controls randomness in generation. Higher values (e.g., 1.0) produce more diverse outputs, while lower values (e.g., 0.2) make responses more deterministic.
- **top_p**: Controls nucleus sampling, where only tokens with cumulative probability less than top_p are considered. This helps focus the generation while maintaining diversity.
- **max_new_tokens**: Sets the maximum length of the generated response.
- **suppression_tokens**: Prevents certain tokens (like thinking tags) from appearing in the output.

### Implementing thinking tag filtering

Qwen 3 models sometimes use "thinking" tags to show their reasoning process. While this is useful for complex problems, we may want to hide this internal reasoning in our final output:

```python
# Custom text filtering function
def filter_thinking(text):
    # Remove anything between <think> and </think> tags
    text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL)
    # Remove any remaining <think> or </think> tags
    text = re.sub(r"<think>|</think>", "", text)
    return text

# Custom streamer class to filter thinking tags
class FilteredTextStreamer(TextStreamer):
    def on_finalized_text(self, text: str, stream_end: bool = False):
        filtered_text = filter_thinking(text)
        if filtered_text.strip():  # Only print non-empty text
            print(filtered_text, end="", flush=True)
```

This custom streamer filters out the thinking process in real-time during generation, providing a cleaner user experience.

### Testing with sample queries

With our generation function set up, we can test the model with some Bullet Echo game questions:

```python
# Test the model with sample queries
print("\n--- Testing Model Responses ---")

test_queries = [
    "What's the best strategy for Cyclops in Bullet Echo?",
    "How does the Stalker's invisibility work in the game?",
    "Which heroes are effective against Bastion in Bullet Echo?",
]

for query in test_queries:
    generate_response(model, tokenizer, query)
```

Sample outputs might look like:

```
User: What's the best strategy for Cyclops in Bullet Echo?
Assistant: The best strategy for Cyclops is to stay hidden, as this is his greatest strength. He excels in ambush tactics, allowing him to surprise enemies and maximize his effectiveness in stealthy encounters.

User: How does the Stalker's invisibility work in the game?
Assistant: The Stalker uses a special ability called invisibility, which makes the character temporarily undetectable by opponents. This is often used for stealth movements and surprise attacks.
```

### Adapting inference for different applications

When implementing inference for your own projects, consider these variations:

**For interactive applications:**
- Use streaming generation to show responses as they're generated
- Consider lower temperature values (0.3-0.5) for more deterministic answers
- Implement a maximum length cutoff appropriate to your UI

**For batch processing:**
- Disable streaming for faster processing
- Consider using higher batch sizes if memory allows
- Store complete outputs rather than printing them

**For specialized domains:**
- Adjust temperature based on how creative vs. factual responses should be
- Consider adding domain-specific post-processing to validate outputs
- You might want to keep thinking tags visible for complex reasoning tasks

## Step 9: Saving and Deploying the Model

Our fine-tuned model is performing well, so it's time to save it for future use and deployment. This step covers saving the model locally, options for sharing via Hugging Face Hub, and verifying that the saved model works correctly.

### Saving the fine-tuned model locally

First, we save the fine-tuned model and tokenizer to local storage:

```python
print("\nSaving fine-tuned model...")
output_model_name = "qwen3-bullet-echo-qa-lora"
model.save_pretrained(output_model_name)
tokenizer.save_pretrained(output_model_name)
print(f"Model successfully saved to: ./{output_model_name}")
```

This creates a directory containing all the necessary files:
- The LoRA adapter weights (much smaller than the full model)
- Configuration files specifying the model architecture and parameters
- Tokenizer files including vocabulary and special token mappings

### Hugging Face Hub integration (optional)

For sharing your model with others or deploying to production, you can push it to Hugging Face Hub:

```python
# Optional: Push to Hugging Face Hub
# from huggingface_hub import login
# login()
# hub_model_id = f"your-hf-username/{output_model_name}"
# model.push_to_hub(hub_model_id)
# tokenizer.push_to_hub(hub_model_id)
# print(f"Model pushed to Hugging Face Hub: {hub_model_id}")
```

This makes your model accessible to others and integrates with various deployment platforms that support Hugging Face models.

### Loading and verifying the saved model

To ensure everything was saved correctly, we load the model back and test it:

```python
print("\n--- Loading Saved Fine-tuned Model ---")

# Load the saved model and tokenizer
saved_model_path = output_model_name  # "qwen3-bullet-echo-qa-lora"
loaded_model, loaded_tokenizer = FastModel.from_pretrained(
    model_name=output_model_name,
    max_seq_length=2048,
    load_in_4bit=True,
    full_finetuning=False,
)

# Enable faster inference
unsloth.FastModel.for_inference(loaded_model)

print("Model successfully loaded for inference!")

# Test with new queries
print("\n--- Testing Loaded Model Responses ---")

new_test_queries = [
    "What's the best strategy for Cyclops in Bullet Echo?",
    "How does the Stalker's invisibility work in the game?",
    "Which heroes are effective against Bastion in Bullet Echo?",
]

for query in new_test_queries:
    generate_response(loaded_model, loaded_tokenizer, query, temperature=0.2)
```

Notice we use a lower temperature (0.2) here to get more deterministic responses for easier comparison.

### Deployment considerations

When deploying your fine-tuned model for real-world use, consider these approaches:

**For web applications:**
- Use Hugging Face Inference API for managed hosting
- Deploy as a container with FastAPI or Flask for more control
- Consider quantizing to INT8 or INT4 for production efficiency

**For local applications:**
- Export to ONNX format for faster CPU inference
- Use llama.cpp for optimized deployment on edge devices
- Consider merging LoRA weights with the base model for simplified deployment

**For scaling considerations:**
- Use vLLM or text-generation-inference for higher throughput
- Implement caching for common queries
- Consider distilling into a smaller model for resource-constrained environments

By following these steps, you've successfully fine-tuned a powerful Qwen 3 model to create a specialized assistant for the Bullet Echo game. The resulting model can now answer domain-specific questions with accuracy and relevance, while maintaining the general capabilities of the base model.


## Conclusion

In this step-by-step guide, we've walked through the complete process of fine-tuning Qwen 3 on a custom dataset. We started by creating a specialized question-answer dataset from the Bullet Echo wiki using Firecrawl's AI-powered extraction capabilities, then prepared our training environment with appropriate hardware and memory optimizations. Through Parameter-Efficient Fine-Tuning with QLoRA, we were able to adapt a 14B parameter model while training only 0.23% of its parameters, making the process feasible on a single GPU. Our implementation of proper validation strategies, optimization techniques, and inference configuration resulted in a model that can accurately answer domain-specific questions about the Bullet Echo game.

The techniques demonstrated here can be applied to create specialized AI assistants for virtually any domain. Whether you're building a customer support bot, a technical documentation assistant, or a domain-specific knowledge base, the combination of Firecrawl for dataset creation and Unsloth for optimized fine-tuning provides a powerful toolkit for customizing large language models. To create your own custom datasets for fine-tuning, consider exploring [Firecrawl's AI-powered extraction](https://firecrawl.dev) capabilities, which eliminate the need for complex web scraping code and make dataset creation accessible even without extensive technical knowledge. As language models continue to evolve, the ability to efficiently adapt them to specialized domains will remain a key competitive advantage for developers and organizations.

> Don't forget to check out [the full code](https://github.com/mendableai/firecrawl-app-examples/tree/main/qwen3-fine-tuning) for this article from our GitHub repository.