<div align="center">
  <!-- <h1>KTransformers</h1> -->
  <p align="center">

<picture>
    <img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>

</picture>

</p>

</div>

# **Introduction**
[KTransformers](https://github.com/kvcache-ai/ktransformers), is designed to enhance the ðŸ¤— Transformers experience through advanced kernel optimizations and placement/parallelism strategies. 
<br/> <br/>
This tutorial serves as a guide for KTransformers-ft, aiming to to give resource-constrained researchers a **local path to explore fine-tuning ultra-large models (e.g., 671B/1000B)**, and also a fast way to customize smaller models (e.g., 14B/30B) for specific scenarios. We validate the setup using representative tasks such as stylized dialogue, Westernized translation tone, and medical Q&A, demonstrating that personalized adaptation can be achieved within hours.
<br/> <br/>
This tutorial takes DeepSeek-V2-Lite as a code example; for more details, refer to [KTransformers-Fine-Tuning_User-Guide](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/KTransformers-Fine-Tuning_User-Guide.md) and [KTransformers-Fine-Tuning_Developer-Technical-Notes](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/KTransformers-Fine-Tuning_Developer-Technical-Notes.md).

# **Installation**

### **1. Install torch and clone the repo**

In [None]:
!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
!cd LLaMA-Factory

**(Optional)** If you want to choose your version of torch and cuda, please install separately.

In [None]:
!pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu118

### **2. Install LLaMA-Factory**

In [7]:
import os
os.chdir("LLaMA-Factory")

In [None]:
!pip install -e ".[torch,metrics]" --no-build-isolation

### **3. Install dependency libraries for GCC and CUDA**
You need to install system-level dependency libraries. `libstdcxx-ng` and `gcc_impl_linux-64` ensure compilation compatibility, while cuda-runtime provides a GPU-accelerated runtime environment. **Please do NOT IGNORE this two commands! `nvidia/label/cuda-11.8.0 cuda-runtime` should be installed for every version of cuda for KT whl.**

In [None]:
!conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
!conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime

### **4. Install ktransformers and flash-attention**
You need to download the corresponding version of python, cuda and torch from [downloading ktransformers whl](https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.1) and [downloading flash-attention whl](https://github.com/Dao-AILab/flash-attention/releases).

In [10]:
import torch
print(torch._C._GLIBCXX_USE_CXX11_ABI)

True


In [None]:
!pip install ../ktransformers-0.4.1+cu128torch27fancy-cp312-cp312-linux_x86_64.whl

In [None]:
!pip install ../flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

# **How to Start**
## Fine-tuning the Model with LoRA

LoRA (Low-Rank Adaptation) fine-tuning only trains small "adapter" weights for large models. However, under traditional frameworks, it still needs more than 1400GB GPU VRAM, which hardly handles on the 4090s machine. **KTransformers**, as high-performance backend engine, provides a solution for GPU/CPU Hybrid devices to further cut GPU memory usage and speed up training. As shown below, we compare KTransformers(ours) with other common LoRA fine-tuning backends (HuggingFace and Unsloth). KTransformers is the **only workable 4090-class solution** for ultra-large MoE models (e.g., 671B) and also delivers higher fine-tuning throughput. <br/>
<div style="text-align: center;">
<img src="https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/æŒ‰ç…§æ¨¡åž‹åˆ’åˆ†çš„å¯¹æ¯”å›¾_02.png" alt="kt_unsloth_huggingface_compare" width="70%" height="auto">
</div>

To make KTransformers-ft more easy-to-use, we cooperator with [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/), a easy and efficiency model fine-tuning framework. As shown below, LLaMA-Factory is the unified configuration layer for the whole fine-tuning workflow. **KTransformers** acts as a high-performance backend that takes over core operators like Attention/MoE under the same training configs, enabling efficient **GPU+CPU heterogeneous cooperation**. <br/>
<div style="text-align: center;">
<img src="https://typora-tuchuang-jimmy.oss-cn-beijing.aliyuncs.com/img/image-20251011010558909.png" alt="image-20251011010558909" width="70%" height="auto">
</div>

This combination lets you fine-tune big models (like 671B/1000B) on consumer level GPUs (2-4 RTX 4090s) â€” no need for expensive hardware. Hereâ€™s the training command:

In [None]:
!USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml

Letâ€™s break down the training command (`USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml`):
- `USE_KT=1`: The "switch" to enable KTransformers optimization.  
- `llamafactory-cli train`: The core command to start LLaMA-Factoryâ€™s fine-tuning tool.
- `examples/train_lora/deepseek2_lora_sft_kt.yaml`: The configuration file that controls model, data, training rules and KTransformers settings â€” weâ€™ll detail this next.

**The LLaMA-Factory yaml (e.g. `deepseek2_lora_sft_kt.yaml`) is where you define how the fine-tuning works.** Below is a simplified version, you can use this directly for basic tasks like style transfer or domain Q&A. And Weâ€™ll explain each sectionâ€™s purpose and why the values are set this way in the following part--Custom your KTransformers-FineTuning + LLaMA-Factory.
```yaml
### model
model_name_or_path: deepseek-ai/DeepSeek-V2-Lite

### method
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: identity
template: deepseek
cutoff_len: 2048
max_samples: 100000

### output
output_dir: saves/Kllama_deepseekV2
logging_steps: 10
save_steps: 500

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0

### ktransformers
use_kt: true # use KTransformers as LoRA sft backend
kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml
cpu_infer: 32
chunk_size: 8192
```

## Chat with the Fine-tuned Model: Test Your Customized AI

After finishing fine-tuning with KTransformers, **the next step is to chat with your model and verify the results!** This step loads the original base model plus the fine-tuned "custom plugin" (LoRA adapter) you saved earlier, letting you interact with the model in real time.  

Weâ€™ll use LLaMA-Factoryâ€™s `chat` command to launch the interactive interface. The core is the LLaMA-Factory YAML configuration file â€” it tells the tool which model to load, how to optimize inference, and what style of dialogue to use. We take one of the example as follows.

In [None]:
!llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml

To know exactly what youâ€™re running, we break down the full command (`llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml`):
- `llamafactory-cli chat`: The core command to launch LLaMA-Factoryâ€™s interactive chat tool.
- `examples/inference/deepseek2_lora_sft_kt.yaml`: The configuration file for inference (controls model loading, optimization, and dialogue settings).
- No need for `USE_KT=1` here â€” weâ€™ll enable KTransformers directly in the YAML (but it still needs to match the training settings!).

**The LLaMA-Factory configuration file for inference (`examples/inference/deepseek2_lora_sft_kt.yaml`) controls the generate config for specific tasks.** Below is a simplified version, you can use this directly to chat with your fine-tuned model. Most setting is linked to your training config â€” weâ€™ll still explain the details in next part.
```yaml
model_name_or_path: deepseek-ai/DeepSeek-V2-Lite
adapter_name_or_path: saves/Kllama_deepseekV2
template: deepseek
infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
trust_remote_code: true

use_kt: true # use KTransformers as LoRA sft backend to inference
kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Lite-Chat-sft-amx.yaml
cpu_infer: 32
chunk_size: 8192
```
`kt_optimize_rule` needs as same as the kt_optimize_rule in LoRA Fine-tuning.

# **Custom your KTransformers-FineTuning + LLaMA-Factory**

Once youâ€™ve got the basic fine-tuning workflow down, youâ€™ll likely want to **adapt the process to your specific needs**â€”whether thatâ€™s training on your own data, squeezing more performance out of limited GPU memory, or speeding up training for large datasets. Belowâ€™s a hands-on guide to customizing every part of the process, with clear explanations of why each setting matters and how to tweak it.

## 1. Fine-tuning Customization: Tailor Training to Your Needs  
To start customizing, youâ€™ll still use the core training command: `USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml`. Notably, it performs even better than the default setup when adapted to your specific needs. <br/>
### Full example **LLaMA-Factory YAML** for DeepSeek-V2-Lite
```yaml
### model
model_name_or_path: deepseek-ai/DeepSeek-V2-Lite
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: identity
template: deepseek
cutoff_len: 2048
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/Kllama_deepseekV2Lite
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

### ktransformers
use_kt: true # use KTransformers as LoRA sft backend
kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Chat-sft-amx.yaml
cpu_infer: 32
chunk_size: 8192
```

---
### A. Pick & Prepare Your Model
The first step in customization is choosing the right base model, and ensuring it works with KTransformers. The `model_name_or_path` setting (shown in LLaMA-Factory YAML before) controls this, and getting it right avoids common errors.
- **Use a public model**: Directly set to Hugging Face Hub names (e.g., `deepseek-ai/DeepSeek-V2-Lite`, `Qwen/Qwen2-MoE-72B`).  
- **Use a local model**: Replace with your local folder path (e.g., `/mnt/data/models/DeepSeek-V2-Lite`).

**Critical Requirement**: The model must be in **BF16 format**.  
  - FP8 models (like DeepSeek-V3â€™s default release) arenâ€™t compatible with KTransformersâ€™ optimization.  
  - Fix: Convert FP8 to BF16 with **[this official script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py)**.

---

### B. Tune LoRA: Balance Fitting Capability & Memory  
LoRA trains tiny "adapter" weights instead of the entire model. Tweaking these two settings in LLaMA-Factory YAML (`lora_rank`, `lora_target`) lets you balance how well the model learns your data and how much GPU memory it uses:

| Setting         | What it does                                                                 | Scenario & Recommendation                                                                 |
|-----------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| `lora_rank`     | Controls the "power" of LoRA adapters (higher = more fitting, more memory). | - Small dataset (â‰¤5k samples) or limited GPU: 4-8 (balances speed/memory).<br>- Large dataset (â‰¥20k samples): 16-32 (better fits custom data). |
| `lora_target`   | Which layers get LoRA (applies only to linear layers).                      | - Quick fine-tuning (e.g., style transfer): `q_proj,v_proj` (only attention layersâ€”faster).<br>- Deep customization (e.g., medical Q&A): `all` (all linear layersâ€”more accurate). |

**Tip**: Pair `lora_rank=8` with `lora_alpha=32` (alpha = 4Ã— rank) for stable training This ratio is tested to work well for most tasks, from chatbots to domain Q&A.  

---

### C. Use Your Own Dataset
Fine-tuningâ€™s value lies in training on your own data, such as company documents, customer support logs, or domain-specific Q&A. Below is how to replace the default (identity) dataset with yours:  

1. **Add a custom dataset**:  
   - Step 1: Organize your data into LLaMA-Factoryâ€™s format (e.g., JSON with `instruction`, `input`, `output` fieldsâ€”see [dataset examples](https://github.com/hiyouga/LLaMA-Factory/tree/main/data)).  
   - Step 2: Register your dataset in [LLaMA-Factory/data/dataset_info.json](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/dataset_info.json) (copy the format of built-in datasetsâ€”just add your dataset name and file path).
     For example,
     ```json
     "niko": {
        "file_name": "../niko_train.json"
      },
      ```
   - Step 3: You may replace `dataset: identity` in LLaMA-Factory YAML to your dataset name (e.g. `dataset: niko`).
2. **Tweak dataset settings for better results**:  
   - `cutoff_len`: Truncates long texts (e.g., set to 4096 for long documents, 2048 for short dialoguesâ€”never exceed `model_max_length`).  
   - `max_samples`: Limit samples to avoid overfitting (use 100 for debugging, `None` for full trainingâ€”great if your dataset is huge).  
   - `template`: Must match your model (e.g., `deepseek` for DeepSeek, `llama3` for LLaMA3, more refer to [supported-models](https://github.com/hiyouga/LLaMA-Factory/tree/main?tab=readme-ov-file#supported-models))â€”mismatched templates break response formatting!  

---

### D. Save GPU Memory & Speed Up Training  
If youâ€™re hitting GPU memory limits or waiting too long for training, adjust these settings in LLaMA-Factory YAML:  

| Challenge               | Setting to Tweak                          | How to Adjust                                                                 |
|-------------------------|-------------------------------------------|--------------------------------------------------------------------------------|
| GPU memory is tight     | `per_device_train_batch_size` + `gradient_accumulation_steps` | Set `per_device_train_batch_size=1` (smallest batch) + `gradient_accumulation_steps=16` (simulates a batch of 16â€”no memory penalty!). |
| Model overfits (bad generalization) | `lora_dropout` + `num_train_epochs` | Add `lora_dropout: 0.1` (prevents overfitting) + reduce `num_train_epochs` to 2 (3 is defaultâ€”overtraining hurts!). |

**Key Train Configs Recap**:  
- `learning_rate`: 1e-4~2e-4 for LoRA (stick to this rangeâ€”too high = unstable, too low = slow learning).  
- `save_steps`: Save checkpoints every 100-500 steps (frequent saves = safe, but donâ€™t overdo itâ€”each checkpoint takes storage!).  
- `output_dir`: Customize the save path (e.g., `saves/medical_qa_deepseek` instead of the defaultâ€”keeps your projects organized!).  

---

### E. KTransformers Optimization: Unlock Maximum Performance  
KTransformers is what makes fine-tuning large models (like 671B-parameter DeepSeek-V3) possible on modest hardware. These settings control how it optimizes layer placement (GPU vs. CPU) and computation speed:

| Setting               | What it does                                                                 | How to Customize                                                                 |
|-----------------------|-----------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| `use_kt`              | Enables KTransformers backend (must be `true`â€”otherwise, no optimization!). | Leave as `true`â€”this is what makes 671B models trainable on 2Ã—4090s!             |
| `cpu_infer`           | Number of CPU threads for MoE/linear computations.                          | Set to half your CPU cores (e.g., 32 for a 64-core CPUâ€”too many threads = bottlenecks!). |
| `chunk_size`          | Block size for long text processing (affects memory and speed).             | Default 8192 works for most tasks; increase to 16384 for extra-long texts (e.g., book summaries). |
| `kt_optimize_rule`    | Defines where layers run (GPU/CPU) and which kernels to use (core of KT!).  | - Use the pre-built rule for your model (e.g., `DeepSeek-V2-Lite-Chat-sft-amx.yaml`).<br>- For faster speed: Use `AMXInt8`/`AMXBF16` as backend (if your CPU supports AMXâ€”check with `lscpu | grep amx`).<br>- For compatibility: Fall back to `llamafile` if AMX isnâ€™t supported. |

#### Example Custom `kt_optimize_rule` (shown in the table above)  
This rule tells KTransformers to offload heavy MoE layers to the CPU (saving GPU memory) and use AMX for fast CPU computation. Use it as a template for your own model: (Details tutorial could be seen in **[here](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/injection_tutorial.md)**)
```yaml
- match:
    name: "^model\\.layers\\..*\\.mlp\\.experts$"  # Target all MoE expert layers
  replace:
    class: ktransformers.operators.experts.KTransformersExperts  # KT's optimized MoE kernel
    kwargs:
      prefill_device: "cuda"  # Fast pre-processing on GPU
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"  # Heavy MoE compute on CPU (saves GPU memory)
      generate_op: "KSFTExpertsCPU"  # KT's SFT-optimized MoE operator
      out_device: "cuda"  # Send results back to GPU for next steps
      backend: "AMXInt8"  # Options: AMXInt8 (fastest) > AMXBF16 > llamafile (default)
```
**Alert:** Never mix KLinearMarlin with LoRA fine-tuningâ€”replace it with KLinearTorch (as in the example) to avoid compatibility issues!

In [None]:
!USE_KT=1 llamafactory-cli train examples/train_lora/deepseek2_lora_sft_kt.yaml

## 2. Chat with the Fine-tuned Model

After completing fine-tuning, the next critical step is to test your customized model through real-time interaction. Running `llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml` loads the base model and your fine-tuned LoRA adapter. Belowâ€™s a detailed guide to customizing the chat process, with clear explanations of each settingâ€™s role and how to fit it to your specific tasks.

### Full example LLaMA-Factory YAML for inference
```yaml
model_name_or_path: deepseek-ai/DeepSeek-V2-Lite
adapter_name_or_path: saves/Kllama_deepseekV2Lite
template: deepseek
infer_backend: ktransformers  # choices: [huggingface, vllm, sglang, ktransformers]
trust_remote_code: true

use_kt: true # use KTransformers as LoRA sft backend to inference
kt_optimize_rule: examples/kt_optimize_rules/DeepSeek-V2-Chat-sft-amx.yaml
cpu_infer: 32
chunk_size: 8192
```

---

### A. Load Your Fine-Tuned Adapter (Two Supported Formats)  
The `adapter_name_or_path` setting in LLaMA-Factory YAML points to your trained LoRA weights. Two formats are supported:  
- **Folder Format (Default)**: If training saved a folder (e.g., `saves/Kllama_deepseekV2`) with `.safetensors` files, set it directly (e.g., `adapter_name_or_path: saves/Kllama_deepseekV2`).  
- **GGUF Format (Single File)**: If you exported the adapter to a `.gguf` file (for portability), set the full path (e.g., `adapter_name_or_path: saves/my_adapter.gguf`).  

---

### B. Tweak Response Quality (Generation Configs)  
Optional generation parameters let you adjust the modelâ€™s responses to fit specific use cases, whether you need factual accuracy, creative expression, or concise answers. Add these to your YAML and modify based on your needs:
```yaml
# Optional generation configs (add to your inference YAML)
max_new_tokens: 1024  # Max length of responses (512 = short, 2048 = long)
temperature: 0.7      # Randomness (0.1 = factual/consistent, 1.0 = creative/diverse)
top_p: 0.9            # Focus (0.8-0.95 = avoids irrelevant content)
repetition_penalty: 1.1  # Reduces repetition (1.0 = no penalty, 1.2 = strict)
```

---

### C. KTransformers Inference Backend  
The KTransformers-related settings directly impact inference performanceâ€”they must align with your training configuration to maintain optimization effects (e.g., low memory usage, fast speed):
- `infer_backend` determines how the model generates responsesâ€”pick based on your needs. You need to choose `ktransformers`, if you LoRA fine-tuning it with ktransformers.
- `use_kt: true`: Must match trainingâ€”disables KT optimization if set to `false` (slower inference!).  
- `kt_optimize_rule`: Use the **exact same file** as training (e.g., `DeepSeek-V2-Lite-Chat-sft-amx.yaml`)â€”ensures layers map correctly.  

---

### How to Verify Inference Works
After launching the chat command, check the logs for these key messages to confirm the model is running correctly:
1. `Loaded adapter weight: XXX -> XXX`: LoRA adapter is loaded correctly.  
2. `KTransformers inference enabled`: KT optimization is active.  
3. `Backend: AMXInt8`: AMX acceleration is working (if supported).  

In [None]:
!llamafactory-cli chat examples/inference/deepseek2_lora_sft_kt.yaml