In [1]:
from qwen_finetuning import QwenFineTuningConfig, QwenFineTuning

In [2]:
config = QwenFineTuningConfig(
    model_name="Qwen/Qwen3-8B",
    train_file="data/train.jsonl",
    output_dir="./v4/results",
    
    batch_size=8,                   # Still optimal for 24GB VRAM
    gradient_accumulation_steps=2, # Effective batch size = 16
    learning_rate=2e-4,             # Still optimal
    num_epochs=2,                   # Increased from 2 with optimizations
    max_length=512,
    lora_r=16,
    lora_alpha=32,
    
    # AUTOMATIC OPTIMIZATIONS (these are now defaults):
    # - use_rslora=True              # ðŸš€ RSLoRA for 5-15% better performance
    # - target_modules="all-linear"  # ðŸš€ All linear layers for maximum performance  
    # - lora_dropout=0.1             # ðŸš€ Better regularization (was 0.05)
    # - lr_scheduler_type="cosine_with_restarts"  # ðŸš€ Better than linear decay
    # - warmup_ratio=0.03            # ðŸš€ Optimal warmup for your dataset size
    # - Flash Attention 2 auto-enabled with fallback
)

In [3]:
config.print_config()


âœ“ Configuration set with optimizations
Model: Qwen/Qwen3-8B
Learning rate: 0.0002
LR scheduler: cosine_with_restarts (warmup: 0.03)
Batch size: 8
Effective batch size: 16
LoRA optimizations:
  - RSLoRA enabled: True
  - Target modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
  - Rank: 16, Alpha: 32, Dropout: 0.1
Dataset processing cores: 4
Cache writer batch size: 500
DataLoader workers: 4
DataLoader optimizations: pin_memory=True, persistent_workers=True
GPU cache management: empty every 4 steps


In [4]:
# Create fine-tuning instance
finetuner = QwenFineTuning(config)


âœ“ Environment loaded, HF token available


In [5]:
# Load training data
train_data = finetuner.load_jsonl(config.train_file)


In [6]:
finetuner.run_complete_finetuning(train_data=train_data)


Train Dataset: 86929 examples
Categories: unknown(86929)
Answer distribution: A(24205), B(24441), C(24625), D(11146), E(2512)
Loading model and tokeniser with optimizations...


config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/3.19G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

âœ“ Optimized LoRA configuration applied:
trainable params: 43,646,976 || all params: 8,234,382,336 || trainable%: 0.5301
  - RSLoRA enabled: True
  - Target modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']

Example prompt format:
<|im_start|>user
Domanda: Il diario clinico ha lo scopo di:...

A) Permettere la ricostruzione del decorso clinico del residente documentando le scelt...

Optimizations enabled:
  - RSLoRA: True
  - Target modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
  - LR scheduler: cosine_with_restarts
  - Warmup ratio: 0.03
  - Dataset processing: 4 CPU cores
  - Memory-efficient caching: batch size 500
  - Optimized DataLoader: 4 workers, pin_memory, persistent_workers
  - GPU memory management: cache clearing every 4 steps
Setting up trainer with optimized configuration...
âœ“ Processing and caching dataset with memory optimization...
Formatting dataset with 4 processes (memory: 42.6GB)...


Formatting with chat templates (num_proc=4):   0%|          | 0/86929 [00:00<?, ? examples/s]

âœ“ Formatting complete (memory: 42.6GB â†’ 42.4GB)
âœ“ Processing complete (memory: 42.5GB â†’ 42.4GB)
âœ“ Saving to cache with optimized batch size (500)...


Saving the dataset (0/1 shards):   0%|          | 0/86929 [00:00<?, ? examples/s]

âœ“ Dataset cached efficiently to: cache/processed_datasets/342ab7c6db43e6f8df6ed1e851ed55d9
âœ“ Total memory usage: 42.5GB â†’ 42.3GB
âœ“ Training optimizations enabled:
  - Learning rate scheduler: cosine_with_restarts
  - Warmup ratio: 0.03
  - DataLoader workers: 4 (parallel data loading)
  - Pin memory: True (faster GPU transfer)
  - Persistent workers: True (reduced startup overhead)
  - GPU cache clearing: every 4 steps




Adding EOS to train dataset:   0%|          | 0/86929 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/86929 [00:00<?, ? examples/s]

Packing train dataset:   0%|          | 0/86929 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


âœ“ Trainer configured with optimizations
Starting optimized training...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
20,1.4626
40,1.1561
60,1.1361
80,1.0649
100,1.0947
120,1.0772
140,1.0748
160,1.0593
180,1.0579
200,1.0624


Saving model...
âœ“ Optimized training completed


In [7]:
print(f"\n1-epoch fine-tuning completed successfully")
print(f"Model saved to: {config.output_dir}")


1-epoch fine-tuning completed successfully
Model saved to: ./v4/results
