# DeepFabric: Dataset Generation + SFT Training on CUDA

This notebook demonstrates the complete workflow:
1. Generate a synthetic dataset using a local LLM
2. Fine-tune a model on the generated dataset
3. Save and optionally upload to HuggingFace Hub

**Requirements:**
- NVIDIA GPU with CUDA support
- ~16GB GPU memory for 7B models
- Python 3.11+

## 1. Setup and Installation

In [None]:
# Install DeepFabric with training dependencies
!pip install 'deepfabric[training]' -q

In [None]:
# Verify CUDA availability
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️ CUDA not available! This notebook requires a GPU.")

## 2. Configuration

In [None]:
# Configuration
MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"  # or "Qwen/Qwen2.5-7B-Instruct"
TOPIC = "Python Programming and Machine Learning"
NUM_SAMPLES = 200  # Number of training examples to generate
DATASET_PATH = "./synthetic_dataset.jsonl"
OUTPUT_DIR = "./trained_model"

# Training configuration
NUM_EPOCHS = 3
BATCH_SIZE = 4
LEARNING_RATE = 2e-5
USE_LORA = True  # Use LoRA for memory efficiency
LORA_R = 32  # LoRA rank

# Optional: HuggingFace Hub
PUSH_TO_HUB = False
HUB_MODEL_ID = "your-username/your-model-name"

print("Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Topic: {TOPIC}")
print(f"  Samples: {NUM_SAMPLES}")
print(f"  LoRA: {USE_LORA}")
print(f"  Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

## 3. Dataset Generation

Generate synthetic training data using the model on GPU.

In [None]:
from deepfabric.pipeline import DeepFabricPipeline

# Initialize pipeline with CUDA
pipeline = DeepFabricPipeline(
    model_name=MODEL_NAME,
    provider="transformers",
    device="cuda",
    torch_dtype="bfloat16",  # Use bfloat16 for better performance on modern GPUs
    model_kwargs={
        "low_cpu_mem_usage": True,
        "trust_remote_code": True,
    },
)

print("Pipeline initialized successfully!")

### Understanding Generation Prompts

DeepFabric uses two types of prompts:

1. **`generation_system_prompt`**: Used during dataset generation to guide the LLM in creating training data
   - Controls the quality, style, and focus of generated examples
   - Should describe what kind of expert the LLM should be
   - Should specify requirements for the training data

2. **`dataset_system_prompt`**: (Optional) Included in the final dataset as the system message
   - If not provided, defaults to `generation_system_prompt`
   - Use when you want different behavior during generation vs. training

**Additional Parameters:**
- `instructions`: Extra instructions for data generation (e.g., "Focus on edge cases")
- `temperature`: Controls creativity (0.0 = deterministic, 2.0 = very creative)
- `max_tokens`: Maximum tokens per generated sample
- `reasoning_style`: For CoT types - "mathematical", "logical", or "general"

In [None]:
# Generate dataset with custom generation prompts
print(f"Generating {NUM_SAMPLES} synthetic training examples...")
print("This may take several minutes depending on your GPU...\n")

# Custom system prompt for generation (controls how the LLM generates data)
GENERATION_SYSTEM_PROMPT = """You are an expert AI training data generator specializing in Python programming and machine learning.

Your task is to create high-quality, diverse training examples that:
- Cover both fundamental concepts and advanced topics
- Include practical, real-world scenarios
- Demonstrate best practices and common patterns
- Provide clear, accurate explanations
- Use proper code examples when relevant"""

dataset = pipeline.generate_dataset(
    # Topic Configuration
    topic_prompt=TOPIC,
    tree_depth=3,  # How deep the topic tree goes (more depth = more specific topics)
    tree_degree=5,  # How many subtopics per level (more = more variety)

    # Generation Configuration
    num_samples=NUM_SAMPLES,
    batch_size=5,  # Process 5 samples at a time for efficiency
    conversation_type="cot_structured",  # Chain-of-thought with structured output

    # Custom Prompts (optional but recommended for better quality)
    generation_system_prompt=GENERATION_SYSTEM_PROMPT,  # Controls data generation quality
    # dataset_system_prompt=None,  # Optional: different system prompt in final dataset

    # Additional Options (uncomment to use)
    # instructions="Focus on practical examples with code",  # Additional instructions
    # temperature=0.8,  # Higher = more creative (default 0.7)
    # max_tokens=2000,  # Max tokens per sample (default 2000)
    # reasoning_style="logical",  # For CoT: "mathematical", "logical", or "general"
)

print(f"\n✓ Generated {len(dataset)} samples")

In [None]:
# Save dataset
dataset.save(DATASET_PATH)
print(f"✓ Dataset saved to: {DATASET_PATH}")

### Inspect Generated Data

In [None]:
# Show a sample conversation
if len(dataset) > 0:
    sample = dataset.samples[0]
    print("Sample Conversation:")
    print("=" * 80)
    
    if "messages" in sample:
        for msg in sample["messages"][:4]:  # Show first 4 messages
            role = msg.get("role", "unknown")
            content = msg.get("content", "")
            print(f"\n{role.upper()}:")
            print(content[:500])  # Truncate long messages
            if len(content) > 500:
                print("...")
    
    print("\n" + "=" * 80)
    print(f"\nDataset statistics:")
    print(f"  Total samples: {len(dataset)}")
    print(f"  Average messages per sample: {sum(len(s.get('messages', [])) for s in dataset.samples) / len(dataset):.1f}")

## 4. Fine-Tuning with SFT

Train the model using the generated dataset.

In [None]:
from deepfabric.training import SFTTrainingConfig, LoRAConfig

# Configure training
training_config = SFTTrainingConfig(
    model_name=MODEL_NAME,
    output_dir=OUTPUT_DIR,
    
    # Training hyperparameters
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_ratio=0.03,
    
    # LoRA configuration (memory efficient)
    lora=LoRAConfig(
        enabled=USE_LORA,
        r=LORA_R,
        lora_alpha=LORA_R * 2,
        lora_dropout=0.05,
        target_modules="all-linear",
    ),
    
    # Memory optimization
    bf16=True,  # Use bfloat16 mixed precision
    gradient_checkpointing=True,
    
    # Logging
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=2,
    
    # HuggingFace Hub (optional)
    push_to_hub=PUSH_TO_HUB,
    hub_model_id=HUB_MODEL_ID if PUSH_TO_HUB else None,
)

print("Training configuration:")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Batch size: {BATCH_SIZE} (effective: {BATCH_SIZE * 4})")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  LoRA: {USE_LORA} (r={LORA_R})")
print(f"  Mixed precision: bf16")
print(f"  Gradient checkpointing: True")

In [None]:
# Start training
print("Starting training...\n")
print("This will take some time depending on:")
print("  - Dataset size")
print("  - Model size")
print("  - Number of epochs")
print("  - GPU speed\n")

metrics = pipeline.train(
    training_config=training_config,
    formatting_func="trl_sft_tools",  # Format for chat-based training
)

print("\n✓ Training complete!")
print(f"\nFinal metrics: {metrics}")

## 5. Save and Deploy

In [None]:
# Save everything
pipeline.save_and_upload(
    output_dir=OUTPUT_DIR,
    dataset_path=DATASET_PATH,
    hf_repo=HUB_MODEL_ID if PUSH_TO_HUB else None,
)

print(f"✓ Model saved to: {OUTPUT_DIR}")
print(f"✓ Dataset saved to: {DATASET_PATH}")
if PUSH_TO_HUB:
    print(f"✓ Uploaded to HuggingFace Hub: {HUB_MODEL_ID}")

## 6. Test the Fine-Tuned Model

In [None]:
# Load the fine-tuned model for inference
from transformers import AutoModelForCausalLM, AutoTokenizer

print("Loading fine-tuned model...")
model = AutoModelForCausalLM.from_pretrained(
    OUTPUT_DIR,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

print("✓ Model loaded")

In [None]:
# Test with a sample prompt
test_prompt = "What is the difference between a list and a tuple in Python?"

messages = [
    {"role": "user", "content": test_prompt}
]

# Format using chat template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(formatted, return_tensors="pt").to("cuda")

print(f"Test prompt: {test_prompt}\n")
print("Generating response...\n")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract just the assistant's response
if "assistant" in response:
    response = response.split("assistant")[-1].strip()

print("Response:")
print("=" * 80)
print(response)
print("=" * 80)

## 7. Cleanup (Optional)

In [None]:
# Free GPU memory
import gc

del model
del tokenizer
del pipeline
gc.collect()
torch.cuda.empty_cache()

print("✓ GPU memory freed")

## Tips for Google Colab / Kaggle

### Google Colab
1. Use a GPU runtime: Runtime → Change runtime type → GPU
2. For large models, use Colab Pro for better GPUs (A100, V100)
3. Mount Google Drive to save models:
   ```python
   from google.colab import drive
   drive.mount('/content/drive')
   OUTPUT_DIR = '/content/drive/MyDrive/models/my_model'
   ```

### Kaggle
1. Enable GPU: Settings → Accelerator → GPU
2. Kaggle provides 30h/week of GPU time
3. Save to `/kaggle/working` or use Kaggle Datasets

### Memory Tips
- Use smaller models (3B instead of 7B) if you hit OOM errors
- Reduce `BATCH_SIZE` to 1 or 2
- Increase `gradient_accumulation_steps` to maintain effective batch size
- Enable `gradient_checkpointing=True`
- Use 4-bit quantization for very large models:
  ```python
  training_config = SFTTrainingConfig(
      ...
      quantization=QuantizationConfig(
          load_in_4bit=True,
          bnb_4bit_compute_dtype="bfloat16",
      ),
  )
  ```