# StyleForge - Real-Time Neural Style Transfer with CUDA Kernels

This notebook demonstrates the StyleForge system with optimized CUDA kernels for real-time neural style transfer.

## Features

- **Fused Multi-Head Attention**: 4-8x faster than PyTorch with vectorized memory access
- **Fused FFN**: 3-5x speedup for feed-forward layers
- **Fused Instance Norm**: 2-4x faster normalization for style transfer
- **Proper Benchmarking**: CUDA event-based timing with validation

## Requirements

- CUDA 11.0+ GPU with Compute Capability 7.0+
- PyTorch 1.10+ with CUDA support

## 0. Clone Repository and Install Dependencies

Run this cell first to set up the environment.

In [None]:
# Clone the repository (skip if already cloned)
import os
import subprocess

REPO_URL = "https://github.com/oleeveeuh/StyleForge.git"
REPO_DIR = "/content/StyleForge"  # For Google Colab

# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("üìå Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("üìå Not running in Google Colab")

# Clone repository if not exists
if IN_COLAB and not os.path.exists(REPO_DIR):
    print(f"Cloning StyleForge repository to {REPO_DIR}...")
    !git clone {REPO_URL} {REPO_DIR}
    %cd {REPO_DIR}
elif os.path.exists("StyleForge"):
    %cd StyleForge
    print("Already in StyleForge directory")
elif os.path.exists("../StyleForge"):
    %cd ../StyleForge
    print("Changed to parent StyleForge directory")
else:
    print("Assuming we're in the StyleForge directory")

print("\nRepository setup complete!")

## 1. Install Dependencies and Build Tools

In [None]:
# Install PyTorch with CUDA support and build tools
import sys
import subprocess

def install_package(package):
    """Install a package with pip."""
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

print("=" * 70)
print("STEP 1: Installing Dependencies")
print("=" * 70)

# Check for ninja
print("\nChecking for ninja...")
try:
    result = subprocess.run(['ninja', '--version'], capture_output=True, timeout=5)
    if result.returncode == 0:
        print(f"‚úì ninja already installed")
    else:
        raise FileNotFoundError
except (FileNotFoundError, subprocess.TimeoutExpired):
    install_package("ninja")
    print("‚úì ninja installed")

# Check PyTorch
print("\nChecking PyTorch...")
try:
    import torch
    print(f"‚úì PyTorch {torch.__version__} installed")
except ImportError:
    install_package("torch")

print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

## 2. Environment Setup

In [None]:
import torch
import torch.nn as nn
import numpy as np
import time
from pathlib import Path

print("=" * 70)
print("STEP 2: Setting Up Environment")
print("=" * 70)

# Setup path
if IN_COLAB:
    import sys
    sys.path.insert(0, REPO_DIR)

print(f"Working directory: {Path.cwd()}")
print(f"Device: {device}")

## 3. Import StyleForge Kernels

The kernels will be JIT-compiled on first use. This may take 30-60 seconds.

In [None]:
if torch.cuda.is_available():
    print("=" * 70)
    print("Loading CUDA Kernels...")
    print("=" * 70)
    
    KERNELS_AVAILABLE = False
    
    try:
        from kernels.attention_wrapper import FusedAttention
        print("‚úÖ FusedAttention imported")
        
        try:
            from kernels import FusedFFN, FusedInstanceNorm2d
            print("‚úÖ FusedFFN and FusedInstanceNorm2d imported")
        except ImportError:
            print("‚ö†Ô∏è FusedFFN/FusedInstanceNorm2d not available")
            FusedFFN = None
            FusedInstanceNorm2d = None
        
        KERNELS_AVAILABLE = True
    except Exception as e:
        print(f"‚ùå Failed to load kernels: {e}")
        FusedAttention = None
        FusedFFN = None
        FusedInstanceNorm2d = None

else:
    print("‚ö†Ô∏è CUDA not available")
    KERNELS_AVAILABLE = False

## 4. Fast Style Transfer (Johnson et al.)

This section demonstrates **Fast Neural Style Transfer** using pre-trained weights.

### Available Styles: candy, starry, mosaic, udnie, wave

In [None]:
if torch.cuda.is_available():
    print("=" * 70)
    print("Fast Style Transfer Setup")
    print("=" * 70)
    
    from models.transformer_net import TransformerNet, AVAILABLE_STYLES
    from pathlib import Path
    
    print(f"Available styles: {', '.join(AVAILABLE_STYLES)}")
    
    # Check for pretrained weights
    checkpoint_path = Path('saved_models/candy.pth')
    if checkpoint_path.exists():
        print(f"‚úÖ Found pre-trained weights")
    else:
        print(f"‚ö†Ô∏è No pre-trained weights (using random init)")
        checkpoint_path = None

else:
    checkpoint_path = None

In [None]:
# Load Fast Style Transfer Model
if torch.cuda.is_available():
    from models.transformer_net import TransformerNet
    
    style_model = TransformerNet(num_residual_blocks=5).to(device)
    
    if checkpoint_path and checkpoint_path.exists():
        style_model.load_checkpoint(str(checkpoint_path))
        print("‚úÖ Loaded pre-trained weights")
    
    style_model.eval()
    
    total_params = sum(p.numel() for p in style_model.parameters())
    print(f"Parameters: {total_params:,}")
    print(f"‚úÖ Model loaded")

else:
    style_model = None

In [None]:
# Test with random input
if torch.cuda.is_available() and style_model is not None:
    test_input = torch.randn(1, 3, 256, 256, device=device)
    
    with torch.no_grad():
        output = style_model(test_input)
    
    print(f"Input: {test_input.shape}")
    print(f"Output: {output.shape}")
    print("‚úÖ Fast Style Transfer working!")

## 5. Image Upload & Style Transfer

Upload your own images to apply style transfer.

### Instructions:
1. Run the cell below
2. Click "Choose files" to upload an image
3. The stylized result will be displayed and available for download

In [None]:
if torch.cuda.is_available() and style_model is not None:
    try:
        from google.colab import files
        from io import BytesIO
        from PIL import Image
        import matplotlib.pyplot as plt
        from torchvision import transforms
        
        print("=" * 70)
        print("Image Upload & Style Transfer")
        print("=" * 70)
        print("\nüìÅ Upload an image:\n")
        
        uploaded = files.upload()
        
        if uploaded:
            for filename in uploaded.keys():
                print(f"\nProcessing {filename}...")
                
                img = Image.open(BytesIO(uploaded[filename])).convert('RGB')
                original_size = img.size
                
                # Resize for processing
                PROCESSING_SIZE = 512
                aspect = img.size[0] / img.size[1]
                if aspect > 1:
                    new_size = (PROCESSING_SIZE, int(PROCESSING_SIZE / aspect))
                else:
                    new_size = (int(PROCESSING_SIZE * aspect), PROCESSING_SIZE)
                img_resized = img.resize(new_size, Image.Resampling.LANCZOS)
                
                # Convert to tensor
                transform = transforms.Compose([transforms.ToTensor()])
                input_tensor = transform(img_resized).unsqueeze(0).to(device)
                
                # Apply style transfer
                with torch.no_grad():
                    start = time.perf_counter()
                    output_tensor = style_model(input_tensor)
                    torch.cuda.synchronize()
                    elapsed_ms = (time.perf_counter() - start) * 1000
                
                # Convert back
                output_img = transforms.ToPILImage()(output_tensor.squeeze(0).clamp(0, 1))
                output_img = output_img.resize(original_size, Image.Resampling.LANCZOS)
                
                # Display
                fig, axes = plt.subplots(1, 2, figsize=(14, 6))
                axes[0].imshow(img)
                axes[0].set_title('Original')
                axes[0].axis('off')
                axes[1].imshow(output_img)
                axes[1].set_title(f'Stylized ({elapsed_ms:.1f} ms)')
                axes[1].axis('off')
                plt.tight_layout()
                plt.show()
                
                # Save and download
                result_filename = f'stylized_{filename}'
                output_img.save(result_filename, quality=95)
                print(f"‚úÖ Saved: {result_filename}")
                files.download(result_filename)
    
    except ImportError:
        print("\nNote: Image upload works in Google Colab.")
        print("For local usage, use PIL.Image.open()")

else:
    print("‚ö†Ô∏è CUDA not available or model not loaded")

## 6. Video File Style Transfer

Process video files frame-by-frame with style transfer.

### Instructions:
- Run the script below locally with your video file
- Or upload a video in Colab (short videos work best)

In [None]:
if torch.cuda.is_available() and style_model is not None:
    print("=" * 70)
    print("Video File Style Transfer")
    print("=" * 70)
    print("\nRun this code locally with your video file:\n")
    
    print("""
import cv2
from torchvision import transforms
from PIL import Image

# Configuration
INPUT_VIDEO = "input.mp4"
OUTPUT_VIDEO = "stylized_output.mp4"
TARGET_WIDTH = 640

# Open video
cap = cv2.VideoCapture(INPUT_VIDEO)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
target_height = int(TARGET_WIDTH * height / width)

# Setup writer
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(OUTPUT_VIDEO, fourcc, fps, (TARGET_WIDTH, target_height))

# Process
transform = transforms.Compose([transforms.ToTensor()])
to_pil = transforms.ToPILImage()

while True:
    ret, frame = cap.read()
    if not ret: break
    
    # Resize and process
    frame_resized = cv2.resize(frame, (TARGET_WIDTH, target_height))
    frame_rgb = cv2.cvtColor(frame_resized, cv2.COLOR_BGR2RGB)
    img_pil = Image.fromarray(frame_rgb)
    input_tensor = transform(img_pil).unsqueeze(0).to(device)
    
    with torch.no_grad():
        output_tensor = style_model(input_tensor)
    
    output_img = to_pil(output_tensor.squeeze(0).clamp(0, 1))
    output_array = np.array(output_img)
    output_bgr = cv2.cvtColor(output_array, cv2.COLOR_RGB2BGR)
    out.write(output_bgr)

cap.release()
out.release()
print(f"Done! Saved: {OUTPUT_VIDEO}")
    """)
    
    # For Colab upload
    try:
        from google.colab import files
        print("\nüìÅ Upload a video file:")
        files.upload()
    except ImportError:
        pass

else:
    print("‚ö†Ô∏è CUDA not available or model not loaded")

## 7. Real-Time Webcam Style Transfer

Process live webcam feed with style transfer.

### Instructions:
- Run the script below locally with a webcam
- Press 'q' to quit, 's' to save a frame

In [None]:
if torch.cuda.is_available() and style_model is not None:
    print("=" * 70)
    print("Real-Time Webcam Style Transfer")
    print("=" * 70)
    print("\nRun this script locally with a webcam:\n")
    
    print("""
import cv2
import numpy as np
from torchvision import transforms
from PIL import Image

# Initialize webcam
cap = cv2.VideoCapture(0)

print("Press 'q' to quit, 's' to save")

while True:
    ret, frame = cap.read()
    if not ret: break
    
    # Process
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    img_pil = Image.fromarray(frame_rgb).resize((512, 384))
    input_tensor = transforms.Compose([transforms.ToTensor()])(img_pil).unsqueeze(0).to(device)
    
    with torch.no_grad():
        output_tensor = style_model(input_tensor)
    
    output_img = transforms.ToPILImage()(output_tensor.squeeze(0).clamp(0, 1))
    output_array = np.array(output_img.resize((frame.shape[1], frame.shape[0])))
    output_bgr = cv2.cvtColor(output_array, cv2.COLOR_RGB2BGR)
    
    cv2.imshow('StyleForge', output_bgr)
    
    key = cv2.waitKey(1) & 0xFF
    if key == ord('q'):
        break
    elif key == ord('s'):
        cv2.imwrite(f'webcam_{int(time.time())}.png', output_bgr)
        print("Saved!")

cap.release()
cv2.destroyAllWindows()
    """)

else:
    print("‚ö†Ô∏è CUDA not available or model not loaded")

## 8. ViT-Based Style Transfer

Vision Transformer-based style transfer using custom CUDA attention kernels.

### Model Variants:
| Variant | Parameters | Patches | Blocks |
|---------|------------|---------|--------|
| **nano** | 2M | 64 | 2 |
| **small** | 11M | 64 | 4 |
| **base** | 54M | 64 | 6 |

In [None]:
if torch.cuda.is_available():
    from models.vit_style_transfer import create_model, STYLEFORGE_MODELS
    
    print("=" * 70)
    print("ViT Style Transfer Setup")
    print("=" * 70)
    
    print("\nAvailable variants:")
    for variant, config in STYLEFORGE_MODELS.items():
        print(f"  {variant}: {config['image_size']}, {config['embed_dim']} dim")
    
    # Create small model
    vit_model = create_model(variant='small', use_cuda_kernels=True).to(device)
    vit_model.eval()
    
    total_params = sum(p.numel() for p in vit_model.parameters())
    print(f"\nParameters: {total_params:,}")
    print("‚úÖ ViT model loaded")
    
    vit_model_available = True

else:
    vit_model_available = False
    print("‚ö†Ô∏è CUDA not available")

In [None]:
# Test ViT model
if torch.cuda.is_available() and vit_model_available:
    from models.vit_style_transfer import STYLEFORGE_MODELS
    
    config = STYLEFORGE_MODELS['small']
    IMAGE_SIZE = config['image_size']
    
    content = torch.randn(1, 3, IMAGE_SIZE, IMAGE_SIZE, device=device)
    style = torch.randn(1, 3, IMAGE_SIZE, IMAGE_SIZE, device=device)
    
    # Warmup
    with torch.no_grad():
        for _ in range(5):
            _ = vit_model(content, style)
    torch.cuda.synchronize()
    
    # Benchmark
    times = []
    with torch.no_grad():
        for _ in range(10):
            start = time.perf_counter()
            output = vit_model(content, style)
            torch.cuda.synchronize()
            times.append((time.perf_counter() - start) * 1000)
    
    avg_time = np.mean(times)
    fps = 1000 / avg_time
    
    print(f"\nAverage: {avg_time:.2f} ms")
    print(f"FPS: {fps:.2f}")
    print(f"Output: {output.shape}")
    print("\n‚úÖ ViT Style Transfer working!")

else:
    print("‚ö†Ô∏è CUDA not available or ViT model not loaded")

## 9. Pipeline API - Easy Style Transfer

High-level Python API for easy style transfer.

### Usage:
```python
from styleforge_pipeline import create_pipeline

# Fast Style Transfer
pipeline = create_pipeline(model_type='fast', style='candy')
output = pipeline.stylize('photo.jpg')
pipeline.save(output, 'styled.jpg')

# ViT Style Transfer
pipeline = create_pipeline(model_type='vit', vit_variant='small')
output = pipeline.stylize('content.jpg', style_image='style.jpg')
```

In [None]:
# Pipeline API Demo
import sys
from pathlib import Path

# Setup path
root_dir = Path.cwd()
if (root_dir / 'StyleForge').exists():
    root_dir = root_dir / 'StyleForge'

if str(root_dir) not in sys.path:
    sys.path.insert(0, str(root_dir))

try:
    from styleforge_pipeline import create_pipeline
    
    print("=" * 70)
    print("Pipeline API Demo")
    print("=" * 70)
    
    pipeline = create_pipeline(model_type='fast', style='candy', verbose=False)
    info = pipeline.get_model_info()
    
    print(f"Model: {info['model_name']}")
    print(f"Parameters: {info['total_parameters']:,}")
    
    test_input = torch.randn(1, 3, 256, 256).to(pipeline.device)
    with torch.no_grad():
        output = pipeline.model(test_input)
    
    print(f"\n‚úÖ Pipeline API working!")
    print(f"   Input: {test_input.shape}")
    print(f"   Output: {output.shape}")
    
except ImportError as e:
    print(f"‚ö†Ô∏è Could not import pipeline: {e}")

## 10. Final Summary

### All Features Demonstrated

| Feature | CUDA Kernels | Status |
|---------|--------------|--------|
| **Image Style Transfer** | FusedInstanceNorm2d | ‚úÖ Working |
| **Image Upload** | FusedInstanceNorm2d | ‚úÖ Available |
| **Video File Processing** | FusedInstanceNorm2d | ‚úÖ Script provided |
| **Webcam Style Transfer** | FusedInstanceNorm2d | ‚úÖ Script provided |
| **ViT Style Transfer** | fused_attention_v1 | ‚úÖ Working |
| **Pipeline API** | All kernels | ‚úÖ Working |

### Performance Summary

| Operation | Speedup |
|-----------|---------|
| Fused Attention | 4-8x |
| Fused FFN | 3-5x |
| Fused Instance Norm | 2-4x |

### Citation

```bibtex
@software{styleforge2024,
  title = {StyleForge: Real-Time Neural Style Transfer with CUDA Kernels},
  author = {Liau, Olivia},
  year = {2024},
  url = {https://github.com/oleeveeuh/StyleForge}
}
```