# ARES Getting Started Tutorial

This tutorial walks through the complete ARES pipeline: taking a PyTorch neural network and deploying it on the GAP9 RISC-V processor with **true INT8 inference**.

## What You'll Learn

1. **Understand** the three stages of quantization (FP32 → Fake INT8 → True INT8)
2. **Define** a quantized model using Brevitas
3. **Train** on MNIST dataset
4. **Extract** INT8 weights and quantization scales
5. **Generate** C code for GAP9
6. **Run** on the GAP9 simulator (GVSOC)

**Time estimate:** ~30 minutes (including training)

---
## Part 1: Understanding Quantization Stages

Before we start coding, it's crucial to understand the **three stages** of quantization in the ARES pipeline:

### Stage 1: FP32 (Floating Point)

This is your regular PyTorch model with 32-bit floating point weights and activations.

```
Weights:     float32 (e.g., 0.0234, -0.1567, ...)
Activations: float32
Compute:     FP32 multiplication and addition
```

**Problem:** FP32 is expensive on embedded hardware - requires FPU, high power consumption more storage.

---

### Stage 2: INT8 Fake Quantized (Brevitas Training)

Brevitas **simulates** INT8 quantization during training, but computations still happen in FP32.

```
Weights:     Quantized to INT8 range [-128, 127], but STORED as float32
Activations: Quantized to INT8 range, but STORED as float32
Compute:     Still FP32 (for gradient computation during backprop)
```

**Why "fake"?** The values are constrained to INT8-representable values, but the underlying computation uses float32. This allows:
- Gradient computation for backpropagation
- The model to **learn** to work with quantized values
- Simulation of quantization noise during training
- A lot quicker to do than true quantization (we can still use GPUs)

**Key insight:** A Brevitas model's accuracy during training tells you what accuracy to expect after true quantization.

---

### Stage 3: INT8 True Quantized (ARES on GAP9)

ARES extracts the quantization parameters and generates C code that performs **actual INT8 arithmetic**.

```
Weights:     True INT8 values (1 byte each)
Activations: True INT8 values (1 byte each)
Compute:     INT8 x INT8 → INT32 accumulator → scale → INT8 output
```

**This is the real deal:** No floating point operations. Pure integer math on the GAP9 cluster.

---

### The Complete Pipeline

```
┌─────────────────────────────────────────────────────────────────────────────┐
│  STAGE 1: FP32                                                              │
│  ─────────────────                                                          │
│  Regular PyTorch model (nn.Conv2d, nn.Linear, etc.)                         │
│  Weights: float32, Compute: float32                                         │
│                                    ↓                                        │
│                         [Replace with Brevitas layers]                      │
│                                    ↓                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│  STAGE 2: INT8 FAKE QUANTIZED (Training)                                    │
│  ────────────────────────────────────────                                   │
│  Brevitas model (QuantConv2d, QuantLinear, etc.)                            │
│  Weights: int8 values stored as float32                                     │
│  Compute: float32 (enables backpropagation)                                 │
│  Scales: Learned during training                                            │
│                                    ↓                                        │
│                         [BrevitasExtractor]                                 │
│                                    ↓                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│  STAGE 3: INT8 TRUE QUANTIZED (Deployment)                                  │
│  ──────────────────────────────────────────                                 │
│  ARES-generated C code on GAP9                                              │
│  Weights: Actual int8 arrays (1 byte per weight)                            │
│  Compute: int8 x int8 → int32 accumulator → rescale → int8                  │
│  No floating point operations!                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

### Why This Matters

| Aspect | FP32 | Fake INT8 (Brevitas) | True INT8 (ARES) |
|--------|------|----------------------|------------------|
| Weight storage | 4 bytes | 4 bytes (float32) | **1 byte** |
| Compute | FPU required | FPU required | **Integer ALU only** |
| Power | High | High | **Low** |
| GAP9 compatible | No | No | **Yes** |
| Training | Yes | Yes | No |
| Accuracy | Baseline | ~Same as True INT8 | Matches Fake INT8 |

**The goal:** Train with Brevitas (Stage 2), then deploy with ARES (Stage 3) with **0.0% error** between them.

---
## Part 2: Setup

### Prerequisites

- Python 3.8+ with PyTorch and Brevitas
- GAP9 SDK (optional, for GVSOC simulation)

### Required Packages

```bash
pip install torch brevitas numpy torchvision
```

In [None]:
# Add ARES root to path
import sys
sys.path.insert(0, '..')

# Standard imports
import os
import json
import numpy as np
from pathlib import Path

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Brevitas (quantization) - Stage 2: Fake INT8
from brevitas.nn import QuantConv2d, QuantLinear, QuantReLU, QuantIdentity

# ARES tools - Stage 3: True INT8
from tools.pytorch_extractor import BrevitasExtractor
from tools.int8_inference import INT8InferenceEngine  # True INT8 inference engine
from tools.generate_golden_outputs import GoldenOutputGenerator  # Golden output generation
from codegen.generate_c_code import CCodeGenerator

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

In [None]:
# Configuration
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
OUTPUT_DIR = Path('outputs')
OUTPUT_DIR.mkdir(exist_ok=True)

print(f"Using device: {DEVICE}")
print(f"Output directory: {OUTPUT_DIR.absolute()}")

---
## Part 3: From FP32 to Brevitas - Step by Step

Before jumping into the full Brevitas model, let's see what a **regular FP32 PyTorch model** looks like and how we transform it.

### Stage 1: The FP32 Network (Regular PyTorch)

Here's a simple CNN in standard PyTorch - no quantization at all:

In [None]:
# Stage 1: Regular FP32 PyTorch CNN (NO quantization)
class FP32_CNN(nn.Module):
    """
    Standard PyTorch CNN - this is what you'd write normally.
    Uses float32 for all weights and computations.
    """
    def __init__(self):
        super().__init__()
        # Standard PyTorch layers - no quantization
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2, 2)
        
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2, 2)
        
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.relu3 = nn.ReLU()
        
        self.classifier = nn.Linear(7 * 7 * 128, 10)
    
    def forward(self, x):
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        x = self.relu3(self.conv3(x))
        x = x.view(x.size(0), -1)
        return self.classifier(x)

# Show the FP32 model
fp32_model = FP32_CNN()
print("FP32 CNN (Stage 1 - Regular PyTorch):")
print(f"  Parameters: {sum(p.numel() for p in fp32_model.parameters()):,}")
print(f"  Weight dtype: {next(fp32_model.parameters()).dtype}")
print(f"  Memory per weight: 4 bytes (float32)")

### Stage 2: Transforming to Brevitas (Fake INT8)

Now we transform this FP32 network into a **Brevitas quantized network**. The transformation is mechanical - replace each layer type with its Brevitas equivalent:

| FP32 (Stage 1) | Brevitas (Stage 2) | Notes |
|----------------|-------------------|-------|
| `nn.Conv2d` | `QuantConv2d` | Add `weight_bit_width=8` |
| `nn.Linear` | `QuantLinear` | Add `weight_bit_width=8` |
| `nn.ReLU` | `QuantReLU` | Add `bit_width=8, return_quant_tensor=True` |
| *(add new)* | `QuantIdentity` | Add at input and after pooling |

**Key additions for Brevitas:**

1. **`QuantIdentity` at input** - Quantizes the FP32 input to INT8 range
2. **`QuantIdentity` after pooling** - Re-establishes scale after non-Brevitas ops
3. **`return_quant_tensor=True`** - Propagates scale information through the network
4. **`weight_bit_width=8`** - Specifies 8-bit quantization for weights

### The Transformed Network (Brevitas)

In [None]:
class TutorialCNN(nn.Module):
    """
    CNN for MNIST classification using Brevitas (Fake INT8).
    
    Compare to FP32_CNN above - the key changes are:
    1. nn.Conv2d → QuantConv2d (with weight_bit_width=8)
    2. nn.Linear → QuantLinear (with weight_bit_width=8)
    3. nn.ReLU → QuantReLU (with bit_width=8, return_quant_tensor=True)
    4. Added QuantIdentity at input and after pooling layers
    """
    
    def __init__(self):
        super().__init__()
        bit_width = 8
        
        # NEW: Input quantization (not in FP32 version)
        self.input_quant = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
        
        # CHANGED: nn.Conv2d → QuantConv2d
        self.conv1 = QuantConv2d(1, 32, kernel_size=3, padding=1, bias=True, weight_bit_width=bit_width)
        # CHANGED: nn.ReLU → QuantReLU
        self.relu1 = QuantReLU(bit_width=bit_width, return_quant_tensor=True)
        # UNCHANGED: MaxPool2d stays the same (no Brevitas equivalent)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        # NEW: QuantIdentity after pooling (re-establishes scale)
        self.pool1_quant = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
        
        # Block 2: Same pattern
        self.conv2 = QuantConv2d(32, 64, kernel_size=3, padding=1, bias=True, weight_bit_width=bit_width)
        self.relu2 = QuantReLU(bit_width=bit_width, return_quant_tensor=True)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.pool2_quant = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
        
        # Block 3
        self.conv3 = QuantConv2d(64, 128, kernel_size=3, padding=1, bias=True, weight_bit_width=bit_width)
        self.relu3 = QuantReLU(bit_width=bit_width, return_quant_tensor=True)
        
        # NEW: QuantIdentity before linear (after flatten)
        self.pre_linear_quant = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
        
        # CHANGED: nn.Linear → QuantLinear
        self.classifier = QuantLinear(7 * 7 * 128, 10, bias=True, weight_bit_width=bit_width)
    
    def forward(self, x):
        # NEW: Quantize input
        x = self.input_quant(x)
        
        # Block 1 (note: pool1_quant is NEW)
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool1(x)
        x = self.pool1_quant(x)  # NEW
        
        # Block 2
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool2(x)
        x = self.pool2_quant(x)  # NEW
        
        # Block 3
        x = self.conv3(x)
        x = self.relu3(x)
        
        # Flatten and classify
        x = x.view(x.size(0), -1)
        x = self.pre_linear_quant(x)  # NEW
        x = self.classifier(x)
        
        return x

# Create model (this is Stage 2: Fake INT8)
cnn_model = TutorialCNN().to(DEVICE)
print("Created TutorialCNN (Brevitas model - Fake INT8 quantization)")
print(f"\nTotal parameters: {sum(p.numel() for p in cnn_model.parameters()):,}")

# Calculate MACs
macs = (32*1*9*784) + (64*32*9*196) + (128*64*9*49) + (6272*10)
print(f"Total MACs: {macs:,} (~{macs/1e6:.1f}M)")

### Load MNIST Dataset

In [None]:
# MNIST transforms: convert to tensor (values in [0, 1])
transform = transforms.Compose([
    transforms.ToTensor(),
])

# Load datasets
train_dataset = datasets.MNIST(
    root='../data/mnist', 
    train=True, 
    download=True, 
    transform=transform
)
test_dataset = datasets.MNIST(
    root='../data/mnist', 
    train=False, 
    download=True, 
    transform=transform
)

# Use subset for faster training (1000 samples)
train_subset = torch.utils.data.Subset(train_dataset, range(1000))
test_subset = torch.utils.data.Subset(test_dataset, range(200))

train_loader = DataLoader(train_subset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_subset, batch_size=32, shuffle=False)

print(f"Training samples: {len(train_subset)}")
print(f"Test samples: {len(test_subset)}")

### Train the CNN (Stage 2: Fake INT8)

Training a Brevitas model is **identical** to training a regular PyTorch model. The fake quantization happens automatically during the forward pass.

**What happens during training:**
- Forward pass: Values are quantized to INT8 range (but stored as float32)
- Backward pass: Gradients flow through using Straight-Through Estimator (STE)
- Weights update: Model learns to work with quantized values

In [None]:
def train_model(model, train_loader, epochs=5, lr=0.001):
    """Train a model on the given data loader."""
    model.train()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    for epoch in range(epochs):
        total_loss = 0
        correct = 0
        total = 0
        
        for images, labels in train_loader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            
            optimizer.zero_grad()
            outputs = model(images)  # Fake INT8 forward pass
            loss = criterion(outputs, labels)
            loss.backward()  # Gradients via STE
            optimizer.step()
            
            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
        
        acc = 100. * correct / total
        print(f"Epoch {epoch+1}/{epochs} - Loss: {total_loss/len(train_loader):.4f}, Acc: {acc:.2f}%")
    
    return model

def evaluate_model(model, test_loader):
    """Evaluate model accuracy."""
    model.eval()
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    return 100. * correct / total

# Train the CNN (Fake INT8 training)
print("="*60)
print("STAGE 2: Training with FAKE INT8 quantization (Brevitas)")
print("="*60)
print("\n(This takes ~1-2 minutes)\n")
cnn_model = train_model(cnn_model, train_loader, epochs=5)

In [None]:
# Evaluate (still Fake INT8 - but this predicts True INT8 accuracy!)
cnn_accuracy_fake = evaluate_model(cnn_model, test_loader)
print(f"\nTutorialCNN Test Accuracy (Fake INT8): {cnn_accuracy_fake:.2f}%")
print("\n→ This accuracy should match True INT8 on GAP9!")

### Extract INT8 Weights (Transition to Stage 3)

Now we use `BrevitasExtractor` to extract the **actual INT8 values** from the Brevitas model. This is the bridge between Stage 2 (fake) and Stage 3 (true).

**What gets extracted:**
- **INT8 weights** - The actual quantized values (not float32 anymore!)
- **Scales** - How to convert between INT8 and real values
- **Layer metadata** - Shapes, types, connections

The formula for dequantization is:
```
real_value = int8_value x scale
```

In [None]:
# Create output directory for CNN
cnn_output_dir = OUTPUT_DIR / 'tutorial_cnn'
cnn_output_dir.mkdir(exist_ok=True)

# Get a sample input for extraction (needed to trace activation scales)
sample_input, _ = next(iter(test_loader))
sample_input = sample_input[:1].to(DEVICE)  # Single image

print(f"Sample input shape: {sample_input.shape}")
print(f"Output directory: {cnn_output_dir}")

In [None]:
print("="*60)
print("Extracting TRUE INT8 weights from Brevitas model")
print("="*60)

# Extract weights and scales
cnn_model.eval()
extractor = BrevitasExtractor(cnn_model)
network_info = extractor.extract_all(sample_input.cpu())  # Note: extract_all(), not extract()

# Save to disk
golden_dir = cnn_output_dir / 'golden_outputs'
golden_dir.mkdir(exist_ok=True)

# Save network_info.json (convert numpy arrays to lists for JSON serialization)
def numpy_to_list(obj):
    """Convert numpy arrays to lists for JSON serialization."""
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, dict):
        return {k: numpy_to_list(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [numpy_to_list(v) for v in obj]
    return obj

network_info_serializable = numpy_to_list(network_info)
with open(golden_dir / 'network_info.json', 'w') as f:
    json.dump(network_info_serializable, f, indent=2, default=str)

# Save weights as actual INT8 arrays
weights_dir = golden_dir / 'weights'
weights_dir.mkdir(exist_ok=True)
extractor.save_weights(str(weights_dir))

print(f"\nExtracted {len(network_info)} layers")
print(f"\nLayers:")
for name, info in network_info.items():
    if not name.startswith('__'):  # Skip metadata keys
        layer_type = info.get('type', 'unknown')
        print(f"  {name}: {layer_type}")

In [None]:
# Let's examine the quantization details for conv1
print("="*60)
print("Examining conv1 quantization (Fake → True INT8)")
print("="*60)

if 'conv1' in network_info:
    conv1_info = network_info['conv1']
    print(f"\nLayer type: {conv1_info.get('type')}")
    print(f"Shape: {conv1_info.get('in_channels')}→{conv1_info.get('out_channels')} with {conv1_info.get('kernel_size')}x{conv1_info.get('kernel_size')} kernel")
    
    # Scales are the key to INT8 conversion
    weight_scale = conv1_info.get('weight_scale', 'N/A')
    input_scale = conv1_info.get('input_scale', 'N/A')
    output_scale = conv1_info.get('output_scale', 'N/A')
    
    print(f"\nQuantization scales:")
    print(f"  Weight scale: {weight_scale}")
    print(f"  Input scale:  {input_scale}")
    print(f"  Output scale: {output_scale}")
    
    print(f"\nHow to interpret:")
    print(f"  real_weight = int8_weight x {weight_scale}")
    print(f"  real_input  = int8_input  x {input_scale}")

In [None]:
# Verify weights are actually INT8
print("="*60)
print("Verifying extracted weights are TRUE INT8")
print("="*60)

weight_files = list(weights_dir.glob('*_weight*.npy'))
if weight_files:
    for wf in weight_files[:2]:  # Show first 2
        w = np.load(wf)
        print(f"\n{wf.name}:")
        print(f"  Shape: {w.shape}")
        print(f"  Dtype: {w.dtype}")
        print(f"  Range: [{w.min()}, {w.max()}]")
        if w.dtype == np.int8:
            print(f"  ✓ TRUE INT8 (1 byte per weight)")
        else:
            print(f"  Note: Stored as {w.dtype} for compatibility")

### Generate Golden Outputs & Evaluate TRUE INT8 Accuracy

Now we use `GoldenOutputGenerator` to:
1. Run **actual INT8 arithmetic** through the network
2. Save golden outputs for GAP9 verification
3. Validate our quantization pipeline

**What happens inside:**
```python
for each test input:
    # TRUE INT8 computation (no floating point in the MAC operations!)
    for each layer:
        accumulator_int32 = sum(input_int8[i] * weight_int8[i])  # int8 x int8 → int32
        output_int8 = clip((accumulator_int32 + bias_int32) * scale, -128, 127)
    
    # Save intermediate INT8 outputs for layer-by-layer verification
    save(intermediate_int8/*.npy)
```

**Expected:** GAP9 output matches these golden outputs with **0.0% error**

In [None]:
print("="*60)
print("STAGE 3: Generating golden outputs with TRUE INT8 inference")
print("="*60)

# Create golden output generator (uses INT8InferenceEngine internally)
golden_generator = GoldenOutputGenerator(network_info)

# Get a real MNIST test image for the golden output
test_image, test_label = test_dataset[3]  # Use test sample #3
test_input_fp32 = test_image.unsqueeze(0).numpy()  # [1, 1, 28, 28]

print(f"\nUsing MNIST test image (label={test_label})")
print(f"Input shape: {test_input_fp32.shape}")

# Generate golden outputs using TRUE INT8 inference
test_cases = [test_input_fp32]
test_case_dir = golden_dir / 'test_cases'
golden_generator.generate_golden_outputs(test_cases, output_dir=str(test_case_dir))

print(f"\nGolden outputs saved to: {test_case_dir}")

### Evaluate TRUE INT8 Accuracy

Let's verify that TRUE INT8 inference produces the same accuracy as Fake INT8 training. This validates our quantization pipeline.

In [None]:
print("="*60)
print("Evaluating TRUE INT8 accuracy")
print("="*60)
print("\n(Testing on a small subset - this is slow but validates the pipeline)\n")

# Create INT8 inference engine
engine = INT8InferenceEngine(network_info)

# Evaluate on subset of test data
num_test_samples = 20
correct_true_int8 = 0

for i in range(num_test_samples):
    # Get test sample
    img, label = test_dataset[i]
    x_fp32 = img.unsqueeze(0).numpy()
    
    # Run TRUE INT8 inference
    output_fp32, _, _ = engine.forward(x_fp32, verbose=False)
    predicted = np.argmax(output_fp32[0])
    
    if predicted == label:
        correct_true_int8 += 1
    
    if (i + 1) % 5 == 0:
        print(f"  Processed {i+1}/{num_test_samples} samples...")

cnn_accuracy_true_int8 = 100. * correct_true_int8 / num_test_samples

print(f"\n" + "="*60)
print("CNN Accuracy Comparison:")
print("="*60)
print(f"  Fake INT8 (Brevitas): {cnn_accuracy_fake:.2f}%")
print(f"  True INT8 (Python):   {cnn_accuracy_true_int8:.2f}%")
print(f"\n  Difference: {abs(cnn_accuracy_fake - cnn_accuracy_true_int8):.2f}%")
if abs(cnn_accuracy_fake - cnn_accuracy_true_int8) < 10:
    print("  ✓ Pipeline validated - True INT8 matches Fake INT8!")
else:
    print("  ! Large difference - check quantization parameters")

### Generate C Code (Stage 3: True INT8 on GAP9)

The `CCodeGenerator` creates C code that performs **actual INT8 arithmetic** on GAP9.

**What's generated:**
- INT8 weight arrays (1 byte per weight)
- INT8 x INT8 → INT32 MAC operations
- Proper scaling and bias handling
- Memory-optimized buffer allocation
- DMA pipelining for efficient data movement

In [None]:
print("="*60)
print("STAGE 3: Generating TRUE INT8 C code for GAP9")
print("="*60)

# Generate C code
generated_dir = cnn_output_dir / 'generated'

# GoldenOutputGenerator creates test_case_1, test_case_2, etc.
actual_test_case_dir = test_case_dir / 'test_case_1'

try:
    generator = CCodeGenerator(
        network_info_path=str(golden_dir / 'network_info.json'),
        weights_dir=str(weights_dir),
        test_case_dir=str(actual_test_case_dir),
        output_dir=str(generated_dir)
    )
    generator.generate_all()
    print(f"\n✓ C code generated at: {generated_dir}")
except Exception as e:
    import traceback
    print(f"\nCode generation error: {e}")
    traceback.print_exc()
    print("\nThis may be expected if running without full ARES setup.")

In [None]:
# List generated files
if generated_dir.exists():
    print("Generated files:")
    for f in sorted(generated_dir.rglob('*')):
        if f.is_file():
            rel_path = f.relative_to(generated_dir)
            size_kb = f.stat().st_size / 1024
            print(f"  {rel_path} ({size_kb:.1f} KB)")

### Running on GAP9 (Optional)

If you have the GAP9 SDK installed, you can build and run on GVSOC:

```bash
cd outputs/tutorial_cnn/generated
source ~/gap_sdk/configs/gap9_v2.sh
make clean all run platform=gvsoc
```

**Expected output:** `Error: 0.0%`

---

### Makefile Options

The generated Makefile supports several useful flags:

#### Basic Build Options

| Flag | Default | Description |
|------|---------|-------------|
| `CORE=N` | 8 | Number of cluster cores to use (1-8) |
| `platform=gvsoc` | - | Run on GVSOC simulator |
| `platform=board` | - | Run on real GAP9 hardware |

#### Performance Profiling

| Flag | Default | Description |
|------|---------|-------------|
| `ENABLE_PERF=1` | 0 | Enable per-layer cycle counters |
| `MINIMAL_OUTPUT=1` | 0 | Reduce debug prints (keeps perf summary) |

#### Validation Options

| Flag | Default | Description |
|------|---------|-------------|
| `DISABLE_INTERMEDIATE_GOLDEN=1` | 0 | Skip per-layer validation (only check final output) |

#### Memory Configuration

| Flag | Default | Description |
|------|---------|-------------|
| `PI_CL_SLAVE_STACK_SIZE=0xNNN` | 0x400 | Stack size for worker cores (increase if stack overflow) |

---

### Example Commands

```bash
# Basic run with full validation
make clean all run platform=gvsoc

# Performance profiling (verbose - shows per-layer breakdown)
make clean all run platform=gvsoc ENABLE_PERF=1

# Clean benchmarking (minimal prints, but shows perf summary)
make clean all run platform=gvsoc MINIMAL_OUTPUT=1 ENABLE_PERF=1

# Fast run for large models (skip per-layer validation)
make clean all run platform=gvsoc DISABLE_INTERMEDIATE_GOLDEN=1

# Debug with single core
make clean all run platform=gvsoc CORE=1
```

---

### Understanding the Output

**With `ENABLE_PERF=1`**, you'll see per-layer timing:
```
PERF conv1      : total=405248 compute=342017 dma_load=0 dma_store=0 idle=63231 overlap=84.4%
PERF conv2      : total=375194 compute=302750 dma_load=0 dma_store=0 idle=72444 overlap=80.7%
...

PERFORMANCE SUMMARY
Total layers:        7
Total cycles:        898819
  Compute cycles:    735044 (81.8%)
```

**Validation output:**
```
CL: Error analysis:
  Max error:  0.000000    ← Maximum absolute error across all outputs
  Mean error: 0.000000    ← Average absolute error
CL: ✓ Test PASSED (error < 1.0%)
```

This confirms that:
1. Stage 2 (Fake INT8, Brevitas) → Accuracy: ~95%
2. Stage 3 (True INT8, GAP9) → **Bit-exact match** with Python INT8 reference
3. The quantization pipeline is working correctly!

---
## The ARES Auto-Tuner & Knowledge Base

One of the key features of ARES is its **auto-tuning system** that automatically discovers optimal configurations for each layer and stores them in a **knowledge base** for future use.

### The Problem: Layer Configuration

Each layer has many tunable parameters that affect performance:

- **Tiling parameters**: `tile_m`, `tile_n`, `tile_k` for matrix operations
- **Unrolling factors**: `outch_unroll` for convolutions
- **Pipeline settings**: Double buffering, DMA overlap
- **Memory placement**: L1 vs L2 vs L3

Finding the optimal configuration manually is tedious and error-prone. Different layer shapes require different settings.

### The Solution: Auto-Tuner + Knowledge Base

ARES includes an **auto-tuning framework** that:

1. **Profiles** each layer to identify bottlenecks
2. **Searches** configuration space systematically
3. **Measures** actual cycle counts on GVSOC
4. **Records** successful configurations to a knowledge base
5. **Reuses** configurations for similar layers in future runs

```
┌─────────────────────────────────────────────────────────────────────┐
│                        AUTO-TUNER WORKFLOW                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   New Layer Shape          Knowledge Base                           │
│        │                        │                                   │
│        ▼                        ▼                                   │
│   ┌─────────┐    lookup    ┌─────────────┐                         │
│   │ linear  │ ───────────► │ Match found │ ──► Use cached config   │
│   │ M=400   │              └─────────────┘                         │
│   │ N=768   │                    │                                  │
│   │ K=192   │              No match                                 │
│   └─────────┘                    │                                  │
│        │                         ▼                                  │
│        │                  ┌─────────────┐                          │
│        └─────────────────►│  Auto-Tune  │                          │
│                           │  (GVSOC)    │                          │
│                           └──────┬──────┘                          │
│                                  │                                  │
│                                  ▼                                  │
│                           ┌─────────────┐                          │
│                           │   Record    │                          │
│                           │   to KB     │                          │
│                           └─────────────┘                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Using the Auto-Tuner

```bash
# Show what layers would be tuned (dry run)
python tools/auto_tune.py --test test_1_simplecnn --dry-run

# Tune all bottleneck layers (>100K cycles)
python tools/auto_tune.py --test test_1_simplecnn --tune-all

# Tune specific layers with more iterations
python tools/auto_tune.py --test test_1_simplecnn --layers conv1 conv2 --max-iter 30
```

### Knowledge Base

The knowledge base (`codegen/optimization/data/knowledge_base.json`) stores:

- **Shape patterns**: Layer dimensions (M, N, K for linear; C_in, C_out, H, W for conv)
- **Optimal configs**: Tiling, unrolling, pipeline settings
- **Performance metrics**: Measured cycles, MACs/cycle
- **Provenance**: Which test/run discovered this configuration

When generating C code, ARES automatically queries the knowledge base:

```
  [KB] Auto-applying config for conv1: {'outch_unroll': 4}
  [KB] Auto-applying config for fc1: {'tile_m': 1, 'tile_n': 128, 'tile_k': 256}
```

This means **you get optimized configurations automatically** for common layer shapes, without manual tuning.

### Shape Matching

The knowledge base uses **fuzzy shape matching** - if an exact match isn't found, it looks for similar shapes:

- Linear layers: Match on (M, N, K) dimensions
- Conv layers: Match on (C_in, C_out, kernel_size, spatial_dims)
- Configurable tolerance for "close enough" matches

This allows configurations learned from one model to benefit similar layers in other models.

---
## Part 4: Example 2 - MLP (Dense Network)

Let's try a network with **no convolutions** - just linear (fully-connected) layers. This demonstrates that ARES handles different architectures and shows a different memory access pattern.

### Architecture

```
INPUT (28x28x1)
  → Flatten (784)
  → Linear(784 → 256) → ReLU
  → Linear(256 → 128) → ReLU  
  → Linear(128 → 10)
```

In [None]:
class SimpleMLP(nn.Module):
    """
    Multi-layer perceptron using Brevitas (Fake INT8).
    
    No convolutions - shows pure Linear layer quantization.
    """
    
    def __init__(self):
        super().__init__()
        
        bit_width = 8
        
        # Input quantization (applied after flatten)
        self.input_quant = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
        
        # Layer 1: 784 → 256
        self.fc1 = QuantLinear(784, 256, bias=True, weight_bit_width=bit_width)
        self.relu1 = QuantReLU(bit_width=bit_width, return_quant_tensor=True)
        self.fc1_quant = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
        
        # Layer 2: 256 → 128
        self.fc2 = QuantLinear(256, 128, bias=True, weight_bit_width=bit_width)
        self.relu2 = QuantReLU(bit_width=bit_width, return_quant_tensor=True)
        self.fc2_quant = QuantIdentity(bit_width=bit_width, return_quant_tensor=True)
        
        # Layer 3: 128 → 10 (output)
        self.fc3 = QuantLinear(128, 10, bias=True, weight_bit_width=bit_width)
    
    def forward(self, x):
        # Flatten first, then quantize
        x = x.view(x.size(0), -1)  # (batch, 784)
        x = self.input_quant(x)
        
        # Layer 1
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc1_quant(x)
        
        # Layer 2
        x = self.fc2(x)
        x = self.relu2(x)
        x = self.fc2_quant(x)
        
        # Layer 3 (output)
        x = self.fc3(x)
        
        return x

# Create model
mlp_model = SimpleMLP().to(DEVICE)
print("Created SimpleMLP (Brevitas model - Fake INT8 quantization)")
print(f"\nTotal parameters: {sum(p.numel() for p in mlp_model.parameters()):,}")

In [None]:
# Train the MLP
print("="*60)
print("STAGE 2: Training MLP with FAKE INT8 quantization")
print("="*60)
print("\n(This takes ~1-2 minutes)\n")
mlp_model = train_model(mlp_model, train_loader, epochs=5)

In [None]:
# Evaluate
mlp_accuracy_fake = evaluate_model(mlp_model, test_loader)
print(f"\nMLP Test Accuracy (Fake INT8): {mlp_accuracy_fake:.2f}%")

In [None]:
# Extract MLP weights (Fake → True INT8)
print("="*60)
print("Extracting TRUE INT8 weights from MLP")
print("="*60)

mlp_output_dir = OUTPUT_DIR / 'tutorial_mlp'
mlp_output_dir.mkdir(exist_ok=True)

mlp_model.eval()
mlp_extractor = BrevitasExtractor(mlp_model)
mlp_network_info = mlp_extractor.extract_all(sample_input.cpu())  # Note: extract_all()

# Save
mlp_golden_dir = mlp_output_dir / 'golden_outputs'
mlp_golden_dir.mkdir(exist_ok=True)

mlp_network_info_serializable = numpy_to_list(mlp_network_info)
with open(mlp_golden_dir / 'network_info.json', 'w') as f:
    json.dump(mlp_network_info_serializable, f, indent=2, default=str)

mlp_weights_dir = mlp_golden_dir / 'weights'
mlp_weights_dir.mkdir(exist_ok=True)
mlp_extractor.save_weights(str(mlp_weights_dir))

print(f"\nExtracted {len(mlp_network_info)} layers:")
for name, info in mlp_network_info.items():
    if not name.startswith('__'):
        print(f"  {name}: {info.get('type', 'unknown')}")

### CNN vs MLP: Performance Comparison

| Metric | TutorialCNN | SimpleMLP |
|--------|-------------|-----------|
| Parameters | ~155K | ~235K |
| MACs | ~7.5M | ~235K |
| Weight size | 155 KB | 235 KB |
| Compute MACs/cycle | 2.7 | **9.2** |
| Total MACs/cycle | 2.7 | 0.7 |

**Key insight:** The MLP achieves excellent compute efficiency (9.2 MACs/cycle using SIMD), but the large fc1 weight matrix (200KB) must stream from L3 HyperRAM, causing DMA overhead. The CNN has smaller weight matrices that fit in L2, so compute ≈ total.

**When to use each:**
- **CNN**: Image/signal data with spatial structure. Weights reused across spatial dimensions.
- **MLP**: Tabular data, embeddings. Best when weights fit in L2 (< ~500KB total).

---
## Part 5: Key Concepts Deep Dive

### The Three Stages Revisited

```
┌──────────────────┬────────────────────────┬────────────────────────┐
│     STAGE 1      │       STAGE 2          │       STAGE 3          │
│      FP32        │   Fake INT8 (Brevitas) │   True INT8 (ARES)     │
├──────────────────┼────────────────────────┼────────────────────────┤
│ nn.Conv2d        │ QuantConv2d            │ int8 MAC loops in C    │
│ nn.Linear        │ QuantLinear            │ int8 GEMM in C         │
│ nn.ReLU          │ QuantReLU              │ max(0, x) on int8      │
│                  │                        │                        │
│ float32 weights  │ int8 vals as float32   │ int8 arrays (1 byte)   │
│ float32 compute  │ float32 compute        │ int8xint8→int32        │
│                  │                        │                        │
│ No quantization  │ Simulated quantization │ Real quantization      │
│ Can train        │ Can train (STE)        │ Inference only         │
└──────────────────┴────────────────────────┴────────────────────────┘
```

### The Bias Trap

A critical detail in INT8 inference is the order of bias addition:

**WRONG:**
```c
output = (accumulator * scale) + bias;  // DON'T DO THIS
```

**CORRECT:**
```c
output = (accumulator + bias_int32) * scale;  // Bias BEFORE scale
```

**Why?** The bias must be added in the INT32 accumulator domain before scaling down to INT8. Adding bias after scaling destroys precision.

**Symptom:** If you see ~60% error rate, check bias order!

### QuantIdentity Placement

`QuantIdentity` serves two critical purposes:

1. **Input quantization**: Convert FP32 → INT8
2. **Scale tracking**: Re-establish quantization scale after non-Brevitas ops

**Rule:** Add `QuantIdentity` after any operation without a Brevitas equivalent:
- After `nn.MaxPool2d` → `QuantIdentity`
- After `nn.Flatten` / `.view()` → `QuantIdentity`
- After `nn.Dropout` → `QuantIdentity`

---
## Part 6: Summary & Next Steps

### What You Learned

1. **Three Quantization Stages:**
   - FP32: Regular PyTorch
   - Fake INT8: Brevitas (training)
   - True INT8: ARES/GAP9 (deployment)

2. **Brevitas Patterns:**
   - Use `QuantConv2d`, `QuantLinear`, `QuantReLU`
   - Add `QuantIdentity` after non-Brevitas ops
   - Set `return_quant_tensor=True` for scale propagation

3. **ARES Pipeline:**
   - `BrevitasExtractor`: Fake → True INT8 weights
   - `CCodeGenerator`: INT8 C code for GAP9
   - Target: 0.0% error vs Python reference

### Adding Your Own Network

1. **Copy the pattern** from this tutorial
2. **Replace layers** with your architecture
3. **Add QuantIdentity** after every non-Brevitas operation
4. **Train** to learn quantization scales
5. **Extract** and generate C code

### Further Reading

- **System Architecture**: `docs/ARCHITECTURE.md`
- **Quantization Details**: `docs/QUANTIZATION_GUIDE.md`
- **Adding Operations**: `docs/ADDING_OPERATIONS.md`
- **Debugging**: `docs/DEBUGGING.md`