# Building Neural Networks with ML Odyssey

Learn how to construct neural network architectures using ExTensor layers.

## Available Layers

ML Odyssey provides optimized implementations of common layer types:

### Convolutional Layers
- **Conv2D**: 2D convolution for image processing
- **DepthwiseConv2D**: Depthwise separable convolutions (efficient)
- **TransposeConv2D**: Learnable upsampling via convolution

### Fully Connected
- **Linear**: Dense matrix multiplication layer
- **Embedding**: Look-up table for categorical inputs

### Normalization
- **BatchNorm**: Batch normalization for training stability
- **LayerNorm**: Layer normalization for attention models

### Pooling
- **MaxPool2D**: Maximum pooling for dimensionality reduction
- **AvgPool2D**: Average pooling
- **AdaptiveAvgPool2D**: Output-size-aware pooling

### Activation
- **ReLU**: Rectified Linear Unit (most common)
- **LeakyReLU**: ReLU with non-zero negative slope
- **Sigmoid**: S-shaped activation (0 to 1)
- **Tanh**: Hyperbolic tangent (-1 to 1)
- **GELU**: Gaussian Error Linear Unit (modern)
- **SiLU/Swish**: Smooth activation

### Regularization
- **Dropout**: Random neuron deactivation (training only)
- **Residual**: Skip connections (ResNet-style)

## Layer Configuration

Each layer requires specific parameters:

### Conv2D Example

```mojo
var conv = Conv2D(
    in_channels=3,      # Input: RGB image
    out_channels=32,    # Output: 32 filters
    kernel_size=3,      # 3x3 filter
    stride=1,           # Move filter by 1 pixel
    padding=1,          # Add 1 pixel padding
    dtype=DType.float32
)
```

### Linear Example

```mojo
var linear = Linear(
    in_features=784,    # Flattened 28x28 image
    out_features=128,   # Hidden units
    dtype=DType.float32
)
```

## Classic Architectures

### LeNet-5 (1998)

First successful deep CNN, revolutionized handwritten digit recognition:

```
Input (28x28)
  ↓
Conv2D(1, 6, 5) → ReLU → MaxPool(2)  # 28→24→12
  ↓
Conv2D(6, 16, 5) → ReLU → MaxPool(2) # 12→8→4
  ↓
Flatten → Linear(16*4*4, 120) → ReLU
  ↓
Linear(120, 84) → ReLU
  ↓
Linear(84, 10)  # Output probabilities
```

### AlexNet (2012)

Deep network that won ImageNet, popularized GPU training:

```
Input (224x224x3)
  ↓
Conv2D(3, 96, 11, stride=4) → ReLU → MaxPool(3)
  ↓
Conv2D(96, 256, 5) → ReLU → MaxPool(3)
  ↓
Conv2D(256, 384, 3) → ReLU
Conv2D(384, 384, 3) → ReLU
Conv2D(384, 256, 3) → ReLU → MaxPool(3)
  ↓
Dropout(0.5)
Linear(256*6*6, 4096) → ReLU → Dropout(0.5)
Linear(4096, 4096) → ReLU → Dropout(0.5)
Linear(4096, 1000)  # 1000 ImageNet classes
```

## Model Definition in Mojo

```mojo
struct LeNet5:
    var conv1: Conv2D
    var pool1: MaxPool2D
    var conv2: Conv2D
    var pool2: MaxPool2D
    var fc1: Linear
    var fc2: Linear
    var fc3: Linear

    fn __init__(out self):
        self.conv1 = Conv2D(1, 6, kernel_size=5)
        self.pool1 = MaxPool2D(2)
        self.conv2 = Conv2D(6, 16, kernel_size=5)
        self.pool2 = MaxPool2D(2)
        self.fc1 = Linear(16*4*4, 120)
        self.fc2 = Linear(120, 84)
        self.fc3 = Linear(84, 10)

    fn forward(self, x: ExTensor) -> ExTensor:
        var x = self.conv1(x)
        x = x.relu()
        x = self.pool1(x)
        
        x = self.conv2(x)
        x = x.relu()
        x = self.pool2(x)
        
        x = x.flatten()
        x = self.fc1(x).relu()
        x = self.fc2(x).relu()
        x = self.fc3(x)  # No activation on output
        
        return x
```

## Weight Initialization

Proper weight initialization is crucial for training stability:

### Xavier/Glorot Initialization
```mojo
# For layers with tanh/sigmoid
limit = sqrt(6.0 / (fan_in + fan_out))
weight = uniform_random(-limit, limit)
```

### Kaiming/He Initialization
```mojo
# For layers with ReLU
std = sqrt(2.0 / fan_in)
weight = normal_random(0, std)
```

### Bias Initialization
```mojo
# Almost always zero
bias = zeros(out_features)
```

## Inspecting Model Parameters

In Python notebooks, we can count and inspect parameters:

In [None]:
import numpy as np
from notebooks.utils import visualization

# Simulate model parameters
layers = [
    {"name": "conv1", "type": "Conv2D(1, 6, 5)", "output_shape": "(N, 6, 24, 24)", "params": 1*6*5*5 + 6},
    {"name": "relu1", "type": "ReLU", "output_shape": "(N, 6, 24, 24)", "params": 0},
    {"name": "pool1", "type": "MaxPool2D(2)", "output_shape": "(N, 6, 12, 12)", "params": 0},
    {"name": "conv2", "type": "Conv2D(6, 16, 5)", "output_shape": "(N, 16, 8, 8)", "params": 6*16*5*5 + 16},
    {"name": "relu2", "type": "ReLU", "output_shape": "(N, 16, 8, 8)", "params": 0},
    {"name": "pool2", "type": "MaxPool2D(2)", "output_shape": "(N, 16, 4, 4)", "params": 0},
    {"name": "fc1", "type": "Linear(256, 120)", "output_shape": "(N, 120)", "params": 256*120 + 120},
    {"name": "fc2", "type": "Linear(120, 84)", "output_shape": "(N, 84)", "params": 120*84 + 84},
    {"name": "fc3", "type": "Linear(84, 10)", "output_shape": "(N, 10)", "params": 84*10 + 10},
]

visualization.display_model_summary(layers)

## Key Design Principles

1. **Deeper is not always better** - But deeper networks can learn more complex patterns
2. **Bottleneck design** - Reduce spatial dimensions, increase feature channels
3. **Skip connections** - Allow gradient flow in very deep networks
4. **Batch normalization** - Stabilizes training of deep networks
5. **Regularization** - Dropout, weight decay prevent overfitting

Next: Train your first model!