# Architecture Analyzer for Transformer Models

This tool analyzes transformer model architectures for GPU hardware efficiency based on the paper "The Case for Co-Designing Model Architectures with Hardware" (Anthony et al., 2024): https://arxiv.org/abs/2401.14489

The `ArchitectureAnalyzer` class evaluates transformer architectures against known hardware constraints to identify inefficient model dimensions and provide optimization suggestions. It focuses on making model dimensions align with GPU hardware requirements, particularly for optimal Tensor Core usage.

In [10]:
from lobster.model.utils import Architecture, ArchitectureAnalyzer, GPUType, ModelType

architecture = Architecture(
    hidden_size=2560,
    num_attention_heads=32, 
    num_hidden_layers=32,
    vocab_size=50257,
    model_type=ModelType.ENCODER_ONLY,
    name="My Model",
)
analyzer = ArchitectureAnalyzer(architecture, gpu_type = GPUType.A100)

result = analyzer.analyze()

GPU EFFICIENCY ANALYSIS: My Model
Model Configuration:
  Model Type:          encoder_only
  Hidden Size:         2560
  Attention Heads:     32
  Head Dimension:      80
  Hidden Layers:       32
  Vocabulary Size:     50257
  Intermediate Size:   10240
  Tensor Parallel:     1

Efficiency Score: [93m78.8/100[0m (Good)

Identified Issues:
  1. Head dimension (80) is not divisible by 64
  2. Vocabulary size (50257) is not divisible by 64

Optimization Suggestions:
  1. Change attention heads from 32 to 1 to get head dimension of 2560
  2. Change attention heads from 32 to 2 to get head dimension of 1280
  3. Change attention heads from 32 to 4 to get head dimension of 640
  4. Change attention heads from 32 to 5 to get head dimension of 512
  5. Change attention heads from 32 to 8 to get head dimension of 320
  6. Change attention heads from 32 to 10 to get head dimension of 256
  7. Change attention heads from 32 to 20 to get head dimension of 128
  8. Pad vocabulary size from 50257

## Explanation

### 1. Head Dimension Alignment (Most Critical)
- **Rule**: The head dimension (`hidden_size / num_attention_heads`) should be divisible by 64 (for A100 GPUs)
- **Reason**: This ensures optimal Tensor Core usage, as Tensor Cores process matrix multiplications in blocks and work best with dimensions that are multiples of 64 (for A100)
- **Penalty**: 25% score reduction when violated
- **Detection**: `_check_head_dimension()` method
- **Suggestions**: 
  - Decrease number of heads to get aligned head dimension
  - Alternatively, increase the hidden size to get aligned head dimension

### 2. Hidden Dimension Alignment
- **Rule**: The hidden size should be divisible by 64
- **Reason**: Optimal Tensor Core usage for matrix operations
- **Penalty**: 20% score reduction when violated
- **Detection**: `_check_hidden_dimension()` method
- **Suggestion**: Round up to the nearest multiple of 64

### 3. Vocabulary Size Alignment
- **Rule**: The vocabulary size should be divisible by 64
- **Reason**: Improves efficiency of embedding and classification layers
- **Penalty**: 15% score reduction when violated
- **Detection**: `_check_vocab_size()` method
- **Suggestion**: Pad vocabulary size to the nearest multiple of 64

### 4. Intermediate Size Alignment
- **Rule**: The intermediate size (in MLP layers) should be divisible by 64
- **Rule Extension**: For SwiGLU activations (which often use 8/3 coefficient), adjust the coefficient to get an aligned intermediate size
- **Reason**: Optimize MLP layer matrix operations
- **Detection**: `_check_intermediate_size()` method
- **Suggestion**: 
  - For standard MLPs: Adjust to nearest multiple of 64
  - For SwiGLU: Use a better coefficient that results in an aligned intermediate size

### 5. Tensor Parallelism Rules
When using tensor parallelism (splitting across multiple GPUs):
- **Rule 1**: Hidden size should be divisible by tensor parallel size
- **Penalty**: 15% score reduction when violated
- **Rule 2**: `(batch_size * num_attention_heads)` should be divisible by tensor parallel size
- **Penalty**: 10% score reduction when violated
- **Detection**: `_check_tensor_parallelism()` method
- **Suggestions**:
  - Change tensor parallel size to a divisor of the hidden size
  - Adjust attention heads for better tensor parallelism efficiency

## General Recommendations

1. **Batch size**: Make batch size as large as possible
2. **Powers of two**: `batch_size * sequence_length`, `hidden_size / num_attention_heads`, and `hidden_size / tensor_parallel_size` should ideally be divisible by a power of two (ideally 64 or higher)
3. **Tensor parallelism**: Keep tensor parallel size as small as possible while satisfying memory constraints
4. **Attention heads**: Fewer, larger heads are often more efficient than many small heads
5. **Layer count**: The number of layers should be divisible by the number of pipeline parallel stages (if using pipeline parallelism)

## Efficiency Score Calculation

The efficiency score starts at 100 and gets reduced based on violations:
- Head dimension not aligned with 64: Up to -25
- Hidden size not divisible by 64: -20
- Vocabulary size not divisible by 64: -15
- Tensor parallelism issues: Up to -25 combined