# Part 8.5: ExecuTorch Fundamentals

## Deploying PyTorch Models to Edge Devices

In previous sections, we covered inference and deployment methods for servers and desktops:
- **Ollama** — Local desktop/server deployment
- **llama.cpp / GGUF** — CPU-optimized, cross-platform
- **FastAPI** — API server deployment

Now we explore **ExecuTorch** — Meta's solution for deploying PyTorch models to edge devices like mobile phones, smartwatches, IoT sensors, and embedded systems.

**Learning Objectives:**
1. Understand what ExecuTorch is and the problem it solves
2. Learn how ExecuTorch differs from standard PyTorch
3. Master the export pipeline: PyTorch → ExecuTorch
4. Run inference with the ExecuTorch runtime

---

## 1. What is ExecuTorch?

ExecuTorch is Meta's framework for deploying PyTorch models on **edge devices** — mobile phones, embedded systems, IoT devices, and wearables. It's designed for environments where resources are severely constrained compared to server deployments.

### The Problem ExecuTorch Solves

You've deployed models to servers and desktops. But what about running inference on:
- A smartphone with 4GB RAM
- A smartwatch with 512MB RAM
- An IoT sensor with 128MB RAM
- A drone or robot with no internet connection

These environments have constraints you haven't dealt with yet:

| Constraint | Description |
|------------|-------------|
| **Limited memory** | Can't load a 4GB model |
| **Limited compute** | No powerful GPU |
| **No network** | Must run fully on-device |
| **Battery concerns** | Efficiency matters |
| **Binary size** | App size limits on mobile stores |

Standard PyTorch wasn't designed for this. It carries dependencies and overhead that work fine on servers but are too heavy for edge devices.

### Think of ExecuTorch as:

> "A compiler and runtime that transforms PyTorch models into efficient, self-contained programs that can run on resource-constrained devices without Python."

---

## 2. How ExecuTorch Differs from Standard PyTorch

### Standard PyTorch Execution

When you run a PyTorch model normally:
- Python interpreter runs your code line by line
- PyTorch dispatches operations dynamically
- You can use if/else, loops, dynamic shapes
- Requires Python runtime + PyTorch library (~hundreds of MB)

### ExecuTorch Execution

With ExecuTorch:
- Model is compiled ahead-of-time (AOT) to a static graph
- Lightweight C++ runtime executes the graph
- No Python needed on the target device
- Minimal runtime (~1MB possible)

### Comparison Table

| Aspect | Server/Desktop (Ollama, FastAPI) | Edge (ExecuTorch) |
|--------|----------------------------------|-------------------|
| Runtime size | Hundreds of MB | ~1MB possible |
| Dependencies | Python, CUDA, libraries | Minimal C++ runtime |
| Compilation | JIT or eager execution | Ahead-of-time (AOT) |
| Optimization | General purpose | Device-specific |
| Target | x86/CUDA servers | ARM, NPU, DSP, custom silicon |

### The Key Trade-off

**You're trading flexibility for efficiency.** A server can JIT compile and adapt; an edge device needs everything pre-decided and optimized.

---

## 3. The ExecuTorch Pipeline

Here's the journey a model takes from PyTorch to edge deployment:

```
PyTorch Model (nn.Module)
        ↓
   torch.export()
        ↓
  Exported Program (ATen dialect)
        ↓
   to_edge()
        ↓
  Edge Program (Edge dialect)
        ↓
   to_executorch()
        ↓
  .pte file (ExecuTorch Program)
        ↓
  ExecuTorch Runtime (C++)
```

### Step-by-Step Explanation

**1. torch.export()** — Captures your model as a static graph. Unlike regular PyTorch which executes dynamically, this freezes the computation into a fixed structure. This is necessary because edge devices can't do dynamic Python execution.

**2. ATen Dialect** — The exported program uses PyTorch's ATen operators (~2000 operations). Too many for lean edge deployment.

**3. to_edge()** — Converts to Edge dialect, a smaller standardized set of operators designed for portability across edge devices.

**4. to_executorch()** — Final compilation step. Produces a `.pte` file (PyTorch ExecuTorch) — a serialized binary ready for the runtime.

**5. ExecuTorch Runtime** — A lightweight C++ runtime that loads and executes `.pte` files. No Python needed on the device.

### Key Concept: Delegates

ExecuTorch can delegate operations to specialized hardware:

| Delegate | Target Hardware |
|----------|----------------|
| **XNNPACK** | Optimized CPU kernels for ARM/x86 |
| **CoreML** | Apple's Neural Engine (iOS/macOS) |
| **Qualcomm QNN** | Snapdragon NPUs |
| **Vulkan** | Mobile GPU compute |
| **Custom backends** | Your own hardware |

This means the same `.pte` file can route different operations to different accelerators for maximum efficiency.

---

## 4. Code Comparison: PyTorch vs ExecuTorch

Let's see the difference between standard PyTorch execution and ExecuTorch compilation with a simple example.

### Standard PyTorch: How You Normally Run Models

```python
import torch
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(10, 32)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(32, 5)
    
    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

# Create model and input
model = SimpleModel()
model.eval()
sample_input = torch.randn(1, 10)

# Run inference — PyTorch executes dynamically
output = model(sample_input)
print(output.shape)  # torch.Size([1, 5])
```

**What happens under the hood:**
- Python interpreter runs your code line by line
- PyTorch dispatches operations dynamically
- Flexible — you can use if/else, loops, dynamic shapes
- Requires Python runtime + PyTorch library (~hundreds of MB)

### ExecuTorch: Compiled for Edge Deployment

```python
import torch
from torch.export import export
from executorch.exir import to_edge

# Same model as before
class SimpleModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(10, 32)
        self.relu = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(32, 5)
    
    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

model = SimpleModel()
model.eval()

# Step 1: Export — capture the model as a static graph
sample_input = (torch.randn(1, 10),)
exported_program = export(model, sample_input)

# Step 2: Convert to Edge dialect
edge_program = to_edge(exported_program)

# Step 3: Convert to ExecuTorch format
executorch_program = edge_program.to_executorch()

# Step 4: Save as .pte file
with open("simple_model.pte", "wb") as f:
    f.write(executorch_program.buffer)

print("Model exported to simple_model.pte")
```

**What happens:**
- `export()` traces your model and captures a fixed computation graph
- `to_edge()` converts operations to a portable edge-friendly format
- `to_executorch()` compiles everything into a binary blob
- The `.pte` file contains everything needed — no Python required to run it

### Running the .pte File

**On your development machine (Python, for testing):**

```python
from executorch.runtime import Runtime

# Load the compiled model
runtime = Runtime.get()
program = runtime.load_program("simple_model.pte")
method = program.load_method("forward")

# Run inference
input_tensor = torch.randn(1, 10)
output = method.execute([input_tensor])
print(output[0].shape)  # Same result as PyTorch
```

**On an edge device (C++, no Python):**

```cpp
#include <executorch/runtime/executor/program.h>

// Load .pte file
auto program = Program::load("simple_model.pte");
auto method = program->load_method("forward");

// Prepare input and run
method->execute();
```

### Visual Summary

```
┌─────────────────────────────────────────────────────────┐
│                   STANDARD PYTORCH                       │
│                                                         │
│   Python Script → PyTorch Runtime → Dynamic Execution   │
│                                                         │
│   Requires: Python + PyTorch (~500MB+)                  │
│   Runs on: Servers, desktops, laptops                   │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                     EXECUTORCH                           │
│                                                         │
│   Python Script                                         │
│        ↓ export()                                       │
│   Exported Program                                      │
│        ↓ to_edge()                                      │
│   Edge Program                                          │
│        ↓ to_executorch()                                │
│   .pte file ──→ Lightweight C++ Runtime → Execution    │
│                                                         │
│   Requires: Just the .pte file + tiny runtime (~1MB)   │
│   Runs on: Phones, watches, IoT, embedded devices      │
└─────────────────────────────────────────────────────────┘
```

---

## 5. Hands-On: Complete ExecuTorch Pipeline

Now let's run through the complete pipeline with working code.

### 5.1 Environment Setup

In [1]:
# Install ExecuTorch
# Note: This may take a few minutes
!pip install executorch -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m64.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.7/542.7 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.7/59.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.4/46.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Verify installation
import torch
print(f"PyTorch version: {torch.__version__}")

try:
    from torch.export import export
    from executorch.exir import to_edge
    print("✓ ExecuTorch imported successfully!")
except ImportError as e:
    print(f"✗ Import error: {e}")
    print("\nTroubleshooting: ExecuTorch requires specific PyTorch versions.")
    print("Try: pip install executorch torch==2.4.0")

PyTorch version: 2.9.0+cpu




✓ ExecuTorch imported successfully!


### 5.2 Create a Simple Model

We'll use a small classifier — complex enough to be meaningful, simple enough to understand completely.

In [3]:
import torch
import torch.nn as nn

class TinyClassifier(nn.Module):
    """
    A small classifier for demonstration.
    Input: 20 features
    Output: 4 classes
    """
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(20, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 4)
        )

    def forward(self, x):
        return self.layers(x)

# Create and set to eval mode
model = TinyClassifier()
model.eval()

# Test with sample input
sample_input = torch.randn(1, 20)
output = model(sample_input)

print(f"Model created successfully!")
print(f"Input shape:  {sample_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Output:       {output}")

Model created successfully!
Input shape:  torch.Size([1, 20])
Output shape: torch.Size([1, 4])
Output:       tensor([[ 0.0738, -0.0361,  0.0600,  0.1681]], grad_fn=<AddmmBackward0>)


### 5.3 Export the Model

This is where we transition from dynamic PyTorch to a static graph.

In [4]:
from torch.export import export

# Define example input for tracing
# Note: Input must be a tuple
example_inputs = (torch.randn(1, 20),)

# Export the model
exported_program = export(model, example_inputs)

print("✓ Export successful!")
print(f"Type: {type(exported_program).__name__}")

✓ Export successful!
Type: ExportedProgram


**What just happened?**

`torch.export()` traced through your model with the example input and captured every operation as a static graph. No more dynamic Python — just a fixed sequence of tensor operations.

Let's inspect what was captured:

In [5]:
# View the exported graph
print("Exported Graph:")
print("=" * 50)
print(exported_program.graph_module.graph)

Exported Graph:
graph():
    %p_layers_0_weight : [num_users=1] = placeholder[target=p_layers_0_weight]
    %p_layers_0_bias : [num_users=1] = placeholder[target=p_layers_0_bias]
    %p_layers_2_weight : [num_users=1] = placeholder[target=p_layers_2_weight]
    %p_layers_2_bias : [num_users=1] = placeholder[target=p_layers_2_bias]
    %p_layers_4_weight : [num_users=1] = placeholder[target=p_layers_4_weight]
    %p_layers_4_bias : [num_users=1] = placeholder[target=p_layers_4_bias]
    %x : [num_users=1] = placeholder[target=x]
    %linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%x, %p_layers_0_weight, %p_layers_0_bias), kwargs = {})
    %relu : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%linear,), kwargs = {})
    %linear_1 : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%relu, %p_layers_2_weight, %p_layers_2_bias), kwargs = {})
    %relu_1 : [num_users=1] = call_function[target=torch.op

You'll see operations like `torch.ops.aten.linear.default` and `torch.ops.aten.relu.default`. This is the **ATen dialect** — PyTorch's internal operator representation.

### 5.4 Convert to Edge Dialect

Now we convert to a more portable format optimized for edge devices.

In [6]:
from executorch.exir import to_edge

# Convert to edge program
edge_program = to_edge(exported_program)

print("✓ Edge conversion successful!")
print(f"Type: {type(edge_program).__name__}")

✓ Edge conversion successful!
Type: EdgeProgramManager


**Why this step?**

ATen has ~2000 operators. Edge dialect standardizes these into a smaller, portable set that can run consistently across different edge devices.

### 5.5 Compile to ExecuTorch Format

Final compilation step — creating the `.pte` file.

In [7]:
# Compile to ExecuTorch
executorch_program = edge_program.to_executorch()

print("✓ ExecuTorch compilation successful!")
print(f"Buffer size: {len(executorch_program.buffer):,} bytes")
print(f"Buffer size: {len(executorch_program.buffer)/1024:.2f} KB")

✓ ExecuTorch compilation successful!
Buffer size: 17,040 bytes
Buffer size: 16.64 KB


Notice the size — this tiny model compiles to just a few kilobytes. Compare that to the full PyTorch model which requires the entire PyTorch runtime (~hundreds of MB).

### 5.6 Save the .pte File

In [8]:
import os

# Save to file
pte_path = "tiny_classifier.pte"

with open(pte_path, "wb") as f:
    f.write(executorch_program.buffer)

# Verify file
file_size = os.path.getsize(pte_path)
print(f"✓ Saved to: {pte_path}")
print(f"  File size: {file_size:,} bytes ({file_size/1024:.2f} KB)")

✓ Saved to: tiny_classifier.pte
  File size: 17,040 bytes (16.64 KB)


### 5.7 Run Inference with ExecuTorch Runtime

Now let's load and run our compiled model using the ExecuTorch runtime.

In [9]:
from executorch.runtime import Runtime

# Load the program
runtime = Runtime.get()
program = runtime.load_program(pte_path)

# Load the forward method
method = program.load_method("forward")

print("✓ Model loaded successfully!")

# Create test input
test_input = torch.randn(1, 20)

# Run inference
outputs = method.execute([test_input])

print(f"\nInference Results:")
print(f"Input shape:  {test_input.shape}")
print(f"Output shape: {outputs[0].shape}")
print(f"Output:       {outputs[0]}")

[program.cpp:153] InternalConsistency verification requested but not available


✓ Model loaded successfully!

Inference Results:
Input shape:  torch.Size([1, 20])
Output shape: torch.Size([1, 4])
Output:       tensor([[-0.0303,  0.1302,  0.0904,  0.1285]])


### 5.8 Verify Results Match

Critical check — does ExecuTorch produce the same results as PyTorch?

In [10]:
# Run same input through both
test_input = torch.randn(1, 20)

# PyTorch inference
with torch.no_grad():
    pytorch_output = model(test_input)

# ExecuTorch inference
executorch_output = method.execute([test_input])[0]

# Compare
print("Comparison: PyTorch vs ExecuTorch")
print("=" * 50)
print(f"PyTorch output:     {pytorch_output.numpy().flatten()}")
print(f"ExecuTorch output:  {executorch_output.numpy().flatten()}")

# Check if close (small numerical differences are normal due to floating point)
are_close = torch.allclose(pytorch_output, executorch_output, atol=1e-5)
print(f"\n✓ Outputs match: {are_close}")

if are_close:
    print("The ExecuTorch model produces identical results to PyTorch!")
else:
    max_diff = torch.max(torch.abs(pytorch_output - executorch_output)).item()
    print(f"Maximum difference: {max_diff}")

Comparison: PyTorch vs ExecuTorch
PyTorch output:     [0.00625398 0.0099101  0.14916544 0.15452597]
ExecuTorch output:  [0.00625396 0.0099101  0.14916542 0.15452595]

✓ Outputs match: True
The ExecuTorch model produces identical results to PyTorch!


---

## 6. Complete Pipeline Summary

Here's the entire ExecuTorch workflow in one place:

In [11]:
# ============================================================
# COMPLETE EXECUTORCH PIPELINE
# ============================================================

import torch
import torch.nn as nn
from torch.export import export
from executorch.exir import to_edge
import os

# 1. DEFINE MODEL
class TinyClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(20, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 4)
        )

    def forward(self, x):
        return self.layers(x)

# 2. PREPARE MODEL
model = TinyClassifier()
model.eval()
print("Step 1: Model created")

# 3. EXPORT → EDGE → EXECUTORCH
example_inputs = (torch.randn(1, 20),)

exported = export(model, example_inputs)
print("Step 2: Model exported (ATen dialect)")

edge = to_edge(exported)
print("Step 3: Converted to Edge dialect")

et_program = edge.to_executorch()
print("Step 4: Compiled to ExecuTorch format")

# 4. SAVE
output_path = "model.pte"
with open(output_path, "wb") as f:
    f.write(et_program.buffer)

file_size = os.path.getsize(output_path)
print(f"Step 5: Saved to {output_path} ({file_size/1024:.2f} KB)")

print("\n" + "=" * 50)
print("✓ Model ready for edge deployment!")
print("=" * 50)

Step 1: Model created
Step 2: Model exported (ATen dialect)
Step 3: Converted to Edge dialect
Step 4: Compiled to ExecuTorch format
Step 5: Saved to model.pte (16.64 KB)

✓ Model ready for edge deployment!


---

## 7. Key Takeaways

### What We Learned

1. **ExecuTorch** is Meta's framework for deploying PyTorch models to edge devices (phones, IoT, embedded systems)

2. **The Pipeline:**
   - `export()` — Captures dynamic PyTorch as a static graph
   - `to_edge()` — Converts to portable edge operators
   - `to_executorch()` — Compiles to final binary format
   - `.pte` file — Self-contained, ready for edge runtime

3. **Key Trade-off:** You sacrifice flexibility (dynamic execution) for efficiency (small runtime, optimized for constrained devices)

4. **Runtime:** No Python needed on the target device — just the lightweight C++ runtime

### Comparison with Other Deployment Methods

| Method | Target | Runtime Size | Use Case |
|--------|--------|--------------|----------|
| Ollama | Desktop/Server | ~100MB+ | Local chatbots |
| GGUF/llama.cpp | Desktop | ~10-50MB | CPU inference |
| FastAPI | Cloud | Variable | API endpoints |
| **ExecuTorch** | **Edge** | **~1MB** | **Mobile apps, IoT** |

### Next Steps

In Part 8.6, we'll:
- Compare GGUF vs ExecuTorch in detail
- Apply ExecuTorch to a real SLM (Small Language Model)
- Explore quantization for edge deployment
- Handle LLM-specific challenges (tokenization, generation loops)

---

## References

- [ExecuTorch Documentation](https://pytorch.org/executorch/stable/index.html)
- [PyTorch Export Documentation](https://pytorch.org/docs/stable/export.html)
- [ExecuTorch GitHub Repository](https://github.com/pytorch/executorch)