# Performance Optimization: Profiling and Benchmarking

This notebook covers performance optimization techniques for tracking applications. We explore:

1. **Profiling Basics** - Identifying bottlenecks
2. **Numba JIT Compilation** - Accelerating numerical code
3. **Vectorization** - Efficient NumPy operations
4. **Caching Strategies** - Avoiding redundant computations
5. **Memory Optimization** - Reducing allocations
6. **Benchmarking Best Practices** - Reliable measurements

## Prerequisites

```bash
pip install nrl-tracker plotly numpy numba line_profiler memory_profiler
```

In [None]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time
import cProfile
import pstats
from io import StringIO
from functools import lru_cache

try:
    from numba import njit, prange
    NUMBA_AVAILABLE = True
except ImportError:
    NUMBA_AVAILABLE = False
    print("Numba not available. Install with: pip install numba")

# Import tracking functions
from pytcl.dynamic_estimation.kalman import kf_predict, kf_update
from pytcl.assignment_algorithms import hungarian_assignment
from pytcl.dynamic_estimation.particle_filters import (
    resample_systematic, effective_sample_size
)

np.random.seed(42)

# Plotly dark theme template
dark_template = go.layout.Template()
dark_template.layout = go.Layout(
    paper_bgcolor='#0d1117',
    plot_bgcolor='#0d1117',
    font=dict(color='#e6edf3'),
    xaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
    yaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
)

## 1. Profiling Basics

Before optimizing, identify where time is actually spent.

### Profiling Tools

| Tool | Purpose | Overhead |
|------|---------|----------|
| `time.time()` | Quick timing | Minimal |
| `cProfile` | Function-level profiling | Moderate |
| `line_profiler` | Line-by-line profiling | High |
| `memory_profiler` | Memory usage | High |

In [None]:
def simple_timer(func, *args, n_runs=10, **kwargs):
    """
    Time a function call with multiple runs.
    
    Parameters
    ----------
    func : callable
        Function to time.
    *args : tuple
        Arguments to pass to function.
    n_runs : int
        Number of timing runs.
    **kwargs : dict
        Keyword arguments to pass to function.
        
    Returns
    -------
    dict
        Timing statistics.
    """
    # Warmup
    func(*args, **kwargs)
    
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        times.append(end - start)
    
    return {
        'mean': np.mean(times),
        'std': np.std(times),
        'min': np.min(times),
        'max': np.max(times),
        'n_runs': n_runs
    }

# Example: Time matrix multiplication
n = 500
A = np.random.randn(n, n)
B = np.random.randn(n, n)

stats = simple_timer(np.dot, A, B, n_runs=20)
print(f"Matrix multiplication ({n}x{n}):")
print(f"  Mean: {stats['mean']*1e3:.2f} ms")
print(f"  Std:  {stats['std']*1e3:.2f} ms")

In [None]:
# Profile a tracking scenario
def run_kalman_tracking(n_steps=100):
    """Run Kalman filter tracking."""
    # System matrices
    dt = 1.0
    F = np.array([[1, dt, 0, 0],
                  [0, 1, 0, 0],
                  [0, 0, 1, dt],
                  [0, 0, 0, 1]])
    H = np.array([[1, 0, 0, 0],
                  [0, 0, 1, 0]])
    Q = np.eye(4) * 0.1
    R = np.eye(2) * 1.0
    
    x = np.zeros(4)
    P = np.eye(4) * 100
    
    for _ in range(n_steps):
        # Predict
        x = F @ x
        P = F @ P @ F.T + Q
        
        # Update
        z = np.random.randn(2) * np.sqrt(R[0, 0])
        y = z - H @ x
        S = H @ P @ H.T + R
        K = P @ H.T @ np.linalg.inv(S)
        x = x + K @ y
        P = (np.eye(4) - K @ H) @ P
    
    return x, P

# Profile with cProfile
profiler = cProfile.Profile()
profiler.enable()

for _ in range(100):
    run_kalman_tracking(100)

profiler.disable()

# Print top 10 functions by cumulative time
s = StringIO()
ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')
ps.print_stats(10)
print(s.getvalue())

## 2. Numba JIT Compilation

Numba compiles Python functions to optimized machine code at runtime.

### Key Features
- `@njit` - Compile without Python interpreter ("nopython" mode)
- `@njit(parallel=True)` - Automatic parallelization
- `prange` - Parallel range for loops
- `cache=True` - Cache compiled code between runs

In [None]:
# Pure Python implementation
def compute_distances_python(points, center):
    """Compute distances from points to center (pure Python)."""
    n = len(points)
    distances = np.zeros(n)
    for i in range(n):
        d = 0.0
        for j in range(len(center)):
            d += (points[i, j] - center[j]) ** 2
        distances[i] = np.sqrt(d)
    return distances

# Vectorized NumPy implementation
def compute_distances_numpy(points, center):
    """Compute distances from points to center (NumPy)."""
    return np.sqrt(np.sum((points - center) ** 2, axis=1))

if NUMBA_AVAILABLE:
    # Numba JIT implementation
    @njit(cache=True)
    def compute_distances_numba(points, center):
        """Compute distances from points to center (Numba)."""
        n = len(points)
        dim = len(center)
        distances = np.zeros(n)
        for i in range(n):
            d = 0.0
            for j in range(dim):
                d += (points[i, j] - center[j]) ** 2
            distances[i] = np.sqrt(d)
        return distances

    # Parallel Numba implementation
    @njit(parallel=True, cache=True)
    def compute_distances_numba_parallel(points, center):
        """Compute distances from points to center (parallel Numba)."""
        n = len(points)
        dim = len(center)
        distances = np.zeros(n)
        for i in prange(n):
            d = 0.0
            for j in range(dim):
                d += (points[i, j] - center[j]) ** 2
            distances[i] = np.sqrt(d)
        return distances

print("Distance functions defined.")

In [None]:
# Benchmark different implementations
n_points_list = [1000, 10000, 100000, 1000000]
dim = 3

python_times = []
numpy_times = []
numba_times = []
numba_parallel_times = []

for n_points in n_points_list:
    points = np.random.randn(n_points, dim)
    center = np.random.randn(dim)
    
    # Python (only for smaller sizes)
    if n_points <= 10000:
        stats = simple_timer(compute_distances_python, points, center, n_runs=3)
        python_times.append(stats['mean'])
    else:
        python_times.append(np.nan)
    
    # NumPy
    stats = simple_timer(compute_distances_numpy, points, center, n_runs=10)
    numpy_times.append(stats['mean'])
    
    if NUMBA_AVAILABLE:
        # Numba (warmup first)
        _ = compute_distances_numba(points, center)
        stats = simple_timer(compute_distances_numba, points, center, n_runs=10)
        numba_times.append(stats['mean'])
        
        # Parallel Numba
        _ = compute_distances_numba_parallel(points, center)
        stats = simple_timer(compute_distances_numba_parallel, points, center, n_runs=10)
        numba_parallel_times.append(stats['mean'])
    
    print(f"n={n_points:7d}: NumPy={numpy_times[-1]*1e3:7.2f}ms", end='')
    if NUMBA_AVAILABLE:
        print(f", Numba={numba_times[-1]*1e3:7.2f}ms, "
              f"Parallel={numba_parallel_times[-1]*1e3:7.2f}ms")
    else:
        print()

In [None]:
# Visualize performance comparison
fig = go.Figure()

fig.add_trace(
    go.Scatter(x=n_points_list, y=np.array(numpy_times)*1e3, mode='lines+markers',
               name='NumPy', line=dict(color='#00d4ff', width=2),
               marker=dict(size=8))
)
if NUMBA_AVAILABLE:
    fig.add_trace(
        go.Scatter(x=n_points_list, y=np.array(numba_times)*1e3, mode='lines+markers',
                   name='Numba', line=dict(color='#00ff88', width=2),
                   marker=dict(size=8, symbol='square'))
    )
    fig.add_trace(
        go.Scatter(x=n_points_list, y=np.array(numba_parallel_times)*1e3, mode='lines+markers',
                   name='Numba Parallel', line=dict(color='#ff4757', width=2),
                   marker=dict(size=8, symbol='triangle-up'))
    )

fig.update_layout(
    template=dark_template,
    title='Distance Computation Performance',
    xaxis_title='Number of Points',
    yaxis_title='Time (ms)',
    xaxis_type='log',
    yaxis_type='log',
    height=450,
)
fig.show()

## 3. Vectorization Patterns

Replacing loops with vectorized NumPy operations can provide significant speedups.

In [None]:
# Example: Compute pairwise distances

def pairwise_distances_loop(X):
    """Compute pairwise distances using loops."""
    n = len(X)
    D = np.zeros((n, n))
    for i in range(n):
        for j in range(i+1, n):
            d = np.sqrt(np.sum((X[i] - X[j])**2))
            D[i, j] = d
            D[j, i] = d
    return D

def pairwise_distances_broadcast(X):
    """Compute pairwise distances using broadcasting."""
    # Shape: (n, 1, d) - (1, n, d) = (n, n, d)
    diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]
    return np.sqrt(np.sum(diff**2, axis=-1))

def pairwise_distances_einsum(X):
    """Compute pairwise distances using einsum."""
    # D[i,j] = ||X[i] - X[j]||² = ||X[i]||² + ||X[j]||² - 2*X[i]·X[j]
    norms_sq = np.sum(X**2, axis=1)
    dot_products = np.einsum('ik,jk->ij', X, X)
    D_sq = norms_sq[:, np.newaxis] + norms_sq[np.newaxis, :] - 2 * dot_products
    D_sq = np.maximum(D_sq, 0)  # Handle numerical errors
    return np.sqrt(D_sq)

# Benchmark
n = 500
X = np.random.randn(n, 3)

print(f"Pairwise distances for {n} points:")

stats = simple_timer(pairwise_distances_loop, X, n_runs=3)
print(f"  Loop:      {stats['mean']*1e3:8.2f} ms")

stats = simple_timer(pairwise_distances_broadcast, X, n_runs=10)
print(f"  Broadcast: {stats['mean']*1e3:8.2f} ms")

stats = simple_timer(pairwise_distances_einsum, X, n_runs=10)
print(f"  Einsum:    {stats['mean']*1e3:8.2f} ms")

# Verify results match
D1 = pairwise_distances_loop(X)
D2 = pairwise_distances_broadcast(X)
D3 = pairwise_distances_einsum(X)
print(f"\nMax difference: {max(np.max(np.abs(D1-D2)), np.max(np.abs(D1-D3))):.2e}")

In [None]:
# Batch Kalman filter - vectorized version

def batch_kf_predict_loop(x_batch, P_batch, F, Q):
    """Predict step using loop."""
    n_batch = len(x_batch)
    x_pred = np.zeros_like(x_batch)
    P_pred = np.zeros_like(P_batch)
    
    for i in range(n_batch):
        x_pred[i] = F @ x_batch[i]
        P_pred[i] = F @ P_batch[i] @ F.T + Q
    
    return x_pred, P_pred

def batch_kf_predict_vectorized(x_batch, P_batch, F, Q):
    """Predict step using vectorized operations."""
    # x_pred[i] = F @ x_batch[i] -> use einsum
    x_pred = np.einsum('ij,bj->bi', F, x_batch)
    
    # P_pred[i] = F @ P_batch[i] @ F.T + Q
    FP = np.einsum('ij,bjk->bik', F, P_batch)  # F @ P for each batch
    FPFt = np.einsum('bij,kj->bik', FP, F)     # ... @ F.T
    P_pred = FPFt + Q
    
    return x_pred, P_pred

# Benchmark
n_batch = 1000
state_dim = 6

x_batch = np.random.randn(n_batch, state_dim)
P_batch = np.tile(np.eye(state_dim), (n_batch, 1, 1))
F = np.random.randn(state_dim, state_dim)
Q = np.eye(state_dim) * 0.1

print(f"Batch KF predict ({n_batch} tracks, {state_dim}D state):")

stats = simple_timer(batch_kf_predict_loop, x_batch, P_batch, F, Q, n_runs=10)
print(f"  Loop:       {stats['mean']*1e3:8.2f} ms")

stats = simple_timer(batch_kf_predict_vectorized, x_batch, P_batch, F, Q, n_runs=10)
print(f"  Vectorized: {stats['mean']*1e3:8.2f} ms")

## 4. Caching Strategies

Avoid recomputing expensive results that are used multiple times.

In [None]:
# Example: Spherical harmonic coefficients

def compute_legendre_uncached(n_max, theta):
    """Compute associated Legendre polynomials without caching."""
    x = np.cos(theta)
    P = np.zeros((n_max + 1, n_max + 1))
    
    P[0, 0] = 1.0
    if n_max > 0:
        P[1, 0] = x
        P[1, 1] = -np.sin(theta)
    
    for n in range(2, n_max + 1):
        for m in range(n + 1):
            if m == n:
                P[n, m] = -(2*n - 1) * np.sin(theta) * P[n-1, m-1]
            elif m == n - 1:
                P[n, m] = (2*n - 1) * x * P[n-1, m]
            else:
                P[n, m] = ((2*n - 1) * x * P[n-1, m] - (n + m - 1) * P[n-2, m]) / (n - m)
    
    return P

@lru_cache(maxsize=128)
def compute_legendre_cached(n_max, theta_index):
    """Compute associated Legendre polynomials with caching.
    
    Note: Using theta_index (discretized) for hashability.
    """
    theta = theta_index * 0.01  # Convert index back to angle
    return compute_legendre_uncached(n_max, theta)

# Simulate repeated evaluation at same locations (common in tracking)
n_max = 20
n_evals = 1000
n_unique_locations = 50  # Many evaluations at same locations

# Random angles, but with repetition
theta_unique = np.random.rand(n_unique_locations) * np.pi
theta_indices = np.random.randint(0, n_unique_locations, n_evals)

# Uncached
start = time.time()
for i in range(n_evals):
    _ = compute_legendre_uncached(n_max, theta_unique[theta_indices[i]])
uncached_time = time.time() - start

# Cached (clear cache first)
compute_legendre_cached.cache_clear()
start = time.time()
for i in range(n_evals):
    # Convert theta to discrete index for caching
    theta_idx = int(theta_unique[theta_indices[i]] * 100)
    _ = compute_legendre_cached(n_max, theta_idx)
cached_time = time.time() - start

print(f"Legendre polynomials (n_max={n_max}, {n_evals} evaluations):")
print(f"  Uncached: {uncached_time*1e3:.2f} ms")
print(f"  Cached:   {cached_time*1e3:.2f} ms")
print(f"  Speedup:  {uncached_time/cached_time:.1f}x")
print(f"  Cache hits: {compute_legendre_cached.cache_info().hits}")
print(f"  Cache misses: {compute_legendre_cached.cache_info().misses}")

## 5. Memory Optimization

Reducing memory allocations can significantly improve performance.

In [None]:
# Pre-allocation vs. dynamic allocation

def simulate_with_append(n_steps):
    """Simulate trajectory using list append."""
    trajectory = []
    x = np.zeros(4)
    
    for _ in range(n_steps):
        x = x + np.random.randn(4) * 0.1
        trajectory.append(x.copy())
    
    return np.array(trajectory)

def simulate_with_prealloc(n_steps):
    """Simulate trajectory with pre-allocated array."""
    trajectory = np.zeros((n_steps, 4))
    x = np.zeros(4)
    
    for i in range(n_steps):
        x = x + np.random.randn(4) * 0.1
        trajectory[i] = x
    
    return trajectory

# Benchmark
n_steps = 10000

stats1 = simple_timer(simulate_with_append, n_steps, n_runs=20)
stats2 = simple_timer(simulate_with_prealloc, n_steps, n_runs=20)

print(f"Trajectory simulation ({n_steps} steps):")
print(f"  List append:    {stats1['mean']*1e3:.2f} ms")
print(f"  Pre-allocated:  {stats2['mean']*1e3:.2f} ms")
print(f"  Speedup:        {stats1['mean']/stats2['mean']:.1f}x")

In [None]:
# In-place operations

def update_with_copy(x, dx):
    """Update using copy."""
    return x + dx

def update_in_place(x, dx):
    """Update in place."""
    x += dx
    return x

# Benchmark with many iterations
n_iter = 100000
x = np.zeros(100)
dx = np.random.randn(100) * 0.001

# With copy
x_copy = x.copy()
start = time.time()
for _ in range(n_iter):
    x_copy = update_with_copy(x_copy, dx)
copy_time = time.time() - start

# In place
x_inplace = x.copy()
start = time.time()
for _ in range(n_iter):
    update_in_place(x_inplace, dx)
inplace_time = time.time() - start

print(f"Update operations ({n_iter} iterations):")
print(f"  With copy: {copy_time*1e3:.2f} ms")
print(f"  In place:  {inplace_time*1e3:.2f} ms")
print(f"  Speedup:   {copy_time/inplace_time:.1f}x")

## 6. Benchmarking Best Practices

Guidelines for reliable performance measurements.

In [None]:
def reliable_benchmark(func, *args, n_warmup=5, n_runs=50, **kwargs):
    """
    Perform a reliable benchmark with proper warmup and statistics.
    
    Parameters
    ----------
    func : callable
        Function to benchmark.
    n_warmup : int
        Number of warmup runs.
    n_runs : int
        Number of measurement runs.
        
    Returns
    -------
    dict
        Benchmark results with statistics.
    """
    import gc
    
    # Disable garbage collection during timing
    gc_was_enabled = gc.isenabled()
    gc.disable()
    
    try:
        # Warmup
        for _ in range(n_warmup):
            func(*args, **kwargs)
        
        # Measurement
        times = []
        for _ in range(n_runs):
            start = time.perf_counter_ns()
            result = func(*args, **kwargs)
            end = time.perf_counter_ns()
            times.append((end - start) / 1e6)  # Convert to ms
        
    finally:
        if gc_was_enabled:
            gc.enable()
    
    times = np.array(times)
    
    return {
        'mean': np.mean(times),
        'std': np.std(times),
        'median': np.median(times),
        'min': np.min(times),
        'max': np.max(times),
        'p25': np.percentile(times, 25),
        'p75': np.percentile(times, 75),
        'n_runs': n_runs,
        'times': times
    }

def print_benchmark_results(name, results):
    """Print formatted benchmark results."""
    print(f"{name}:")
    print(f"  Mean:   {results['mean']:.3f} ms (±{results['std']:.3f})")
    print(f"  Median: {results['median']:.3f} ms")
    print(f"  Range:  [{results['min']:.3f}, {results['max']:.3f}] ms")
    print(f"  IQR:    [{results['p25']:.3f}, {results['p75']:.3f}] ms")

In [None]:
# Run comprehensive benchmark
print("Comprehensive Benchmark Results")
print("=" * 60)

# Matrix operations
n = 500
A = np.random.randn(n, n)
B = np.random.randn(n, n)

results = reliable_benchmark(np.dot, A, B)
print_benchmark_results(f"Matrix multiply ({n}x{n})", results)

print()

# Matrix inversion
results = reliable_benchmark(np.linalg.inv, A)
print_benchmark_results(f"Matrix inversion ({n}x{n})", results)

print()

# Hungarian assignment
m = 100
cost = np.random.rand(m, m) * 100
results = reliable_benchmark(hungarian_assignment, cost)
print_benchmark_results(f"Hungarian assignment ({m}x{m})", results)

In [None]:
# Visualize benchmark distribution
# Run multiple sizes
sizes = [100, 200, 300, 400, 500]
means = []
stds = []

for n in sizes:
    A = np.random.randn(n, n)
    B = np.random.randn(n, n)
    results = reliable_benchmark(np.dot, A, B, n_runs=50)
    means.append(results['mean'])
    stds.append(results['std'])

means = np.array(means)
stds = np.array(stds)

fig = go.Figure()

# Error bars with shaded region
fig.add_trace(
    go.Scatter(x=sizes, y=means + stds, mode='lines',
               line=dict(width=0), showlegend=False, hoverinfo='skip')
)
fig.add_trace(
    go.Scatter(x=sizes, y=means - stds, mode='lines',
               fill='tonexty', fillcolor='rgba(0, 212, 255, 0.3)',
               line=dict(width=0), name='±1σ')
)
fig.add_trace(
    go.Scatter(x=sizes, y=means, mode='lines+markers',
               name='Mean time', line=dict(color='#00d4ff', width=2),
               marker=dict(size=10), error_y=dict(type='data', array=stds,
                                                   visible=True, color='#00d4ff'))
)

# O(n³) reference
n3_fit = means[0] * (np.array(sizes) / sizes[0])**3
fig.add_trace(
    go.Scatter(x=sizes, y=n3_fit, mode='lines',
               name='O(n³) reference', line=dict(color='#ff4757', width=1.5, dash='dash'))
)

fig.update_layout(
    template=dark_template,
    title='Matrix Multiplication Benchmark',
    xaxis_title='Matrix Size',
    yaxis_title='Time (ms)',
    height=450,
)
fig.show()

## Summary

Key optimization strategies:

1. **Profile first** - Don't optimize blindly
2. **Vectorize** - Replace loops with NumPy operations
3. **Use Numba** - JIT compile hot spots
4. **Cache results** - Avoid redundant computation
5. **Pre-allocate** - Reduce memory allocation overhead
6. **In-place operations** - Avoid unnecessary copies

### Optimization Checklist

| Check | Description |
|-------|-------------|
| Profile | Identify actual bottlenecks |
| Vectorize | Replace Python loops with NumPy |
| Numba | JIT compile remaining loops |
| Cache | Memoize expensive pure functions |
| Pre-allocate | Use np.zeros instead of append |
| In-place | Use += instead of x = x + |
| Data types | Use float32 if precision allows |
| Parallel | Use Numba prange or multiprocessing |

## Exercises

1. Profile a complete multi-target tracking pipeline and identify bottlenecks
2. Implement a Numba-accelerated particle filter resampling algorithm
3. Create a caching decorator that tracks cache hit rates
4. Optimize the JPDA algorithm for 1000+ tracks

## References

1. VanderPlas, J. (2016). *Python Data Science Handbook*.
2. Numba Documentation: https://numba.pydata.org/
3. NumPy Performance Tips: https://numpy.org/doc/stable/user/quickstart.html