# Performance Optimization and Calibration (v0.7)

This notebook demonstrates how to use HPCSeries Core's auto-tuning calibration system to optimize performance for your specific hardware.

## What is Auto-Tuning Calibration?

HPCSeries automatically determines optimal:
- **Parallelization thresholds** (when to use multi-threading)
- **Thread counts** for different operation types
- **NUMA modes** for memory-intensive operations

This ensures maximum performance on your specific CPU architecture.

In [None]:
import hpcs
import numpy as np
import time
import matplotlib.pyplot as plt
import json
import os

%matplotlib inline

## 1. System Information

First, let's check your system configuration:

In [None]:
# Display CPU information
!hpcs cpuinfo

## 2. Quick Calibration

Quick calibration uses hardware heuristics (5-10 seconds):

In [None]:
# Run quick calibration
print("Running quick calibration...\n")
start = time.time()
hpcs.calibrate(quick=True)
elapsed = time.time() - start
print(f"\nCalibration completed in {elapsed:.2f} seconds")

## 3. Save and Load Configuration

In [None]:
# Save configuration
config_path = os.path.expanduser("~/.hpcs/config.json")
hpcs.save_calibration_config(config_path)
print(f"Configuration saved to: {config_path}")

# Load and display configuration
with open(config_path, 'r') as f:
    config = json.load(f)

print("\nConfiguration contents:")
print(json.dumps(config, indent=2))

## 4. Understanding Configuration Parameters

### Thresholds
Minimum array sizes for parallel processing:
- **simple**: Basic reductions (sum, mean, etc.)
- **rolling**: Rolling window operations
- **robust**: Robust statistics (median, MAD)
- **anomaly**: Anomaly detection operations

### Threads
Number of threads to use for each operation type

### NUMA Modes
- **0**: Default (OS scheduling)
- **1**: Interleaved memory allocation
- **2**: Local allocation

## 5. Performance Before and After Calibration

Let's measure the impact of calibration:

In [None]:
def benchmark_operations(n=10_000_000, iterations=10):
    """Benchmark key operations."""
    data = np.random.randn(n)
    
    operations = {
        'sum': lambda: hpcs.sum(data),
        'mean': lambda: hpcs.mean(data),
        'std': lambda: hpcs.std(data),
        'median': lambda: hpcs.median(data),
        'rolling_mean': lambda: hpcs.rolling_mean(data[:100000], 50),
    }
    
    results = {}
    for name, func in operations.items():
        # Warmup
        _ = func()
        
        # Benchmark
        start = time.perf_counter()
        for _ in range(iterations):
            _ = func()
        elapsed = (time.perf_counter() - start) / iterations
        results[name] = elapsed * 1000  # Convert to ms
    
    return results

# Run benchmark
print("Benchmarking with current configuration...\n")
results = benchmark_operations()

print(f"{'Operation':<20} {'Time (ms)':<12} {'Throughput':<15}")
print("-" * 50)
for name, time_ms in results.items():
    n = 10_000_000 if 'rolling' not in name else 100_000
    throughput = n / (time_ms / 1000) / 1e6
    print(f"{name:<20} {time_ms:>8.3f}     {throughput:>8.2f} M elem/s")

## 6. Full Calibration (Optional)

Full calibration benchmarks actual performance (30-60 seconds):

**Note**: This cell is commented out because it takes time. Uncomment to run.

In [None]:
# Uncomment to run full calibration
# print("Running full calibration (this may take 30-60 seconds)...\n")
# start = time.time()
# hpcs.calibrate(quick=False)
# elapsed = time.time() - start
# print(f"\nFull calibration completed in {elapsed:.1f} seconds")
# hpcs.save_calibration_config(config_path)
# print(f"Updated configuration saved to: {config_path}")

## 7. Scaling Analysis

Let's analyze how performance scales with array size:

In [None]:
def scaling_benchmark(operation, sizes):
    """Measure operation time across different array sizes."""
    times = []
    
    for size in sizes:
        data = np.random.randn(size)
        
        # Warmup
        _ = operation(data)
        
        # Benchmark
        start = time.perf_counter()
        for _ in range(5):
            _ = operation(data)
        elapsed = (time.perf_counter() - start) / 5
        times.append(elapsed * 1000)  # Convert to ms
    
    return times

# Test different sizes
sizes = [1000, 10_000, 100_000, 1_000_000, 10_000_000]

# Benchmark sum operation
print("Benchmarking sum operation across different sizes...")
times_sum = scaling_benchmark(hpcs.sum, sizes)

# Plot results
plt.figure(figsize=(10, 6))
plt.loglog(sizes, times_sum, 'o-', linewidth=2, markersize=8)
plt.xlabel('Array Size', fontsize=12)
plt.ylabel('Time (ms)', fontsize=12)
plt.title('Sum Operation Performance Scaling', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate throughput
print(f"\n{'Size':<15} {'Time (ms)':<12} {'Throughput (M elem/s)'}")
print("-" * 50)
for size, time_ms in zip(sizes, times_sum):
    throughput = size / (time_ms / 1000) / 1e6
    print(f"{size:<15,} {time_ms:>8.3f}     {throughput:>12.2f}")

## 8. Comparison: Serial vs Parallel

Visualize the benefit of parallelization:

In [None]:
# Note: This is conceptual - actual serial/parallel comparison
# would require modifying the C code to disable parallelization

# For demonstration, we'll show the scaling efficiency
plt.figure(figsize=(10, 6))

# Calculate throughput for each size
throughputs = [s / (t / 1000) / 1e6 for s, t in zip(sizes, times_sum)]

plt.semilogx(sizes, throughputs, 'o-', linewidth=2, markersize=8, label='HPCSeries (parallel)')
plt.axhline(y=throughputs[0], color='r', linestyle='--', alpha=0.7, label='Serial baseline')
plt.xlabel('Array Size', fontsize=12)
plt.ylabel('Throughput (M elem/s)', fontsize=12)
plt.title('Throughput: Parallel Scaling Efficiency', fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Peak throughput: {max(throughputs):.2f} M elem/s at size {sizes[throughputs.index(max(throughputs))]:,}")

## 9. Operation-Specific Tuning

Different operations have different optimal configurations:

In [None]:
# Read current configuration
with open(config_path, 'r') as f:
    config = json.load(f)

# Display thresholds
thresholds = config['thresholds']
threads = config['threads']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot thresholds
operations = list(thresholds.keys())
threshold_values = list(thresholds.values())
ax1.bar(operations, threshold_values, color='steelblue', alpha=0.7)
ax1.set_ylabel('Threshold (elements)', fontsize=12)
ax1.set_title('Parallelization Thresholds', fontsize=14)
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3, axis='y')

# Plot thread counts
thread_values = list(threads.values())
ax2.bar(operations, thread_values, color='coral', alpha=0.7)
ax2.set_ylabel('Thread Count', fontsize=12)
ax2.set_title('Optimal Thread Counts', fontsize=14)
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("Configuration Summary:")
print(f"CPU: {config['cpu_id']}")
print(f"\nThresholds (min size for parallel):")
for op, thresh in thresholds.items():
    print(f"  {op:<10} {thresh:>10,} elements")
print(f"\nThread counts:")
for op, count in threads.items():
    print(f"  {op:<10} {count:>2} threads")

## 10. Best Practices

### When to Calibrate:
1. **First installation** - Run calibration once
2. **Hardware changes** - After upgrading CPU or RAM
3. **Workload changes** - If your typical array sizes change significantly

### Calibration Modes:
- **Quick calibration** (`quick=True`): Use hardware heuristics, fast (5-10s)
- **Full calibration** (`quick=False`): Benchmark actual performance, slower (30-60s) but more accurate

### Performance Tips:
1. **Use contiguous arrays** - C-contiguous layout is fastest
2. **Batch operations** - Process larger arrays when possible
3. **Check thresholds** - Ensure your typical array sizes exceed parallelization thresholds
4. **Monitor thread count** - Match to your CPU core count for best results

### Configuration Persistence:
The configuration is saved to `~/.hpcs/config.json` and automatically loaded on import.

## 11. CLI Usage

You can also use the command-line interface:

In [None]:
# Quick calibration via CLI
!hpcs calibrate --quick

# Show configuration location
!hpcs config

# Run performance benchmarks
!hpcs bench --size 1000000 --iterations 10

## Summary

HPCSeries Core's auto-tuning calibration system provides:

✅ **Automatic optimization** for your hardware  
✅ **Operation-specific tuning** (simple, rolling, robust, anomaly)  
✅ **Easy API** (`calibrate()`, `save_config()`, `load_config()`)  
✅ **CLI support** for automation  
✅ **Persistent configuration** across sessions  

For most users, running `hpcs.calibrate(quick=True)` once is sufficient for optimal performance.