# ActiveSurrogate: Automatic Surrogate Modeling with Active Learning

This notebook demonstrates how to use the `ActiveSurrogate` class to automatically build surrogate models using active learning. The system intelligently samples an input domain to minimize expensive function evaluations while building an accurate surrogate model.

## Key Features

- **Automatic Sampling**: Starts with Latin Hypercube sampling, then uses active learning to select informative points
- **Multiple Acquisition Functions**: Expected Improvement (EI), Upper Confidence Bound (UCB), Probability of Improvement (PI), Maximum Variance
- **Flexible Stopping Criteria**: Mean ratio, percentile-based, absolute threshold, convergence detection
- **Batch Support**: Can sample multiple points per iteration for parallel evaluation
- **History Tracking**: Records all metrics for analysis and visualization

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C

from pycse.pyroxy import ActiveSurrogate

# Set random seed for reproducibility
np.random.seed(42)

## Example 1: Basic Usage with a 1D Function

Let's start with a simple 1D function that has some interesting features (multiple local minima).

In [None]:
# Define a moderately complex test function
def expensive_function(X):
    """A function with multiple local minima."""
    x = X.flatten()
    return np.sin(x) + 0.5 * np.sin(3 * x) + 0.1 * x

# Define the domain
bounds = [(0, 10)]

# Create a Gaussian Process model
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2))
model = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=5)

# Build the surrogate with active learning
print("Building surrogate with Expected Improvement acquisition...")
surrogate, history = ActiveSurrogate.build(
    func=expensive_function,
    bounds=bounds,
    model=model,
    acquisition='ei',
    stopping_criterion='mean_ratio',
    stopping_threshold=1.5,
    n_initial=5,
    max_iterations=20,
    verbose=True
)

print(f"\nFinal surrogate built with {len(surrogate.xtrain)} samples")

### Visualize the Results

In [None]:
# Create test points
X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
y_true = expensive_function(X_plot)
y_pred = surrogate(X_plot)

# Plot results
plt.figure(figsize=(14, 5))

# Left: Surrogate vs True Function
plt.subplot(1, 2, 1)
plt.plot(X_plot, y_true, 'b-', label='True function', linewidth=2)
plt.plot(X_plot, y_pred, 'r--', label='Surrogate', linewidth=2)
plt.scatter(surrogate.xtrain, surrogate.ytrain, c='black', s=50,
            label=f'Samples (n={len(surrogate.xtrain)})', zorder=5)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Surrogate vs True Function')
plt.grid(True, alpha=0.3)

# Right: Sample locations over iterations
plt.subplot(1, 2, 2)
# Mark initial samples
plt.scatter(surrogate.xtrain[:5], surrogate.ytrain[:5], 
            c='green', s=100, marker='s', label='Initial (LHS)', zorder=5)
# Mark actively selected samples
if len(surrogate.xtrain) > 5:
    plt.scatter(surrogate.xtrain[5:], surrogate.ytrain[5:], 
                c='red', s=100, marker='^', label='Active learning', zorder=5)
plt.plot(X_plot, y_true, 'b-', alpha=0.3, linewidth=1)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('Sample Selection Strategy')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print accuracy metrics
rmse = np.sqrt(np.mean((y_pred - y_true)**2))
print(f"\nSurrogate Accuracy:")
print(f"  RMSE: {rmse:.4f}")
print(f"  Max Error: {np.max(np.abs(y_pred - y_true)):.4f}")

### Analyze the Training History

In [None]:
plt.figure(figsize=(14, 4))

# Sample progression
plt.subplot(1, 3, 1)
plt.plot(history['iterations'], history['n_samples'], 'b-o')
plt.xlabel('Iteration')
plt.ylabel('Total Samples')
plt.title('Sample Count Progression')
plt.grid(True, alpha=0.3)

# Uncertainty evolution
plt.subplot(1, 3, 2)
plt.plot(history['iterations'], history['mean_uncertainty'], 'g-o', label='Mean')
plt.plot(history['iterations'], history['max_uncertainty'], 'r-s', label='Max')
plt.xlabel('Iteration')
plt.ylabel('Uncertainty')
plt.legend()
plt.title('Uncertainty Evolution')
plt.grid(True, alpha=0.3)

# Acquisition values
plt.subplot(1, 3, 3)
plt.plot(history['iterations'], history['acquisition_values'], 'm-o')
plt.xlabel('Iteration')
plt.ylabel('Best Acquisition Value')
plt.title('Acquisition Value Trends')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Example 2: Comparing Different Acquisition Functions

Different acquisition functions have different exploration/exploitation trade-offs. Let's compare them:

In [None]:
acquisition_functions = ['ei', 'ucb', 'pi', 'variance']
acquisition_names = {
    'ei': 'Expected Improvement',
    'ucb': 'Upper Confidence Bound',
    'pi': 'Probability of Improvement',
    'variance': 'Maximum Variance'
}

results = {}

for acq in acquisition_functions:
    print(f"\nTesting {acquisition_names[acq]}...")
    
    # Create fresh model for each test
    kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2))
    model = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=5)
    
    # Build surrogate
    surrogate, history = ActiveSurrogate.build(
        func=expensive_function,
        bounds=bounds,
        model=model,
        acquisition=acq,
        stopping_criterion='absolute',
        stopping_threshold=0.15,
        n_initial=5,
        max_iterations=15,
        verbose=False
    )
    
    # Evaluate accuracy
    y_pred = surrogate(X_plot)
    rmse = np.sqrt(np.mean((y_pred - y_true)**2))
    
    results[acq] = {
        'surrogate': surrogate,
        'history': history,
        'y_pred': y_pred,
        'rmse': rmse,
        'n_samples': len(surrogate.xtrain)
    }
    
    print(f"  Samples used: {results[acq]['n_samples']}")
    print(f"  RMSE: {rmse:.4f}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, acq in enumerate(acquisition_functions):
    ax = axes[idx]
    
    # Plot true function and surrogate
    ax.plot(X_plot, y_true, 'b-', label='True', linewidth=2, alpha=0.5)
    ax.plot(X_plot, results[acq]['y_pred'], 'r--', label='Surrogate', linewidth=2)
    
    # Plot sample points
    surrogate = results[acq]['surrogate']
    ax.scatter(surrogate.xtrain, surrogate.ytrain, c='black', s=50, zorder=5)
    
    ax.set_title(f"{acquisition_names[acq]}\n"
                 f"Samples: {results[acq]['n_samples']}, RMSE: {results[acq]['rmse']:.4f}")
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary comparison
print("\nSummary Comparison:")
print(f"{'Acquisition':<25} {'Samples':<10} {'RMSE':<10}")
print("-" * 45)
for acq in acquisition_functions:
    print(f"{acquisition_names[acq]:<25} {results[acq]['n_samples']:<10} {results[acq]['rmse']:<10.4f}")

## Example 3: 2D Function with Visualization

Let's demonstrate active learning on a 2D function.

In [None]:
# Define 2D test function
def func_2d(X):
    """2D test function with interesting features."""
    x1, x2 = X[:, 0], X[:, 1]
    return np.sin(x1) * np.cos(x2) + 0.1 * x1

# Define 2D domain
bounds_2d = [(0, 2*np.pi), (0, 2*np.pi)]

# Create model
kernel = C(1.0, (1e-3, 1e3)) * RBF([1.0, 1.0], (1e-2, 1e2))
model_2d = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=5)

# Build surrogate
print("Building 2D surrogate...")
surrogate_2d, history_2d = ActiveSurrogate.build(
    func=func_2d,
    bounds=bounds_2d,
    model=model_2d,
    acquisition='ei',
    stopping_criterion='mean_ratio',
    stopping_threshold=2.0,
    n_initial=10,
    max_iterations=20,
    verbose=True
)

print(f"\n2D Surrogate built with {len(surrogate_2d.xtrain)} samples")

In [None]:
# Create meshgrid for visualization
x1_grid = np.linspace(0, 2*np.pi, 50)
x2_grid = np.linspace(0, 2*np.pi, 50)
X1, X2 = np.meshgrid(x1_grid, x2_grid)
X_grid = np.column_stack([X1.ravel(), X2.ravel()])

# Evaluate true function and surrogate
y_true_2d = func_2d(X_grid).reshape(X1.shape)
y_pred_2d = surrogate_2d(X_grid).reshape(X1.shape)

# Visualize
fig = plt.figure(figsize=(16, 5))

# True function
ax1 = fig.add_subplot(131, projection='3d')
surf1 = ax1.plot_surface(X1, X2, y_true_2d, cmap='viridis', alpha=0.8)
ax1.scatter(surrogate_2d.xtrain[:, 0], surrogate_2d.xtrain[:, 1], 
            surrogate_2d.ytrain, c='red', s=50, marker='o')
ax1.set_title('True Function + Sample Points')
ax1.set_xlabel('x1')
ax1.set_ylabel('x2')
ax1.set_zlabel('y')

# Surrogate
ax2 = fig.add_subplot(132, projection='3d')
surf2 = ax2.plot_surface(X1, X2, y_pred_2d, cmap='viridis', alpha=0.8)
ax2.scatter(surrogate_2d.xtrain[:, 0], surrogate_2d.xtrain[:, 1], 
            surrogate_2d.ytrain, c='red', s=50, marker='o')
ax2.set_title('Surrogate Model')
ax2.set_xlabel('x1')
ax2.set_ylabel('x2')
ax2.set_zlabel('y')

# Error
ax3 = fig.add_subplot(133)
error = np.abs(y_true_2d - y_pred_2d)
contour = ax3.contourf(X1, X2, error, levels=20, cmap='Reds')
ax3.scatter(surrogate_2d.xtrain[:, 0], surrogate_2d.xtrain[:, 1], 
            c='blue', s=50, marker='o', edgecolors='white', linewidths=1.5)
plt.colorbar(contour, ax=ax3, label='Absolute Error')
ax3.set_title('Prediction Error')
ax3.set_xlabel('x1')
ax3.set_ylabel('x2')

plt.tight_layout()
plt.show()

print(f"\n2D Surrogate Statistics:")
print(f"  Mean Absolute Error: {np.mean(error):.4f}")
print(f"  Max Absolute Error: {np.max(error):.4f}")
print(f"  RMSE: {np.sqrt(np.mean(error**2)):.4f}")

## Example 4: Using Batch Mode for Parallel Evaluation

When your function can be evaluated in parallel, batch mode samples multiple points per iteration.

In [None]:
# Build surrogate with batch sampling
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2))
model_batch = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=5)

print("Building surrogate with batch_size=3...")
surrogate_batch, history_batch = ActiveSurrogate.build(
    func=expensive_function,
    bounds=bounds,
    model=model_batch,
    acquisition='ucb',
    batch_size=3,  # Sample 3 points per iteration
    stopping_criterion='absolute',
    stopping_threshold=0.15,
    n_initial=5,
    max_iterations=10,
    verbose=True
)

print(f"\nBatch surrogate built with {len(surrogate_batch.xtrain)} samples")
print(f"Iterations run: {len(history_batch['iterations'])}")
print(f"Average samples per iteration: {(len(surrogate_batch.xtrain) - 5) / len(history_batch['iterations']):.1f}")

## Example 5: Custom Callback for Monitoring

You can provide a callback function to monitor or log the training process.

In [None]:
# Define a custom callback
callback_data = {'iterations': [], 'uncertainties': []}

def my_callback(iteration, history):
    """Custom callback to track specific metrics."""
    callback_data['iterations'].append(iteration)
    callback_data['uncertainties'].append(history['mean_uncertainty'][-1])
    
    if iteration % 5 == 0:
        print(f"  [Callback] Iteration {iteration}: Mean uncertainty = {history['mean_uncertainty'][-1]:.4f}")

# Build surrogate with callback
kernel = C(1.0, (1e-3, 1e3)) * RBF(1.0, (1e-2, 1e2))
model_cb = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=5)

print("Building surrogate with custom callback...")
surrogate_cb, history_cb = ActiveSurrogate.build(
    func=expensive_function,
    bounds=bounds,
    model=model_cb,
    acquisition='ei',
    stopping_criterion='convergence',
    stopping_threshold=0.05,
    n_initial=5,
    max_iterations=25,
    callback=my_callback,
    verbose=False
)

print(f"\nCallback tracked {len(callback_data['iterations'])} iterations")

# Plot callback data
plt.figure(figsize=(10, 4))
plt.plot(callback_data['iterations'], callback_data['uncertainties'], 'b-o')
plt.xlabel('Iteration')
plt.ylabel('Mean Uncertainty')
plt.title('Custom Callback Monitoring')
plt.grid(True, alpha=0.3)
plt.show()

## Summary

This notebook demonstrated:

1. **Basic Usage**: Building a surrogate with default settings
2. **Acquisition Functions**: Comparing EI, UCB, PI, and variance-based strategies
3. **2D Functions**: Extending to multi-dimensional problems
4. **Batch Mode**: Sampling multiple points per iteration for parallel evaluation
5. **Custom Callbacks**: Monitoring and logging during training

### Key Takeaways

- **Expected Improvement (EI)** balances exploration and exploitation well for optimization
- **Maximum Variance** is best for pure space-filling and coverage
- **Batch mode** with hallucination provides diverse samples for parallel evaluation
- **Stopping criteria** can be tuned based on your accuracy requirements
- The returned surrogate is a standard `_Surrogate` object that can be used like any other pyroxy surrogate

### When to Use ActiveSurrogate

- Your function is expensive to evaluate (simulations, experiments, etc.)
- You want to minimize the number of function evaluations
- You need a surrogate model for optimization or analysis
- You want automatic, intelligent sampling rather than manual grid selection