# YAML Configuration for NLSQ Workflows

> Configure NLSQ workflows via YAML files for reproducible, shareable configurations

**20 minutes** | **Level: Intermediate**

---

## What You'll Learn

By the end of this notebook, you will be able to:

- Create an `nlsq.yaml` configuration file
- Configure tolerances, memory limits, and checkpointing via YAML
- Use `load_yaml_config()` and `get_custom_workflow()` functions
- Override YAML settings with environment variables

---

## Learning Path

**You are here:** Workflow System > **YAML Configuration**

```
Workflow Presets --> [You are here: YAML Configuration] --> HPC & Checkpointing
```

**Recommended flow:**
- **Previous:** [04_workflow_presets.ipynb](04_workflow_presets.ipynb) - Built-in workflow presets
- **Next:** [07_hpc_and_checkpointing.ipynb](07_hpc_and_checkpointing.ipynb) - HPC cluster integration

---

## Before You Begin

**Required knowledge:**
- Familiarity with NLSQ workflows (see 01-04 in this section)
- Basic understanding of YAML format

**Required software:**
- NLSQ >= 0.3.4
- Python >= 3.12
- pyyaml (optional dependency): `pip install pyyaml`

**Check prerequisites:**

In [None]:
# Verify pyyaml is installed
try:
    import yaml
    print(f"pyyaml version: {yaml.__version__}")
except ImportError:
    print("pyyaml not installed. Install with: pip install pyyaml")

---

## Why This Matters

YAML configuration provides:

1. **Reproducibility:** Store exact workflow settings for experiments
2. **Shareability:** Share configurations with collaborators
3. **Version control:** Track configuration changes in git
4. **Environment flexibility:** Override via environment variables in different environments

**Common use cases:**
- Production deployments with fixed settings
- HPC batch jobs with environment-specific overrides
- Multi-team projects with shared defaults

---

## Quick Start (30 seconds)

Create a minimal `nlsq.yaml` and load it:

In [None]:
import yaml
from pathlib import Path

# Create a minimal nlsq.yaml
minimal_config = {
    "default_workflow": "robust",
    "memory_limit_gb": 8.0,
}

# Write to file
config_path = Path("nlsq.yaml")
with open(config_path, "w") as f:
    yaml.dump(minimal_config, f, default_flow_style=False)

print("Created nlsq.yaml:")
print(config_path.read_text())

In [None]:
from nlsq.workflow import load_yaml_config

# Load the configuration
config = load_yaml_config()
print(f"Loaded config: {config}")

---

## Setup

In [None]:
import os
from pathlib import Path

import yaml
import numpy as np
import jax.numpy as jnp

from nlsq import curve_fit
from nlsq.workflow import (
    load_yaml_config,
    get_custom_workflow,
    get_env_overrides,
    load_config_with_overrides,
    WorkflowConfig,
    WorkflowTier,
    OptimizationGoal,
)

np.random.seed(42)

---

## Tutorial Content

### Section 1: YAML File Structure

The `nlsq.yaml` file supports the following structure:

```yaml
# Top-level settings
default_workflow: robust    # Default workflow preset
memory_limit_gb: 32.0       # Memory limit in GB

# Custom workflow definitions
workflows:
  my_workflow:
    tier: CHUNKED
    goal: QUALITY
    gtol: 1e-9
    ftol: 1e-9
    xtol: 1e-9
    enable_multistart: true
    n_starts: 15
```

In [None]:
# Create a complete nlsq.yaml example
complete_config = {
    # Top-level defaults
    "default_workflow": "standard",
    "memory_limit_gb": 16.0,
    
    # Custom workflow definitions
    "workflows": {
        # High-precision workflow
        "high_precision": {
            "tier": "STANDARD",
            "goal": "QUALITY",
            "gtol": 1e-10,
            "ftol": 1e-10,
            "xtol": 1e-10,
            "enable_multistart": True,
            "n_starts": 20,
            "sampler": "lhs",
        },
        # Fast exploration workflow
        "quick_explore": {
            "tier": "STANDARD",
            "goal": "FAST",
            "gtol": 1e-5,
            "ftol": 1e-5,
            "xtol": 1e-5,
            "enable_multistart": False,
        },
        # Large dataset workflow
        "large_data": {
            "tier": "CHUNKED",
            "goal": "ROBUST",
            "gtol": 1e-8,
            "ftol": 1e-8,
            "xtol": 1e-8,
            "memory_limit_gb": 8.0,
            "chunk_size": 100000,
            "enable_multistart": True,
            "n_starts": 10,
        },
        # HPC with checkpointing
        "hpc_checkpoint": {
            "tier": "STREAMING_CHECKPOINT",
            "goal": "ROBUST",
            "gtol": 1e-7,
            "ftol": 1e-7,
            "xtol": 1e-7,
            "enable_checkpoints": True,
            "checkpoint_dir": "./checkpoints",
            "enable_multistart": True,
            "n_starts": 10,
        },
    },
}

# Write to nlsq.yaml
config_path = Path("nlsq.yaml")
with open(config_path, "w") as f:
    yaml.dump(complete_config, f, default_flow_style=False, sort_keys=False)

print("Created complete nlsq.yaml:")
print("=" * 50)
print(config_path.read_text())

### Section 2: Loading YAML Configuration

In [None]:
# Load configuration from nlsq.yaml
config = load_yaml_config()

print("Loaded configuration:")
print(f"  default_workflow: {config.get('default_workflow')}")
print(f"  memory_limit_gb: {config.get('memory_limit_gb')}")
print(f"  workflows defined: {list(config.get('workflows', {}).keys())}")

In [None]:
# Load from a specific path
config_custom_path = load_yaml_config("./nlsq.yaml")
print(f"Loaded from custom path: {config_custom_path is not None}")

### Section 3: Using get_custom_workflow()

The `get_custom_workflow()` function loads a named workflow from your YAML file
and returns a `WorkflowConfig` object.

In [None]:
# Get the high_precision workflow
high_precision = get_custom_workflow("high_precision")

if high_precision:
    print("high_precision workflow:")
    print(f"  tier: {high_precision.tier}")
    print(f"  goal: {high_precision.goal}")
    print(f"  gtol: {high_precision.gtol}")
    print(f"  enable_multistart: {high_precision.enable_multistart}")
    print(f"  n_starts: {high_precision.n_starts}")
else:
    print("Workflow not found")

In [None]:
# Get the large_data workflow
large_data = get_custom_workflow("large_data")

if large_data:
    print("large_data workflow:")
    print(f"  tier: {large_data.tier}")
    print(f"  memory_limit_gb: {large_data.memory_limit_gb}")
    print(f"  chunk_size: {large_data.chunk_size}")

In [None]:
# Get the HPC checkpoint workflow
hpc_workflow = get_custom_workflow("hpc_checkpoint")

if hpc_workflow:
    print("hpc_checkpoint workflow:")
    print(f"  tier: {hpc_workflow.tier}")
    print(f"  enable_checkpoints: {hpc_workflow.enable_checkpoints}")
    print(f"  checkpoint_dir: {hpc_workflow.checkpoint_dir}")

### Section 4: Environment Variable Overrides

Environment variables take precedence over YAML settings, allowing
different configurations per environment.

**Supported environment variables:**

| Variable | Description |
|----------|-------------|
| `NLSQ_WORKFLOW_GOAL` | Override optimization goal (FAST, ROBUST, QUALITY, etc.) |
| `NLSQ_MEMORY_LIMIT_GB` | Override memory limit in GB |
| `NLSQ_DEFAULT_WORKFLOW` | Override default workflow name |
| `NLSQ_CHECKPOINT_DIR` | Override checkpoint directory |

In [None]:
# Set environment variables
os.environ["NLSQ_WORKFLOW_GOAL"] = "quality"
os.environ["NLSQ_MEMORY_LIMIT_GB"] = "32.0"

# Get overrides from environment
env_overrides = get_env_overrides()

print("Environment variable overrides:")
for key, value in env_overrides.items():
    print(f"  {key}: {value}")

In [None]:
# Load config with environment overrides
merged_config = load_config_with_overrides()

print("\nMerged configuration (YAML + environment):")
print(f"  goal: {merged_config.get('goal')}")
print(f"  memory_limit_gb: {merged_config.get('memory_limit_gb')}")

In [None]:
# Clean up environment variables
del os.environ["NLSQ_WORKFLOW_GOAL"]
del os.environ["NLSQ_MEMORY_LIMIT_GB"]

print("Environment variables cleaned up")

### Section 5: Using Custom Workflows with curve_fit()

Apply your custom workflow configuration to actual fitting.

In [None]:
# Define a test model
def exponential_decay(x, a, b, c):
    return a * jnp.exp(-b * x) + c

# Generate test data
x_data = np.linspace(0, 5, 300)
true_a, true_b, true_c = 2.5, 1.2, 0.5
y_true = true_a * np.exp(-true_b * x_data) + true_c
y_data = y_true + 0.1 * np.random.randn(len(x_data))

print(f"True parameters: a={true_a}, b={true_b}, c={true_c}")

In [None]:
# Load custom workflow and use its settings
workflow = get_custom_workflow("high_precision")

if workflow:
    # Apply workflow settings to curve_fit
    popt, pcov = curve_fit(
        exponential_decay,
        x_data,
        y_data,
        p0=[1.0, 1.0, 0.0],
        bounds=([0, 0, -1], [10, 5, 2]),
        gtol=workflow.gtol,
        ftol=workflow.ftol,
        xtol=workflow.xtol,
        multistart=workflow.enable_multistart,
        n_starts=workflow.n_starts if workflow.enable_multistart else 0,
        sampler=workflow.sampler,
    )
    
    print(f"\nFitted with high_precision workflow:")
    print(f"  a={popt[0]:.6f}, b={popt[1]:.6f}, c={popt[2]:.6f}")
    print(f"\nWorkflow settings used:")
    print(f"  gtol: {workflow.gtol}")
    print(f"  multistart: {workflow.enable_multistart}")
    print(f"  n_starts: {workflow.n_starts}")

### Section 6: Creating Workflow Helper Function

Create a helper that loads and applies YAML configuration automatically.

In [None]:
def fit_with_yaml_config(f, xdata, ydata, p0=None, bounds=(-np.inf, np.inf),
                         workflow_name=None, config_path="nlsq.yaml"):
    """Curve fit using YAML-defined workflow configuration.
    
    Parameters
    ----------
    f : callable
        Model function
    xdata, ydata : array_like
        Data to fit
    p0 : array_like, optional
        Initial parameters
    bounds : tuple, optional
        Parameter bounds
    workflow_name : str, optional
        Name of workflow from nlsq.yaml. If None, uses default_workflow.
    config_path : str, optional
        Path to YAML config file
    
    Returns
    -------
    popt, pcov : tuple
        Fitted parameters and covariance
    """
    # Load YAML config
    yaml_config = load_yaml_config(config_path)
    
    if yaml_config is None:
        print("No YAML config found, using defaults")
        return curve_fit(f, xdata, ydata, p0=p0, bounds=bounds)
    
    # Determine workflow to use
    if workflow_name is None:
        workflow_name = yaml_config.get("default_workflow", "standard")
    
    # Try to get custom workflow
    workflow = get_custom_workflow(workflow_name, config_path)
    
    if workflow is None:
        # Fall back to preset
        try:
            workflow = WorkflowConfig.from_preset(workflow_name)
        except ValueError:
            workflow = WorkflowConfig()  # Default
    
    # Apply workflow settings
    return curve_fit(
        f,
        xdata,
        ydata,
        p0=p0,
        bounds=bounds,
        gtol=workflow.gtol,
        ftol=workflow.ftol,
        xtol=workflow.xtol,
        multistart=workflow.enable_multistart,
        n_starts=workflow.n_starts if workflow.enable_multistart else 0,
        sampler=workflow.sampler,
    )

print("fit_with_yaml_config() helper defined")

In [None]:
# Use the helper with different workflows
print("Using 'high_precision' workflow:")
popt1, _ = fit_with_yaml_config(
    exponential_decay, x_data, y_data,
    p0=[1.0, 1.0, 0.0],
    bounds=([0, 0, -1], [10, 5, 2]),
    workflow_name="high_precision",
)
print(f"  Result: a={popt1[0]:.4f}, b={popt1[1]:.4f}, c={popt1[2]:.4f}")

print("\nUsing 'quick_explore' workflow:")
popt2, _ = fit_with_yaml_config(
    exponential_decay, x_data, y_data,
    p0=[1.0, 1.0, 0.0],
    bounds=([0, 0, -1], [10, 5, 2]),
    workflow_name="quick_explore",
)
print(f"  Result: a={popt2[0]:.4f}, b={popt2[1]:.4f}, c={popt2[2]:.4f}")

---

## Key Takeaways

After completing this notebook, remember:

1. **YAML structure:** Top-level settings + `workflows` dictionary for custom definitions

2. **Loading functions:**
   - `load_yaml_config()` - Load raw YAML dictionary
   - `get_custom_workflow(name)` - Get `WorkflowConfig` from YAML
   - `load_config_with_overrides()` - Merge YAML + environment

3. **Environment variables override YAML:**
   - `NLSQ_WORKFLOW_GOAL` - Goal override
   - `NLSQ_MEMORY_LIMIT_GB` - Memory override
   - `NLSQ_CHECKPOINT_DIR` - Checkpoint directory

4. **Best practices:**
   - Version control your `nlsq.yaml` file
   - Use environment variables for deployment-specific overrides
   - Define workflows for common use cases in your project

---

## Common Questions

**Q: What if nlsq.yaml doesn't exist?**

A: `load_yaml_config()` returns `None` and your code should handle this gracefully with defaults.

**Q: Can I use YAML without installing pyyaml?**

A: No, pyyaml is required. Install with `pip install pyyaml`.

**Q: Where should I put nlsq.yaml?**

A: By default, NLSQ looks in the current working directory. You can specify a custom path to `load_yaml_config(path)`.

---

## Related Resources

**Next steps:**
- [07_hpc_and_checkpointing.ipynb](07_hpc_and_checkpointing.ipynb) - HPC cluster integration with PBS

**Further reading:**
- [YAML specification](https://yaml.org/spec/)
- [NLSQ API Documentation](https://nlsq.readthedocs.io/)

---

## Glossary

**YAML:** YAML Ain't Markup Language - a human-readable data serialization format.

**Environment variable override:** A value set via `os.environ` that takes precedence over file-based configuration.

In [None]:
# Cleanup: remove the test nlsq.yaml
if config_path.exists():
    config_path.unlink()
    print("Cleaned up nlsq.yaml")

# Final summary
print("\n" + "=" * 50)
print("Summary")
print("=" * 50)
print("\nYAML configuration enables:")
print("  - Reproducible workflow settings")
print("  - Easy sharing between collaborators")
print("  - Environment-specific overrides")
print("\nKey functions:")
print("  - load_yaml_config()")
print("  - get_custom_workflow(name)")
print("  - load_config_with_overrides()")