# Evaluating Sailing Agents

This notebook provides a simple interface for evaluating sailing agents on different initial windfields. You can:

1. Test your agent on any predefined initial windfield
2. Get quantitative performance metrics (success rate, rewards, steps)
3. Optionally visualize your agent's behavior

## Setup

First, let's import the necessary evaluation tools:

In [None]:
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
from typing import Dict, Any

# Add the src directory to the path
sys.path.append(os.path.abspath('../src'))
sys.path.append(os.path.abspath('..'))

# Import the evaluation tools
from src.test_agent_validity import validate_agent, load_agent_class
from src.evaluation import evaluate_agent, visualize_trajectory
from initial_windfields import get_initial_windfield, INITIAL_WINDFIELDS

# List available initial windfields
print("Available initial windfields:")
for windfield_name in sorted(INITIAL_WINDFIELDS.keys()):
    print(f"- {windfield_name}")

## Configuration

Set your evaluation parameters below. You can easily modify these values without changing the rest of the notebook.

In [None]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Path to your agent implementation (change this to your agent file path)
# AGENT_PATH = "../src/agents/agent_naive.py"
#AGENT_PATH = "../src/agents/agent_trained_example.py"
AGENT_PATH = "../src/agents/sailing_agent.py"

# Scenario to evaluate on (choose from the list printed above)
INITIAL_WINDFIELD_NAME = "simple_static" # Options: simple_static, training_1, training_2, training_3, etc.

# Evaluation parameters
SEEDS = [42, 43, 44, 45, 46]  # Seeds to use for evaluation
MAX_HORIZON = 200            # Maximum steps per episode
VERBOSE = True               # Show progress bar
RENDER = False               # Enable rendering (slower but necessary for visualization)

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

# Validation and informational prints
print(f"Agent to evaluate: {AGENT_PATH}")
print(f"Initial windfield: {INITIAL_WINDFIELD_NAME}")
print(f"Using {len(SEEDS)} seeds: {SEEDS}")
print(f"Max steps per episode: {MAX_HORIZON}")

## Load and Validate Agent

First, let's load and validate your agent implementation:

In [None]:
# Load trained weights
if 'agent' in locals():
    try:
        # Import the custom model loader
        sys.path.append(os.path.abspath('..'))
        from src.utils.sailing_model_loader import load_trained_sailing_model
        
        # Try to load the model weights
        model_loaded = load_trained_sailing_model(agent)
        
        if model_loaded:
            print("✅ Successfully loaded trained weights for the WindAwareNavigator")
        else:
            print("⚠️ Using the agent with default weights")
    except ImportError as e:
        print(f"⚠️ Model loader not found: {e}")
        print("Using the agent with default weights")
    except Exception as e:
        print(f"⚠️ Error loading model: {e}")
        print("Using the agent with default weights")

In [None]:
def load_and_validate_agent(agent_path):
    """Load and validate an agent from a file path."""
    try:
        # Validate the agent first
        validation_results = validate_agent(agent_path)
        
        if not validation_results['valid']:
            print("❌ Agent validation failed:")
            for error in validation_results['errors']:
                print(f"  - {error}")
            return None
        
        # If valid, return the agent class
        return validation_results['agent_class']
        
    except Exception as e:
        print(f"❌ Error loading agent: {str(e)}")
        return None

# Load and validate the agent specified in AGENT_PATH
AgentClass = load_and_validate_agent(AGENT_PATH)

if AgentClass:
    print(f"✅ Successfully loaded agent: {AgentClass.__name__}")
    # Create an instance of your agent
    agent = AgentClass()
else:
    print("⚠️ Please fix your agent implementation before evaluation.")

## Evaluate on Specified Initial Windfield

Let's evaluate your agent on the initial windfield you selected:

In [None]:
def print_evaluation_results(results):
    """Print evaluation results in a readable format."""
    print("\n" + "="*50)
    print("EVALUATION RESULTS")
    print("="*50)
    
    print(f"Success Rate: {results['success_rate']:.2%}")
    print(f"Mean Reward: {results['mean_reward']:.2f} ± {results['std_reward']:.2f}")
    print(f"Mean Steps: {results['mean_steps']:.1f} ± {results['std_steps']:.1f}")
    
    if 'individual_results' in results:
        print("\nIndividual Episode Results:")
        for i, episode in enumerate(results['individual_results']):
            print(f"  Seed {episode['seed']}: " + 
                  f"Reward={episode['reward']:.1f}, " +
                  f"Steps={episode['steps']}, " +
                  f"Success={'✓' if episode['success'] else '✗'}")
    
    print("="*50)

# Only run if the agent was successfully loaded
if 'agent' in locals():
    # Get the selected initial windfield
    initial_windfield = get_initial_windfield(INITIAL_WINDFIELD_NAME)
    
    print(f"Evaluating agent on initial windfield: {INITIAL_WINDFIELD_NAME}")
    print(f"Using {len(SEEDS)} seeds with max horizon of {MAX_HORIZON} steps")
    
    # Run the evaluation
    results = evaluate_agent(
        agent=agent,
        initial_windfield=initial_windfield,
        seeds=SEEDS,
        max_horizon=MAX_HORIZON,
        verbose=VERBOSE,
        render=RENDER,
        full_trajectory=True  # Need full trajectory for later visualization
    )
    
    # Display the results
    print_evaluation_results(results)

## Evaluate on All Training Scenarios

To get a comprehensive evaluation, you can test your agent on all training scenarios:

In [None]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Choose which training initial windfields to evaluate on
TRAINING_INITIAL_WINDFIELDS = ["training_1", "training_2", "training_3"]

# Evaluation parameters for all initial windfields
ALL_SEEDS = [42, 43, 44, 45, 46]  # Seeds to use for all evaluations
ALL_MAX_HORIZON = 200             # Maximum steps per episode

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

# Only run if the agent was successfully loaded
if 'agent' in locals():
    # Store results for each initial windfield
    all_results = {}
    
    print(f"Evaluating agent on {len(TRAINING_INITIAL_WINDFIELDS)} training initial windfields...")
    
    # Evaluate on each initial windfield
    for initial_windfield_name in TRAINING_INITIAL_WINDFIELDS:
        print(f"\nInitial windfield: {initial_windfield_name}")
        
        # Get the initial windfield
        initial_windfield = get_initial_windfield(initial_windfield_name)
        
        # Run the evaluation
        results = evaluate_agent(
            agent=agent,
            initial_windfield=initial_windfield,
            seeds=ALL_SEEDS,
            max_horizon=ALL_MAX_HORIZON,
            verbose=False,  # Less verbose for multiple evaluations
            render=False,
            full_trajectory=False
        )
        
        # Store results
        all_results[initial_windfield_name] = results
        
        # Print summary
        print(f"  Success Rate: {results['success_rate']:.2%}")
        print(f"  Mean Reward: {results['mean_reward']:.2f}")
        print(f"  Mean Steps: {results['mean_steps']:.1f}")
    
    # Print overall performance
    total_success = sum(r['success_rate'] for r in all_results.values()) / len(all_results)
    print("\n" + "="*50)
    print(f"OVERALL SUCCESS RATE: {total_success:.2%}")
    print("="*50)

## Summary Results Across Initial Windfields

The table below summarizes your agent's performance across all the training initial windfields. 
This gives you a comprehensive view of how well your agent generalizes to different wind patterns and conditions.

A strong agent should:
1. Maintain a high success rate across all initial windfields
2. Achieve good rewards efficiently (high reward values)
3. Complete episodes in fewer steps (better efficiency)

Compare your agent's performance across initial windfields to identify potential weaknesses that you might address in future improvements.

In [None]:
#############################################
### SUMMARY TABLE FOR ALL INITIAL WINDFIELDS #########
#############################################

# Only run if the agent was successfully loaded and evaluated on multiple initial windfields
if 'agent' in locals() and 'all_results' in locals():
    # Create summary table with pandas
    import pandas as pd
    
    # Prepare data for summary table
    summary_data = []
    for initial_windfield_name, results in all_results.items():
        summary_data.append({
            'Initial Windfield': initial_windfield_name.upper(),
            'Mean Reward': f"{results['mean_reward']:.2f} ± {results['std_reward']:.2f}",
            'Success Rate': f"{results['success_rate']:.2%}",
            'Mean Steps': f"{results['mean_steps']:.1f} ± {results['std_steps']:.1f}"
        })
    
    # Create summary DataFrame
    summary_df = pd.DataFrame(summary_data)
    
    # Display summary table
    from IPython.display import display
    print("\nSummary of Results Across All Initial Windfields:")
    display(summary_df)
    
    # Calculate average across initial windfields
    avg_success_rate = np.mean([results['success_rate'] for results in all_results.values()])
    avg_reward = np.mean([results['mean_reward'] for results in all_results.values()])
    avg_steps = np.mean([results['mean_steps'] for results in all_results.values()])
    
    print(f"\nAverage Across Training Initial Windfields:")
    print(f"  Success Rate: {avg_success_rate:.2%}")
    print(f"  Mean Reward: {avg_reward:.2f}")
    print(f"  Mean Steps: {avg_steps:.1f}")
    print("\nNote: Your final evaluation will include hidden test initial windfields.")

## Visualize Agent Behavior (Optional)

If you want to see how your agent behaves in a specific initial windfield, you can visualize its trajectory.
First, enable rendering by setting `VISUALIZE = True` below.

In [None]:
#############################################
### MODIFY THESE PARAMETERS AS NEEDED ######
#############################################

# Set to True to enable visualization
VISUALIZE = True

# Visualization parameters
VIZ_INITIAL_WINDFIELD_NAME = "training_1"  # Choose which initial windfield to visualize
VIZ_SEED = 42                    # Choose a single seed for visualization

#############################################
### DO NOT MODIFY BELOW THIS LINE ##########
#############################################

# Only run if visualization is enabled and agent is loaded
if VISUALIZE and 'agent' in locals():
    # Get the initial windfield with visualization parameters
    viz_initial_windfield = get_initial_windfield(VIZ_INITIAL_WINDFIELD_NAME)
    viz_initial_windfield.update({
        'env_params': {
            'wind_grid_density': 25,
            'wind_arrow_scale': 80,
            'render_mode': "rgb_array"
        }
    })
    
    print(f"Visualizing agent behavior on initial windfield: {VIZ_INITIAL_WINDFIELD_NAME}")
    print(f"Using seed: {VIZ_SEED}")
    
    # Run the evaluation with visualization enabled
    viz_results = evaluate_agent(
        agent=agent,
        initial_windfield=viz_initial_windfield,
        seeds=VIZ_SEED,
        max_horizon=MAX_HORIZON,
        verbose=False,
        render=True,
        full_trajectory=True  # Enable full trajectory for visualization
    )
    
    # Visualize the trajectory with a slider
    visualize_trajectory(viz_results, None, with_slider=True)
else:
    if 'agent' in locals():
        print("Visualization is disabled. Set VISUALIZE = True to see agent behavior.")

## 7. Command-Line Evaluation

For quick evaluation of your agent on different scenarios, you can use the command-line interface:

```bash
cd src
python3 evaluate_submission.py agents/agent_naive.py --initial_windfield training_1 --seeds 1 --num-seeds 100 --verbose
```

### Command Options

- `agents/agent_naive.py`: Path to your agent implementation file
- `--initial_windfield NAME`: Specific initial windfield to evaluate on (e.g., `training_1`, `training_2`, `training_3`)
- `--seeds N`: Starting seed number (default: 1)
- `--num-seeds N`: Number of consecutive seeds to evaluate on (default: 1)
- `--output FILE`: Save results to a JSON file (e.g., `--output results.json`)
- `--verbose`: Show detailed evaluation results (default: simplified output)

### Evaluating on Multiple Initial Windfields

To evaluate on all training initial windfields:

```bash
cd src
python3 evaluate_submission.py agents/agent_naive.py --seeds 1 --num-seeds 100
```

This will run your agent on all available training windfields and compute the average performance.

### Sample Output (Simplified)

```bash
Validating agent: agents/agent_naive.py
✅ Successfully loaded agent: NaiveAgent
Evaluating on 3 scenarios with 100 seeds each
SCENARIO | SUCCESS RATE | MEAN REWARD | MEAN STEPS
training_1 | Success: 98.00% | Reward: 61.43 ± 3.85 | Steps: 49.2 ± 6.2
training_2 | Success: 94.00% | Reward: 58.21 ± 4.12 | Steps: 53.8 ± 7.5
training_3 | Success: 96.00% | Reward: 59.87 ± 3.96 | Steps: 51.4 ± 6.8
======================================================================
OVERALL | 96.00% ± 2.00% | 59.84 ± 3.98 | 51.5 ± 6.8
======================================================================
```


For more detailed output, add the `--verbose` flag to see seed-by-seed results.

## Conclusion

This notebook provides a standardized way to evaluate agents for the Sailing Challenge. You've now:

1. Validated your agent's implementation to ensure it meets the interface requirements
2. Evaluated your agent on one or more initial windfields to measure its performance
3. Viewed a summary of your agent's results across multiple initial windfields
4. Optionally visualized your agent's behavior in a specific initial windfields

### Next Steps

- **Fine-tune your agent**: Use the performance metrics to identify areas for improvement
- **Test across all initial windfields**: Ensure your agent can handle different wind patterns
- **Optimize for efficiency**: Aim to reach the goal in fewer steps
- **Consider advanced strategies**: Experiment with algorithms that better account for wind physics

Remember that your final agent will be evaluated on both the training initial windfields and hidden test initial windfields, so your agent should be robust and adaptable.

Good luck with your agent submission!