# Advanced Machine Learning for Solar Boat Telemetry

This notebook demonstrates advanced ML techniques for analyzing and predicting solar boat performance.

## Topics Covered:
1. Advanced Feature Engineering
2. Multiple Model Comparison (Linear, Random Forest, XGBoost)
3. Time-Series Cross-Validation
4. Feature Importance Analysis
5. Model Ensemble

In [None]:
# Install advanced ML dependencies if needed
# !pip install -e ".[ml-advanced]"

import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# Solar Regatta imports
from solar_regatta import (
    generate_sample_vesc_data,
    calculate_speeds,
    analyze_performance,
)

from solar_regatta.ml import (
    FeatureEngineer,
    prepare_training_data,
    train_speed_model,
    evaluate_model,
    time_series_split,
    compare_models,
)

# Try to import advanced models
try:
    from solar_regatta.ml import (
        RandomForestSpeedModel,
        XGBoostSpeedModel,
        GradientBoostingSpeedModel,
    )
    ADVANCED_MODELS_AVAILABLE = True
    print("âœ“ Advanced ML models loaded successfully")
except ImportError:
    ADVANCED_MODELS_AVAILABLE = False
    print("âš  Advanced models not available. Install with: pip install -e '.[ml-advanced]'")

print("\nðŸ“Š Notebook ready!")

## 1. Generate Sample Telemetry Data

First, let's generate realistic telemetry data for a 10-minute race.

In [None]:
# Generate 10 minutes of telemetry at 1-second intervals
duration = 600  # 10 minutes
interval = 1    # 1 second

print(f"Generating {duration}s of telemetry data...")
gps_points, timestamps, speeds_raw, battery_voltage, motor_current = \
    generate_sample_vesc_data(duration_seconds=duration, interval=interval)

# Calculate actual speeds from GPS
speeds = calculate_speeds(gps_points, timestamps)

print(f"\nâœ“ Generated {len(gps_points)} data points")
print(f"  - {len(speeds)} speed measurements")
print(f"  - {len(battery_voltage)} voltage readings")
print(f"  - {len(motor_current)} current readings")

# Basic statistics
metrics = analyze_performance(speeds, battery_voltage, motor_current, timestamps)
print(f"\nðŸ“ˆ Race Summary:")
print(f"  Duration: {metrics['duration']:.0f}s ({metrics['duration']/60:.1f} min)")
print(f"  Distance: {metrics['distance']:.1f}m")
print(f"  Avg Speed: {metrics['avg_speed']:.2f} m/s")
print(f"  Battery: {metrics['min_voltage']:.2f}V - {metrics['max_voltage']:.2f}V")

## 2. Advanced Feature Engineering

Create sophisticated features from raw telemetry data.

In [None]:
# Create feature engineer with multiple transformations
feature_engineer = FeatureEngineer(
    rolling_windows=[3, 5, 10],  # Multiple time windows
    lag_features=3,              # Use last 3 values
    include_derivatives=True,    # Rate of change
    include_physics=True,        # Power, efficiency, etc.
)

# Convert timestamps to seconds
time_seconds = np.array([(t - timestamps[0]).total_seconds() for t in timestamps])

# Generate features
print("Creating advanced features...")
X_full, feature_names = feature_engineer.fit_transform(
    np.array(speeds),
    np.array(battery_voltage[:len(speeds)]),
    np.array(motor_current[:len(speeds)]),
    time_seconds[:len(speeds)]
)

print(f"\nâœ“ Created {X_full.shape[1]} features from {X_full.shape[0]} samples")
print(f"\nFeature categories:")
print(f"  - Original: 4 (speed, voltage, current, delta_time)")
print(f"  - Rolling stats: {4 * len([3, 5, 10]) * 3} (mean, std, max/min for 3 windows)")
print(f"  - Lag features: {3 * 3} (3 lags Ã— 3 variables)")
print(f"  - Derivatives: {3 * 2} (1st & 2nd derivatives)")
print(f"  - Physics: 6 (power, efficiency metrics)")

# Show some feature names
print(f"\nSample features:")
for i, name in enumerate(feature_names[:15]):
    print(f"  {i+1}. {name}")
print("  ...")

## 3. Prepare Training Data

Create train/test split using time-series aware splitting.

In [None]:
# Prepare data for prediction (predict next speed)
n_samples = len(speeds) - 1
X = X_full[:n_samples]
y = np.array(speeds[1:n_samples+1])  # Next speed value

# Time-series split (80/20)
split_idx = int(0.8 * len(X))
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"Data split:")
print(f"  Training: {len(X_train)} samples")
print(f"  Testing:  {len(X_test)} samples")
print(f"  Features: {X_train.shape[1]}")
print(f"\nTarget (speed) range: {y.min():.2f} - {y.max():.2f} m/s")

## 4. Train Multiple Models

Compare Linear Regression, Random Forest, and XGBoost.

In [None]:
from solar_regatta.ml import PerformanceModel

results = {}

# 1. Linear Regression (baseline)
print("Training Linear Regression...")
linear_model = PerformanceModel(
    coefficients=np.zeros(X_train.shape[1]),
    intercept=0.0,
    feature_names=feature_names
)
# Use numpy least squares
X_aug = np.column_stack([X_train, np.ones(len(X_train))])
solution, *_ = np.linalg.lstsq(X_aug, y_train, rcond=None)
linear_model.coefficients = solution[:-1]
linear_model.intercept = solution[-1]

linear_metrics = evaluate_model(linear_model, X_test, y_test)
results['Linear Regression'] = linear_metrics
print(f"  RÂ² = {linear_metrics['r2']:.4f}")

if ADVANCED_MODELS_AVAILABLE:
    # 2. Random Forest
    print("\nTraining Random Forest...")
    rf_model = RandomForestSpeedModel(n_estimators=100, max_depth=10, random_state=42)
    rf_model.fit(X_train, y_train)
    rf_metrics = evaluate_model(rf_model, X_test, y_test)
    results['Random Forest'] = rf_metrics
    print(f"  RÂ² = {rf_metrics['r2']:.4f}")
    
    # 3. XGBoost
    print("\nTraining XGBoost...")
    xgb_model = XGBoostSpeedModel(n_estimators=100, max_depth=6, learning_rate=0.1, random_state=42)
    xgb_model.fit(X_train, y_train, verbose=False)
    xgb_metrics = evaluate_model(xgb_model, X_test, y_test)
    results['XGBoost'] = xgb_metrics
    print(f"  RÂ² = {xgb_metrics['r2']:.4f}")
else:
    print("\nâš  Skipping advanced models (not installed)")

print("\nâœ“ Model training complete!")

## 5. Model Comparison

Compare all models across multiple metrics.

In [None]:
import pandas as pd

# Create comparison table
comparison_df = pd.DataFrame(results).T
comparison_df = comparison_df[['r2', 'rmse', 'mae', 'mape']]
comparison_df.columns = ['RÂ²', 'RMSE', 'MAE', 'MAPE (%)']

print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(comparison_df.to_string())
print("="*70)

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# RÂ² comparison
comparison_df['RÂ²'].plot(kind='bar', ax=axes[0], color=['blue', 'green', 'red'][:len(comparison_df)])
axes[0].set_title('Model Accuracy (RÂ²)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('RÂ² Score')
axes[0].set_ylim([0, 1])
axes[0].grid(True, alpha=0.3)

# Error comparison
comparison_df['RMSE'].plot(kind='bar', ax=axes[1], color=['blue', 'green', 'red'][:len(comparison_df)])
axes[1].set_title('Model Error (RMSE)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('RMSE (m/s)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nðŸ“Š Best model:", comparison_df['RÂ²'].idxmax())

## 6. Feature Importance Analysis

Understand which features drive predictions.

In [None]:
if ADVANCED_MODELS_AVAILABLE:
    # Get feature importance from Random Forest
    importances = rf_model.get_feature_importance()
    
    # Create DataFrame
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    # Top 20 features
    top_20 = importance_df.head(20)
    
    print("\nTop 20 Most Important Features:")
    print("="*70)
    for idx, row in top_20.iterrows():
        print(f"{row['feature']:40s} {row['importance']:.4f}")
    print("="*70)
    
    # Visualize
    plt.figure(figsize=(12, 8))
    plt.barh(range(20), top_20['importance'].values[::-1])
    plt.yticks(range(20), top_20['feature'].values[::-1])
    plt.xlabel('Importance Score', fontsize=12)
    plt.title('Top 20 Feature Importances (Random Forest)', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()
else:
    print("\nâš  Feature importance requires advanced models")

## 7. Prediction Visualization

Compare actual vs predicted speeds.

In [None]:
# Make predictions
linear_pred = linear_model.predict(X_test)

if ADVANCED_MODELS_AVAILABLE:
    rf_pred = rf_model.predict(X_test)
    xgb_pred = xgb_model.predict(X_test)

# Time indices for test set
test_time = range(len(X_train), len(X_train) + len(X_test))

# Plot
plt.figure(figsize=(15, 6))

plt.plot(test_time, y_test, 'k-', label='Actual Speed', linewidth=2, alpha=0.7)
plt.plot(test_time, linear_pred, '--', label='Linear Regression', alpha=0.7)

if ADVANCED_MODELS_AVAILABLE:
    plt.plot(test_time, rf_pred, '--', label='Random Forest', alpha=0.7)
    plt.plot(test_time, xgb_pred, '--', label='XGBoost', alpha=0.7)

plt.xlabel('Time Step', fontsize=12)
plt.ylabel('Speed (m/s)', fontsize=12)
plt.title('Model Predictions vs Actual Speed (Test Set)', fontsize=14, fontweight='bold')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Scatter plot: Actual vs Predicted
if ADVANCED_MODELS_AVAILABLE:
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    for ax, pred, name in zip(axes, [linear_pred, rf_pred, xgb_pred], 
                               ['Linear', 'Random Forest', 'XGBoost']):
        ax.scatter(y_test, pred, alpha=0.5)
        ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
        ax.set_xlabel('Actual Speed (m/s)')
        ax.set_ylabel('Predicted Speed (m/s)')
        ax.set_title(f'{name}\nRÂ² = {results[name if name != "Linear" else "Linear Regression"]["r2"]:.4f}')
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 8. Cross-Validation

Robust evaluation using time-series cross-validation.

In [None]:
from solar_regatta.ml import cross_validate

if ADVANCED_MODELS_AVAILABLE:
    print("Performing 5-fold cross-validation...\n")
    
    # Cross-validate Random Forest
    cv_results = cross_validate(
        RandomForestSpeedModel,
        X, y,
        n_splits=5,
        n_estimators=50,
        max_depth=10,
        random_state=42
    )
    
    print("Cross-Validation Results (Random Forest):")
    print("="*70)
    print(f"RÂ² Score:  {cv_results['r2_mean']:.4f} Â± {cv_results['r2_std']:.4f}")
    print(f"RMSE:      {cv_results['rmse_mean']:.4f} Â± {cv_results['rmse_std']:.4f}")
    print(f"MAE:       {cv_results['mae_mean']:.4f} Â± {cv_results['mae_std']:.4f}")
    print("="*70)
else:
    print("âš  Cross-validation requires advanced models")

## Summary

This notebook demonstrated:

âœ… **Advanced Feature Engineering** - Created 60+ features from raw telemetry  
âœ… **Multiple Models** - Compared Linear, Random Forest, and XGBoost  
âœ… **Feature Importance** - Identified key performance drivers  
âœ… **Proper Evaluation** - Time-series cross-validation for robust results  

### Key Takeaways:
- Feature engineering dramatically improves model performance
- Tree-based models (RF, XGBoost) outperform linear regression for complex patterns
- Rolling statistics and lag features capture temporal dependencies
- Physics-based features provide domain knowledge to the model

### Next Steps:
1. Try with real VESC data
2. Experiment with different feature combinations
3. Tune hyperparameters for better performance
4. Deploy models for real-time prediction