# Static Yaw Challenge - Exploratory Data Analysis (EDA)

**Objective**: Understand the dataset characteristics, identify patterns, and prepare for feature engineering.

**Dataset**: Aventa AV-7 (6kW) wind turbine SCADA data with static yaw offsets (0°, 4°, 6°)

**Key Questions**:
1. What is the distribution of yaw offsets in training data?
2. How do SCADA signals differ across yaw offsets?
3. What is the distribution shift between train and test data?
4. What features show the strongest signal for yaw detection?
5. How do power curves vary with yaw misalignment?

## 1. Setup and Data Loading

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Configure plotting
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

# Define paths
DATA_DIR = Path('../data/yaw_alignment_dataset')
TRAIN_PATH = DATA_DIR / 'train.parquet'
TEST_PATH = DATA_DIR / 'test.parquet'
FIGURES_DIR = Path('../reports/figures')
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

print(f"Data directory: {DATA_DIR}")
print(f"Train file exists: {TRAIN_PATH.exists()}")
print(f"Test file exists: {TEST_PATH.exists()}")

In [None]:
# Load training data
print("Loading training data...")
df_train = pd.read_parquet(TRAIN_PATH)

# Load test data
print("Loading test data...")
df_test = pd.read_parquet(TEST_PATH)

print(f"\nTraining data shape: {df_train.shape}")
print(f"Test data shape: {df_test.shape}")
print(f"\nTraining memory usage: {df_train.memory_usage(deep=True).sum() / 1e6:.2f} MB")
print(f"Test memory usage: {df_test.memory_usage(deep=True).sum() / 1e6:.2f} MB")

## 2. Initial Data Inspection

In [None]:
# Training data overview
print("="*80)
print("TRAINING DATA")
print("="*80)
print("\nData Info:")
print(df_train.info())
print("\nFirst few rows:")
display(df_train.head())
print("\nStatistical Summary:")
display(df_train.describe())

In [None]:
# Test data overview
print("="*80)
print("TEST DATA")
print("="*80)
print("\nData Info:")
print(df_test.info())
print("\nFirst few rows:")
display(df_test.head())
print("\nStatistical Summary:")
display(df_test.describe())

In [None]:
# Check for missing values
print("Missing values in training data:")
print(df_train.isnull().sum())
print("\nMissing values in test data:")
print(df_test.isnull().sum())

## 3. Target Variable Analysis

In [None]:
# Yaw offset distribution
print("Yaw Offset Distribution:")
yaw_counts = df_train['yaw_offset'].value_counts().sort_index()
print(yaw_counts)
print("\nPercentages:")
print(df_train['yaw_offset'].value_counts(normalize=True).sort_index() * 100)

# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
yaw_counts.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#3498db', '#e74c3c'])
axes[0].set_xlabel('Yaw Offset (degrees)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Yaw Offset Distribution (Training Data)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)

# Pie chart
axes[1].pie(yaw_counts, labels=[f"{int(x)}°" for x in yaw_counts.index],
            autopct='%1.1f%%', colors=['#2ecc71', '#3498db', '#e74c3c'],
            startangle=90)
axes[1].set_title('Yaw Offset Proportions', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig(FIGURES_DIR / '01_yaw_offset_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## 4. Distribution Shift Analysis: Train vs Test

**Critical**: Test data has more operating/production data than training data!

In [None]:
# Define operating conditions based on wind speed
def categorize_wind_speed(ws):
    """Categorize wind speed into operating regimes."""
    if ws < 3:
        return 'Below cut-in (<3 m/s)'
    elif ws < 12:
        return 'Operating (3-12 m/s)'
    else:
        return 'Above rated (>12 m/s)'

# Add wind speed category
df_train['wind_category'] = df_train['wind_speed'].apply(categorize_wind_speed)
df_test['wind_category'] = df_test['wind_speed'].apply(categorize_wind_speed)

# Compare distributions
print("Wind Speed Category Distribution:")
print("\nTraining Data:")
train_wind_dist = df_train['wind_category'].value_counts(normalize=True) * 100
print(train_wind_dist.sort_index())

print("\nTest Data:")
test_wind_dist = df_test['wind_category'].value_counts(normalize=True) * 100
print(test_wind_dist.sort_index())

In [None]:
# Visualize distribution shift
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Wind speed distribution
axes[0, 0].hist(df_train['wind_speed'], bins=50, alpha=0.6, label='Train', color='blue', density=True)
axes[0, 0].hist(df_test['wind_speed'], bins=50, alpha=0.6, label='Test', color='orange', density=True)
axes[0, 0].set_xlabel('Wind Speed (m/s)')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Wind Speed Distribution')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Power output distribution
axes[0, 1].hist(df_train['power_output'], bins=50, alpha=0.6, label='Train', color='blue', density=True)
axes[0, 1].hist(df_test['power_output'], bins=50, alpha=0.6, label='Test', color='orange', density=True)
axes[0, 1].set_xlabel('Power Output (kW)')
axes[0, 1].set_ylabel('Density')
axes[0, 1].set_title('Power Output Distribution')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Rotor speed distribution
axes[0, 2].hist(df_train['rotor_speed'], bins=50, alpha=0.6, label='Train', color='blue', density=True)
axes[0, 2].hist(df_test['rotor_speed'], bins=50, alpha=0.6, label='Test', color='orange', density=True)
axes[0, 2].set_xlabel('Rotor Speed (RPM)')
axes[0, 2].set_ylabel('Density')
axes[0, 2].set_title('Rotor Speed Distribution')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# Turbine status distribution
train_status = df_train['turbine_status'].value_counts(normalize=True).sort_index() * 100
test_status = df_test['turbine_status'].value_counts(normalize=True).sort_index() * 100
status_comparison = pd.DataFrame({'Train': train_status, 'Test': test_status}).fillna(0)
status_comparison.plot(kind='bar', ax=axes[1, 0], color=['blue', 'orange'], alpha=0.7)
axes[1, 0].set_xlabel('Turbine Status')
axes[1, 0].set_ylabel('Percentage (%)')
axes[1, 0].set_title('Turbine Status Distribution')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Wind category comparison
wind_cat_comparison = pd.DataFrame({
    'Train': df_train['wind_category'].value_counts(normalize=True) * 100,
    'Test': df_test['wind_category'].value_counts(normalize=True) * 100
}).fillna(0)
wind_cat_comparison.plot(kind='bar', ax=axes[1, 1], color=['blue', 'orange'], alpha=0.7)
axes[1, 1].set_xlabel('Wind Speed Category')
axes[1, 1].set_ylabel('Percentage (%)')
axes[1, 1].set_title('Operating Regime Distribution')
axes[1, 1].legend()
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=45, ha='right')
axes[1, 1].grid(True, alpha=0.3)

# Generator temperature
axes[1, 2].hist(df_train['generator_temperature'], bins=50, alpha=0.6, label='Train', color='blue', density=True)
axes[1, 2].hist(df_test['generator_temperature'], bins=50, alpha=0.6, label='Test', color='orange', density=True)
axes[1, 2].set_xlabel('Generator Temperature (°C)')
axes[1, 2].set_ylabel('Density')
axes[1, 2].set_title('Generator Temperature Distribution')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / '02_train_test_distribution_shift.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Statistical comparison of key features
features = ['rotor_speed', 'generator_speed', 'generator_temperature', 
            'wind_speed', 'power_output', 'relative_wind_direction']

comparison_stats = pd.DataFrame({
    'Train Mean': df_train[features].mean(),
    'Test Mean': df_test[features].mean(),
    'Train Std': df_train[features].std(),
    'Test Std': df_test[features].std(),
    'Shift %': ((df_test[features].mean() - df_train[features].mean()) / df_train[features].mean() * 100)
})

print("\nFeature Statistics Comparison (Train vs Test):")
print(comparison_stats.round(2))

## 5. Power Curve Analysis by Yaw Offset

**Key Hypothesis**: Yaw misalignment causes power loss

In [None]:
# Filter for production mode (status 10) to see clearest yaw effect
df_production = df_train[df_train['turbine_status'] == 10].copy()

print(f"Production mode data: {len(df_production):,} rows ({len(df_production)/len(df_train)*100:.1f}%)")
print(f"\nYaw offset distribution in production mode:")
print(df_production['yaw_offset'].value_counts().sort_index())

In [None]:
# Power curves by yaw offset
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Power vs Wind Speed by Yaw Offset (scatter)
for yaw in sorted(df_production['yaw_offset'].unique()):
    subset = df_production[df_production['yaw_offset'] == yaw]
    # Sample for visualization (too many points)
    sample = subset.sample(min(10000, len(subset)), random_state=RANDOM_SEED)
    axes[0, 0].scatter(sample['wind_speed'], sample['power_output'], 
                      alpha=0.1, s=1, label=f'{int(yaw)}°')

axes[0, 0].set_xlabel('Wind Speed (m/s)', fontsize=12)
axes[0, 0].set_ylabel('Power Output (kW)', fontsize=12)
axes[0, 0].set_title('Power Curve by Yaw Offset (Production Mode)', fontsize=14, fontweight='bold')
axes[0, 0].legend(title='Yaw Offset')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_xlim(0, 20)
axes[0, 0].set_ylim(-1, 7)

# 2. Binned power curves (mean)
wind_bins = np.arange(0, 20, 0.5)
for yaw in sorted(df_production['yaw_offset'].unique()):
    subset = df_production[df_production['yaw_offset'] == yaw].copy()
    subset['wind_bin'] = pd.cut(subset['wind_speed'], bins=wind_bins)
    binned_power = subset.groupby('wind_bin')['power_output'].mean()
    bin_centers = [interval.mid for interval in binned_power.index]
    axes[0, 1].plot(bin_centers, binned_power.values, marker='o', 
                   linewidth=2, markersize=4, label=f'{int(yaw)}°')

axes[0, 1].set_xlabel('Wind Speed (m/s)', fontsize=12)
axes[0, 1].set_ylabel('Mean Power Output (kW)', fontsize=12)
axes[0, 1].set_title('Binned Power Curves (Mean)', fontsize=14, fontweight='bold')
axes[0, 1].legend(title='Yaw Offset')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_xlim(0, 20)

# 3. Power loss analysis
# Calculate power loss relative to 0° yaw
power_0deg = df_production[df_production['yaw_offset'] == 0].copy()
power_0deg['wind_bin'] = pd.cut(power_0deg['wind_speed'], bins=wind_bins)
baseline_power = power_0deg.groupby('wind_bin')['power_output'].mean()

for yaw in [4, 6]:
    subset = df_production[df_production['yaw_offset'] == yaw].copy()
    subset['wind_bin'] = pd.cut(subset['wind_speed'], bins=wind_bins)
    yaw_power = subset.groupby('wind_bin')['power_output'].mean()
    power_loss_pct = ((baseline_power - yaw_power) / baseline_power * 100).dropna()
    bin_centers = [interval.mid for interval in power_loss_pct.index]
    axes[1, 0].plot(bin_centers, power_loss_pct.values, marker='o', 
                   linewidth=2, markersize=4, label=f'{int(yaw)}°')

axes[1, 0].set_xlabel('Wind Speed (m/s)', fontsize=12)
axes[1, 0].set_ylabel('Power Loss (%)', fontsize=12)
axes[1, 0].set_title('Power Loss Relative to 0° Yaw', fontsize=14, fontweight='bold')
axes[1, 0].legend(title='Yaw Offset')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].axhline(y=0, color='black', linestyle='--', linewidth=1)

# 4. Box plot of power by yaw offset (filtered for operating range)
operating_data = df_production[(df_production['wind_speed'] >= 5) & 
                               (df_production['wind_speed'] <= 12)].copy()
operating_data['yaw_offset_str'] = operating_data['yaw_offset'].astype(str) + '°'

sns.boxplot(data=operating_data, x='yaw_offset_str', y='power_output', ax=axes[1, 1])
axes[1, 1].set_xlabel('Yaw Offset', fontsize=12)
axes[1, 1].set_ylabel('Power Output (kW)', fontsize=12)
axes[1, 1].set_title('Power Distribution by Yaw (5-12 m/s wind)', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / '03_power_curve_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Feature Correlation Analysis

In [None]:
# Correlation with yaw offset (production mode only)
numeric_cols = ['rotor_speed', 'generator_speed', 'generator_temperature', 
                'wind_speed', 'power_output', 'relative_wind_direction',
                'supply_voltage', 'blade_pitch_deg', 'turbine_status', 'yaw_offset']

correlation_matrix = df_production[numeric_cols].corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.3f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Matrix (Production Mode)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig(FIGURES_DIR / '04_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Print correlations with yaw_offset
print("\nCorrelation with Yaw Offset (sorted by absolute value):")
yaw_corr = correlation_matrix['yaw_offset'].drop('yaw_offset').abs().sort_values(ascending=False)
print(yaw_corr)

## 7. Relative Wind Direction Analysis

**Expected**: Relative wind direction should show yaw misalignment, but correlation is near zero!

In [None]:
# Wind direction analysis by yaw offset
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Distribution of relative wind direction by yaw
for yaw in sorted(df_production['yaw_offset'].unique()):
    subset = df_production[df_production['yaw_offset'] == yaw]['relative_wind_direction']
    axes[0, 0].hist(subset, bins=50, alpha=0.5, label=f'{int(yaw)}°', density=True)

axes[0, 0].set_xlabel('Relative Wind Direction (degrees)', fontsize=12)
axes[0, 0].set_ylabel('Density', fontsize=12)
axes[0, 0].set_title('Relative Wind Direction Distribution', fontsize=14, fontweight='bold')
axes[0, 0].legend(title='Yaw Offset')
axes[0, 0].grid(True, alpha=0.3)

# Box plot
df_production['yaw_str'] = df_production['yaw_offset'].astype(str) + '°'
sns.boxplot(data=df_production, x='yaw_str', y='relative_wind_direction', ax=axes[0, 1])
axes[0, 1].set_xlabel('Yaw Offset', fontsize=12)
axes[0, 1].set_ylabel('Relative Wind Direction (degrees)', fontsize=12)
axes[0, 1].set_title('Relative Wind Direction by Yaw Offset', fontsize=14, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Scatter: wind direction vs power by yaw
for yaw in sorted(df_production['yaw_offset'].unique()):
    subset = df_production[df_production['yaw_offset'] == yaw]
    sample = subset.sample(min(5000, len(subset)), random_state=RANDOM_SEED)
    axes[1, 0].scatter(sample['relative_wind_direction'], sample['power_output'],
                      alpha=0.2, s=1, label=f'{int(yaw)}°')

axes[1, 0].set_xlabel('Relative Wind Direction (degrees)', fontsize=12)
axes[1, 0].set_ylabel('Power Output (kW)', fontsize=12)
axes[1, 0].set_title('Power vs Wind Direction by Yaw', fontsize=14, fontweight='bold')
axes[1, 0].legend(title='Yaw Offset')
axes[1, 0].grid(True, alpha=0.3)

# Statistics by yaw offset
wind_dir_stats = df_production.groupby('yaw_offset')['relative_wind_direction'].agg(['mean', 'std', 'median'])
wind_dir_stats.plot(kind='bar', ax=axes[1, 1], alpha=0.7)
axes[1, 1].set_xlabel('Yaw Offset (degrees)', fontsize=12)
axes[1, 1].set_ylabel('Value (degrees)', fontsize=12)
axes[1, 1].set_title('Wind Direction Statistics by Yaw', fontsize=14, fontweight='bold')
axes[1, 1].legend(['Mean', 'Std Dev', 'Median'])
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=0)

plt.tight_layout()
plt.savefig(FIGURES_DIR / '05_wind_direction_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nWind Direction Statistics by Yaw Offset:")
print(wind_dir_stats)

## 8. Test Data Segment Analysis

In [None]:
# Analyze test segments
print(f"Number of unique segments: {df_test['segment_id'].nunique()}")
print(f"\nSegment size statistics:")
segment_sizes = df_test.groupby('segment_id').size()
print(segment_sizes.describe())

# Segment characteristics
segment_stats = df_test.groupby('segment_id').agg({
    'wind_speed': ['mean', 'std'],
    'power_output': ['mean', 'std'],
    'turbine_status': lambda x: x.mode()[0] if len(x.mode()) > 0 else x.iloc[0]
}).reset_index()
segment_stats.columns = ['segment_id', 'wind_speed_mean', 'wind_speed_std', 
                         'power_output_mean', 'power_output_std', 'dominant_status']

print("\nSegment Statistics Summary:")
print(segment_stats.describe())

In [None]:
# Visualize segment characteristics
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Segment mean wind speed distribution
axes[0, 0].hist(segment_stats['wind_speed_mean'], bins=30, color='skyblue', edgecolor='black')
axes[0, 0].set_xlabel('Mean Wind Speed (m/s)', fontsize=12)
axes[0, 0].set_ylabel('Number of Segments', fontsize=12)
axes[0, 0].set_title('Distribution of Segment Mean Wind Speed', fontsize=14, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Segment mean power distribution
axes[0, 1].hist(segment_stats['power_output_mean'], bins=30, color='lightcoral', edgecolor='black')
axes[0, 1].set_xlabel('Mean Power Output (kW)', fontsize=12)
axes[0, 1].set_ylabel('Number of Segments', fontsize=12)
axes[0, 1].set_title('Distribution of Segment Mean Power', fontsize=14, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Dominant status distribution
status_dist = segment_stats['dominant_status'].value_counts()
status_dist.plot(kind='bar', ax=axes[1, 0], color='lightgreen', edgecolor='black')
axes[1, 0].set_xlabel('Dominant Turbine Status', fontsize=12)
axes[1, 0].set_ylabel('Number of Segments', fontsize=12)
axes[1, 0].set_title('Dominant Status per Segment', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Scatter: mean wind vs mean power (colored by dominant status)
for status in segment_stats['dominant_status'].unique():
    subset = segment_stats[segment_stats['dominant_status'] == status]
    axes[1, 1].scatter(subset['wind_speed_mean'], subset['power_output_mean'],
                      alpha=0.6, label=f'Status {int(status)}', s=30)

axes[1, 1].set_xlabel('Mean Wind Speed (m/s)', fontsize=12)
axes[1, 1].set_ylabel('Mean Power Output (kW)', fontsize=12)
axes[1, 1].set_title('Segment Characteristics', fontsize=14, fontweight='bold')
axes[1, 1].legend(title='Dominant Status')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(FIGURES_DIR / '06_segment_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

## 9. Key Findings Summary

In [None]:
print("="*80)
print("KEY FINDINGS FROM EXPLORATORY DATA ANALYSIS")
print("="*80)

print("\n1. TARGET DISTRIBUTION:")
print(f"   - 0° yaw: {len(df_train[df_train['yaw_offset']==0]):,} ({len(df_train[df_train['yaw_offset']==0])/len(df_train)*100:.1f}%)")
print(f"   - 4° yaw: {len(df_train[df_train['yaw_offset']==4]):,} ({len(df_train[df_train['yaw_offset']==4])/len(df_train)*100:.1f}%) - UNDERREPRESENTED")
print(f"   - 6° yaw: {len(df_train[df_train['yaw_offset']==6]):,} ({len(df_train[df_train['yaw_offset']==6])/len(df_train)*100:.1f}%)")

print("\n2. DISTRIBUTION SHIFT (Train → Test):")
print(f"   - Low wind (<3 m/s): {train_wind_dist['Below cut-in (<3 m/s)']:.1f}% → {test_wind_dist['Below cut-in (<3 m/s)']:.1f}%")
print(f"   - Operating (3-12 m/s): {train_wind_dist['Operating (3-12 m/s)']:.1f}% → {test_wind_dist['Operating (3-12 m/s)']:.1f}% (SIGNIFICANT INCREASE)")
print(f"   - Mean wind speed: {df_train['wind_speed'].mean():.2f} → {df_test['wind_speed'].mean():.2f} m/s (+{(df_test['wind_speed'].mean()-df_train['wind_speed'].mean())/df_train['wind_speed'].mean()*100:.1f}%)")
print(f"   - Mean power: {df_train['power_output'].mean():.2f} → {df_test['power_output'].mean():.2f} kW (+{(df_test['power_output'].mean()-df_train['power_output'].mean())/df_train['power_output'].mean()*100:.1f}%)")

print("\n3. FEATURE CORRELATIONS WITH YAW OFFSET (Production Mode):")
for feature, corr in yaw_corr.head(5).items():
    print(f"   - {feature}: {correlation_matrix.loc[feature, 'yaw_offset']:.4f}")

print("\n4. POWER LOSS PATTERN:")
for yaw in [0, 4, 6]:
    mean_power = operating_data[operating_data['yaw_offset']==yaw]['power_output'].mean()
    print(f"   - {int(yaw)}° yaw: Mean power = {mean_power:.2f} kW (5-12 m/s wind)")

print("\n5. RELATIVE WIND DIRECTION:")
print("   - Does NOT directly show yaw offset (correlation ≈ 0)")
print("   - Yaw misalignment must be inferred from power loss patterns")

print("\n6. TEST SEGMENTS:")
print(f"   - {df_test['segment_id'].nunique()} segments of ~1 hour each")
print(f"   - Segment size: {segment_sizes.mean():.0f} ± {segment_sizes.std():.0f} rows")
print(f"   - Production-dominant segments: {len(segment_stats[segment_stats['dominant_status']==10])}")
print(f"   - Standby-dominant segments: {len(segment_stats[segment_stats['dominant_status']==9])}")

print("\n7. DATA QUALITY:")
print("   - No missing values")
print("   - Clean, consistent 1 Hz sampling")
print("   - Minor anomalies detected (negative values, outliers)")

print("\n" + "="*80)

## 10. Next Steps

Based on this EDA, the recommended next steps are:

1. **Feature Engineering** (Notebook 02):
   - Power loss features (expected vs actual)
   - Segment-level aggregations (mean, std, percentiles)
   - Temporal features (rolling windows, rate of change)
   - Wind direction variability metrics
   - Filter/focus on production mode data

2. **Validation Strategy** (Module):
   - Create validation set matching test distribution
   - Stratify by yaw offset and operating conditions
   - Segment-level evaluation metrics

3. **Baseline Modeling** (Notebook 03):
   - Random Forest on segment aggregates
   - Handle class imbalance (4° underrepresented)
   - Establish performance benchmarks

4. **Advanced Modeling** (Notebook 04):
   - Gradient boosting (LightGBM/XGBoost)
   - Two-stage approach (row → segment)
   - Ensemble methods
   - Strategy for mystery offset generalization