# Basic Statistical Analysis with Monet Stats

This notebook demonstrates the basic statistical analysis capabilities of Monet Stats for atmospheric sciences. We'll explore fundamental metrics for model evaluation and verification.

In [1]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# For xarray support
import monet_stats as ms

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## Load Example Dataset

We'll use the synthetic temperature dataset generated for this example.

In [2]:
# Load the example temperature dataset
temp_df = pd.read_csv('../data/temperature_obs_mod.csv')

# Display basic information about the dataset
print("Dataset shape:", temp_df.shape)
print("\nFirst few rows:")
print(temp_df.head())

print("\nDataset summary:")
print(temp_df.describe())

Dataset shape: (36530, 6)

First few rows:
         date station_id   latitude   longitude  observed_temp  modeled_temp
0  2010-01-01     STN001  48.807459 -104.292642       4.846258      2.405771
1  2010-01-01     STN002  41.618037  -71.091006       7.334191      4.291591
2  2010-01-01     STN003  44.480624  -77.625068       3.787901      2.090386
3  2010-01-01     STN004  37.405214 -102.395090       8.378226      5.821626
4  2010-01-01     STN005  30.463983  -90.015159       1.589425      0.320924

Dataset summary:
           latitude     longitude  observed_temp  modeled_temp
count  36530.000000  36530.000000   36530.000000  36530.000000
mean      38.944442    -84.798976      15.217608     13.173439
std        5.563585     13.353808       7.306792      7.250814
min       30.399022   -104.292642      -0.789027     -1.714732
25%       36.541071    -98.972585       8.431574      6.334459
50%       38.742505    -83.820113      15.220243     13.161681
75%       42.867920    -71.091006   

## Basic Error Metrics

Let's compute basic error metrics between observed and modeled temperatures.

In [None]:
# Extract observed and modeled temperatures
obs_temps = temp_df['observed_temp'].values
mod_temps = temp_df['modeled_temp'].values

# Compute basic error metrics
print("Basic Error Metrics:")
print(f"Mean Absolute Error (MAE): {ms.MAE(obs_temps, mod_temps):.3f}")
print(f"Root Mean Square Error (RMSE): {ms.RMSE(obs_temps, mod_temps):.3f}")
print(f"Mean Bias Error (MB): {ms.MB(obs_temps, mod_temps):.3f}")
print(f"Mean Absolute Percentage Error (MAPE): {ms.MAPE(obs_temps, mod_temps):.3f}%")
print(f"Mean Percentage Error (MPE): {ms.MPE(obs_temps, mod_temps):.3f}%")

# Calculate additional metrics
print("\nCorrelation Metrics:")
print(f"Pearson Correlation: {ms.pearson_correlation(obs_temps, mod_temps):.3f}")
print(f"Spearman Correlation: {ms.spearman_correlation(obs_temps, mod_temps):.3f}")
print(f"Coefficient of Determination (R²): {ms.R2(obs_temps, mod_temps):.3f}")

Basic Error Metrics:
Mean Absolute Error (MAE): 2.044
Root Mean Square Error (RMSE): 2.130


AttributeError: module 'monet_stats' has no attribute 'MBE'

## Skill Scores

Calculate skill scores relative to a reference forecast (e.g., climatology).

In [None]:
# Calculate skill scores
print("Skill Scores (relative to climatology reference):")

# Use overall mean as climatology reference
climatology = np.mean(obs_temps)

# Calculate metrics
rmse_model = ms.RMSE(obs_temps, mod_temps)
rmse_climo = ms.RMSE(obs_temps, np.full_like(obs_temps, climatology))

# Calculate skill score
ss_rmse = ms.SS(mse_model=rmse_model**2, mse_reference=rmse_climo**2)
print(f"RMSE Skill Score: {ss_rmse:.3f}")

# Nash-Sutcliffe Efficiency
nse = ms.NSE(obs_temps, mod_temps)
print(f"Nash-Sutcliffe Efficiency: {nse:.3f}")

# Modified Nash-Sutcliffe Efficiency
mnse = ms.mNSE(obs_temps, mod_temps)
print(f"Modified Nash-Sutcliffe Efficiency: {mnse:.3f}")

# Relative Nash-Sutcliffe Efficiency
rnse = ms.rNSE(obs_temps, mod_temps)
print(f"Relative Nash-Sutcliffe Efficiency: {rnse:.3f}")

# Nash-Sutcliffe Efficiency (log)
nse_log = ms.NSElog(obs_temps, mod_temps)
print(f"Nash-Sutcliffe Efficiency (log): {nse_log:.3f}")

# Nash-Sutcliffe Efficiency (modified)
nse_m = ms.NSEm(obs_temps, mod_temps)
print(f"Nash-Sutcliffe Efficiency (modified): {nse_m:.3f}")

## Visualization

Create visualizations to understand the model performance.

In [None]:
# Create a comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Scatter plot
axes[0, 0].scatter(obs_temps[:1000], mod_temps[:1000], alpha=0.5, s=20)
axes[0, 0].plot([obs_temps.min(), obs_temps.max()], [obs_temps.min(), obs_temps.max()], 'r--', lw=2)
axes[0, 0].set_xlabel('Observed Temperature (°C)')
axes[0, 0].set_ylabel('Modeled Temperature (°C)')
axes[0, 0].set_title(f'Scatter Plot (R² = {ms.R2(obs_temps, mod_temps):.3f})')
axes[0, 0].grid(True, alpha=0.3)

# Time series for first station
first_station_data = temp_df[temp_df['station_id'] == temp_df['station_id'].iloc[0]]
axes[0, 1].plot(first_station_data['date'][:365], first_station_data['observed_temp'][:365], label='Observed', alpha=0.7)
axes[0, 1].plot(first_station_data['date'][:365], first_station_data['modeled_temp'][:365], label='Modeled', alpha=0.7)
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Temperature (°C)')
axes[0, 1].set_title('Time Series Comparison (First Station, First Year)')
axes[0, 1].legend()
axes[0, 1].tick_params(axis='x', rotation=45)

# Histogram of errors
errors = mod_temps - obs_temps
axes[1, 0].hist(errors, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Model Error (°C)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title(f'Distribution of Errors (Mean: {np.mean(errors):.3f}, Std: {np.std(errors):.3f})')
axes[1, 0].grid(True, alpha=0.3)

# Q-Q plot to check normality of errors
from scipy import stats

stats.probplot(errors, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot of Errors')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Station-wise Analysis

Analyze model performance by station.

In [None]:
# Calculate metrics for each station
stations = temp_df['station_id'].unique()
station_metrics = []

for station in stations[:5]:  # Analyze first 5 stations
    station_data = temp_df[temp_df['station_id'] == station]
    obs_station = station_data['observed_temp'].values
    mod_station = station_data['modeled_temp'].values

    station_metrics.append({
        'station': station,
        'MAE': ms.MAE(obs_station, mod_station),
        'RMSE': ms.RMSE(obs_station, mod_station),
        'R2': ms.R2(obs_station, mod_station),
        'Correlation': ms.pearson_correlation(obs_station, mod_station),
        'MBE': ms.MBE(obs_station, mod_station)
    })

# Convert to DataFrame and display
metrics_df = pd.DataFrame(station_metrics)
print("Station-wise Performance Metrics:")
print(metrics_df)

# Visualize station performance
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].bar(metrics_df['station'], metrics_df['RMSE'])
axes[0].set_xlabel('Station')
axes[0].set_ylabel('RMSE')
axes[0].set_title('RMSE by Station')
axes[0].tick_params(axis='x', rotation=45)

axes[1].bar(metrics_df['station'], metrics_df['R2'])
axes[1].set_xlabel('Station')
axes[1].set_ylabel('R²')
axes[1].set_title('R² by Station')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated the basic statistical analysis capabilities of Monet Stats:

1. **Error Metrics**: MAE, RMSE, MBE, MAPE, MPE
2. **Correlation Metrics**: Pearson, Spearman, R²
3. **Skill Scores**: NSE, mNSE, rNSE, NSElog, NSEm, SS
4. **Visualization**: Scatter plots, time series, error distributions
5. **Station-wise Analysis**: Performance by location

These metrics provide a comprehensive view of model performance for temperature forecasts.