# Part 2: ERSSTv6 Data Exploration

**Goal**: Download, combine, and explore the NOAA ERSSTv6 (Extended Reconstructed Sea Surface Temperature) dataset.

## About ERSSTv6

- **Source**: [NOAA NCEI](https://www.ncei.noaa.gov/data/sea-surface-temperature-extended-reconstructed/v6/access/)
- **Resolution**: 2° × 2° global grid, monthly
- **Time span**: 1850 – present
- **Variables**: `sst` (absolute SST) and `ssta` (SST anomaly relative to 1991–2020 climatology)
- **Format**: NetCDF, one file per month (`ersst.v6.YYYYMM.nc`)

In this notebook we will:
1. Download all monthly files from NOAA
2. Combine them into a single dataset
3. Explore the data with visualizations
4. Save a processed version for ENSO prediction

In [None]:
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import os
import urllib.request
from pathlib import Path

DATA_DIR = Path("../data/ersstv6")
DATA_DIR.mkdir(parents=True, exist_ok=True)
print(f"Data directory: {DATA_DIR.resolve()}")

## 1. Download ERSSTv6 Monthly Files

Each file is ~130 KB. We download from 1950 to present (more reliable observations) — about 900 files (~120 MB total). Earlier data (1850–1949) has limited ship observations and higher uncertainty.

In [None]:
BASE_URL = "https://www.ncei.noaa.gov/data/sea-surface-temperature-extended-reconstructed/v6/access"
START_YEAR = 1950
END_YEAR = 2025

# Build list of files to download
files_to_download = []
for year in range(START_YEAR, END_YEAR + 1):
    for month in range(1, 13):
        fname = f"ersst.v6.{year}{month:02d}.nc"
        fpath = DATA_DIR / fname
        if not fpath.exists():
            files_to_download.append((f"{BASE_URL}/{fname}", fpath))

existing = (END_YEAR - START_YEAR + 1) * 12 - len(files_to_download)
print(f"Files already downloaded: {existing}")
print(f"Files to download: {len(files_to_download)}")

In [None]:
errors = []
for i, (url, fpath) in enumerate(files_to_download):
    try:
        urllib.request.urlretrieve(url, fpath)
    except Exception as e:
        errors.append((url, str(e)))
    
    # Print progress every 100 files
    if (i + 1) % 100 == 0 or (i + 1) == len(files_to_download):
        print(f"  Downloaded {i+1}/{len(files_to_download)} files...")

if errors:
    print(f"\n{len(errors)} files failed (probably future months):")
    for url, err in errors[-3:]:
        print(f"  {url.split('/')[-1]}: {err}")
else:
    print("All files downloaded successfully!")

## 2. Combine into a Single Dataset

We can open all the files at once by using `xr.open_mfdataset`

In [None]:
ds = xr.open_mfdataset(f"{DATA_DIR}/ersst.v6.*.nc").squeeze().load()
ds

## 3. Explore the Data

### 3.1 Global SST Maps

In [None]:
sample_time = ds.time.values[-1]
time_str = str(sample_time)[:7]

fig, axes = plt.subplots(2, 1, figsize=(14, 8), subplot_kw={'projection': ccrs.PlateCarree(central_longitude=180)})

# SST
ds.sst.sel(time=sample_time).plot(
    ax=axes[0], cmap='coolwarm', 
    cbar_kwargs={'label': 'SST (°C)'},
    transform=ccrs.PlateCarree()
)
axes[0].set_title(f'Sea Surface Temperature — {time_str}')
gl = axes[0].gridlines(draw_labels=True, lw=0.5, ls='--')
gl.top_labels = False
gl.right_labels = False
axes[0].coastlines()

# SSTA
ds.ssta.sel(time=sample_time).plot(
    ax=axes[1], cmap='RdBu_r', vmin=-3, vmax=3,
    cbar_kwargs={'label': 'SST Anomaly (°C)'},
    transform=ccrs.PlateCarree()
)
axes[1].set_title(f'Sea Surface Temperature Anomaly — {time_str}')
gl = axes[1].gridlines(draw_labels=True, lw=0.5, ls='--')
gl.top_labels = False
gl.right_labels = False
axes[1].coastlines()

fig.tight_layout()

### 3.2 Time Series: Global Mean SST

In [None]:
# Area-weighted global mean SST over time
weights = np.cos(np.deg2rad(ds.lat))
global_mean_sst = ds.sst.weighted(weights).mean(dim=['lat', 'lon'])

fig, ax = plt.subplots(figsize=(14, 4))
global_mean_sst.plot(ax=ax)
ax.set_title('Global Mean SST (Area-Weighted)')
ax.set_ylabel('SST (°C)')
ax.set_xlabel('Year')
fig.tight_layout()

### 3.3 The ENSO Region

The **Niño 3.4 region** (5°S–5°N, 170°W–120°W) is the key area for ENSO monitoring. Let's highlight it.

In [None]:
import matplotlib.patches as patches


fig, ax = plt.subplots(figsize=(14, 5), subplot_kw={'projection': ccrs.PlateCarree(central_longitude=180)})
ds.ssta.sel(time=sample_time).plot(
    ax=ax, cmap='RdBu_r', vmin=-3, vmax=3,
    cbar_kwargs={'label': 'SST Anomaly (°C)'},
    transform=ccrs.PlateCarree(),
)

# In 0-360° convention: 170°W = 190°, 120°W = 240°
rect = patches.Rectangle((190, -5), 50, 10, 
    linewidth=2.5, edgecolor='black', facecolor='none', linestyle='--', transform=ccrs.PlateCarree())
ax.add_patch(rect)
ax.text(215, 8, 'Niño 3.4', ha='center', fontsize=12, fontweight='bold',
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.8), transform=ccrs.PlateCarree())

ax.set_title(f'SSTA with Niño 3.4 Region — {time_str}')

gl = ax.gridlines(draw_labels=True, alpha=0.5, linestyle='--')
gl.top_labels = False
gl.right_labels = False

ax.coastlines()

fig.tight_layout()

### 3.4 Niño 3.4 Index Time Series

In [None]:
# Compute Niño 3.4 index
nino34 = ds.ssta.sel(lat=slice(-5, 5), lon=slice(190, 240))
weights_nino = np.cos(np.deg2rad(nino34.lat))
nino34_index = nino34.weighted(weights_nino).mean(dim=['lat', 'lon'])

fig, ax = plt.subplots(figsize=(14, 4))
nino34_index.plot(ax=ax, color='black', linewidth=0.8)
ax.fill_between(nino34_index.time.values, nino34_index.values, 0.5,
                where=nino34_index.values >= 0.5, color='red', alpha=0.4, label='El Niño')
ax.fill_between(nino34_index.time.values, nino34_index.values, -0.5,
                where=nino34_index.values <= -0.5, color='blue', alpha=0.4, label='La Niña')
ax.axhline(0.5, color='red', linestyle='--', alpha=0.3)
ax.axhline(-0.5, color='blue', linestyle='--', alpha=0.3)
ax.axhline(0, color='gray', linewidth=0.5)
ax.set_title('Niño 3.4 Index (Monthly SSTA)')
ax.set_ylabel('SSTA (°C)')
ax.legend()
fig.tight_layout()

## 4. Save Processed Data

We save the combined dataset for use in Part 2B (ENSO prediction). This avoids re-downloading and re-processing each time.

In [None]:
# Save the full combined dataset for later use
output_path = "../data/processed/ersstv6_combined.nc"
ds.to_netcdf(output_path)
print(f"Saved combined dataset to: {output_path}")