# MERRA-2 Daily Processing Pipeline

This notebook processes MERRA-2 hourly data day-by-day for the US Lower 48 states.

## What this does:
1. Loops through dates from 1984-01-01 to 2025-12-31
2. For each day:
   - Checks if file already exists (skips if yes)
   - Downloads MERRA-2 M2T1NXSLV data
   - Extracts US Lower 48 region
   - Calculates 2m temperature (°C) and VPD (kPa)
   - Saves to `daily_data/merra2_us_YYYYMMDD.nc`

## Output:
- **Location**: `research/daily_data/`
- **Format**: NetCDF files named `merra2_us_YYYYMMDD.nc`
- **Variables**: T2M (°C), VPD (kPa)
- **Temporal**: Hourly data (24 timesteps per day)
- **Spatial**: US Lower 48 (~0.625° x 0.5° resolution)

## 1. Setup and Imports

In [None]:
# Import required libraries
import sys
from pathlib import Path
import pandas as pd
import earthaccess
from datetime import datetime
from tqdm.notebook import tqdm

# Import our custom processing functions
import merra2_processing as m2p

print("Libraries imported successfully")

## 2. Authenticate with NASA EarthData

You'll need a NASA EarthData account: https://urs.earthdata.nasa.gov/

In [None]:
# Authenticate once at the beginning
auth = earthaccess.login()
print("✓ Authentication successful!")

## 3. Configure Processing Parameters

In [None]:
# US Lower 48 Bounding Box
bbox = (-125, 24, -66, 49)  # (min_lon, min_lat, max_lon, max_lat)

# Date range
start_date = "1984-01-01"
end_date = "2025-12-31"

# MERRA-2 collection
collection_id = "M2T1NXSLV"  # Hourly single-level diagnostics

# Output directory
output_dir = Path("daily_data")

print(f"Configuration:")
print(f"  Date range: {start_date} to {end_date}")
print(f"  Bounding box: {bbox} (US Lower 48)")
print(f"  Collection: {collection_id}")
print(f"  Output directory: {output_dir}")

# Generate list of dates to process
dates = m2p.get_date_range(start_date, end_date)
print(f"\nTotal days to process: {len(dates):,}")

## 4. Check Existing Files

Let's see how many files have already been processed.

In [None]:
# Count existing files
existing_files = [d for d in dates if m2p.check_file_exists(d, output_dir)]
remaining_files = [d for d in dates if not m2p.check_file_exists(d, output_dir)]

print(f"Status:")
print(f"  ✓ Already processed: {len(existing_files):,} days")
print(f"  ⧗ Remaining to process: {len(remaining_files):,} days")

if len(existing_files) > 0:
    print(f"\nFirst processed file: {existing_files[0].strftime('%Y-%m-%d')}")
    print(f"Last processed file: {existing_files[-1].strftime('%Y-%m-%d')}")

## 5. Process All Days

This cell will loop through all dates and process them. It will:
- Skip files that already exist
- Show a progress bar
- Print status for each file
- Track success/failure/skip counts

**Note**: This will take a long time for 40+ years of data. Consider processing in chunks!

In [None]:
# Track results
results = {
    'success': [],
    'failed': [],
    'skipped': []
}

# Process each day with progress bar
for date in tqdm(dates, desc="Processing MERRA-2 data"):
    result = m2p.process_single_day(
        date=date,
        bbox=bbox,
        collection_id=collection_id,
        output_dir=output_dir,
        auth=auth
    )
    
    if result['success'] and result.get('skipped', False):
        results['skipped'].append(date)
    elif result['success']:
        results['success'].append(date)
        print(f"✓ {date.strftime('%Y-%m-%d')}: {result['message']}")
    else:
        results['failed'].append(date)
        print(f"✗ {date.strftime('%Y-%m-%d')}: {result['message']}")

# Print summary
print("\n" + "="*60)
print("PROCESSING COMPLETE")
print("="*60)
print(f"Successfully processed: {len(results['success']):,} days")
print(f"Skipped (already exist): {len(results['skipped']):,} days")
print(f"Failed: {len(results['failed']):,} days")

## 6. Process a Smaller Date Range (Optional)

Use this cell to process a smaller date range for testing or incremental processing.

In [None]:
# Example: Process just June 2023
test_start = "2023-06-01"
test_end = "2023-06-30"

test_dates = m2p.get_date_range(test_start, test_end)
print(f"Processing {len(test_dates)} days from {test_start} to {test_end}\n")

for date in tqdm(test_dates, desc="Processing test range"):
    result = m2p.process_single_day(
        date=date,
        bbox=bbox,
        collection_id=collection_id,
        output_dir=output_dir,
        auth=auth
    )
    
    status = "✓" if result['success'] else "✗"
    print(f"{status} {date.strftime('%Y-%m-%d')}: {result['message']}")

## 7. Verify Output Files

Check the structure and content of a processed file.

In [None]:
import xarray as xr

# Load a sample file
sample_files = sorted(output_dir.glob("merra2_us_*.nc"))

if len(sample_files) > 0:
    sample_file = sample_files[0]
    print(f"Sample file: {sample_file.name}\n")
    
    ds = xr.open_dataset(sample_file)
    
    print("Dataset Information:")
    print("=" * 60)
    print(f"Dimensions: {dict(ds.dims)}")
    print(f"\nVariables: {list(ds.data_vars)}")
    
    print("\nT2M (Temperature):")
    print(f"  Units: {ds['T2M'].attrs.get('units', 'N/A')}")
    print(f"  Long name: {ds['T2M'].attrs.get('long_name', 'N/A')}")
    print(f"  Range: {float(ds['T2M'].min()):.2f} to {float(ds['T2M'].max()):.2f} °C")
    
    print("\nVPD (Vapor Pressure Deficit):")
    print(f"  Units: {ds['VPD'].attrs.get('units', 'N/A')}")
    print(f"  Long name: {ds['VPD'].attrs.get('long_name', 'N/A')}")
    print(f"  Range: {float(ds['VPD'].min()):.2f} to {float(ds['VPD'].max()):.2f} kPa")
    
    print("\nGlobal Attributes:")
    for key, value in ds.attrs.items():
        print(f"  {key}: {value}")
    
    ds.close()
else:
    print("No processed files found yet. Run processing cells first.")

## 8. List Failed Dates (if any)

If any dates failed to process, they will be listed here.

In [None]:
# This will only work if you ran the processing loop above
if 'results' in locals() and len(results['failed']) > 0:
    print(f"Failed dates ({len(results['failed'])}):")
    for date in results['failed']:
        print(f"  - {date.strftime('%Y-%m-%d')}")
    
    # Save failed dates to a file for reprocessing
    failed_dates_file = output_dir / "failed_dates.txt"
    with open(failed_dates_file, 'w') as f:
        for date in results['failed']:
            f.write(f"{date.strftime('%Y-%m-%d')}\n")
    print(f"\nFailed dates saved to: {failed_dates_file}")
else:
    print("No failed dates found or processing hasn't been run yet.")

## 9. Storage Usage

Check how much disk space the processed files are using.

In [None]:
# Calculate total storage used
total_size = 0
file_count = 0

for file in output_dir.glob("merra2_us_*.nc"):
    total_size += file.stat().st_size
    file_count += 1

print(f"Storage Statistics:")
print(f"  Total files: {file_count:,}")
print(f"  Total size: {total_size / (1024**3):.2f} GB")
if file_count > 0:
    print(f"  Average file size: {total_size / file_count / (1024**2):.2f} MB")

## Notes

### Processing Time Estimates:
- Each day takes ~10-30 seconds to download and process
- Full dataset (1984-2025): ~15,000 days = ~40-125 hours
- **Recommendation**: Process in yearly chunks or run overnight

### Tips:
1. **Resume capability**: The code automatically skips already-processed files, so you can stop and restart anytime
2. **Chunk processing**: Use the optional cell (Section 6) to process specific date ranges
3. **Monitor progress**: Check the `daily_data/` directory to see files being created
4. **Disk space**: Expect ~5-10 MB per day = ~75-150 GB for full dataset

### Troubleshooting:
- If authentication expires, re-run the authentication cell
- If downloads are slow, check your internet connection
- If a date consistently fails, it may not have available data (check MERRA-2 availability)