# CaSR SWE File Combination Workflow

This notebook demonstrates how to combine NetCDF files from the CaSR SWE dataset using the `combine_casr_swe_files.py` script. The CaSR dataset contains files organized by variable types, spatial regions, and time periods that can be combined in different ways:

1. **Temporal combination**: Combine files across time periods
2. **Spatial combination**: Combine files across spatial regions  
3. **Full combination**: Combine both temporal and spatial dimensions

The CaSR SWE dataset includes:
- **Variable types**: A_PR24_SFC (precipitation) and P_SWE_LAND (snow water equivalent)
- **Spatial regions**: Different rlon/rlat coordinate ranges
- **Time periods**: 4-year chunks from 1980-2023

## Setup and Imports

**Note**: If you encounter NumPy compatibility errors, please run one of the following commands in your terminal before running this notebook:

**Option 1 (Recommended)**: Install from requirements file
```bash
pip install -r requirements_notebook.txt
```

**Option 2**: Manual installation with compatible versions
```bash
pip install "numpy<2" xarray pandas matplotlib netcdf4
```

**Option 3**: Using conda
```bash
conda install numpy=1.26 xarray pandas matplotlib netcdf4
```

**Option 4**: Create a new environment with compatible versions
```bash
conda create -n snowdrought python=3.9 numpy=1.26 xarray pandas matplotlib netcdf4 jupyter
conda activate snowdrought
```

In [None]:
# Check for NumPy compatibility issues
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='numpy')

# Import required packages
import sys
import os
from pathlib import Path

# Handle NumPy compatibility
try:
    import numpy as np
    print(f"NumPy version: {np.__version__}")
except ImportError as e:
    print(f"NumPy import error: {e}")
    print("Please install NumPy: pip install numpy")

# Import data science packages with error handling
try:
    import xarray as xr
    print(f"xarray version: {xr.__version__}")
except ImportError as e:
    print(f"xarray import error: {e}")
    print("If you encounter NumPy compatibility issues, try:")
    print("  pip install 'numpy<2' xarray pandas matplotlib")
    print("  or")
    print("  conda install numpy=1.26 xarray pandas matplotlib")
    raise

try:
    import pandas as pd
    print(f"pandas version: {pd.__version__}")
except ImportError as e:
    print(f"pandas import error: {e}")
    raise

try:
    import matplotlib.pyplot as plt
    print(f"matplotlib version: {plt.matplotlib.__version__}")
except ImportError as e:
    print(f"matplotlib import error: {e}")
    raise

# Add the project root to Python path to import the combine script
project_root = Path().cwd().parent.parent
sys.path.append(str(project_root))

# Import the CaSR file combiner
try:
    from combine_casr_swe_files import CaSRFileCombiner
    print("Successfully imported CaSRFileCombiner")
except ImportError as e:
    print(f"Error importing CaSRFileCombiner: {e}")
    print("Make sure combine_casr_swe_files.py is in the project root directory")
    raise

# Import the elevation data filter (REUSING EXISTING CODE)
try:
    from filter_merge_elevation_data import ElevationDataFilter
    print("Successfully imported ElevationDataFilter")
except ImportError as e:
    print(f"Error importing ElevationDataFilter: {e}")
    print("Make sure filter_merge_elevation_data.py is in the project root directory")
    raise

# Import the optimized elevation data extractor
try:
    from extract_elevation_data_optimized import OptimizedElevationDataExtractor
    print("Successfully imported OptimizedElevationDataExtractor")
except ImportError as e:
    print(f"Error importing OptimizedElevationDataExtractor: {e}")
    print("Make sure extract_elevation_data_optimized.py is in the project root directory")
    raise

## Configuration

Set up the input and output directories for your CaSR SWE data files.

In [None]:
# Define data paths - modify these paths according to your data location
input_dir = r"data/input_data/CaSR_SWE"  # Directory containing CaSR NetCDF files
output_dir = r"data/output_data/combined_casr"  # Directory for combined output files
elevation_dir = r"data/input_data/Elevation"  # Directory containing elevation shapefiles
filtered_output_dir = r"data/output_data/filtered_elevation"  # Directory for filtered elevation data

# Create absolute paths
input_path = project_root / input_dir
output_path = project_root / output_dir
elevation_path = project_root / elevation_dir
filtered_output_path = project_root / filtered_output_dir

print(f"Input directory: {input_path}")
print(f"Output directory: {output_path}")
print(f"Elevation directory: {elevation_path}")
print(f"Filtered output directory: {filtered_output_path}")
print(f"Input directory exists: {input_path.exists()}")
print(f"Elevation directory exists: {elevation_path.exists()}")

## Initialize the CaSR File Combiner

Create an instance of the `CaSRFileCombiner` class with your input and output directories.

In [None]:
# Initialize the file combiner
combiner = CaSRFileCombiner(input_dir=str(input_path), output_dir=str(output_path))

## Explore Dataset Information

Before combining files, let's examine what data is available in the input directory.

In [None]:
# Get information about the available datasets
combiner.get_dataset_info()

## Examine File Groups

Let's look at how the files are grouped by variable type.

In [None]:
# Get file groups
file_groups = combiner.get_file_groups()

print("Available file groups:")
for group_name, files in file_groups.items():
    print(f"\n{group_name}: {len(files)} files")
    
    # Show first few filenames as examples
    for i, file_path in enumerate(files[:3]):
        filename = Path(file_path).name
        print(f"  {i+1}. {filename}")
    
    if len(files) > 3:
        print(f"  ... and {len(files) - 3} more files")

## Example: Parse Individual Filenames

Let's examine how the filename parsing works for understanding the file structure.

In [None]:
# Get a sample file and parse its filename
if file_groups:
    # Get the first file from the first group
    first_group = list(file_groups.keys())[0]
    sample_file = file_groups[first_group][0]
    sample_filename = Path(sample_file).name
    
    print(f"Sample filename: {sample_filename}")
    
    # Parse the filename
    parsed_info = combiner.parse_filename(sample_filename)
    
    print("\nParsed information:")
    for key, value in parsed_info.items():
        print(f"  {key}: {value}")
else:
    print("No files found in the input directory. Please check your input path.")

## Combination Options

Now let's demonstrate the different ways to combine the CaSR files.

### Option 1: Temporal Combination Only

Combine files across time periods while keeping spatial regions separate.

In [None]:
# Temporal combination only
print("Performing temporal combination (keeping spatial regions separate)...")
combiner.combine_by_variable(combine_spatial=False, combine_temporal=True)
print("Temporal combination completed!")

### Option 2: Spatial Combination Only

Combine files across spatial regions while keeping time periods separate.

In [None]:
# Spatial combination only
print("Performing spatial combination (keeping time periods separate)...")
combiner.combine_by_variable(combine_spatial=True, combine_temporal=False)
print("Spatial combination completed!")

### Option 3: Full Combination

Combine files across both spatial and temporal dimensions to create complete datasets.

In [None]:
# Full combination (both spatial and temporal)
print("Performing full combination (both spatial and temporal)...")
combiner.combine_by_variable(combine_spatial=True, combine_temporal=True)
print("Full combination completed!")

## Examine Combined Output Files

Let's check what files were created in the output directory.

In [None]:
# List output files
output_files = list(output_path.glob('*.nc'))

print(f"Combined files created in {output_path}:")
print(f"Total files: {len(output_files)}\n")

for i, file_path in enumerate(output_files, 1):
    file_size = file_path.stat().st_size / (1024**2)  # Size in MB
    print(f"{i}. {file_path.name} ({file_size:.1f} MB)")

## Extract Elevation Data from Combined Files

Now that we have combined the CaSR SWE files, let's extract data at specific elevation points using the optimized elevation data extractor. This allows us to track SWE and precipitation data at different elevation levels.

### Configure Elevation Data Extraction

Set up the paths for elevation data and configure extraction parameters.

In [None]:
# Define elevation data paths
elevation_output_dir = r"data/output_data/elevation"  # Directory for elevation extraction output

# Create absolute paths
elevation_output_path = project_root / elevation_output_dir

print(f"Elevation output directory: {elevation_output_path}")

# Initialize the elevation data extractor
elevation_extractor = OptimizedElevationDataExtractor(
    elevation_dir=str(elevation_path),
    combined_casr_dir=str(output_path),  # Use the output from CaSR combination
    output_dir=str(elevation_output_path)
)

### Load and Explore Elevation Data

Load the elevation shapefile to see what elevation points are available for data extraction.

In [None]:
# Load elevation data
elevation_extractor.load_elevation_data()

# Display basic information about elevation points
if elevation_extractor.elevation_gdf is not None:
    print(f"\nTotal elevation points: {len(elevation_extractor.elevation_gdf)}")
    print(f"\nFirst 5 elevation points:")
    print(elevation_extractor.elevation_gdf.head())
    
    # Show elevation statistics if available
    elev_cols = [col for col in elevation_extractor.elevation_gdf.columns 
                 if 'elev' in col.lower() or col in ['min', 'max', 'mean', 'median']]
    if elev_cols:
        print(f"\nElevation statistics:")
        for col in elev_cols:
            if pd.api.types.is_numeric_dtype(elevation_extractor.elevation_gdf[col]):
                print(f"  {col}:")
                print(f"    Min: {elevation_extractor.elevation_gdf[col].min():.1f}")
                print(f"    Max: {elevation_extractor.elevation_gdf[col].max():.1f}")
                print(f"    Mean: {elevation_extractor.elevation_gdf[col].mean():.1f}")

### Check Available Combined CaSR Files

Let's see what combined CaSR files are available for elevation data extraction.

In [None]:
# Get available combined CaSR files
temporal_files, full_files = elevation_extractor.get_combined_casr_files()

print("Available files for elevation extraction:")
print(f"\nTemporal combined files ({len(temporal_files)}):")
for i, file in enumerate(temporal_files[:3], 1):
    print(f"  {i}. {file.name}")
if len(temporal_files) > 3:
    print(f"  ... and {len(temporal_files) - 3} more files")

print(f"\nFull combined files ({len(full_files)}):")
for i, file in enumerate(full_files[:3], 1):
    print(f"  {i}. {file.name}")
if len(full_files) > 3:
    print(f"  ... and {len(full_files) - 3} more files")

### Extract Elevation Data with Optimization

Extract data at elevation points from the combined CaSR files. We'll use time sampling to handle large datasets efficiently.

In [None]:
# Configure extraction parameters
time_sampling = 'all'  # Options: 'all', 'monthly', 'yearly', 'sample'
max_records = 10000  # Maximum records per point to avoid memory issues
file_types = ['temporal', 'full']  # Which file types to process

print(f"Extraction configuration:")
print(f"  Time sampling: {time_sampling}")
print(f"  Max records per point: {max_records}")
print(f"  File types to process: {file_types}")
print(f"\nStarting elevation data extraction...")

# Process all files and extract elevation data
extraction_results = elevation_extractor.process_all_files(
    file_types=file_types,
    time_sampling=time_sampling,
    max_records=max_records
)

print(f"\nExtraction completed!")
print(f"Processed {len(extraction_results)} file groups")

### Save Extracted Data

Save the extracted elevation data to files for further analysis.

In [None]:
# Save results in multiple formats
output_format = 'both'  # Options: 'csv', 'parquet', 'both'

print(f"Saving extracted data in {output_format} format...")
elevation_extractor.save_results(extraction_results, format=output_format)

# Generate summary report
elevation_extractor.generate_summary_report(extraction_results)

## Filter and Merge Elevation Data

Now we'll use the existing `ElevationDataFilter` class from `filter_merge_elevation_data.py` to filter and merge the elevation data with non-null precipitation and SWE values.

In [None]:
# Initialize the elevation data filter using the existing class
elevation_filter = ElevationDataFilter(
    elevation_dir=str(elevation_path),
    casr_dir=str(output_path),  # Use combined CaSR files
    output_dir=str(filtered_output_path)
)

print("ElevationDataFilter initialized successfully")

### Process Elevation Data with Filtering

Use the main processing function to filter and merge elevation data.

In [None]:
# Configure processing parameters
sample_points = 100  # Number of elevation points to sample for testing
sample_time = 10     # Number of time steps to sample for testing

print(f"Processing configuration:")
print(f"  Sample points: {sample_points}")
print(f"  Sample time steps: {sample_time}")
print(f"\nStarting elevation data filtering and merging process...")

# Process the data using the existing functionality
try:
    elevation_filter.process(
        sample_points=sample_points,
        sample_time=sample_time
    )
    print("\nElevation data filtering and merging completed successfully!")
except Exception as e:
    print(f"Error during processing: {e}")
    raise

### Examine Filtered Results

Let's check what files were created by the filtering process.

In [None]:
# List filtered output files
filtered_files = list(filtered_output_path.glob('*'))

print(f"Filtered files created in {filtered_output_path}:")
print(f"Total files: {len(filtered_files)}\n")

for i, file_path in enumerate(filtered_files, 1):
    if file_path.is_file():
        file_size = file_path.stat().st_size / 1024  # Size in KB
        print(f"{i}. {file_path.name} ({file_size:.1f} KB)")

### Load and Examine Filtered Data

Let's load and examine the filtered elevation data to understand the results.

In [None]:
# Load the filtered data if it exists
filtered_csv = filtered_output_path / "filtered_elevation_data.csv"
stats_csv = filtered_output_path / "elevation_statistics.csv"

if filtered_csv.exists():
    # Load filtered data
    filtered_df = pd.read_csv(filtered_csv)
    print(f"Loaded filtered elevation data: {len(filtered_df)} records")
    print(f"\nData columns: {list(filtered_df.columns)}")
    print(f"\nFirst 5 records:")
    print(filtered_df.head())
    
    # Show basic statistics
    print(f"\nBasic statistics:")
    numeric_cols = filtered_df.select_dtypes(include=[np.number]).columns
    print(filtered_df[numeric_cols].describe())
    
    # Load elevation statistics if available
    if stats_csv.exists():
        stats_df = pd.read_csv(stats_csv, index_col=0)
        print(f"\nElevation statistics by elevation bins:")
        print(stats_df)
else:
    print("No filtered data file found. Check if the processing completed successfully.")

## Summary

This notebook has demonstrated the complete workflow for:

1. **CaSR File Combination**: Combined NetCDF files across temporal and spatial dimensions using three different approaches:
   - Temporal combination only (keeping spatial regions separate)
   - Spatial combination only (keeping time periods separate)
   - Full combination (both spatial and temporal dimensions)

2. **Elevation Data Extraction**: Used the `OptimizedElevationDataExtractor` to extract precipitation and SWE data at specific elevation points from the combined CaSR files

3. **Data Filtering and Merging**: Used the existing `ElevationDataFilter` class to create a clean dataset with only non-null precipitation and SWE values

4. **Elevation Pattern Analysis**: Analyzed relationships between elevation and climate variables

5. **Data Export**: Saved the filtered and merged dataset for further analysis

The final output includes:
- **Combined CaSR files**: NetCDF files with temporal, spatial, or full combinations
- **Extracted elevation data**: CSV/Parquet files with climate data at elevation points
- **filtered_elevation_data.csv/parquet**: Main dataset with non-null precipitation and SWE values
- **elevation_statistics.csv/parquet**: Statistical analysis by elevation bins
- **filtering_summary.json**: Summary report of the filtering process

This workflow provides a foundation for analyzing snow drought patterns across different elevation zones using the CaSR dataset.

### Key Improvement

This improved version **reuses existing functionality** from both `extract_elevation_data_optimized.py` and `filter_merge_elevation_data.py` instead of duplicating code in the notebook. This approach:

- **Eliminates code duplication** (removed ~200+ lines of redundant code)
- **Maintains consistency** across the project
- **Makes maintenance easier** - updates to logic only need to be made in one place
- **Follows DRY (Don't Repeat Yourself) principles**
- **Reduces notebook complexity** and focuses on the workflow rather than implementation details
- **Includes all original functionality** while being more maintainable