# CaSR SWE File Combination Workflow

This notebook demonstrates how to combine NetCDF files from the CaSR SWE dataset using the `combine_casr_swe_files.py` script. The CaSR dataset contains files organized by variable types, spatial regions, and time periods that can be combined in different ways:

1. **Temporal combination**: Combine files across time periods
2. **Spatial combination**: Combine files across spatial regions  
3. **Full combination**: Combine both temporal and spatial dimensions

The CaSR SWE dataset includes:
- **Variable types**: A_PR24_SFC (precipitation) and P_SWE_LAND (snow water equivalent)
- **Spatial regions**: Different rlon/rlat coordinate ranges
- **Time periods**: 4-year chunks from 1980-2023

## Setup and Imports

**Note**: If you encounter NumPy compatibility errors, please run one of the following commands in your terminal before running this notebook:

**Option 1 (Recommended)**: Install from requirements file
```bash
pip install -r requirements_notebook.txt
```

**Option 2**: Manual installation with compatible versions
```bash
pip install "numpy<2" xarray pandas matplotlib netcdf4
```

**Option 3**: Using conda
```bash
conda install numpy=1.26 xarray pandas matplotlib netcdf4
```

**Option 4**: Create a new environment with compatible versions
```bash
conda create -n snowdrought python=3.9 numpy=1.26 xarray pandas matplotlib netcdf4 jupyter
conda activate snowdrought
```

In [None]:
# Check for NumPy compatibility issues
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='numpy')

# Import required packages
import sys
import os
from pathlib import Path

# Handle NumPy compatibility
try:
    import numpy as np
    print(f"NumPy version: {np.__version__}")
except ImportError as e:
    print(f"NumPy import error: {e}")
    print("Please install NumPy: pip install numpy")

# Import data science packages with error handling
try:
    import xarray as xr
    print(f"xarray version: {xr.__version__}")
except ImportError as e:
    print(f"xarray import error: {e}")
    print("If you encounter NumPy compatibility issues, try:")
    print("  pip install 'numpy<2' xarray pandas matplotlib")
    print("  or")
    print("  conda install numpy=1.26 xarray pandas matplotlib")
    raise

try:
    import pandas as pd
    print(f"pandas version: {pd.__version__}")
except ImportError as e:
    print(f"pandas import error: {e}")
    raise

try:
    import matplotlib.pyplot as plt
    print(f"matplotlib version: {plt.matplotlib.__version__}")
except ImportError as e:
    print(f"matplotlib import error: {e}")
    raise

# Add the project root to Python path to import the combine script
project_root = Path().cwd().parent.parent
sys.path.append(str(project_root))

# Import the CaSR file combiner
try:
    from combine_casr_swe_files import CaSRFileCombiner
    print("Successfully imported CaSRFileCombiner")
except ImportError as e:
    print(f"Error importing CaSRFileCombiner: {e}")
    print("Make sure combine_casr_swe_files.py is in the project root directory")
    raise

## Configuration

Set up the input and output directories for your CaSR SWE data files.

In [None]:
# Define data paths - modify these paths according to your data location
input_dir = r"data/input_data/CaSR_SWE"  # Directory containing CaSR NetCDF files
output_dir = r"data/output_data/combined_casr"  # Directory for combined output files

# Create absolute paths
input_path = project_root / input_dir
output_path = project_root / output_dir

print(f"Input directory: {input_path}")
print(f"Output directory: {output_path}")
print(f"Input directory exists: {input_path.exists()}")

## Initialize the CaSR File Combiner

Create an instance of the `CaSRFileCombiner` class with your input and output directories.

In [None]:
# Initialize the file combiner
combiner = CaSRFileCombiner(input_dir=str(input_path), output_dir=str(output_path))

## Explore Dataset Information

Before combining files, let's examine what data is available in the input directory.

In [None]:
# Get information about the available datasets
combiner.get_dataset_info()

## Examine File Groups

Let's look at how the files are grouped by variable type.

In [None]:
# Get file groups
file_groups = combiner.get_file_groups()

print("Available file groups:")
for group_name, files in file_groups.items():
    print(f"\n{group_name}: {len(files)} files")
    
    # Show first few filenames as examples
    for i, file_path in enumerate(files[:3]):
        filename = Path(file_path).name
        print(f"  {i+1}. {filename}")
    
    if len(files) > 3:
        print(f"  ... and {len(files) - 3} more files")

## Example: Parse Individual Filenames

Let's examine how the filename parsing works for understanding the file structure.

In [None]:
# Get a sample file and parse its filename
if file_groups:
    # Get the first file from the first group
    first_group = list(file_groups.keys())[0]
    sample_file = file_groups[first_group][0]
    sample_filename = Path(sample_file).name
    
    print(f"Sample filename: {sample_filename}")
    
    # Parse the filename
    parsed_info = combiner.parse_filename(sample_filename)
    
    print("\nParsed information:")
    for key, value in parsed_info.items():
        print(f"  {key}: {value}")
else:
    print("No files found in the input directory. Please check your input path.")

## Combination Options

Now let's demonstrate the different ways to combine the CaSR files.

### Option 1: Temporal Combination Only

Combine files across time periods while keeping spatial regions separate.

In [None]:
# Temporal combination only
print("Performing temporal combination (keeping spatial regions separate)...")
combiner.combine_by_variable(combine_spatial=False, combine_temporal=True)
print("Temporal combination completed!")

### Option 2: Spatial Combination Only

Combine files across spatial regions while keeping time periods separate.

In [None]:
# Spatial combination only
print("Performing spatial combination (keeping time periods separate)...")
combiner.combine_by_variable(combine_spatial=True, combine_temporal=False)
print("Spatial combination completed!")

### Option 3: Full Combination

Combine files across both spatial and temporal dimensions to create complete datasets.

In [None]:
# Full combination (both spatial and temporal)
print("Performing full combination (both spatial and temporal)...")
combiner.combine_by_variable(combine_spatial=True, combine_temporal=True)
print("Full combination completed!")

## Examine Combined Output Files

Let's check what files were created in the output directory.

In [None]:
# List output files
output_files = list(output_path.glob('*.nc'))

print(f"Combined files created in {output_path}:")
print(f"Total files: {len(output_files)}\n")

for i, file_path in enumerate(output_files, 1):
    file_size = file_path.stat().st_size / (1024**2)  # Size in MB
    print(f"{i}. {file_path.name} ({file_size:.1f} MB)")

## Extract Elevation Data from Combined Files

Now that we have combined the CaSR SWE files, let's extract data at specific elevation points using the optimized elevation data extractor. This allows us to track SWE and precipitation data at different elevation levels.

In [None]:
# Import the optimized elevation data extractor
try:
    from extract_elevation_data_optimized import OptimizedElevationDataExtractor
    print("Successfully imported OptimizedElevationDataExtractor")
except ImportError as e:
    print(f"Error importing OptimizedElevationDataExtractor: {e}")
    print("Make sure extract_elevation_data_optimized.py is in the project root directory")
    raise

### Configure Elevation Data Extraction

Set up the paths for elevation data and configure extraction parameters.

In [None]:
# Define elevation data paths
elevation_dir = r"data/input_data/Elevation"  # Directory containing elevation shapefiles
elevation_output_dir = r"data/output_data/elevation"  # Directory for elevation extraction output

# Create absolute paths
elevation_path = project_root / elevation_dir
elevation_output_path = project_root / elevation_output_dir

print(f"Elevation data directory: {elevation_path}")
print(f"Elevation output directory: {elevation_output_path}")
print(f"Elevation directory exists: {elevation_path.exists()}")

# Initialize the elevation data extractor
elevation_extractor = OptimizedElevationDataExtractor(
    elevation_dir=str(elevation_path),
    combined_casr_dir=str(output_path),  # Use the output from CaSR combination
    output_dir=str(elevation_output_path)
)

### Load and Explore Elevation Data

Load the elevation shapefile to see what elevation points are available for data extraction.

In [None]:
# Load elevation data
elevation_extractor.load_elevation_data()

# Display basic information about elevation points
if elevation_extractor.elevation_gdf is not None:
    print(f"\nTotal elevation points: {len(elevation_extractor.elevation_gdf)}")
    print(f"\nFirst 5 elevation points:")
    print(elevation_extractor.elevation_gdf.head())
    
    # Show elevation statistics if available
    elev_cols = [col for col in elevation_extractor.elevation_gdf.columns 
                 if 'elev' in col.lower() or col in ['min', 'max', 'mean', 'median']]
    if elev_cols:
        print(f"\nElevation statistics:")
        for col in elev_cols:
            if pd.api.types.is_numeric_dtype(elevation_extractor.elevation_gdf[col]):
                print(f"  {col}:")
                print(f"    Min: {elevation_extractor.elevation_gdf[col].min():.1f}")
                print(f"    Max: {elevation_extractor.elevation_gdf[col].max():.1f}")
                print(f"    Mean: {elevation_extractor.elevation_gdf[col].mean():.1f}")

### Check Available Combined CaSR Files

Let's see what combined CaSR files are available for elevation data extraction.

In [None]:
# Get available combined CaSR files
temporal_files, full_files = elevation_extractor.get_combined_casr_files()

print("Available files for elevation extraction:")
print(f"\nTemporal combined files ({len(temporal_files)}):")
for i, file in enumerate(temporal_files[:3], 1):
    print(f"  {i}. {file.name}")
if len(temporal_files) > 3:
    print(f"  ... and {len(temporal_files) - 3} more files")

print(f"\nFull combined files ({len(full_files)}):")
for i, file in enumerate(full_files[:3], 1):
    print(f"  {i}. {file.name}")
if len(full_files) > 3:
    print(f"  ... and {len(full_files) - 3} more files")

### Extract Elevation Data with Optimization

Extract data at elevation points from the combined CaSR files. We'll use time sampling to handle large datasets efficiently.

In [None]:
# Configure extraction parameters
time_sampling = 'all'  # Options: 'all', 'monthly', 'yearly', 'sample'
max_records = 10000  # Maximum records per point to avoid memory issues
file_types = ['temporal', 'full']  # Which file types to process

print(f"Extraction configuration:")
print(f"  Time sampling: {time_sampling}")
print(f"  Max records per point: {max_records}")
print(f"  File types to process: {file_types}")
print(f"\nStarting elevation data extraction...")

# Process all files and extract elevation data
extraction_results = elevation_extractor.process_all_files(
    file_types=file_types,
    time_sampling=time_sampling,
    max_records=max_records
)

print(f"\nExtraction completed!")
print(f"Processed {len(extraction_results)} file groups")

### Save Extracted Data

Save the extracted elevation data to files for further analysis.

In [None]:
# Save results in multiple formats
output_format = 'both'  # Options: 'csv', 'parquet', 'both'

print(f"Saving extracted data in {output_format} format...")
elevation_extractor.save_results(extraction_results, format=output_format)

# Generate summary report
elevation_extractor.generate_summary_report(extraction_results)

## Filter and Merge Elevation Data

Now we'll filter and merge the elevation data to create a new dataset containing only non-null precipitation and SWE values. This step combines the functionality from `filter_merge_elevation_data.py` to create a clean, merged dataset for analysis.

In [None]:
# Import additional libraries for filtering and merging
import geopandas as gpd
from shapely.geometry import Point
from datetime import datetime
import json

print("Additional libraries imported for filtering and merging")

### Initialize Elevation Data Filter

Create a filter class to handle the merging and filtering of elevation data with non-null precipitation and SWE values.

In [None]:
class ElevationDataFilter:
    """Filter and merge elevation data based on non-null precipitation and SWE values."""
    
    def __init__(self, elevation_dir, casr_dir, output_dir=None):
        """
        Initialize the filter.
        
        Parameters:
        -----------
        elevation_dir : str
            Path to directory containing elevation shapefiles
        casr_dir : str
            Path to directory containing CaSR NetCDF files
        output_dir : str, optional
            Output directory for filtered data
        """
        self.elevation_dir = Path(elevation_dir)
        self.casr_dir = Path(casr_dir)
        self.output_dir = Path(output_dir) if output_dir else Path("data/output_data/filtered_elevation")
        
        # Create output directory if it doesn't exist
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Store data
        self.elevation_gdf = None
        self.precipitation_data = None
        self.swe_data = None
        
    def load_elevation_data(self, sample_size=None):
        """Load elevation shapefile data."""
        print("Loading elevation data...")
        
        # Find shapefile in elevation directory
        shp_files = list(self.elevation_dir.glob("*.shp"))
        if not shp_files:
            raise FileNotFoundError(f"No shapefile found in {self.elevation_dir}")
        
        # Load the first shapefile found
        shp_file = shp_files[0]
        print(f"Loading shapefile: {shp_file}")
        
        try:
            self.elevation_gdf = gpd.read_file(shp_file)
            
            # Sample data if requested
            if sample_size and len(self.elevation_gdf) > sample_size:
                print(f"Sampling {sample_size} points from {len(self.elevation_gdf)} total points")
                self.elevation_gdf = self.elevation_gdf.sample(n=sample_size, random_state=42)
            
            print(f"Loaded {len(self.elevation_gdf)} elevation points")
            print(f"Elevation data columns: {list(self.elevation_gdf.columns)}")
            
            # Identify elevation columns
            self.elev_cols = [col for col in self.elevation_gdf.columns 
                             if 'elev' in col.lower() or col in ['min', 'max', 'mean', 'median']]
            
            if self.elev_cols:
                print(f"Elevation columns found: {self.elev_cols}")
                for col in self.elev_cols:
                    if pd.api.types.is_numeric_dtype(self.elevation_gdf[col]):
                        print(f"{col} range: {self.elevation_gdf[col].min():.1f} - {self.elevation_gdf[col].max():.1f}")
            
        except Exception as e:
            print(f"Error loading shapefile: {e}")
            raise
    
    def find_casr_files(self):
        """Find precipitation and SWE files in the CaSR directory."""
        nc_files = list(self.casr_dir.glob("*.nc"))
        
        precip_files = []
        swe_files = []
        
        for f in nc_files:
            if "A_PR24_SFC" in f.name:
                precip_files.append(f)
            elif "P_SWE_LAND" in f.name:
                swe_files.append(f)
        
        print(f"Found {len(precip_files)} precipitation files")
        print(f"Found {len(swe_files)} SWE files")
        
        return precip_files, swe_files

print("ElevationDataFilter class defined")

In [None]:
# Continue with data extraction and filtering methods
def extract_data_at_points(self, nc_file, points_gdf, variable_name, sample_time=None):
    """Extract data from NetCDF file at elevation points."""
    print(f"Extracting {variable_name} data from {nc_file.name}...")
    
    try:
        # Open NetCDF file
        ds = xr.open_dataset(nc_file)
        
        # Get data variable name
        data_vars = [v for v in ds.data_vars if v != 'rotated_pole']
        if not data_vars:
            print(f"No data variables found in {nc_file.name}")
            return None
        
        var_name = data_vars[0]  # Assume first variable is the main data
        print(f"Extracting variable: {var_name}")
        
        # Sample time if requested
        if sample_time and 'time' in ds.dims and ds.dims['time'] > sample_time:
            print(f"Sampling {sample_time} time steps from {ds.dims['time']} total")
            time_indices = np.linspace(0, ds.dims['time']-1, sample_time, dtype=int)
            ds = ds.isel(time=time_indices)
        
        # Get coordinate information
        if 'lon' in ds.coords and 'lat' in ds.coords:
            lon_coord, lat_coord = 'lon', 'lat'
        elif 'rlon' in ds.coords and 'rlat' in ds.coords:
            lon_coord, lat_coord = 'rlon', 'rlat'
        else:
            print(f"Could not identify coordinate variables")
            return None
        
        # Convert points to same CRS as NetCDF if needed
        points_proj = points_gdf.copy()
        if points_proj.crs and points_proj.crs.to_string() != 'EPSG:4326':
            points_proj = points_proj.to_crs('EPSG:4326')
        
        # Extract data at each point
        extracted_data = []
        
        for idx, row in points_proj.iterrows():
            # Get point coordinates
            geom = row.geometry
            if hasattr(geom, 'x') and hasattr(geom, 'y'):
                lon, lat = geom.x, geom.y
            else:
                centroid = geom.centroid
                lon, lat = centroid.x, centroid.y
            
            try:
                # Find nearest grid point
                if 'lon' in ds.coords and 'lat' in ds.coords and ds.lon.ndim == 2:
                    # Handle 2D coordinate arrays
                    lon_2d = ds.lon.values
                    lat_2d = ds.lat.values
                    
                    # Convert longitude if needed
                    target_lon = lon if lon < 0 else lon - 360
                    lon_2d_adj = np.where(lon_2d > 180, lon_2d - 360, lon_2d)
                    
                    # Find nearest point
                    dist = np.sqrt((lon_2d_adj - target_lon)**2 + (lat_2d - lat)**2)
                    min_idx = np.unravel_index(np.argmin(dist), dist.shape)
                    rlat_idx, rlon_idx = min_idx
                    
                    point_data = ds.isel(rlat=rlat_idx, rlon=rlon_idx)
                else:
                    # Simple nearest neighbor selection
                    point_data = ds.sel({lon_coord: lon, lat_coord: lat}, method='nearest')
                
                # Extract time series data
                if 'time' in point_data[var_name].dims:
                    times = point_data.time.values
                    values = point_data[var_name].values
                    
                    for t, v in zip(times, values):
                        data_dict = {
                            'point_id': idx,
                            'lon': lon,
                            'lat': lat,
                            'time': pd.to_datetime(t),
                            variable_name: float(v) if not np.isnan(v) else np.nan
                        }
                        
                        # Add elevation data
                        for col in self.elev_cols:
                            if col in row and pd.api.types.is_numeric_dtype(type(row[col])):
                                data_dict[f'elevation_{col}'] = row[col]
                        
                        extracted_data.append(data_dict)
                else:
                    # Single value
                    data_dict = {
                        'point_id': idx,
                        'lon': lon,
                        'lat': lat,
                        variable_name: float(point_data[var_name].values)
                    }
                    
                    # Add elevation data
                    for col in self.elev_cols:
                        if col in row and pd.api.types.is_numeric_dtype(type(row[col])):
                            data_dict[f'elevation_{col}'] = row[col]
                    
                    extracted_data.append(data_dict)
                    
            except Exception as e:
                print(f"Could not extract data for point {idx}: {e}")
                continue
        
        ds.close()
        
        if extracted_data:
            df = pd.DataFrame(extracted_data)
            print(f"Extracted {len(df)} records for {variable_name}")
            return df
        else:
            print(f"No data extracted from {nc_file.name}")
            return None
            
    except Exception as e:
        print(f"Error processing {nc_file.name}: {e}")
        return None

# Add the method to the class
ElevationDataFilter.extract_data_at_points = extract_data_at_points
print("Added extract_data_at_points method to ElevationDataFilter")

In [None]:
# Add filtering and merging methods
def filter_and_merge_data(self, precip_df, swe_df):
    """Filter and merge precipitation and SWE data for non-null values."""
    print("Filtering and merging data...")
    
    # Merge on common keys
    merge_keys = ['point_id', 'lon', 'lat']
    if 'time' in precip_df.columns and 'time' in swe_df.columns:
        merge_keys.append('time')
    
    # Merge dataframes
    merged_df = pd.merge(
        precip_df,
        swe_df,
        on=merge_keys,
        suffixes=('_precip', '_swe'),
        how='inner'
    )
    
    # Get elevation columns (handle duplicates from merge)
    elev_cols_merged = [col for col in merged_df.columns if col.startswith('elevation_')]
    
    # Remove duplicate elevation columns
    for col in elev_cols_merged:
        if col.endswith('_swe') and col.replace('_swe', '_precip') in merged_df.columns:
            # Keep only one version
            merged_df[col.replace('_swe', '')] = merged_df[col]
            merged_df = merged_df.drop([col, col.replace('_swe', '_precip')], axis=1)
    
    # Filter for non-null values
    precip_col = [col for col in merged_df.columns if 'precipitation' in col.lower() or 'PR24' in col][0]
    swe_col = [col for col in merged_df.columns if 'swe' in col.lower() or 'SWE' in col][0]
    
    print(f"Total merged records: {len(merged_df)}")
    print(f"Records with null precipitation: {merged_df[precip_col].isna().sum()}")
    print(f"Records with null SWE: {merged_df[swe_col].isna().sum()}")
    
    # Filter for non-null values in both variables
    filtered_df = merged_df[
        merged_df[precip_col].notna() & 
        merged_df[swe_col].notna()
    ].copy()
    
    print(f"Records with non-null values in both variables: {len(filtered_df)}")
    
    return filtered_df, precip_col, swe_col

def analyze_elevation_patterns(self, filtered_df, precip_col, swe_col):
    """Analyze patterns in the filtered data by elevation."""
    print("Analyzing elevation patterns...")
    
    # Find elevation columns
    elev_cols = [col for col in filtered_df.columns if col.startswith('elevation_')]
    
    if not elev_cols:
        print("No elevation columns found for analysis")
        return None
    
    # Use the first elevation column for analysis
    elev_col = elev_cols[0]
    
    # Create elevation bins
    filtered_df['elevation_bin'] = pd.cut(filtered_df[elev_col], bins=10)
    
    # Calculate statistics by elevation bin
    stats_by_elevation = filtered_df.groupby('elevation_bin').agg({
        precip_col: ['mean', 'std', 'count'],
        swe_col: ['mean', 'std', 'count'],
        'point_id': 'nunique'
    }).round(2)
    
    # Rename columns for clarity
    stats_by_elevation.columns = [
        'precip_mean', 'precip_std', 'precip_count',
        'swe_mean', 'swe_std', 'swe_count',
        'unique_points'
    ]
    
    # Calculate correlation between variables
    if len(filtered_df) > 1:
        correlation = filtered_df[[precip_col, swe_col]].corr().iloc[0, 1]
        print(f"Correlation between precipitation and SWE: {correlation:.3f}")
    
    return stats_by_elevation

# Add methods to the class
ElevationDataFilter.filter_and_merge_data = filter_and_merge_data
ElevationDataFilter.analyze_elevation_patterns = analyze_elevation_patterns
print("Added filtering and analysis methods to ElevationDataFilter")

In [None]:
# Add save and summary methods
def save_results(self, filtered_df, stats_df, format='csv'):
    """Save filtered data and statistics."""
    print(f"Saving results to {self.output_dir}")
    
    # Save filtered data
    if format in ['csv', 'both']:
        csv_file = self.output_dir / "filtered_elevation_data.csv"
        filtered_df.to_csv(csv_file, index=False)
        print(f"Saved filtered data to: {csv_file}")
        
        if stats_df is not None:
            stats_csv = self.output_dir / "elevation_statistics.csv"
            stats_df.to_csv(stats_csv)
            print(f"Saved statistics to: {stats_csv}")
    
    if format in ['parquet', 'both']:
        parquet_file = self.output_dir / "filtered_elevation_data.parquet"
        filtered_df.to_parquet(parquet_file, index=False)
        print(f"Saved filtered data to: {parquet_file}")
        
        if stats_df is not None:
            stats_parquet = self.output_dir / "elevation_statistics.parquet"
            stats_df.to_parquet(stats_parquet)
            print(f"Saved statistics to: {stats_parquet}")
    
    # Generate summary report
    self.generate_summary_report(filtered_df, stats_df)

def generate_summary_report(self, filtered_df, stats_df):
    """Generate a summary report of the filtering results."""
    summary = {
        'processing_date': datetime.now().isoformat(),
        'elevation_points_loaded': len(self.elevation_gdf) if self.elevation_gdf is not None else 0,
        'filtered_records': len(filtered_df),
        'unique_points_with_data': filtered_df['point_id'].nunique() if 'point_id' in filtered_df.columns else 0,
        'time_range': None
    }
    
    if 'time' in filtered_df.columns:
        summary['time_range'] = {
            'start': filtered_df['time'].min().isoformat() if pd.notna(filtered_df['time'].min()) else None,
            'end': filtered_df['time'].max().isoformat() if pd.notna(filtered_df['time'].max()) else None
        }
    
    # Save summary
    summary_file = self.output_dir / "filtering_summary.json"
    with open(summary_file, 'w') as f:
        json.dump(summary, f, indent=2)
    
    print(f"Summary saved to: {summary_file}")
    
    # Print summary
    print("\n" + "="*60)
    print("ELEVATION DATA FILTERING SUMMARY")
    print("="*60)
    print(f"Elevation points loaded: {summary['elevation_points_loaded']}")
    print(f"Filtered records (non-null precip & SWE): {summary['filtered_records']}")
    print(f"Unique points with valid data: {summary['unique_points_with_data']}")
    if summary['time_range']:
        print(f"Time range: {summary['time_range']['start']} to {summary['time_range']['end']}")
    
    if stats_df is not None:
        print("\nElevation Statistics:")
        print(stats_df)
    print("="*60)

# Add methods to the class
ElevationDataFilter.save_results = save_results
ElevationDataFilter.generate_summary_report = generate_summary_report
print("Added save and summary methods to ElevationDataFilter")

### Initialize and Configure the Filter

Set up the elevation data filter with the appropriate directories.

In [None]:
# Define filtered output directory
filtered_output_dir = r"data/output_data/filtered_elevation"
filtered_output_path = project_root / filtered_output_dir

print(f"Filtered output directory: {filtered_output_path}")

# Initialize the elevation data filter
elevation_filter = ElevationDataFilter(
    elevation_dir=str(elevation_path),
    casr_dir=str(output_path),  # Use combined CaSR files
    output_dir=str(filtered_output_path)
)

print("ElevationDataFilter initialized successfully")

### Load Elevation Data for Filtering

Load the elevation shapefile data with optional sampling for testing.

In [None]:
# Load elevation data with sampling for testing
sample_points = 100  # Adjust this number based on your needs

elevation_filter.load_elevation_data(sample_size=sample_points)

print(f"\nElevation data loaded with {len(elevation_filter.elevation_gdf)} points")

### Find and Process CaSR Files

Identify precipitation and SWE files from the combined CaSR data.

In [None]:
# Find precipitation and SWE files
precip_files, swe_files = elevation_filter.find_casr_files()

if not precip_files or not swe_files:
    print("Missing precipitation or SWE files")
    print("Available files in CaSR directory:")
    for f in elevation_filter.casr_dir.glob("*.nc"):
        print(f"  {f.name}")
else:
    print(f"\nFound files for processing:")
    print(f"Precipitation files: {[f.name for f in precip_files[:3]]}")
    print(f"SWE files: {[f.name for f in swe_files[:3]]}")

### Extract Data at Elevation Points

Extract precipitation and SWE data at the elevation points from the combined NetCDF files.

In [None]:
# Configure extraction parameters
sample_time = 10  # Number of time steps to sample for testing

if precip_files and swe_files:
    print("Processing sample files...")
    
    # Extract precipitation data
    precip_df = elevation_filter.extract_data_at_points(
        precip_files[0], 
        elevation_filter.elevation_gdf, 
        'precipitation',
        sample_time=sample_time
    )
    
    # Extract SWE data
    swe_df = elevation_filter.extract_data_at_points(
        swe_files[0], 
        elevation_filter.elevation_gdf, 
        'swe',
        sample_time=sample_time
    )
    
    if precip_df is not None and swe_df is not None:
        print(f"\nExtracted data successfully:")
        print(f"Precipitation records: {len(precip_df)}")
        print(f"SWE records: {len(swe_df)}")
    else:
        print("Failed to extract data from files")
else:
    print("No precipitation or SWE files found for processing")
    precip_df = None
    swe_df = None

### Filter and Merge Data

Now filter and merge the precipitation and SWE data to keep only records with non-null values in both variables.

In [None]:
if precip_df is not None and swe_df is not None:
    # Filter and merge data
    filtered_df, precip_col, swe_col = elevation_filter.filter_and_merge_data(precip_df, swe_df)
    
    print(f"\nFiltered and merged data:")
    print(f"Total records with non-null values: {len(filtered_df)}")
    print(f"Precipitation column: {precip_col}")
    print(f"SWE column: {swe_col}")
    
    # Show sample of filtered data
    if len(filtered_df) > 0:
        print(f"\nSample of filtered data:")
        print(filtered_df.head())
        
        # Show data types
        print(f"\nData types:")
        print(filtered_df.dtypes)
    else:
        print("No records found with non-null values in both variables")
        filtered_df = None
else:
    print("Cannot proceed with filtering - missing precipitation or SWE data")
    filtered_df = None

### Analyze Elevation Patterns

Analyze the filtered data to understand patterns by elevation.

In [None]:
if filtered_df is not None and len(filtered_df) > 0:
    # Analyze elevation patterns
    stats_df = elevation_filter.analyze_elevation_patterns(filtered_df, precip_col, swe_col)
    
    if stats_df is not None:
        print(f"\nElevation pattern analysis completed:")
        print(stats_df)
        
        # Show correlation analysis
        if len(filtered_df) > 1:
            correlation_matrix = filtered_df[[precip_col, swe_col]].corr()
            print(f"\nCorrelation Matrix:")
            print(correlation_matrix)
    else:
        print("Could not perform elevation pattern analysis")
        stats_df = None
else:
    print("No filtered data available for elevation pattern analysis")
    stats_df = None

### Save Filtered and Merged Data

Save the filtered and merged dataset to files for further analysis.

In [None]:
if filtered_df is not None and len(filtered_df) > 0:
    # Save results in multiple formats
    save_format = 'both'  # Options: 'csv', 'parquet', 'both'
    
    print(f"Saving filtered and merged data in {save_format} format...")
    elevation_filter.save_results(filtered_df, stats_df, format=save_format)
    
    print(f"\nFiltering and merging process completed successfully!")
    print(f"Output files saved to: {elevation_filter.output_dir}")
    
    # List output files
    output_files = list(elevation_filter.output_dir.glob('*'))
    print(f"\nGenerated files:")
    for i, file_path in enumerate(output_files, 1):
        if file_path.is_file():
            file_size = file_path.stat().st_size / 1024  # Size in KB
            print(f"  {i}. {file_path.name} ({file_size:.1f} KB)")
else:
    print("No data to save - filtering process did not produce valid results")

## Summary

This notebook has demonstrated the complete workflow for:

1. **CaSR File Combination**: Combined NetCDF files across temporal and spatial dimensions
2. **Elevation Data Extraction**: Extracted precipitation and SWE data at specific elevation points
3. **Data Filtering and Merging**: Created a clean dataset with only non-null precipitation and SWE values
4. **Elevation Pattern Analysis**: Analyzed relationships between elevation and climate variables
5. **Data Export**: Saved the filtered and merged dataset for further analysis

The final output includes:
- **filtered_elevation_data.csv/parquet**: Main dataset with non-null precipitation and SWE values
- **elevation_statistics.csv/parquet**: Statistical analysis by elevation bins
- **filtering_summary.json**: Summary report of the filtering process

This workflow provides a foundation for analyzing snow drought patterns across different elevation zones using the CaSR dataset.