# Test Refactored tsfresh Ground Truth Preparation

This notebook tests the refactored `forestry.prepare_tsfresh_with_ground_truth()` method which:
1. Loads ground truth training data (polygons)
2. Clips satellite data to sample bounding boxes
3. Converts training polygons to raster masks
4. Merges masks into 4D dataset (plot_id, time, y, x)
5. Merges satellite data with ground truth labels
6. Saves to zarr for efficient access

**Workflow:**
- Uses `ds_resampled` from `get_ds_resampled_gee()` (or loads from zarr)
- Loads ground truth from parquet file (GCS or local)
- Prepares datasets ready for tsfresh feature extraction


In [1]:
import ee, eemont
import pandas as pd
import numpy as np
from forestry_carbon_arr.core import ForestryCarbonARR

# Initialize Forestry Carbon ARR system
forestry = ForestryCarbonARR(config_path='./00_input/korindo.json')
print("‚úÖ Forestry Carbon ARR initialized")




‚úÖ Forestry Carbon ARR initialized


## Step 1: Get ds_resampled (satellite time series data)

Either load from existing zarr or create from GEE asset.


In [3]:
from dotenv import load_dotenv
load_dotenv()


True

In [4]:
# Option 1: Load from existing zarr (fastest)
from forestry_carbon_arr.utils.zarr_utils import load_dataset_zarr
import os

zarr_path = os.getenv('GCS_ZARR_DIR', '')
if zarr_path:
    if not zarr_path.startswith('gs://'):
        zarr_path = f"gs://{zarr_path}/ds_resampled.zarr"
    else:
        zarr_path = f"{zarr_path}/ds_resampled.zarr"
    storage = 'gcs'
else:
    zarr_path = os.path.join(os.getcwd(), 'data', 'ds_resampled.zarr')
    storage = 'local'

try:
    ds_resampled = load_dataset_zarr(zarr_path, storage=storage)
    print(f"‚úÖ Loaded ds_resampled from zarr: {zarr_path}")
    print(f"   Dimensions: {dict(ds_resampled.sizes)}")
    print(f"   Variables: {list(ds_resampled.data_vars)}")
except Exception as e:
    print(f"‚ö†Ô∏è  Could not load from zarr: {e}")
    print("   Will create from GEE asset instead...")
    ds_resampled = None


üìÇ Loading dataset from GCS zarr: gs://remote_sensing_saas/01-korindo/timeseries_zarr/ds_resampled.zarr
‚úÖ Dataset loaded: {'time': 81, 'x': 4489, 'y': 3213}
‚úÖ Loaded ds_resampled from zarr: gs://remote_sensing_saas/01-korindo/timeseries_zarr/ds_resampled.zarr
   Dimensions: {'time': 81, 'x': 4489, 'y': 3213}
   Variables: ['EVI', 'NDVI']


  print(f"‚úÖ Dataset loaded: {dict(ds.dims)}")


In [5]:
# Option 2: Create from GEE asset (if zarr not available)
if ds_resampled is None:
    print("Creating ds_resampled from GEE asset...")
    ds_resampled = forestry.get_ds_resampled_gee(
        use_existing_asset=True,
        asset_folder='projects/remote-sensing-476412/assets/korindo_sentinel2_monthly',
        asset_is_monthly_composites=True,
        save_to_zarr=True,  # Save for future use
        zarr_path=None,
        overwrite_zarr=False
    )
    print(f"‚úÖ Created ds_resampled: {dict(ds_resampled.sizes)}")


## Step 2: Prepare tsfresh data with ground truth

This will:
- Load ground truth training polygons
- Clip satellite data to sample bounding boxes
- Convert polygons to raster masks
- Merge satellite data with ground truth labels
- Save to zarr (one dataset per sample)


In [9]:
# Reload module to get latest fixes
import importlib
import forestry_carbon_arr.utils.tsfresh_utils
importlib.reload(forestry_carbon_arr.utils.tsfresh_utils)

# Check coordinate order before processing
print("Checking ds_resampled coordinate order:")
print(f"  X: {ds_resampled.x.values[0]:.2f} to {ds_resampled.x.values[-1]:.2f} ({'ascending' if ds_resampled.x.values[0] < ds_resampled.x.values[-1] else 'descending'})")
print(f"  Y: {ds_resampled.y.values[0]:.2f} to {ds_resampled.y.values[-1]:.2f} ({'ascending' if ds_resampled.y.values[0] < ds_resampled.y.values[-1] else 'descending'})")
print("\nNote: Dataset will be standardized to STAC convention (y descending, x ascending) automatically")

Checking ds_resampled coordinate order:
  X: 578619.54 to 623499.54 (ascending)
  Y: 9949396.12 to 9981516.12 (ascending)

Note: Dataset will be standardized to STAC convention (y descending, x ascending) automatically


In [10]:
# Prepare tsfresh data with ground truth
# This is the main method that does everything!

ground_truth_path = 'gs://remote_sensing_saas/01-korindo/sample_tsfresh/20251112_df_long.parquet'

ds_gt_list = forestry.prepare_tsfresh_with_ground_truth(
    ds_resampled=ds_resampled,  # From Step 1
    ground_truth_path=ground_truth_path,  # GCS or local path to parquet
    buffer_pixels=50,  # Buffer around sample bboxes
    save_to_zarr=False,  # Save to zarr for efficient access
    zarr_path=None,  # Auto-detects from GCS_ZARR_DIR env var
    overwrite_zarr=False,  # Don't overwrite if exists
    storage='auto'
)

print(f"\n‚úÖ Prepared {len(ds_gt_list)} sample datasets")
for i, ds_gt in enumerate(ds_gt_list):
    plot_id = ds_gt.coords['plot_id'].values[0] if 'plot_id' in ds_gt.coords else f'sample_{i+1}'
    print(f"   {plot_id}: {dict(ds_gt.sizes)}")
    print(f"      Variables: {list(ds_gt.data_vars)}")


  x=slice(minx, maxx),
  x=slice(minx, maxx),



‚úÖ Prepared 3 sample datasets
   sample_3: {'time': 93, 'x': 414, 'y': 341, 'plot_id': 1}
      Variables: ['EVI', 'NDVI', 'ground_truth', 'gt_valid']
   sample_2: {'time': 93, 'x': 413, 'y': 301, 'plot_id': 1}
      Variables: ['EVI', 'NDVI', 'ground_truth', 'gt_valid']
   sample_1: {'time': 90, 'x': 322, 'y': 231, 'plot_id': 1}
      Variables: ['EVI', 'NDVI', 'ground_truth', 'gt_valid']


  x=slice(minx, maxx),


## Step 3: Inspect results

Each dataset in `ds_gt_list` contains:
- **Dimensions:** (plot_id, time, x, y)
- **Variables:**
  - `EVI`, `NDVI`: Satellite time series
  - `ground_truth`: Training labels (0=non-tree, 1=tree, NaN=no label)
  - `gt_valid`: Pixels with labels for all times


In [11]:
# Inspect first sample dataset
if len(ds_gt_list) > 0:
    ds_gt = ds_gt_list[0]
    plot_id = ds_gt.coords['plot_id'].values[0] if 'plot_id' in ds_gt.coords else 'sample_1'
    
    print(f"Sample: {plot_id}")
    print(f"Dimensions: {dict(ds_gt.sizes)}")
    print(f"Variables: {list(ds_gt.data_vars)}")
    print(f"\nTime range: {pd.to_datetime(ds_gt.time.min().values)} to {pd.to_datetime(ds_gt.time.max().values)}")
    
    # Check ground truth statistics
    if 'ground_truth' in ds_gt.data_vars:
        gt_values = ds_gt['ground_truth'].values.flatten()
        n_total = len(gt_values)
        n_nan = np.isnan(gt_values).sum()
        n_zeros = (gt_values == 0).sum()
        n_ones = (gt_values == 1).sum()
        
        print(f"\nGround Truth Statistics:")
        print(f"  Total pixels: {n_total:,}")
        print(f"  NaN (no label): {n_nan:,} ({100*n_nan/n_total:.1f}%)")
        print(f"  0 (non-tree): {n_zeros:,} ({100*n_zeros/n_total:.1f}%)")
        print(f"  1 (tree): {n_ones:,} ({100*n_ones/n_total:.1f}%)")
    
    # Show dataset structure
    print(f"\nDataset structure:")
    print(ds_gt)


Sample: sample_3
Dimensions: {'time': 93, 'x': 414, 'y': 341, 'plot_id': 1}
Variables: ['EVI', 'NDVI', 'ground_truth', 'gt_valid']

Time range: 2016-03-15 00:00:00 to 2025-09-15 00:00:00

Ground Truth Statistics:
  Total pixels: 13,129,182
  NaN (no label): 10,912,539 (83.1%)
  0 (non-tree): 1,112,030 (8.5%)
  1 (tree): 1,104,613 (8.4%)

Dataset structure:
<xarray.Dataset> Size: 158MB
Dimensions:       (time: 93, x: 414, y: 341, plot_id: 1)
Coordinates:
  * time          (time) datetime64[ns] 744B 2016-03-15 ... 2025-09-15
  * x             (x) float64 3kB 5.828e+05 5.828e+05 ... 5.869e+05 5.869e+05
  * y             (y) float64 3kB 9.971e+06 9.971e+06 ... 9.967e+06 9.967e+06
  * plot_id       (plot_id) object 8B 'sample_3'
    image_id      (time) object 744B dask.array<chunksize=(20,), meta=np.ndarray>
    epsg          int64 8B 32749
Data variables:
    EVI           (plot_id, time, x, y) float32 53MB dask.array<chunksize=(1, 20, 128, 128), meta=np.ndarray>
    NDVI          (plot_i

## Summary

‚úÖ **Workflow Complete!**

The refactored `forestry.prepare_tsfresh_with_ground_truth()` method:
1. ‚úÖ Loads ground truth training data
2. ‚úÖ Clips satellite data to sample bounding boxes  
3. ‚úÖ Converts polygons to raster masks (parallel processing)
4. ‚úÖ Merges masks into 4D dataset (plot_id, time, y, x)
5. ‚úÖ Merges satellite data with ground truth labels
6. ‚úÖ Saves to zarr for efficient access

**Result:** List of datasets ready for tsfresh feature extraction!

Each dataset has:
- Satellite time series (EVI, NDVI)
- Ground truth labels (0=non-tree, 1=tree, NaN=no label)
- Validity mask (pixels with labels for all times)

**Next steps:**
- Extract time series features using tsfresh
- Train machine learning models
- Apply to full AOI
