# InstaGeo Data Cleaner Demo

This notebook demonstrates how to use the `data_cleaner.py` module to clean and filter satellite image chips and segmentation maps. The data cleaner helps improve dataset quality by removing low-quality chips and cleaning segmentation maps.

## Overview
- Filters chips based on no-data pixel thresholds
- Supports different cleaning methods (buffer, limit)
- Handles observation point-based cleaning

## Prerequisites
1. **InstaGeo installed** - Make sure you have InstaGeo properly installed
2. **Dataset CSV file** - You can create this by running examples from the [Chip Creator Demo Notebook](chip_creator_demo.ipynb)
3. **Generated chips and segmentation maps** - These are created by the chip creator pipeline (see Chip Creator Demo)
4. **Observation points CSV** - For segmentation map adjustment (see Chip Creator Demo for observations points)

  
> **💡 Tip**: Run the examples in the [Chip Creator Demo Notebook](chip_creator_demo.ipynb) or [Raster Chip Creator Demo Notebook](raster_chip_creator_demo.ipynb) first to generate sample chips and segmentation maps that you can then clean with this notebook.


## Core Functions and Use Cases
The data cleaner provides three main functions that address different quality issues:

### 1. Quality Filtering
**Purpose**: Remove chips with too many no-data pixels that would hurt model training.

**Why it's important**:
- **Prevents model confusion**: Chips with excessive no-data pixels provide poor training signals
- **Improves training efficiency**: Removes samples that don't contribute meaningful information

**Example use case**:


In [None]:
# Remove chips with 2%+ no-data pixels
!mkdir cleaned_output
!python -m instageo.data.data_cleaner \
    --chips_dataset_csv="chip_output/hls_cloud_filtered/hls_raster_dataset.csv" \
    --output_chips_dataset_csv="cleaned_output/cleaned_dataset.csv" \
    --drop_chips=True \
    --drop_chips_strategy="any" \
    --no_data_threshold=0.02 \
    --no_data_value=0

### 2. Spatial Context Enhancement
**Purpose**: Expand segmentation maps around observation pixels to provide spatial context for training. We will assign pixel values to pixels in the neighborhood.

**Why it's important**:
- **Captures spatial patterns**: In our context, original records don't usually occur at single points but in spatial neighborhoods
- **Improves model robustness**: Models learn better with spatial context rather than isolated points
- **Handles GPS uncertainty**: Observation points may not be perfectly aligned with pixel centers
- **Enables better generalization**: Models trained on buffered data could generalize better to new locations

**Example use case**:

In [None]:
# Create a buffer around observation pixels (window_size=3 --> ~0.2km buffer radius)
# if `seg_map_output_dir` is not provided, the original segmentation map files will be overwritten
!python -m instageo.data.data_cleaner \
    --chips_dataset_csv="cleaned_output/cleaned_dataset.csv" \
    --output_chips_dataset_csv="cleaned_output/buffered_dataset.csv" \
    --seg_map_output_dir="cleaned_output/seg_maps_win_3" \
    --clean_seg_maps=True \
    --cleaning_method="buffer" \
    --window_size=3 \
    --ignore_index=-1\
    --no_data_value=0

### 3. Precise Point-Based Cleaning
**Purpose**: Restrict segmentation maps to only observation points pixels.

**Why it could be useful**:
- **Prevents false positives**: Eliminates background noise that could confuse a model during training.
- **Focuses training**: Model learns only from verified observation locations
- **Reduces overfitting**: Prevents model from learning spurious patterns in unlabeled areas
- **Enables point-based evaluation**: Perfect for tasks where only specific locations matter

**Normal Behavior**:
When observation points are found within the segmentation map bounds
- Observation pixels: Keep their original values (1, 2, 3, etc.)
- Non-observation pixels: Set to ignore_index (-1 for instance)
- Result: Segmentation map with mostly -1 values and a few pixels with actual observation values
- Training: Model will ignore -1 pixels and learns only from observation pixels
  
**Edge Case**: 
No Observations Found

If no observation points are found within the bounds of a segmentation map, the corresponding row is removed from the output CSV

**Example use case**:


In [None]:
# Let's generate a "verified" dataset with confirmed observation points.
# We will just sample from the original observation points CSV file we created in the Chip Creator Demo.
# We will also use the buffered dataset we created in the previous step.
import pandas as pd

# Sample 10 random observation points from the original observation points CSV file
original_obs_points = pd.read_csv("demo_data/sample_observations.csv")
sampled_obs_points = original_obs_points.sample(n=45)
# Create a new CSV file with the sampled observation points
sampled_obs_points.to_csv("cleaned_output/verified_observations.csv", index=False)

In [None]:
# Limit segmentation maps to exact observation points only
# if `seg_map_output_dir` is not provided, the original segmentation map files will be overwritten

!python -m instageo.data.data_cleaner \
    --chips_dataset_csv="cleaned_output/buffered_dataset.csv" \
    --output_chips_dataset_csv="cleaned_output/verified_dataset.csv" \
    --seg_map_output_dir="cleaned_output/verified_seg_maps" \
    --clean_seg_maps=True \
    --cleaning_method="limit" \
    --observation_points_csv="cleaned_output/verified_observations.csv" \
    --ignore_index=-1


### When to Use Each Option

| Function | Use When | Example Scenario |
|----------|----------|------------------|
| Drop entries from dataset | You have chips with many no-data pixels (or want to exclude chips with no-data value over a specific threshold) | Cloud-covered satellite images that have been masked already |
| Buffer segmentation maps | You need spatial context around observations (note that you can use window size > 0 directly when using the chip creator to have observation pixels buffered and you could benefit directly from cloud masking if enabled.) | Land cover classification, environmental monitoring |
| Limit segmentation maps to specific points | You want precise point-based pixels that are kept in the segmentation map. | Precise location mapping, evaluation over specific points |

### 🔗 Related Demos:
- `chip_creator_demo.ipynb`: Point-based chip creation
- `raster_chip_creator_demo.ipynb`: Raster-based chip creation
- `data_splitter_demo.ipynb`: Data splitting strategies