# InstaGeo Chip Creator Demo

This notebook demonstrates how to use the `chip_creator.py` module to create satellite image chips and segmentation maps from point observations. The chip creator is designed to extract small image patches (chips) from larger satellite tiles based on point observations, making it suitable for training machine learning models.

## Overview

The chip creator module:
- Extracts satellite imagery (HLS, Sentinel-1, Sentinel-2)
- Creates chips around observation points
- Generates segmentation maps for training
- Supports multiple data sources and processing methods

## Prerequisites

Before running this notebook, ensure you have:
1. InstaGeo installed
2. NASA Earthdata credentials configured
3. A CSV file with observation data containing columns: date, x, y, label (will be created with the `create_sample_data` function)

## 1. Setup and Configuration


In [None]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

# Set up paths
notebook_dir = Path.cwd()
data_dir = notebook_dir / "demo_data"
output_dir = notebook_dir / "chip_output"

# Create directories
data_dir.mkdir(exist_ok=True)
output_dir.mkdir(exist_ok=True)

print(f"Data directory: {data_dir}")
print(f"Output directory: {output_dir}")

## 2. Create Sample Observation Data

Let's create a sample dataset to demonstrate the chip creator functionality. In practice, this would represent your original point observations:


In [None]:
# Create sample observation data
np.random.seed(42)
n_samples = 50

# Sample coordinates around Paris area
lat_center, lon_center = 48.8566, 2.3522
lat_range, lon_range = 0.5, 0.5

sample_data = {
    'date': pd.date_range('2023-06-14', periods=n_samples, freq='D').strftime('%Y-%m-%d'),
    'x': np.random.uniform(lon_center - lon_range, lon_center + lon_range, n_samples),
    'y': np.random.uniform(lat_center - lat_range, lat_center + lat_range, n_samples),
    'label': np.random.choice(['crop', 'forest', 'urban', 'water'], n_samples, p=[0.4, 0.3, 0.2, 0.1])
}

df = pd.DataFrame(sample_data)
df.head()

In [None]:
# Convert string labels to integers (required for chip creator)
label_mapping = {'crop': 1, 'forest': 2, 'urban': 3, 'water': 4}
df['label_int'] = df['label'].map(label_mapping)

print("Label mapping:")
for label, label_int in label_mapping.items():
    print(f"  {label}: {label_int}")

print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nInteger label distribution:")
print(df['label_int'].value_counts().sort_index())

In [None]:
# Save sample data with integer labels
sample_file = data_dir / "sample_observations.csv"
df_final = df[['date', 'x', 'y', 'label_int']].copy()
df_final.columns = ['date', 'x', 'y', 'label']  # Rename label_int back to label
df_final.to_csv(sample_file, index=False)

print(f"Sample data saved to: {sample_file}")
print(f"Dataset shape: {df_final.shape}")
print(f"Date range: {df_final['date'].min()} to {df_final['date'].max()}")
print(f"Label range: {df_final['label'].min()} to {df_final['label'].max()}")
print(f"\nFinal dataset preview:")
print(df_final.head())


## 3. Visualizing Sample Data

In [None]:
# Visualize the sample data
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Spatial distribution
colors = {1: 'green', 2: 'darkgreen', 3: 'red', 4: 'blue'}
label_names = {1: 'crop', 2: 'forest', 3: 'urban', 4: 'water'}

for label_int in df_final['label'].unique():
    mask = df_final['label'] == label_int
    ax1.scatter(df_final[mask]['x'], df_final[mask]['y'], 
               c=colors[label_int], label=label_names[label_int], alpha=0.7, s=50)

ax1.set_xlabel('Longitude')
ax1.set_ylabel('Latitude')
ax1.set_title('Spatial Distribution of Observations')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Temporal distribution
df_final['date_parsed'] = pd.to_datetime(df_final['date'])
df_final['month'] = df_final['date_parsed'].dt.month
monthly_counts = df_final.groupby(['month', 'label']).size().unstack(fill_value=0)
monthly_counts.plot(kind='bar', ax=ax2, color=[colors[label] for label in monthly_counts.columns])
ax2.set_xlabel('Month')
ax2.set_ylabel('Number of Observations')
ax2.set_title('Temporal Distribution by Month')
ax2.legend(title='Label', labels=[label_names[label] for label in monthly_counts.columns])
ax2.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()


## 4. Usage Examples

### Example 1: Basic usage with HLS data
We want to create chips of size 256x256 using one timestep with a temporal tolerance of 2 days, with their corresponding segmentation maps.
This means for each record, we will use the closest available image to the observation date retrieved using a time window starting 2 days before and ending 2 days after the observation date. The label values from the original observations will be used to set pixel values in the segmentation maps. 


In [None]:
!python -m instageo.data.chip_creator \
    --dataframe_path="demo_data/sample_observations.csv" \
    --output_directory="chip_output/hls_basic" \
    --min_count=1 \
    --temporal_tolerance=2 \
    --chip_size=256 \
    --num_steps=1 \
    --data_source="HLS"

### Example 2: Multiple temporal chips with HLS data.

This example creates a multitemporal dataset with 3-timesteps chips and their corresponding segmentation maps . This could be useful if we are working on a task related to tracking changes over an area for a given period or if we are interested in capturing variations that could improve the results of a classification task at a given date. The `temporal_step` argument is the number of days we want to have between each step. The `num_steps` argument is the total number of steps. Here we want to create 3-timesteps chips with monthly increment in recording times from the original observation date.
Given that we might not find an image for each timestep at the exact date, we can set the `temporal_tolerance` argument to a higher value to allow for more flexibility in the date search. It is important to note that a very high value could break the intended seasonality of the data.

In [None]:
!python -m instageo.data.chip_creator \
    --dataframe_path="demo_data/sample_observations.csv" \
    --output_directory="chip_output/hls_multitemporal" \
    --min_count=1 \
    --chip_size=256 \
    --temporal_tolerance=3 \
    --temporal_step=30 \
    --num_steps=3 \
    --data_source="HLS"

### Example 3: HLS chips with cloud coverage filtering and masking
This example creates a dataset with chips and segmentation maps, but only includes records with cloud coverage less than 20%.
Note that the cloud percentage is verified at the HLS tile level, not at the chip level.
This means that if a tile has a cloud coverage of 20%+, all records/chips that could be potentially extracted from that tile will be discarded.
We also mask clouds (using `mask_types` set to cloud and cloud_shadow) by setting corresponding pixels to `no_data` values (both in the chips and the segmentation maps, respectively 0 and -1; check `data/settings.py`). We use the `masking_strategy` argument to specify how to apply the masking. Here we use the `any` strategy, which means that if the mask is present for at least one timestep, the pixel will be masked.

To address the issue of potentially discarded chips, we can set the `cloud_coverage` argument to a higher value to allow for more flexibility in the search and consider more tiles and still apply masking. We could then filter chips by setting a threshold on `no_data` values percentage (this is not covered in this notebook. See the `data_cleaner_demo.ipynb` or `data/data_cleaner.py` module for more details).

Note that by default, if you don't set the `cloud_coverage` argument a value of 10 is used

In [None]:
!python -m instageo.data.chip_creator \
    --dataframe_path="demo_data/sample_observations.csv" \
    --output_directory="chip_output/hls_cloud_filtered" \
    --min_count=1 \
    --chip_size=256 \
    --temporal_tolerance=2 \
    --num_steps=1 \
    --cloud_coverage=20 \
    --mask_types="cloud,cloud_shadow" \
    --masking_strategy="any"

### Example 4: Buffered observation points
In many applications, when generating the chips and segmentation maps datasets we might be interested in uniformizing the neighborhood of the observation points to include more pixels. To do this we set the same value for neighboring pixels as the pixel in which the observation falls. This can be done by setting the `window_size` argument to a value greater than 0. Using the basic HLS example, we can set the `window_size` argument to 2, this will include 2 pixels in all directions around the observation pixel.

Note that the default value for `window_size` is 0, meaning no buffer will be applied. 
To account for variable window sizes we can keep a value of 0 and then generate different variants for the segmentation maps using the `data/data_cleaner.py` module.

In [None]:
!python -m instageo.data.chip_creator \
    --dataframe_path="demo_data/sample_observations.csv" \
    --output_directory="chip_output/hls_buffered" \
    --min_count=1 \
    --chip_size=256\
    --temporal_tolerance=2 \
    --num_steps=1 \
    --data_source="HLS" \
    --window_size=2

### Example 5: Sentinel-2 example
For Sentinel-2 and Sentinel-1, we set the spatial resolution to 10m in EPSG:4326. The default value corresponds to 30m in EPSG:4326 (for HLS data).


In [None]:
!python -m instageo.data.chip_creator \
    --dataframe_path="demo_data/sample_observations.csv" \
    --output_directory="chip_output/s2_example" \
    --min_count=1 \
    --chip_size=256 \
    --temporal_tolerance=5 \
    --num_steps=1 \
    --data_source="S2" \
    --cloud_coverage=20 \
    --spatial_resolution=8.983152841195215e-05 #10m resolution in EPSG:4326

### Example 6: Sentinel-1 example

In [None]:
!python -m instageo.data.chip_creator \
    --dataframe_path="demo_data/sample_observations.csv" \
    --output_directory="chip_output/s1_example" \
    --min_count=1 \
    --chip_size=256 \
    --temporal_tolerance=5 \
    --num_steps=1 \
    --data_source="S1" \
    --spatial_resolution=8.983152841195215e-05 #10m resolution in EPSG:4326

### Example 7: Regression example
All the past examples have been focused on segmentation tasks. 
If the label was a continuous variable (let's say if the label is not representing indices to class names but rather a "count variable" for instance), we could use the `task_type` argument set to `reg` to create a regression dataset.
In that case, the segmentation maps data type will be set to `float32` 


In [None]:
!python -m instageo.data.chip_creator \
    --dataframe_path="demo_data/sample_observations.csv" \
    --output_directory="chip_output/hls_regression" \
    --min_count=1 \
    --chip_size=256 \
    --temporal_tolerance=2 \
    --num_steps=1 \
    --data_source="HLS"\
    --task_type="reg"

## 5. Troubleshooting Common Issues

Here are solutions to common problems:

| Issue | Problem | Solution | Command/Args |
|-------|---------|----------|--------------|
| Authentication Error | NASA Earthdata login failed | Configure ~/.netrc with Earthdata credentials | `echo 'machine urs.earthdata.nasa.gov login <username> password <password>' >> ~/.netrc` |
| No Data Found | No observations found/ No objects to concatenate | Increase temporal_tolerance, or cloud_coverage (not applicable for Sentinel-1), or check date ranges of your observations | `--temporal_tolerance=10 --cloud_coverage=50` |

## 6. Next Steps

After creating chips, you can:

1. **Clean the data** using `data_cleaner.py`
2. **Split the dataset** using `data_splitter.py`
3. **Train machine learning models and Evaluate model performance** (Take a look at training scripts in `experiments_dir`)
4. Or you can generate optionally chips with `raster_chip_creator.py`

See the other demo notebooks for these next steps!


## 7. Summary

The chip creator is a powerful tool for creating patches/chips data from satellite imagery. Key takeaways:

- **Flexible data sources**: Supports HLS, Sentinel-1, Sentinel-2
- **Time series support**: Create temporal sequences for time series models
- **Quality assurance**: Built-in validation and filtering
- **Scalable**: Can process large datasets efficiently

For more information, see the InstaGeo documentation and other demo notebooks.
