# Week 3 Activity 1 — Data Preparation and Exploration for CNN Training

**Case Study:** Los Lagos Parcelización  
**Objective:** Prepare a high-quality training dataset for land-cover classification  
**Duration:** 90 minutes  

---

## 📋 Activity Overview

This notebook guides you through the critical process of preparing geospatial training data for deep learning. You'll learn that **data quality matters more than model sophistication** — a well-prepared dataset with thoughtful labels and proper spatial splitting will outperform a fancy architecture trained on poor data.

### Learning Objectives

By the end of this activity, you will:
1. Define spectrally and spatially distinct land-cover classes
2. Create training labels in QGIS with proper attribute structure
3. Extract and store multispectral patches from Sentinel-2 imagery
4. Analyze spectral signatures to understand class separability
5. Implement spatial train/validation splitting (critical!)
6. Understand tradeoffs between storage formats and band selection

### Workflow Sections

**Part A**

1. **Setup and Initialization** — Environment, paths, reproducibility
2. **Class Definition** — Document your 5 land-cover classes
3. **QGIS Digitization Guide** — Attribute table setup and workflow
4. **Load Training Polygons** — Import and validate labels
5. **Sentinel-2 Data Access** — Connect to Earth Engine, select optimal bands

**Part B**

6. **Patch Extraction** — Extract training patches with multiple approaches
7. **Spectral Analysis** — Visualize and compare class signatures
8. **Spatial Train/Val Split** — Implement proper spatial separation
9. **Storage Format Comparison** — NumPy vs GeoTIFF tradeoffs
10. **Experiment Logging** — Document your data preparation decisions
11. **Self-Assessment** — Evaluate data quality and readiness

---

## ⚠️ Critical Concept: Spatial Autocorrelation

**The most common mistake in geospatial ML**: Random train/test splitting.

Nearby pixels are similar (Tobler's First Law). If you randomly split patches, nearby patches end up in both train and test sets. Your model achieves high test accuracy by memorizing locations, not learning generalizable features.

**Solution**: Spatially separate train and test data. We'll explore this thoroughly in Section 8.

---

# 1. Setup and Initialization

**Objective:**  
Initialize the notebook environment with all necessary libraries, establish reproducible random seeds, configure file paths, and create the project directory structure.

**Key Components:**
- **Geospatial libraries**: `geopandas`, `earthengine-api`, `geemap`, `rasterio`
- **ML/Data libraries**: `numpy`, `torch`, `scikit-learn`
- **Visualization**: `matplotlib`, `plotly` (interactive)
- **Reproducibility**: Fixed random seeds for Python, NumPy, and PyTorch
- **Path management**: Automatic repository root detection
- **Experiment logging**: Basic logging system initialization

**Why this matters:**  
Proper setup ensures reproducible results and clean project organization. The experiment log will track all your data preparation decisions.

In [None]:
# 1) Setup and Initialization

# === Core Python utilities ===
from pathlib import Path
import os, json, random, warnings
from datetime import datetime
import numpy as np
import pandas as pd

# === Geospatial libraries ===
import geopandas as gpd              # Vector data handling
import rasterio                       # Raster I/O for GeoTIFFs
from rasterio.features import rasterize
from shapely.geometry import box, Point
import ee                             # Earth Engine Python API
import geemap                         # Earth Engine visualization

# === Machine Learning libraries ===
import torch
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

# === Visualization libraries ===
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import seaborn as sns

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
warnings.filterwarnings('ignore')

# === Reproducibility: Set all random seeds ===
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(RANDOM_SEED)

print(f'✓ Random seed set to {RANDOM_SEED} for reproducibility')

# === Repository Path Management ===

def find_repo_root(start: Path) -> Path:
    """
    Locate repository root by searching upward for 'data' and 'figures' directories.
    This allows notebooks to run from any subdirectory.
    """
    for p in [start] + list(start.parents):
        if (p / 'data').exists() and (p / 'figures').exists():
            return p
    return start  # Fallback to current directory

# Resolve key paths
CWD = Path.cwd()
REPO = find_repo_root(CWD)
DATA = REPO / 'data'
FIGS = REPO / 'figures'
REPORTS = REPO / 'reports'
MODELS = REPO / 'models'
NOTEBOOKS = REPO / 'notebooks'

# Data subdirectories
DATA_EXTERNAL = DATA / 'external'      # External data (AOI, reference data)
DATA_RAW = DATA / 'raw'                # Raw downloaded imagery
DATA_PROCESSED = DATA / 'processed'    # Processed patches and arrays
DATA_LABELS = DATA / 'labels'          # Training labels (GeoJSON)

# Create all necessary directories
for directory in [FIGS, REPORTS, MODELS, NOTEBOOKS, 
                  DATA_EXTERNAL, DATA_RAW, DATA_PROCESSED, DATA_LABELS]:
    directory.mkdir(exist_ok=True, parents=True)

# Define key file paths
AOI_PATH = DATA_EXTERNAL / 'aoi.geojson'
TRAINING_POLYGONS_PATH = DATA_LABELS / 'training_polygons.geojson'

# === Initialize Experiment Log ===

# Create a simple experiment log to track data preparation decisions
experiment_log = {
    'activity': 'Week 3 Activity 1 - Data Preparation',
    'case_study': 'Los Lagos Parcelización',
    'date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'random_seed': RANDOM_SEED,
    'classes': ['Forest', 'Agriculture', 'Parcels', 'Water', 'Urban'],
    'decisions': {},  # Will be populated throughout notebook
    'metrics': {}     # Will store data quality metrics
}

print('\n📁 Project Structure:')
print(f'  Repository root: {REPO}')
print(f'  Data directory: {DATA}')
print(f'  Figures directory: {FIGS}')
print(f'  Reports directory: {REPORTS}')
print(f'\n📍 Key Files:')
print(f'  AOI: {AOI_PATH}')
print(f'  AOI exists: {AOI_PATH.exists()}')
print(f'  Training polygons: {TRAINING_POLYGONS_PATH}')
print(f'  Training polygons exist: {TRAINING_POLYGONS_PATH.exists()}')
print(f'\n✓ Setup complete!')

**Outcome:**

After running this cell, you should see:
- ✓ Random seed confirmation
- 📁 Project structure with all paths
- 📍 AOI file status (should show `True`)
- 📍 Training polygons status (will be `False` until you create them in QGIS)

The `experiment_log` dictionary will track all your decisions throughout this activity.

---

# 2. Define Land-Cover Classes

**Objective:**  
Document clear, unambiguous definitions for each of your 5 land-cover classes. This is critical for consistent labeling and model performance.

**Why this matters:**  
Ambiguous class definitions lead to inconsistent labels, which confuse the model and reduce accuracy. Clear definitions ensure you (and others) label consistently.

**Key Principles:**
1. **Spectral distinctiveness**: Classes should have different spectral signatures
2. **Spatial distinctiveness**: Classes should have different spatial patterns/textures
3. **Practical relevance**: Classes should align with your research questions
4. **Label feasibility**: You should be able to reliably identify each class in imagery

In [None]:
# 2) Define Land-Cover Classes

# === Class Definitions for Los Lagos Parcelización ===

class_definitions = {
    'Forest': {
        'id': 0,
        'color': '#228B22',  # Forest green
        'description': 'Native and plantation forests with dense canopy cover',
        'spectral_characteristics': [
            'High NIR reflectance (vegetation)',
            'Low red reflectance (chlorophyll absorption)',
            'High NDVI (>0.6)',
            'Moderate SWIR reflectance'
        ],
        'spatial_characteristics': [
            'Continuous canopy texture',
            'Irregular boundaries (native) or regular (plantation)',
            'Large contiguous patches'
        ],
        'examples': [
            'Temperate rainforest',
            'Eucalyptus plantations',
            'Mixed native forest'
        ],
        'exclusions': [
            'Sparse trees in agricultural areas',
            'Recently cleared forest (classify as Bare Soil or Parcels)',
            'Shrubland with <30% canopy cover'
        ]
    },
    
    'Agriculture': {
        'id': 1,
        'color': '#FFD700',  # Gold
        'description': 'Active agricultural fields including crops and pasture',
        'spectral_characteristics': [
            'Variable NDVI (0.3-0.7) depending on crop stage',
            'Lower NIR than forest',
            'Seasonal variability in all bands'
        ],
        'spatial_characteristics': [
            'Regular field boundaries',
            'Rectangular or geometric shapes',
            'Uniform texture within fields',
            'Medium patch sizes'
        ],
        'examples': [
            'Wheat fields',
            'Potato fields',
            'Managed pasture',
            'Hay fields'
        ],
        'exclusions': [
            'Fallow fields (if bare, classify as Bare Soil)',
            'Recently harvested fields',
            'Residential gardens (classify as Parcels)'
        ]
    },
    
    'Parcels': {
        'id': 2,
        'color': '#FF6347',  # Tomato red
        'description': 'Subdivided residential areas (parcelización) — the key phenomenon of interest',
        'spectral_characteristics': [
            'Mixed spectral signature (buildings + vegetation + bare soil)',
            'Moderate NDVI (0.2-0.5) due to mixed cover',
            'High spatial heterogeneity',
            'Often includes bright surfaces (roofs, roads)'
        ],
        'spatial_characteristics': [
            'Small, regular subdivisions (typically <1 hectare)',
            'Grid pattern or linear arrangement along roads',
            'Mix of structures and vegetation',
            'Access roads visible',
            'Often at forest-agriculture interface'
        ],
        'examples': [
            'Residential subdivisions',
            'Rural housing developments',
            'Parceled land with structures'
        ],
        'exclusions': [
            'Traditional rural homesteads (isolated houses)',
            'Urban areas (classify as Urban)',
            'Agricultural buildings without subdivision pattern'
        ]
    },
    
    'Water': {
        'id': 3,
        'color': '#1E90FF',  # Dodger blue
        'description': 'Water bodies including lakes, rivers, and coastal areas',
        'spectral_characteristics': [
            'Very low NIR reflectance (water absorbs NIR)',
            'Negative NDVI',
            'High NDWI (>0.3)',
            'Low reflectance in all bands (clear water)',
            'Higher visible reflectance if turbid'
        ],
        'spatial_characteristics': [
            'Smooth, homogeneous texture',
            'Irregular boundaries (natural) or regular (reservoirs)',
            'Low spatial variability'
        ],
        'examples': [
            'Lakes',
            'Rivers',
            'Coastal waters',
            'Reservoirs'
        ],
        'exclusions': [
            'Shadows (can have similar spectral signature)',
            'Wetlands with emergent vegetation',
            'Temporary flooding'
        ]
    },
    
    'Urban': {
        'id': 4,
        'color': '#808080',  # Gray
        'description': 'Established urban and built-up areas',
        'spectral_characteristics': [
            'High reflectance in visible bands (concrete, asphalt)',
            'Low NDVI (<0.2)',
            'High NDBI (built-up index)',
            'High spatial heterogeneity'
        ],
        'spatial_characteristics': [
            'Dense building patterns',
            'Road networks',
            'Large contiguous built-up areas',
            'Mix of bright (roofs) and dark (roads) surfaces'
        ],
        'examples': [
            'Town centers',
            'Industrial areas',
            'Commercial districts',
            'Dense residential areas'
        ],
        'exclusions': [
            'Parcels (classify separately)',
            'Isolated rural buildings',
            'Agricultural infrastructure'
        ]
    }
}

# Store in experiment log
experiment_log['class_definitions'] = class_definitions

# Display class summary
print('📊 Land-Cover Classes for Los Lagos Parcelización\n')
print('=' * 80)
for class_name, info in class_definitions.items():
    print(f"\n{info['id']}. {class_name.upper()}")
    print(f"   Color: {info['color']}")
    print(f"   {info['description']}")
    print(f"   Key spectral: {', '.join(info['spectral_characteristics'][:2])}")
    print(f"   Key spatial: {info['spatial_characteristics'][0]}")

print('\n' + '=' * 80)
print('\n✓ Class definitions documented')
print('\n💡 TIP: Keep these definitions handy while digitizing in QGIS!')

**Outcome:**

You now have clear, documented definitions for all 5 classes. Notice:

- **Spectral distinctiveness**: Water has negative NDVI, Forest has high NDVI, Urban has low NDVI
- **Spatial distinctiveness**: Parcels have small regular subdivisions, Agriculture has larger geometric fields
- **Explicit exclusions**: Helps avoid ambiguous cases (e.g., "Is this a Parcel or Urban?")

**🤔 Reflection Question 1:**

*Which two classes do you think will be most difficult to distinguish? Why? Consider both spectral and spatial characteristics.*

[Write your answer here before proceeding]

---

# 3. QGIS Digitization Guide

**Objective:**  
Provide step-by-step instructions for creating training polygons in QGIS with proper attribute structure for seamless import into this notebook.

**Why QGIS:**  
QGIS provides visual context (high-resolution imagery, Sentinel-2 composites) and precise digitization tools. Manual digitization ensures high-quality labels.

---

## 🗺️ QGIS Workflow

### Step 1: Load Base Layers

1. **Open QGIS** and create a new project
2. **Load your AOI**: `Layer → Add Layer → Add Vector Layer` → Select `data/external/aoi.geojson`
3. **Add Google Satellite** (for reference):
   - `Browser Panel → XYZ Tiles → Google Satellite` (drag to map)
   - Or: `Web → QuickMapServices → Google → Google Satellite`
4. **Add Sentinel-2 via GEE Plugin**:
   - `Plugins → Python Console`
   - Or use the GEE plugin to add a Sentinel-2 median composite for your AOI

### Step 2: Create Training Polygon Layer

1. **Create new GeoJSON layer**:
   - `Layer → Create Layer → New GeoJSON Layer`
   - **File name**: `data/labels/training_polygons.geojson`
   - **Geometry type**: Polygon
   - **CRS**: EPSG:4326 (WGS 84)

2. **Add attribute fields** (CRITICAL for seamless import):

   Click "New Field" for each of the following:

   | Field Name | Type | Length | Description |
   |------------|------|--------|-------------|
   | **class_name** | Text | 50 | Land-cover class (Forest, Agriculture, Parcels, Water, Urban) |
   | **class_id** | Integer | 10 | Numeric class ID (0-4) |
   | **confidence** | Text | 20 | Your confidence (High, Medium, Low) |
   | **notes** | Text | 200 | Any observations or uncertainties |
   | **date_digitized** | Text | 20 | Date you created this polygon |

3. **Click OK** to create the layer

### Step 3: Configure Attribute Form (Optional but Recommended)

This makes digitization faster and prevents typos:

1. **Right-click layer** → `Properties → Attributes Form`
2. **For class_name field**:
   - Widget Type: `Value Map`
   - Add values: `Forest`, `Agriculture`, `Parcels`, `Water`, `Urban`
3. **For class_id field**:
   - Widget Type: `Value Map`
   - Add values: `0`, `1`, `2`, `3`, `4`
4. **For confidence field**:
   - Widget Type: `Value Map`
   - Add values: `High`, `Medium`, `Low`
5. **Click OK**

Now when you digitize, you'll get dropdown menus instead of typing!

### Step 4: Digitization Strategy

**Goal**: 50-100 polygons per class, distributed across your AOI

**Best Practices**:

1. **Spatial distribution**: 
   - Don't cluster all polygons in one area
   - Sample from north, south, east, west, and center of AOI
   - Include different elevations/aspects if relevant

2. **Polygon size**:
   - **Minimum**: ~200m × 200m (20 pixels × 20 pixels at 10m resolution)
   - **Ideal**: 500m × 500m to 1km × 1km
   - Larger polygons = more training patches per polygon

3. **Polygon purity**:
   - Avoid mixed pixels at boundaries
   - Draw polygons well inside homogeneous areas
   - Leave buffer from class boundaries

4. **Class balance**:
   - Aim for roughly equal numbers of polygons per class
   - If one class is rare, digitize smaller but more numerous polygons

5. **Quality over quantity**:
   - Better to have 50 high-quality polygons than 100 ambiguous ones
   - Mark uncertain areas as `confidence: Low` or skip them

### Step 5: Digitize Polygons

1. **Enable editing**: Click the pencil icon or `Toggle Editing`
2. **Add polygon**: Click `Add Polygon Feature` icon
3. **Draw polygon**: Click to add vertices, right-click to finish
4. **Fill attributes**:
   - **class_name**: Select from dropdown (e.g., "Forest")
   - **class_id**: Select corresponding ID (e.g., 0 for Forest)
   - **confidence**: Select High/Medium/Low
   - **notes**: Add any observations (e.g., "Mixed native and plantation")
   - **date_digitized**: Enter today's date (e.g., "2025-10-16")
5. **Click OK**
6. **Repeat** for all training areas

### Step 6: Save and Verify

1. **Save edits**: Click `Save Layer Edits`
2. **Toggle editing off**: Click pencil icon again
3. **Open attribute table**: Right-click layer → `Open Attribute Table`
4. **Verify**:
   - All polygons have class_name and class_id
   - No typos in class names
   - Roughly balanced class distribution

### Step 7: Export (if needed)

If you created the layer elsewhere, export to the correct location:

1. **Right-click layer** → `Export → Save Features As`
2. **Format**: GeoJSON
3. **File name**: `data/labels/training_polygons.geojson`
4. **CRS**: EPSG:4326
5. **Click OK**

---

## ✅ Digitization Checklist

Before proceeding to the next section, verify:

- [ ] Training polygon layer created with correct attribute fields
- [ ] 50-100 polygons per class (or best effort)
- [ ] Polygons distributed across AOI (not clustered)
- [ ] All polygons have class_name and class_id
- [ ] Polygons are in homogeneous areas (avoid mixed pixels)
- [ ] File saved to `data/labels/training_polygons.geojson`
- [ ] CRS is EPSG:4326

---

**Once you've completed digitization in QGIS, return to this notebook and run the next cell to load and validate your training polygons.**

## 🌟 The More You Know! — Why These Specific Attributes?

### **class_name** (Text)
Human-readable label for visualization and interpretation. Makes it easy to understand results without looking up numeric codes.

### **class_id** (Integer)
Numeric identifier required for machine learning. Models work with numbers, not text. The ID maps directly to model output classes.

### **confidence** (Text)
Tracks your certainty about each label. During training, you might:
- Use only "High" confidence polygons initially
- Add "Medium" confidence later if you need more data
- Exclude "Low" confidence to avoid confusing the model

### **notes** (Text)
Captures important context:
- "Mixed native and plantation forest"
- "Recently cleared, might be transitioning"
- "Shadow from clouds, verify later"

These notes help you remember why you made certain decisions and identify potential issues.

### **date_digitized** (Text)
Tracks when labels were created. Useful if:
- You digitize in multiple sessions
- Land cover changes over time
- You need to match labels to specific imagery dates

---

### 💡 Pro Tip: Iterative Refinement

Don't aim for perfection on first pass:
1. **First pass**: Digitize obvious, high-confidence examples (30-50 per class)
2. **Train initial model**: See what the model learns
3. **Analyze errors**: Where does the model fail?
4. **Second pass**: Add more examples in confused areas
5. **Iterate**: Repeat until performance is acceptable

This iterative approach is more efficient than trying to create perfect labels upfront.

---

# 4. Load and Validate Training Polygons

**Objective:**  
Import your digitized training polygons, validate the attribute structure, and perform initial quality checks.

**Key Validations:**
- File exists and is readable
- Required attributes present
- CRS is correct (EPSG:4326)
- No missing or invalid class labels
- Class distribution is reasonable
- Spatial distribution across AOI

In [None]:
# 4) Load and Validate Training Polygons

if not TRAINING_POLYGONS_PATH.exists():
    print('❌ Training polygons file not found!')
    print(f'   Expected: {TRAINING_POLYGONS_PATH}')
    print('\n📝 Complete QGIS digitization (Section 3) before proceeding.')
else:
    # Load training polygons
    training_polys = gpd.read_file(TRAINING_POLYGONS_PATH)
    print(f'✓ Training polygons loaded: {len(training_polys)} polygons')
    print(f'  CRS: {training_polys.crs}')
    
    # Validation 1: Check required attributes
    required_attrs = ['class_name', 'class_id']
    missing_attrs = [attr for attr in required_attrs if attr not in training_polys.columns]
    if missing_attrs:
        print(f'\n⚠️  Missing attributes: {missing_attrs}')
    else:
        print('\n✓ Required attributes present')
    
    # Validation 2: Check CRS
    if training_polys.crs != 'EPSG:4326':
        print(f'\n⚠️  Reprojecting from {training_polys.crs} to EPSG:4326')
        training_polys = training_polys.to_crs('EPSG:4326')
        print('   ✓ Reprojected')
    
    # Validation 3: Check for missing values
    missing_class_name = training_polys['class_name'].isna().sum()
    missing_class_id = training_polys['class_id'].isna().sum()
    if missing_class_name > 0 or missing_class_id > 0:
        print(f'\n⚠️  Missing values: {missing_class_name} class_name, {missing_class_id} class_id')
        training_polys = training_polys.dropna(subset=['class_name', 'class_id'])
    
    # Validation 4: Check class names
    expected_classes = list(class_definitions.keys())
    actual_classes = training_polys['class_name'].unique().tolist()
    unexpected = [c for c in actual_classes if c not in expected_classes]
    if unexpected:
        print(f'\n⚠️  Unexpected class names: {unexpected}')
        print(f'   Expected: {expected_classes}')
    else:
        print('\n✓ All class names valid')
    
    # Validation 5: Class distribution
    class_counts = training_polys['class_name'].value_counts().sort_index()
    print('\n📊 Class Distribution:')
    print('=' * 50)
    for class_name, count in class_counts.items():
        bar = '█' * int(count / 2)
        print(f'  {class_name:15s}: {count:3d} {bar}')
    print('=' * 50)
    
    min_count = class_counts.min()
    max_count = class_counts.max()
    imbalance_ratio = max_count / min_count if min_count > 0 else float('inf')
    if imbalance_ratio > 3:
        print(f'\n⚠️  Class imbalance (ratio: {imbalance_ratio:.1f}:1)')
    else:
        print(f'\n✓ Class distribution reasonable (ratio: {imbalance_ratio:.1f}:1)')
    
    # Validation 6: Spatial distribution
    aoi = gpd.read_file(AOI_PATH)
    training_polys['centroid'] = training_polys.geometry.centroid
    aoi_bounds = aoi.total_bounds
    mid_x = (aoi_bounds[0] + aoi_bounds[2]) / 2
    mid_y = (aoi_bounds[1] + aoi_bounds[3]) / 2
    
    training_polys['quadrant'] = training_polys['centroid'].apply(
        lambda p: 'NE' if p.x >= mid_x and p.y >= mid_y else
                  'NW' if p.x < mid_x and p.y >= mid_y else
                  'SE' if p.x >= mid_x and p.y < mid_y else 'SW'
    )
    
    quadrant_counts = training_polys['quadrant'].value_counts()
    print('\n🗺️  Spatial Distribution (Quadrants):')
    print('=' * 50)
    for quadrant in ['NW', 'NE', 'SW', 'SE']:
        count = quadrant_counts.get(quadrant, 0)
        bar = '█' * int(count / 2)
        print(f'  {quadrant}: {count:3d} {bar}')
    print('=' * 50)
    
    if quadrant_counts.min() < len(training_polys) * 0.1:
        print('\n⚠️  Uneven spatial distribution')
    else:
        print('\n✓ Spatial distribution looks good')
    
    # Store metrics
    experiment_log['metrics']['total_polygons'] = len(training_polys)
    experiment_log['metrics']['class_distribution'] = class_counts.to_dict()
    experiment_log['metrics']['imbalance_ratio'] = float(imbalance_ratio)
    experiment_log['metrics']['spatial_distribution'] = quadrant_counts.to_dict()
    
    print(f'\n✓ Validation complete!')

**Outcome:**

- ✓ Training polygons loaded and validated
- 📊 Class distribution analyzed
- 🗺️ Spatial distribution checked

**🤔 Reflection Question 2:**

*Look at your class distribution. Is it balanced? If not, which class(es) need more examples? Why might some classes be harder to find than others in your AOI?*

[Write your answer here]

---

# 5. Earth Engine Initialization and Sentinel-2 Access

**Objective:**  
Connect to Google Earth Engine, load Sentinel-2 imagery for your AOI, and explore which spectral bands are most useful for classification.

**Key Decisions:**
1. **Time period**: Which date range to use?
2. **Cloud filtering**: How to handle clouds?
3. **Band selection**: Which bands to include?

In [None]:
# 5) Earth Engine Initialization

try:
    ee.Initialize()
    print('✓ Earth Engine initialized')
except Exception as e:
    print('❌ EE initialization failed. Attempting authentication...')
    ee.Authenticate()
    ee.Initialize()
    print('✓ EE initialized after authentication')

# Load AOI
aoi_gdf = gpd.read_file(AOI_PATH)
aoi_ee = geemap.geopandas_to_ee(aoi_gdf)
print(f'✓ AOI loaded')

# Define temporal parameters (Austral growing season — expanded)
START_DATE = '2018-10-01'
END_DATE = '2019-04-30'
print(f'\n📅 Time period: {START_DATE} to {END_DATE}')
print('   (Austral growing season — expanded)')

experiment_log['decisions']['temporal_range'] = {
    'start': START_DATE,
    'end': END_DATE,
    'rationale': 'Austral growing season (expanded shoulder months)'
}

## 🌟 The More You Know! — Sentinel-2 Bands

Sentinel-2 has **13 spectral bands**:

### **10m Resolution** (Highest detail)
- **B2** (Blue, 490 nm)
- **B3** (Green, 560 nm)
- **B4** (Red, 665 nm)
- **B8** (NIR, 842 nm)

### **20m Resolution**
- **B5-B7** (Red Edge)
- **B8A** (Narrow NIR)
- **B11-B12** (SWIR)

### Band Selection Options:

**Option 1: RGB Only** (B4, B3, B2)
- ✅ Smallest, fastest
- ❌ Misses vegetation info

**Option 2: RGB + NIR** (B4, B3, B2, B8) ⭐ Recommended
- ✅ Adds vegetation (NDVI)
- ✅ Better water detection (NDWI)
- ✅ Still manageable size

**Option 3: 10-band Multi** (all bands)
- ✅ Maximum information
- ❌ Larger files, more complex

We'll use **Option 2 (RGB + NIR)** for Week 3.

In [None]:
# 5 continued) Access Sentinel-2 Imagery

# Load Sentinel-2 collection
s2 = ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED') \
    .filterBounds(aoi_ee) \
    .filterDate(START_DATE, END_DATE) \
    .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 20))

image_count = s2.size().getInfo()
print(f'\n📡 Found {image_count} Sentinel-2 images')

if image_count == 0:
    print('\n⚠️  No images found! Try:')
    print('   1. Expanding date range')
    print('   2. Relaxing cloud filter')
else:
    # Cloud masking function
    def mask_s2_clouds(image):
        qa = image.select('QA60')
        cloud_mask = qa.bitwiseAnd(1 << 10).eq(0).And(
                     qa.bitwiseAnd(1 << 11).eq(0))
        return image.updateMask(cloud_mask)
    
    # Apply cloud masking and create median composite
    s2_masked = s2.map(mask_s2_clouds)
    s2_median = s2_masked.median().clip(aoi_ee)
    
    print('✓ Cloud masking applied')
    print('✓ Median composite created')
    
    # Select bands
    bands_rgb_nir = ['B4', 'B3', 'B2', 'B8']  # Red, Green, Blue, NIR
    s2_rgb_nir = s2_median.select(bands_rgb_nir)
    
    bands_multi = ['B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B8A', 'B11', 'B12']
    s2_multi = s2_median.select(bands_multi).resample('bilinear').reproject(
        crs='EPSG:4326', scale=10
    )
    
    print(f'\n📊 Band Selections:')
    print(f'   RGB+NIR (4 bands): {bands_rgb_nir}')
    print(f'   Multi (10 bands): {bands_multi}')
    
    # Calculate spectral indices
    nir = s2_median.select('B8')
    red = s2_median.select('B4')
    green = s2_median.select('B3')
    swir1 = s2_median.select('B11')
    
    ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI')
    ndwi = green.subtract(nir).divide(green.add(nir)).rename('NDWI')
    ndbi = swir1.subtract(nir).divide(swir1.add(nir)).rename('NDBI')
    
    s2_with_indices = s2_median.addBands([ndvi, ndwi, ndbi])
    
    print('✓ Spectral indices calculated (NDVI, NDWI, NDBI)')
    
    # Store decisions
    experiment_log['decisions']['imagery'] = {
        'collection': 'COPERNICUS/S2_SR_HARMONIZED',
        'cloud_threshold': 20,
        'composite_method': 'median',
        'image_count': int(image_count),
        'bands_rgb_nir': bands_rgb_nir,
        'bands_multi': bands_multi
    }
    
    print('\n✓ Sentinel-2 data ready for patch extraction')

**Outcome:**

- ✓ Earth Engine connected
- ✓ Sentinel-2 composite created
- ✓ Two band configurations ready (RGB+NIR and 10-band)
- ✓ Spectral indices calculated

**🤔 Reflection Question 3:**

*Why use a median composite instead of a single image? What are the tradeoffs?*

*Hint: Think about clouds, temporal variability, and phenology.*

[Write your answer here]

---

## ✅ Part A Complete!

You've completed:
1. ✓ Environment setup and class definitions
2. ✓ QGIS digitization guide
3. ✓ Training polygon validation
4. ✓ Sentinel-2 data access

**Next:** Continue to **Part B** for:
- Patch extraction
- Spectral analysis
- Spatial train/val splitting
- Storage format comparison
- Experiment logging
- Self-assessment