# Google Satellite Embeddings: Data Preprocessing

This is the alternative preprocessing workflow using **Google Satellite Embeddings** instead of raw Sentinel-2 spectral bands. These embeddings are high-dimensional representations derived from a pre-trained deep learning model, which can potentially capture complex spatial features that raw spectral bands might miss.

**Workflow Steps:**
1.  **Data Acquisition:** Accessing the `GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL` dataset.
2.  **Land Masking:** Since the Embeddings dataset lacks spectral bands for calculating water indices (like NDWI), we utilize the **Dynamic World** dataset (`GOOGLE/DYNAMICWORLD/V1`) to create a robust water mask.
3.  **Spatial Extraction:** Extracting the 64-dimensional embedding vectors at specific coordinates corresponding to field survey sounding points.
4.  **Dataset Preparation:** Splitting the extracted features into Training (70%) and Testing (30%) sets and saving them as NumPy arrays for model development.

**References:**
1.  Google Earth Engine Catalog. *Google Satellite Embeddings V1 Annual*.
2.  Google Earth Engine Catalog. *Dynamic World V1*.

In [2]:
import ee
import geemap
import os
import math
import numpy as np
import pandas as pd
import geopandas as gpd
from sklearn.model_selection import train_test_split


In [3]:
# ==========================================
# 1. CONFIGURATION & GEE SETUP
# ==========================================

# --- Input Paths ---
SOUNDING_PATH = r'data\sounding\sounding.geojson'
AOI_PATH = r'data\aoi_gili_ketapang.shp'

# --- Processing Parameters ---
DEPTH_COL = 'z1'      # Target variable column
TARGET_YEAR = 2018    # Year for Satellite Embeddings & Dynamic World

# --- Output Paths ---
DATASET_OUTPUT_DIR = r'train-test dataset\embeddings'
DRIVE_FOLDER = 'Google Earth Engine'
FILENAME_PREFIX = 'GiliKetapang_Embeddings'

# Ensure output directory exists
os.makedirs(DATASET_OUTPUT_DIR, exist_ok=True)

# --- Initialize Google Earth Engine ---
try:
    # Try initializing with the specific project
    ee.Initialize(project='mwahyur')
except Exception as e:
    # Force authentication if initialization fails
    print("Authentication required...")
    ee.Authenticate(force=True)
    ee.Initialize(project='mwahyur')

print("Google Earth Engine Initialized.")

Google Earth Engine Initialized.


## 2. Helper Functions
The following functions handle the interaction between local GeoDataFrames and Google Earth Engine (GEE).
* **`clean_gdf_for_gee`**: Ensures geometries are strictly 2D (removing Z-coordinates) and projected to WGS84, as GEE rejects 3D geometries.
* **`extract_in_chunks`**: Batches the extraction process to avoid GEE memory limits and timeouts when processing large numbers of points.

In [4]:
# ==========================================
# 2. HELPER FUNCTIONS
# ==========================================

def fc_to_pandas(features):
    """
    Converts a GEE FeatureCollection (list of dictionaries) to a Pandas DataFrame.
    """
    if not features:
        return pd.DataFrame()
    return pd.DataFrame([f['properties'] for f in features])

def clean_gdf_for_gee(gdf, label_col):
    """
    Sanitizes GeoDataFrame for GEE upload.
    
    Actions:
    1. Forces 2D geometry (Drops Z-coordinates which cause GEE errors).
    2. Reprojects to EPSG:4326 (WGS84).
    """
    # Force 2D (Drop Z dimension)
    if gdf.geometry.has_z.any():
        print("   [Info] Dropping Z-coordinates for GEE compatibility...")
        gdf['geometry'] = gdf.geometry.apply(
            lambda geom: pd.NA if geom is None else 
            (type(geom)(geom.x, geom.y) if geom.has_z else geom)
        )
    
    # Ensure WGS84
    if gdf.crs != "EPSG:4326":
        gdf = gdf.to_crs("EPSG:4326")

    return gdf[[label_col, 'geometry']].copy()

def extract_in_chunks(gdf, image, label_col, chunk_size=200):
    """
    Extracts raster values in batches to prevent GEE timeouts.
    
    Logic:
    1. Uploads a chunk of points to GEE.
    2. Attempts 'sampleRegions' (exact pixel match).
    3. If no data found (masked), falls back to a 30m buffer reduction.
    """
    gdf = clean_gdf_for_gee(gdf, label_col)
    results = []
    num_chunks = math.ceil(len(gdf) / chunk_size)
    
    print(f"Processing {len(gdf)} points in {num_chunks} batches...")
    
    for i in range(num_chunks):
        chunk = gdf.iloc[i*chunk_size : (i+1)*chunk_size]
        try:
            ee_chunk = geemap.geopandas_to_ee(chunk)
            
            # Attempt 1: Exact Extraction
            samples = image.sampleRegions(
                collection=ee_chunk, 
                properties=[label_col], 
                scale=10, 
                geometries=False
            )
            features = samples.getInfo()['features']
            
            # Attempt 2: Buffer fallback if exact pixel is masked
            if not features:
                buffered_chunk = ee_chunk.map(lambda f: f.buffer(30))
                samples_buffered = image.reduceRegions(
                    collection=buffered_chunk, 
                    reducer=ee.Reducer.mean(), 
                    scale=10
                )
                features = samples_buffered.filter(ee.Filter.notNull(['A01'])).getInfo()['features']

            if features:
                df_chunk = fc_to_pandas(features)
                # Filter out rows where embeddings (A01) are missing
                if 'A01' in df_chunk.columns:
                    df_chunk = df_chunk.dropna(subset=['A01'])
                    results.append(df_chunk)
                    print(f"Batch {i+1}: Extracted {len(df_chunk)} samples")
            else:
                print(f"Batch {i+1}: No valid pixels found")
                
        except Exception as e:
            print(f"Batch {i+1} Error: {str(e)}")
            continue

    if not results:
        return pd.DataFrame()
        
    return pd.concat(results, ignore_index=True)

## 3. Dataset Preparation & Map Export
We prepare the datasets by filtering for the target year and Area of Interest (AOI).

**Key Logic:**
* **Embeddings Source**: `GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL`.
* **Water Masking**: `GOOGLE/DYNAMICWORLD/V1` (Class 0 = Water) is used to update the mask of the embeddings image. This removes land pixels, which is critical for bathymetry visualization and analysis.

In [5]:
# ==========================================
# 3. PREPARE LAYERS & EXPORT MAP
# ==========================================

# 1. Load AOI
gdf_aoi = gpd.read_file(AOI_PATH)
if gdf_aoi.crs != "EPSG:4326":
    gdf_aoi = gdf_aoi.to_crs("EPSG:4326")
aoi_geometry = geemap.geopandas_to_ee(gdf_aoi).geometry()

# 2. Load Satellite Embeddings
embeddings = ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL') \
    .filterDate(f'{TARGET_YEAR}-01-01', f'{TARGET_YEAR+1}-01-01') \
    .filterBounds(aoi_geometry) \
    .mosaic()

# 3. Load Dynamic World Water Mask
# Class 0 represents "Water"
dw_water_mask = ee.ImageCollection("GOOGLE/DYNAMICWORLD/V1") \
    .filterDate(f'{TARGET_YEAR}-01-01', f'{TARGET_YEAR+1}-01-01') \
    .filterBounds(aoi_geometry) \
    .mosaic() \
    .select('label').eq(0) 

# 4. Apply Mask & Clip
# This image is used for the visual export (Land is transparent)
image_to_export = embeddings.clip(aoi_geometry).updateMask(dw_water_mask)

# 5. Export to Google Drive
task = ee.batch.Export.image.toDrive(
    image=image_to_export,
    description=f"{FILENAME_PREFIX}_{TARGET_YEAR}",
    fileNamePrefix=f"{FILENAME_PREFIX}_{TARGET_YEAR}",
    folder=DRIVE_FOLDER,
    region=aoi_geometry,
    scale=10,
    crs='EPSG:4326',
    maxPixels=1e13,
    fileFormat='GeoTIFF'
)

task.start()
print(f"Export task started. Destination: Drive/{DRIVE_FOLDER}/{FILENAME_PREFIX}_{TARGET_YEAR}.tif")

Export task started. Destination: Drive/Google Earth Engine/GiliKetapang_Embeddings_2018.tif


## 4. Feature Extraction & Dataset Splitting
Using the helper functions defined above, we extract the 64 embedding bands (`A01` - `A64`) for each survey point. The data is then cleaned and split into training and testing sets.

In [None]:
# ==========================================
# 4. EXECUTE EXTRACTION
# ==========================================

# 1. Load Sounding Points
gdf_points = gpd.read_file(SOUNDING_PATH)
gdf_points[DEPTH_COL] = pd.to_numeric(gdf_points[DEPTH_COL], errors='coerce')
gdf_points = gdf_points.dropna(subset=[DEPTH_COL])

# 2. Train/Test Split
print("Partitioning dataset (70% Train / 30% Test)...")
gdf_train, gdf_test = train_test_split(gdf_points, test_size=0.3, random_state=42)

# 3. Run Extraction
# Note: We pass the raw 'embeddings' object (unmasked) to ensure 
# the extraction function handles buffering and validity checks internally.
print("Extracting Training Set...")
df_train = extract_in_chunks(gdf_train, embeddings, DEPTH_COL)

print("Extracting Testing Set...")
df_test = extract_in_chunks(gdf_test, embeddings, DEPTH_COL)

if df_train.empty:
    raise ValueError("Error: No training data extracted. Check date range or AOI.")

# 4. Save to Disk
print("Saving artifacts...")

def save_npy(df, prefix):
    # Select Embedding Bands only (A01 - A64)
    cols = sorted([c for c in df.columns if c.startswith('A') and c != 'system:index'])
    X = df[cols].values
    y = df[DEPTH_COL].values
    
    # Remove NaNs
    mask = ~np.isnan(X).any(axis=1) & ~np.isnan(y)
    
    np.save(os.path.join(DATASET_OUTPUT_DIR, f'X_{prefix}.npy'), X[mask])
    np.save(os.path.join(DATASET_OUTPUT_DIR, f'y_{prefix}.npy'), y[mask])
    
    # Save feature names once (from training set)
    if prefix == 'train':
        np.save(os.path.join(DATASET_OUTPUT_DIR, 'feature_names.npy'), np.array(cols))

save_npy(df_train, 'train')
save_npy(df_test, 'test')

print("-" * 30)
print(f"Success! Data saved to: '{DATASET_OUTPUT_DIR}'")

Splitting Train/Test...
Extracting Training Set...
   [Info] Dropping Z-coordinates for GEE compatibility...
Processing 738 points in 4 batches...
Batch 1: Extracted 200 samples
Batch 2: Extracted 200 samples
Batch 3: Extracted 200 samples
Batch 4: Extracted 138 samples
Extracting Test Set...
   [Info] Dropping Z-coordinates for GEE compatibility...
Processing 317 points in 2 batches...
Batch 1: Extracted 200 samples
Batch 2: Extracted 117 samples
Saving .npy files...
------------------------------
Starting Data Extraction...
Success! Map export started and Datasets saved to: train-test dataset\embeddings
