# VECTOR TILING FOR MAP GENERALIZATION

Generalization means creating many copies of a single ‘source of truth’ and tailoring them to the view context

Doing this on the fly is labor intensive in a static paper mapping context, especially en masse with variable map scales and contexts (urban, semi-urban, rural) 

Value in adopting dynamic mapping techniques into the existing processing workflow:

- Vector tiling: highly optimized, customizable, retains records’ attributes (versus image tiles)
- Customization, filtering, alignment between 'in-house' GRID3 data and third-party sources like OpenStreetMap
- API endpoint targets: maps can diff and grow as the sources of truth do
- TIPPECANOE: Incredible and fast processing library, somewhat unpredictable w/ emergent effects when processing geometries in a ‘recursive tiled’ manner, but benefits outweigh headaches
- Plus: Self-hosting interest (docker/containerization): anxiety after ‘open’ data portals going offline nationwide



<!-- - What works?

- What doesn't? -->
<!-- 
Note: “Emergent properties” -> usually helpful automation, as long as there are constraints, but need to watch for weirdness and misalignment with what a legend would otherwise say a layer/category should look like

(For instance, interpolation between color steps, etc) -->

===

# Processing pipeline

1. **Download** - Fetch Overture Maps and GRID3 (AGOL feature services) data for specified extent (as GeoParquet file)
2. **Convert to FlatGeobuf** - Transform GeoParquet to FlatGeobuf for compatibility + efficiency
3. **Tile** - Generate PMTiles using tippecanoe with bespoke settings per-layer
4. **View** - using maplibre open spec

## File formats
- **GeoParquet (.parquet)** - Download format (compact, fast queries via "duckquery")
- **FlatGeobuf (.fgb)** - Convert for optimal tippecanoe library support
- **GeoJSON (.geojson)** - Fallback support for small datasets
- **pmTiles** - Dynamic vector tiles, served from single static file

## CONFIG

.env -> config.py imports environment vars

In [1]:
# ============================================================
# CONFIGURATION - Run this cell first
# ============================================================
# This cell initializes all configuration and should be run 
# first. Re-run this cell to reload configuration changes.
# ============================================================

import sys
import os
from pathlib import Path
from dotenv import load_dotenv

# Setup paths
notebook_dir = Path.cwd()
processing_dir = notebook_dir.parent  # 1-processing
repo_root = processing_dir.parent     # basemap (repository root)

# Add processing directory to path
if str(processing_dir) not in sys.path:
    sys.path.insert(0, str(processing_dir))

# Load environment variables from REPOSITORY ROOT (monorepo-wide .env)
env_path = repo_root / '.env'
load_dotenv(env_path)
print(f"✓ Loaded environment from repository root: {env_path}")
print(f"  DATA_DISK = {os.environ.get('DATA_DISK', 'not set')}")

# Import configuration (will also load .env via config.py)
from config import (
    get_config,
    ensure_directories,
    print_config_summary,
    SCRIPTS_DIR,
    OUTPUT_DIR,
    OVERTURE_DATA_DIR,
    GRID3_DATA_DIR,
    SCRATCH_DIR,
)

# Import processing functions
from scripts import (
    download_overture_data,
    convert_file,
    convert_parquet_to_fgb,
    batch_convert_directory,
    process_to_tiles,
    create_tilejson,
    download_arcgis_data,
    batch_download_arcgis_layers,
)

# Additional libraries
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# ============================================================
# GLOBAL CONFIGURATION - Available in all cells below
# ============================================================
CONFIG = get_config()

# ============================================================
# EXTENT CONFIGURATION - SINGLE SOURCE OF TRUTH
# ============================================================
# Extent is now configured in .env file (repository root)
# To change the geographic area, edit .env and restart kernel
# 
# Current extent values from .env:
print(f"\n=== EXTENT FROM ENVIRONMENT (.env) ===")
print(f"  West (lon_min):  {os.environ.get('EXTENT_WEST', 'not set')}")
print(f"  South (lat_min): {os.environ.get('EXTENT_SOUTH', 'not set')}")
print(f"  East (lon_max):  {os.environ.get('EXTENT_EAST', 'not set')}")
print(f"  North (lat_max): {os.environ.get('EXTENT_NORTH', 'not set')}")
print(f"  Buffer (degrees): {os.environ.get('EXTENT_BUFFER', 'not set')}")
print(f"\n  Combined tuple: {CONFIG['extent']['coordinates']}")
print(f"  Buffer: {CONFIG['extent']['buffer_degrees']} degrees")

# DO NOT override extent here - edit .env instead!
# CONFIG["extent"]["coordinates"] is automatically loaded from .env

# Processing options (can still be customized here)
CONFIG["tiling"]["input_dirs"] = [SCRATCH_DIR]  # Read FlatGeobuf files from scratch
CONFIG["download"]["verbose"] = True
CONFIG["conversion"]["verbose"] = True
CONFIG["tiling"]["verbose"] = True
CONFIG["tiling"]["parallel"] = True

# Create directories and verify
ensure_directories()

# Verification
print("\n=== CONFIGURATION VERIFICATION ===")
print(f"Repository root:       {repo_root}")
print(f"Environment .env:      {env_path}")
print(f"Environment DATA_DISK: {os.environ.get('DATA_DISK', 'NOT SET')}")
print(f"Config uses:           {CONFIG['paths']['data_dir'].parent}")

print_config_summary(CONFIG)
print("\n✓ Configuration loaded - CONFIG available in all cells")
print("\n⚠️  To change extent: Edit .env file and restart kernel")
print("   All downloads, processing, and viewing will use the same extent")


✓ Loaded environment from repository root: /Users/matthewheaton/GitHub/.env
  DATA_DISK = .

=== EXTENT FROM ENVIRONMENT (.env) ===
  West (lon_min):  19.5
  South (lat_min): -12.5
  East (lon_max):  30.5
  North (lat_max): -3.5
  Buffer (degrees): 0

  Combined tuple: (19.5, -12.5, 30.5, -3.5)
  Buffer: 0.0 degrees

=== CONFIGURATION VERIFICATION ===
Repository root:       /Users/matthewheaton/GitHub
Environment .env:      /Users/matthewheaton/GitHub/.env
Environment DATA_DISK: .
Config uses:           /Users/matthewheaton/GitHub/basemap
PROJECT CONFIGURATION
Project root:        /Users/matthewheaton/GitHub/basemap/1-processing
Scripts directory:   /Users/matthewheaton/GitHub/basemap/1-processing/scripts
Notebooks directory: /Users/matthewheaton/GitHub/basemap/1-processing/notebooks
Data directory:      /Users/matthewheaton/GitHub/basemap/data
Scratch directory:   /Users/matthewheaton/GitHub/basemap/data/2-scratch
Output directory:    /Users/matthewheaton/GitHub/basemap/data/3-pmtiles

## 2. Download Overture Data with DuckDB 

Use the `downloadOverture.py` module to fetch geospatial data from Overture Maps

e.g., replace periodic geofabrik OSM ~shapefile~ fetches

In [None]:
# Download Overture Maps data
print("=== STEP 1: DOWNLOADING OVERTURE DATA ===")
download_results = download_overture_data(
    extent=CONFIG["extent"]["coordinates"],
    buffer_degrees=CONFIG["extent"]["buffer_degrees"],
    template_path=str(CONFIG["paths"]["template_path"]),
    verbose=CONFIG["download"]["verbose"],
    project_root=str(CONFIG["paths"]["project_root"]),
    overture_data_dir=str(CONFIG["paths"]["overture_data_dir"])
)

print(f"Download completed: {download_results['success']}")
print(f"Sections processed: {download_results['processed_sections']}")
if download_results["errors"]:
    print(f"Errors encountered: {len(download_results['errors'])}")
    for error in download_results["errors"]:
        print(f"  - {error}")
print()

## 2a. Download ArcGIS Feature Server Data

Download geospatial data from hosted ArcGIS Feature Server REST API endpoints - can include any esri-hosted data as endpoint

- **Automatic pagination** - Handles ArcGIS's 1000-2000 feature limit per request
- **Spatial filtering** - Apply bounding box filter to download only features in aoi
- **formats** - Download as GeoJSON or directly convert to FlatGeobuf
- **Batch processing** - Download multiple layers with one function call

### GRID3 DRC Layers
- https://services3.arcgis.com/BU6Aadhn6tbBEdyk/arcgis/rest/services

In [None]:
# Download ArcGIS Feature Server data (optional - skip if not needed)
print("=== STEP 2a: DOWNLOADING ARCGIS DATA (OPTIONAL) ===")
print(f"Using extent from CONFIG: {CONFIG['extent']['coordinates']}")
print(f"  Longitude: {CONFIG['extent']['coordinates'][0]} to {CONFIG['extent']['coordinates'][2]}")
print(f"  Latitude: {CONFIG['extent']['coordinates'][1]} to {CONFIG['extent']['coordinates'][3]}")

# Define ArcGIS layers to download
# Uncomment and customize the layers you need
arcgis_layers = [
    {
        'url': 'https://services3.arcgis.com/BU6Aadhn6tbBEdyk/arcgis/rest/services/GRID3_COD_health_zones_v7_0/FeatureServer/0',
        'name': 'health_zones',
        'where': '1=1'  # Download all features (can add SQL filter here)
    },
    {
        'url': 'https://services3.arcgis.com/BU6Aadhn6tbBEdyk/arcgis/rest/services/GRID3_COD_Settlement_Extents_v3_1/FeatureServer/0',
        'name': 'settlement_extents',
        'where': '1=1'
    },
    {
        'url': 'https://services3.arcgis.com/BU6Aadhn6tbBEdyk/ArcGIS/rest/services/GRID3_COD_health_areas_v7_0/FeatureServer/0',
        'name': 'health_areas',
        'where': '1=1'
    },
        {
        'url': 'https://services3.arcgis.com/BU6Aadhn6tbBEdyk/ArcGIS/rest/services/GRID3_COD_settlement_names_v7_0/FeatureServer/0',
        'name': 'settlement_names',
        'where': '1=1'  
    },
    {
        'url': 'https://services3.arcgis.com/BU6Aadhn6tbBEdyk/ArcGIS/rest/services/COD_GRID3_health_facilities_v7_0/FeatureServer/0',
        'name': 'health_facilities',
        'where': '1=1'
    },
    {
        'url': 'https://services3.arcgis.com/BU6Aadhn6tbBEdyk/ArcGIS/rest/services/GRID3_COD_religious_centers_v1_0/FeatureServer/0',
        'name': 'religious_centers',
        'where': '1=1'
    },
]

# Download layers using the SAME EXTENT as Overture data
# This ensures all data layers align spatially
if arcgis_layers:
    arcgis_results = batch_download_arcgis_layers(
        layer_configs=arcgis_layers,
        output_dir=str(CONFIG["paths"]["scratch_dir"]),  # Save directly to scratch for tiling
        extent=CONFIG["extent"]["coordinates"],  # ← SAME EXTENT as Overture downloads
        output_format="fgb",  # Use FlatGeobuf for optimal tiling performance
        verbose=CONFIG["download"]["verbose"]
    )
    
    print(f"\nArcGIS Download Summary:")
    print(f"  Total layers: {arcgis_results['total_layers']}")
    print(f"  Successful: {arcgis_results['successful']}")
    print(f"  Failed: {arcgis_results['failed']}")
    
    for layer in arcgis_results['layers']:
        if layer['success']:
            print(f"  ✓ {layer['name']}: {layer['feature_count']:,} features")
        else:
            print(f"  ✗ {layer['name']}: {layer.get('error', 'Unknown error')}")
else:
    print("No ArcGIS layers configured. Edit arcgis_layers list above to download data.")
    print(f"\nNote: When configured, downloads will use extent: {CONFIG['extent']['coordinates']}")

# Option 2: Download a single layer (alternative approach)
# Also uses the same extent from CONFIG
# Uncomment and customize as needed:
# single_layer_result = download_arcgis_data(
#     service_url='https://services3.arcgis.com/BU6Aadhn6tbBEdyk/arcgis/rest/services/GRID3_COD_health_zones_v7_0/FeatureServer/0',
#     output_path=str(CONFIG["paths"]["scratch_dir"] / "health_zones.fgb"),
#     extent=CONFIG["extent"]["coordinates"],  # ← Uses CONFIG extent
#     output_format="fgb",
#     verbose=True
# )

In [None]:
# Generate centroids for administrative boundary labels
print("=== STEP 2b: GENERATING CENTROIDS FOR ADMINISTRATIVE LABELS ===")

# Import the centroid generation function
from scripts import batch_generate_centroids

# Define which layers need centroids for label positioning
# These correspond to the administrative boundary layers
layers_for_centroids = [
    'health_zones',   # Health zone polygons -> health_zones_centroids
    'health_areas',   # Health area polygons -> health_areas_centroids
]

# Generate centroids for specified layers
centroid_results = batch_generate_centroids(
    input_dir=str(CONFIG["paths"]["scratch_dir"]),  # Where polygon FGB files are
    output_dir=str(CONFIG["paths"]["scratch_dir"]),  # Save centroids alongside polygons
    layers=layers_for_centroids,                     # Only process these layers
    suffix='_centroids',                             # Output: layer_name_centroids.fgb
    verbose=CONFIG["download"]["verbose"]
)

print(f"\nCentroid Generation Summary:")
print(f"  Total layers: {centroid_results['total_layers']}")
print(f"  Successful: {centroid_results['successful']}")
print(f"  Failed: {centroid_results['failed']}")

for layer in centroid_results['layers']:
    if layer['success']:
        output_name = Path(layer['output_file']).name
        print(f"  ✓ {output_name}: {layer['feature_count']:,} centroids")
    else:
        print(f"  ✗ {Path(layer['input_file']).name}: {layer.get('error', 'Unknown error')}")

# List all files ready for tiling (including new centroids)
if CONFIG["paths"]["scratch_dir"].exists():
    all_fgb_files = sorted(CONFIG["paths"]["scratch_dir"].glob("*.fgb"))
    centroid_files = [f for f in all_fgb_files if '_centroids' in f.name]
    
    print(f"\n{'='*50}")
    print(f"Files ready for tiling:")
    print(f"  Total FlatGeobuf files: {len(all_fgb_files)}")
    print(f"  Centroid files: {len(centroid_files)}")
    if centroid_files:
        for f in centroid_files:
            print(f"    - {f.name}")
    print(f"  Location: {CONFIG['paths']['scratch_dir']}")


## 2b. Generate Centroids for Administrative Polygons

Generate interior centroid points for health zones and health areas. These will be used for single-label-per-polygon rendering in the map viewer (interior labels at lower zoom levels).

**Why centroids?**
- Guarantees one label per polygon (no duplicates across tile boundaries)
- `representative_point()` ensures label is always inside the polygon
- Preserves all attributes for label content
- Separate point layer is more efficient than point-based symbol placement on polygons

## 2c. Generate Centerlines for Water Polygons

Generate centerline features for polygonal water bodies (lakes, reservoirs, etc.). These will be used for placing labels along the natural axis of elongated water features.

**Why centerlines?**
- Better label placement for elongated water features (lakes, reservoirs)
- Creates linear features along the medial axis of polygons
- Labels follow the natural orientation of the water body
- Uses Voronoi-based skeleton algorithm for accurate centerline extraction
- Preserves all attributes for label content

In [None]:
# Generate centerlines for water feature labels
print("=== STEP 2c: GENERATING CENTERLINES FOR WATER LABELS ===")

# Import the centerline generation function
from scripts import batch_generate_centerlines

# Define which layers need centerlines for label positioning
# Typically used for elongated water bodies like lakes and reservoirs
layers_for_centerlines = [
    'water',  # Water polygons -> water_centerlines
]

# Generate centerlines for specified layers
centerline_results = batch_generate_centerlines(
    input_dir=str(CONFIG["paths"]["scratch_dir"]),  # Where polygon FGB files are
    output_dir=str(CONFIG["paths"]["scratch_dir"]),  # Save centerlines alongside polygons
    layers=layers_for_centerlines,                   # Only process these layers
    suffix='_centerlines',                           # Output: layer_name_centerlines.fgb
    simplify_tolerance=5.0,                          # N meters simplification
    border_density=20,                            # Increase this for winding rivers (default is 100)
    verbose=CONFIG["download"]["verbose"]
)

print(f"  Total layers: {centerline_results['total_layers']}")
print(f"  Successful: {centerline_results['successful']}")
print(f"  Failed: {centerline_results['failed']}")

for layer in centerline_results['layers']:
    if layer['success']:
        output_name = Path(layer['output_file']).name
        print(f"  ✓ {output_name}: {layer['feature_count']:,} centerlines")
    else:
        print(f"  ✗ {Path(layer['input_file']).name}: {layer.get('error', 'Unknown error')}")

# List all processed files ready for tiling
if CONFIG["paths"]["scratch_dir"].exists():
    all_fgb_files = sorted(CONFIG["paths"]["scratch_dir"].glob("*.fgb"))
    centroid_files = [f for f in all_fgb_files if '_centroids' in f.name]
    centerline_files = [f for f in all_fgb_files if '_centerlines' in f.name]
    
    print(f"\n{'='*50}")
    print(f"Processed geometry files ready for tiling:")
    print(f"  Total FlatGeobuf files: {len(all_fgb_files)}")
    print(f"  Centroid files: {len(centroid_files)}")
    if centroid_files:
        for f in centroid_files:
            print(f"    - {f.name}")
    print(f"  Centerline files: {len(centerline_files)}")
    if centerline_files:
        for f in centerline_files:
            print(f"    - {f.name}")
    print(f"  Location: {CONFIG['paths']['scratch_dir']}")


### ArcGIS Feature Server Downloads

**Finding Feature Server URLs:**
1. Browse your organization's ArcGIS REST Services Directory
2. Navigate to a specific layer (e.g., FeatureServer/0, FeatureServer/1)
3. Copy the full URL up to and including the layer number
4. The script will automatically append `/query` and handle parameters

**Spatial Filtering:**
- The `extent` parameter filters features to your bounding box (saves bandwidth & time)
- For global layers, omit `extent=None` to download all features
- Extent uses WGS84 coordinates: `(lon_min, lat_min, lon_max, lat_max)`

**Attribute Filtering:**
- Use `where` clause for SQL-based filtering: `'population > 10000'`
- Default `'1=1'` downloads all features

**Output Formats:**
- `"fgb"` (FlatGeobuf) - Recommended for direct tiling (streaming, indexed)
- `"geojson"` - more flexible, less optimal


- Large datasets (>100k features) automatically use pagination
- Downloads directly to scratch directory

In [None]:
# Check what files were created during download
print("=== CHECKING DOWNLOADED FILES ===")

# Check Overture Maps downloads (GeoParquet)
overture_files = []
search_dirs = [CONFIG["paths"]["data_dir"], CONFIG["paths"]["overture_data_dir"]]

for data_dir in search_dirs:
    if data_dir.exists():
        for pattern in CONFIG["download"]["output_formats"]:
            files = list(data_dir.glob(pattern))
            overture_files.extend(files)

print(f"\nOverture Maps: {len(overture_files)} files")
for file in sorted(overture_files):
    file_size = file.stat().st_size / 1024 / 1024  # Size in MB
    print(f"  {file.name} ({file_size:.1f} MB)")

# Check ArcGIS downloads (FlatGeobuf in scratch)
arcgis_files = []
if CONFIG["paths"]["scratch_dir"].exists():
    arcgis_files = list(CONFIG["paths"]["scratch_dir"].glob("*.fgb"))
    # Filter out converted Overture files (they have matching .parquet names)
    overture_names = {f.stem for f in overture_files}
    arcgis_files = [f for f in arcgis_files if f.stem not in overture_names]

print(f"\nArcGIS Feature Server: {len(arcgis_files)} files")
for file in sorted(arcgis_files):
    file_size = file.stat().st_size / 1024 / 1024  # Size in MB
    print(f"  {file.name} ({file_size:.1f} MB)")

# Display overall statistics
all_files = overture_files + arcgis_files
if all_files:
    total_size_mb = sum(f.stat().st_size for f in all_files) / 1024 / 1024
    print(f"\n{'='*50}")
    print(f"Total downloaded: {len(all_files)} files ({total_size_mb:.1f} MB)")
    print(f"  Overture Maps (GeoParquet): {len(overture_files)} files")
    print(f"  ArcGIS (FlatGeobuf): {len(arcgis_files)} files")
else:
    print("\nNo files found. Run download steps above first.")


## 3. Convert GeoParquet to FlatGeobuf

Convert downloaded Overture GeoParquet files to FlatGeobuf format for tippecanoe compatibility

**Note**: ArcGIS data was already downloaded as FlatGeobuf in Step 2a, so this step only processes Overture Maps data. Both sources will coexist in the scratch directory.

In [None]:
# Verify that downloaded data extent matches configured extent
print("=== EXTENT VERIFICATION ===")

# Get configured extent from CONFIG (loaded from .env)
config_extent = CONFIG["extent"]["coordinates"]
config_west, config_south, config_east, config_north = config_extent

print(f"\n1. Configured Extent (from .env):")
print(f"   West:  {config_west:>10.6f}")
print(f"   South: {config_south:>10.6f}")
print(f"   East:  {config_east:>10.6f}")
print(f"   North: {config_north:>10.6f}")

# Check actual extent of downloaded files
import subprocess

def get_fgb_extent(file_path):
    """Get extent from FlatGeobuf file using ogrinfo"""
    try:
        result = subprocess.run(
            ['ogrinfo', '-al', '-so', str(file_path)],
            capture_output=True, text=True, timeout=10
        )
        for line in result.stdout.split('\n'):
            if 'Extent:' in line:
                # Parse: Extent: (20.294878, -7.704176) - (23.705435, -3.795773)
                parts = line.split(':')[1].strip()
                parts = parts.replace('(', '').replace(')', '').replace(' - ', ',')
                coords = [float(x.strip()) for x in parts.split(',')]
                return tuple(coords)  # (west, south, east, north)
    except Exception as e:
        return None
    return None

# Check key files
check_files = ['buildings.fgb', 'roads.fgb', 'water.fgb']
mismatches = []

print(f"\n2. Downloaded Data Extents:")
for filename in check_files:
    fgb_path = CONFIG["paths"]["scratch_dir"] / filename
    if fgb_path.exists():
        extent = get_fgb_extent(fgb_path)
        if extent:
            west, south, east, north = extent
            print(f"\n   {filename}:")
            print(f"     West:  {west:>10.6f}")
            print(f"     South: {south:>10.6f}")
            print(f"     East:  {east:>10.6f}")
            print(f"     North: {north:>10.6f}")
            
            # Check if extents match (within 0.5 degree tolerance for tile snapping)
            tolerance = 0.5
            west_match = abs(west - config_west) < tolerance
            south_match = abs(south - config_south) < tolerance
            east_match = abs(east - config_east) < tolerance
            north_match = abs(north - config_north) < tolerance
            
            if not (west_match and south_match and east_match and north_match):
                mismatches.append({
                    'file': filename,
                    'data_extent': extent,
                    'config_extent': config_extent
                })
    else:
        print(f"\n   {filename}: Not found")

# Report results
print(f"\n{'='*60}")
if mismatches:
    print("EXTENT MISMATCH DETECTED!")
    print(f"\n   {len(mismatches)} file(s) have different extents than configured.")
    print(f"\n   SOLUTION:")
    print(f"   1. If you want the data in the downloaded files:")
    print(f"      - Update EXTENT_* values in .env to match data extent")
    print(f"      - Restart kernel and re-run cells")
    print(f"\n   2. If you want the extent currently in .env:")
    print(f"      - Delete files: {CONFIG['paths']['scratch_dir']}")
    print(f"      - Delete files: {CONFIG['paths']['overture_data_dir']}")
    print(f"      - Re-run download cells (Step 1 and 2a)")
    print(f"\n   DO NOT proceed to tiling until extents match!")
else:
    print("EXTENT VERIFICATION PASSED")
    print(f"\n   All downloaded data matches the configured extent.")
    print(f"   Safe to proceed with tile generation.")
print(f"{'='*60}\n")


## 2.5b. Verify Extent

In [None]:
# Convert GeoParquet files to FlatGeobuf for optimal tiling performance
print("=== STEP 3: CONVERTING OVERTURE GEOPARQUET TO FLATGEOBUF ===")
print("Note: ArcGIS data already in FlatGeobuf format (from Step 2a)")

# Get list of existing ArcGIS FlatGeobuf files to avoid overwriting
existing_fgb_files = set()
if CONFIG["paths"]["scratch_dir"].exists():
    existing_fgb_files = {f.stem for f in CONFIG["paths"]["scratch_dir"].glob("*.fgb")}
    if existing_fgb_files:
        print(f"Preserving {len(existing_fgb_files)} existing ArcGIS FlatGeobuf files:")
        for name in sorted(existing_fgb_files):
            print(f"  - {name}.fgb")

# Convert Overture GeoParquet files to FlatGeobuf
# These will be added alongside ArcGIS files in the scratch directory
fgb_results = batch_convert_directory(
    input_dir=str(CONFIG["paths"]["overture_data_dir"]),
    output_dir=str(CONFIG["paths"]["scratch_dir"]),  # Save FGB files to scratch directory
    pattern=CONFIG["fgb_conversion"]["input_pattern"],
    overwrite=CONFIG["fgb_conversion"]["overwrite"],
    verbose=CONFIG["fgb_conversion"]["verbose"]
)

print(f"\nOverture Conversion Summary:")
print(f"  Converted: {fgb_results['converted']} files")
print(f"  Skipped:   {fgb_results['skipped']} files (already exist)")
print(f"  Errors:    {len(fgb_results['errors'])} files")

if fgb_results['errors']:
    print("\nErrors encountered:")
    for error in fgb_results['errors']:
        print(f"  - {error['file']}: {error['error']}")

# Count total FlatGeobuf files ready for tiling
if CONFIG["paths"]["scratch_dir"].exists():
    all_fgb_files = list(CONFIG["paths"]["scratch_dir"].glob("*.fgb"))
    overture_fgb_count = len([f for f in all_fgb_files if f.stem not in existing_fgb_files])
    arcgis_fgb_count = len(existing_fgb_files)
    
    print(f"\n{'='*50}")
    print(f"✓ All FlatGeobuf files ready for tiling")
    print(f"  Location: {CONFIG['paths']['scratch_dir']}")
    print(f"  Total files: {len(all_fgb_files)}")
    print(f"    - Overture (converted): {overture_fgb_count}")
    print(f"    - ArcGIS (direct): {arcgis_fgb_count}")
else:
    print(f"\n⚠ No FlatGeobuf files found in scratch directory")
    if fgb_results['skipped'] > 0:
        print(f"All {fgb_results['skipped']} Overture files already converted.")

## 4. Process FlatGeobuf to PMTiles

Use the `runCreateTiles.py` module to convert FlatGeobuf files to PMTiles using custom tippecanoe queries from tippecanoe.py

In [2]:
# Step 4: Process all geospatial files to PMTiles
print("=== STEP 4: PROCESSING TO PMTILES ===")

# Process all downloaded and converted files to PMTiles using CONFIG settings
# Now supports: GeoJSON, GeoJSONSeq, and GeoParquet formats
tiling_results = process_to_tiles(
    extent=CONFIG["extent"]["coordinates"],
    input_dirs=[str(d) for d in CONFIG["tiling"]["input_dirs"]],  # Convert Path objects to strings
    filter_pattern=CONFIG["tiling"]["filter_pattern"],  # Pass filter pattern from CONFIG
    output_dir=str(CONFIG["tiling"]["output_dir"]),  # Use explicit output directory from CONFIG
    parallel=CONFIG["tiling"]["parallel"],
    verbose=CONFIG["tiling"]["verbose"]
)

# print(f"Tiling completed: {tiling_results['success']}")
# print(f"Files processed: {len(tiling_results['processed_files'])}/{tiling_results['total_files']}")

if tiling_results["errors"]:
    print(f"Errors encountered: {len(tiling_results['errors'])}")
    for error in tiling_results["errors"]:
        print(f"  - {error}")

# Display generated PMTiles files
if tiling_results["processed_files"]:
    print(f"\n✓ Successfully generated {len(tiling_results['processed_files'])} PMTiles:")
    
    pmtiles_files = list(CONFIG["paths"]["tile_dir"].glob("*.pmtiles"))
    
    total_size_mb = 0
    for pmtile in sorted(pmtiles_files):
        size_mb = pmtile.stat().st_size / 1024 / 1024
        total_size_mb += size_mb
        print(f"  {pmtile.name} ({size_mb:.1f} MB)")
    
    print(f"\nTotal PMTiles size: {total_size_mb:.1f} MB")
    print(f"Files location: {CONFIG['paths']['tile_dir']}")
    
else:
    print("\nNo PMTiles files were generated. Check the errors above.")
    print(f"Make sure you have geospatial files (GeoJSON/GeoJSONSeq/GeoParquet) in: {[str(d) for d in CONFIG['tiling']['input_dirs']]}")

=== STEP 4: PROCESSING TO PMTILES ===
=== PROCESSING TO TILES ===
Found 2 files to process:
  water.fgb (FlatGeobuf)
  water_centerlines.fgb (FlatGeobuf)


Processing files:   0%|          | 0/2 [00:00<?, ?file/s]

  Using template settings for water.fgb (7 options)


Processing files:  50%|█████     | 1/2 [00:02<00:02,  2.87s/file]

✓ water_centerlines.fgb -> /Users/matthewheaton/GitHub/basemap/data/3-pmtiles/water_centerlines.pmtiles


Processing files: 100%|██████████| 2/2 [00:22<00:00, 11.08s/file]

✓ water.fgb -> /Users/matthewheaton/GitHub/basemap/data/3-pmtiles/water.pmtiles

=== TILE PROCESSING COMPLETE ===
Processed: 2/2 files

✓ Successfully generated 2 PMTiles:
  buildings.pmtiles (379.1 MB)
  health_areas.pmtiles (61.9 MB)
  health_areas_centroids.pmtiles (17.4 MB)
  health_facilities.pmtiles (20.8 MB)
  health_zones.pmtiles (17.8 MB)
  health_zones_centroids.pmtiles (1.1 MB)
  infrastructure.pmtiles (5.8 MB)
  land_cover.pmtiles (834.7 MB)
  land_residential.pmtiles (12.3 MB)
  land_use.pmtiles (6.7 MB)
  religious_centers.pmtiles (11.6 MB)
  roads.pmtiles (287.6 MB)
  settlement_extents.pmtiles (62.0 MB)
  settlement_names.pmtiles (28.3 MB)
  water.pmtiles (35.7 MB)
  water_centerlines.pmtiles (1.8 MB)

Total PMTiles size: 1784.5 MB
Files location: /Users/matthewheaton/GitHub/basemap/data/3-pmtiles





## 5. Create TileJSON Metadata for map viewer

- **Set bounds and zoom levels**
- **PMTiles URL references**

In [None]:
# Step 5: Create TileJSON metadata for MapLibre integration
print("=== STEP 5: CREATING TILEJSON METADATA ===")

# Check if PMTiles files exist in the configured tile directory
pmtiles_files = list(CONFIG["paths"]["tile_dir"].glob("*.pmtiles"))

if pmtiles_files:
    print(f"Found {len(pmtiles_files)} PMTiles files, creating TileJSON...")
    
    try:
        tilejson = create_tilejson(
            tile_dir=str(CONFIG["paths"]["tile_dir"]),  # Explicitly pass tile directory
            extent=CONFIG["extent"]["coordinates"],  # Pass extent from CONFIG
            output_file=str(CONFIG["paths"]["tile_dir"] / "tilejson.json")  # Explicitly pass output file path
        )
        
        print("✓ TileJSON created successfully")
        print(f"  Bounds: {tilejson['bounds']}")
        print(f"  Zoom range: {tilejson['minzoom']} - {tilejson['maxzoom']}")
        print(f"  Vector layers: {len(tilejson['vector_layers'])}")
        print(f"  Output file: {CONFIG['paths']['tile_dir'] / 'tilejson.json'}")
        
        # Show a summary of all output files
        print(f"\nComplete output summary:")
        total_size_mb = 0
        for pmtile in sorted(pmtiles_files):
            size_mb = pmtile.stat().st_size / 1024 / 1024
            total_size_mb += size_mb
            print(f"  {pmtile.name} ({size_mb:.1f} MB)")
        
        print(f"  tilejson.json")
        print(f"\nTotal PMTiles size: {total_size_mb:.1f} MB")
        print(f"All files location: {CONFIG['paths']['tile_dir']}")
        
    except Exception as e:
        print(f"✗ TileJSON creation failed: {e}")
        
else:
    print("No PMTiles files found in output directory.")
    print(f"Expected location: {CONFIG['paths']['tile_dir']}")
    print("Run Step 4 first to generate PMTiles files.")

## 6. Test

In [None]:
# Individual Step Testing and Validation

print("INDIVIDUAL STEP TESTING")
print("=" * 50)

print("\n1. Test downloadOverture.py standalone:")
print("python scripts/downloadOverture.py --extent='27.0,-8.0,30.5,-2.0' --buffer=0")

print("\n2. Test downloadArcGIS.py standalone:")
print("python scripts/downloadArcGIS.py \\")
print("  'https://services3.arcgis.com/.../FeatureServer/0' \\")
print("  output.fgb --extent='27.0,-8.0,30.5,-2.0' --format=fgb")

print("\n3. Test runCreateTiles.py standalone:")
print("python scripts/runCreateTiles.py --extent='27.0,-8.0,30.5,-2.0' --create-tilejson")

print("\n4. Test individual steps in this notebook:")
print("   - Step 1: Download Overture data")
print("   - Step 2a: Download ArcGIS data (optional)")
print("   - Step 2b: Check downloaded files")
print("   - Step 2.5: Convert GeoParquet to FlatGeobuf")
print("   - Step 4: Process to PMTiles")
print("   - Step 5: Create TileJSON")

print("\n5. Validate outputs using CONFIG paths:")
print(f"   - Overture GeoParquet: {CONFIG['paths']['overture_data_dir']}")
print(f"   - FlatGeobuf files: {CONFIG['paths']['scratch_dir']}")
print(f"   - PMTiles: {CONFIG['paths']['tile_dir']}")
print(f"   - TileJSON metadata: {CONFIG['paths']['tile_dir']}/tilejson.json")

# Configuration validation using centralized CONFIG
print("\nCURRENT CONFIGURATION VALIDATION")
print("=" * 50)
print(f"Extent: {CONFIG['extent']['coordinates']}")
print(f"Buffer: {CONFIG['extent']['buffer_degrees']} degrees")
print(f"Tile output directory: {CONFIG['paths']['tile_dir']}")
print(f"Scratch directory (FlatGeobuf): {CONFIG['paths']['scratch_dir']}")
print(f"Input directories for tiling: {[str(d) for d in CONFIG['tiling']['input_dirs']]}")

# Area calculation using CONFIG
extent = CONFIG['extent']['coordinates']
area = (extent[2] - extent[0]) * (extent[3] - extent[1])
print(f"Processing area: {area:.2f} degree² ({area * 111**2:.0f} km²)")

# Check directory status
print(f"\nDIRECTORY STATUS")
print("=" * 30)
for path_name, path_obj in CONFIG['paths'].items():
    if path_name.endswith('_dir'):
        status = "exists" if path_obj.exists() else "missing"
        file_count = len(list(path_obj.glob("*"))) if path_obj.exists() else 0
        print(f"{path_name}: {status} ({file_count} files)")

print("\nPERFORMANCE OPTIMIZATION TIPS")
print("=" * 50)

print(f"\n1. For large areas (current: {area:.2f} degree²):")
print(f"   - Current buffer: {CONFIG['extent']['buffer_degrees']} degrees")
print(f"   - Parallel processing: {CONFIG['tiling']['parallel']}")
print(f"   - Use spatial filtering on both Overture and ArcGIS downloads")
print("   - Consider smaller chunks if memory issues occur")

print("\n2. File management:")
print(f"   - Overture GeoParquet: {CONFIG['paths']['overture_data_dir']}")
print(f"   - ArcGIS + converted FlatGeobuf: {CONFIG['paths']['scratch_dir']}")
print(f"   - Clean intermediate files between steps if needed")
print("   - Use filter patterns to process specific layers only")

print("\n3. Output optimization:")
print(f"   - PMTiles output: {CONFIG['paths']['tile_dir']}")
print("   - Copy final tiles to public directory for web serving")
print("   - ArcGIS downloads directly to FlatGeobuf (no conversion needed)")


<!-- # Modular Processing Summary

This notebook provides a complete, step-by-step approach for **large-scale geospatial data processing** optimized for continent and world-scale datasets.

## Core Steps
1. **Download Overture Maps data** - Global basemap features using DuckDB (outputs GeoParquet)
2. **Download ArcGIS data** (Optional) - Organization-specific layers via REST API (outputs FlatGeobuf)
3. **Check and validate** downloaded files from both sources
4. **Convert to FlatGeobuf** - Optimize Overture GeoParquet for efficient tiling
5. **Generate PMTiles** - Convert all FlatGeobuf files to web-optimized vector tiles
6. **Create TileJSON metadata** - Generate metadata for web mapping integration

## Data Sources

### Overture Maps (Global Open Data)
- Buildings, roads, water, land use, places, infrastructure
- Downloaded as GeoParquet via DuckDB queries
- Continent and world-scale processing capability

### ArcGIS Feature Server (Organization Data)
- Custom organizational layers (boundaries, settlements, facilities)
- Downloaded directly via REST API with automatic pagination
- Spatial and attribute filtering supported
- Outputs directly to FlatGeobuf for tiling

## Format Workflow (Optimized for Scale)

```
Overture Maps (DuckDB)          ArcGIS REST API
─────────────────────           ───────────────
GeoParquet (.parquet)           FlatGeobuf (.fgb)
        ↓                              ↓
    Convert                        [Ready]
        ↓                              ↓
FlatGeobuf (.fgb)  ←────────────────────┘
        ↓
Tippecanoe Tiling
        ↓
PMTiles (.pmtiles)
        ↓
    Web Maps
```

## Why This Workflow?

### 1. GeoParquet for Overture Downloads
- **Compact storage**: 50-80% smaller than GeoJSON
- **Fast DuckDB queries**: Efficient spatial filtering
- **Columnar format**: Excellent compression

### 2. FlatGeobuf for Tiling
- **Streaming capability**: Process datasets larger than RAM
- **Spatial indexing**: R-tree for fast spatial queries
- **Native tippecanoe support**: No conversion overhead
- **Optimal for large scale**: Tested on continent/world datasets
- **Direct from ArcGIS**: Skip conversion step entirely

### 3. PMTiles for Serving
- **Cloud-native**: Works with any static file host
- **Efficient delivery**: HTTP range requests
- **No tile server needed**: Direct browser access

## Performance Benefits
- **Memory efficiency**: Process billions of features without OOM errors
- **Disk space**: GeoParquet + FlatGeobuf = 2-3x less than GeoJSON workflow
- **Processing speed**: 20-40% faster tile generation vs GeoJSON
- **Parallel processing**: Multi-threaded for optimal CPU utilization
- **Direct API access**: ArcGIS data downloads directly to optimal format

## Scale Capabilities
- ✓ **City-scale**: Brooklyn, Paris, Tokyo, Kinshasa
- ✓ **Province-scale**: Haut-Lomami, Tanganyika, multiple health zones
- ✓ **Country-scale**: DRC, USA, India  
- ✓ **Continent-scale**: Africa, Europe, Americas
- ✓ **World-scale**: Global basemaps with billions of features

## Key Features
- **Modular design** - Each step can be run independently
- **Multiple data sources** - Combine Overture and organizational data
- **Flexible configuration** - Easy to customize for different areas and data types
- **Interactive development** - Run steps individually for debugging
- **Performance optimized** - Format selection based on dataset size
- **Production ready** - Robust error handling and validation
- **Memory conscious** - Streaming workflows prevent OOM errors

## Output Files
Each step generates specific outputs:
- **GeoParquet files (.parquet)** - Overture Maps download format
- **FlatGeobuf files (.fgb)** - Optimized tiling input (streaming, indexed)
- **PMTiles files (.pmtiles)** - Efficient web mapping output
- **TileJSON metadata** - MapLibre GL JS integration

## Usage Patterns
- **Development**: Run steps individually for testing and debugging
- **Production**: Execute all steps in sequence for automated processing
- **Customization**: Modify CONFIG settings and layer lists to customize data
- **Integration**: Use generated PMTiles with web mapping applications

## Best Practices for Large Datasets
1. **Use spatial filtering**: Apply extent to both Overture and ArcGIS downloads
2. **ArcGIS direct to FGB**: Downloads directly to FlatGeobuf (skip conversion)
3. **Convert Overture to FGB**: Always convert GeoParquet before tiling
4. **Use parallel processing**: Multi-file datasets process faster
5. **Monitor disk space**: Keep parquet as source, FGB for tiling
6. **Clean up intermediate files**: After successful tiling if needed
7. **Process by region**: For extremely large datasets, split by area -->