# Open Buildings Extraction via GCS (Fast Method)

This notebook demonstrates the **GCS-based extraction method** - significantly faster than Earth Engine.

```
┌─────────────────────────────────────┐
│  Create AOI (Area of Interest)      │
│  - Load AFRICAPOLIS2020.geojson     │
│  - Filter for Accra agglomeration   │
│  - Output: accra_aoi.geojson        │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│  Configure GCS Extraction           │
│  - Set confidence threshold (0.75)  │
│  - Set area filters (10-1000 m²)    │
│  - Set parallel workers (4)         │
│  - Choose output format (GeoJSON)   │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│  Extract Buildings from GCS         │
│  - NO authentication required!      │
│  - Direct download from GCS         │
│  - Parallel S2 cell processing      │
│  - Filter by confidence & area      │
│  - Apply spatial intersection       │
│  - Export: accra_buildings.geojson  │
└─────────────────────────────────────┘
```

## Key Advantages over Earth Engine Method

**Speed:**
- Small area (10 km²): 30-60 seconds vs 2-5 minutes
- Medium city (100 km²): 2-5 minutes vs 10-30 minutes
- Large city (1000 km²): 10-20 minutes vs 1-2 hours

**Simplicity:**
- No authentication needed (public data)
- No API quotas or timeouts
- Windows-compatible

**Input Data:**
- `AFRICAPOLIS2020.geojson` → AOI creation
- No service account needed!

**Output:**
- `accra_aoi.geojson` (area boundary)
- `accra_buildings.geojson` (building polygons)

In [1]:
from pathlib import Path
import logging

# GeoWorkflow imports
from geoworkflow.schemas.config_models import AOIConfig
from geoworkflow.processors.aoi.processor import AOIProcessor
from geoworkflow.schemas.open_buildings_gcs_config import OpenBuildingsGCSConfig
from geoworkflow.processors.extraction.open_buildings_gcs import OpenBuildingsGCSProcessor

## Optional: Setup Logging

Create a status logging tracker to monitor progress. This is **OPTIONAL** - you can skip this cell and remove logging statements below if preferred.

In [2]:
# Setup logging (OPTIONAL)
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

## Step 1: Create Area of Interest (AOI)

Extract the Accra boundary from AFRICAPOLIS data. This is the same as the Earth Engine workflow.

In [None]:
# Define AOI output path
aoi_file = Path("../data/aoi/accra_aoi.geojson")
#aoi_file = Path("../data/aoi/accra_sample_aoi.geojson")

# Create AOI configuration for Accra
aoi_config = AOIConfig(
    input_file=Path("../data/00_source/boundaries/agglomerations.gpkg"),
    country_name_column="Agglomeration_Name",
    countries=["Accra"],
    buffer_km=0,
    dissolve_boundaries=False,
    output_file=aoi_file
)

# Create and run the processor
aoi_processor = AOIProcessor(aoi_config)
aoi_result = aoi_processor.process()

# Check results
if aoi_result.success:
    print(f"✅ {aoi_result.message}")
    print(f"Processing time: {aoi_result.elapsed_time:.2f}s")
    print(f"Output: {aoi_file}")
else:
    print(f"❌ Failed: {aoi_result.message}")

2025-10-23 20:17:19,751 - INFO - Starting AOIProcessor processing


Output()

2025-10-23 20:17:20,747 - INFO - Loading administrative boundaries


2025-10-23 20:17:21,617 - INFO - Filtering 1 countries
2025-10-23 20:17:21,619 - INFO - Saving AOI to ../data/aoi/accra_aoi.geojson


2025-10-23 20:17:21,806 - INFO - Created 1 records


2025-10-23 20:17:21,925 - INFO - AOI saved successfully completed: 4/3 items in 1.2s (3.4 items/sec)
2025-10-23 20:17:21,926 - INFO - Successfully completed AOIProcessor processing


✅ Successfully created AOI with 1 features
Processing time: 2.18s
Output: ../data/aoi/accra_aoi.geojson


In [None]:
import geopandas as gpd

# Check AOI CRS
aoi_gdf = gpd.read_file("../data/aoi/accra_aoi.geojson")
print(f"AOI CRS: {aoi_gdf.crs}")
print(f"AOI bounds: {aoi_gdf.total_bounds}")
print(f"AOI geometry type: {aoi_gdf.geometry.iloc[0].geom_type}")

# For comparison, check original Africapolis
africapolis = gpd.read_file("../data/00_source/boundaries/agglomorations.gpkg")
print(f"\nAfricapolis CRS: {africapolis.crs}")
accra_orig = africapolis[africapolis['Agglomeration_Name'] == 'Accra']
print(f"Accra bounds in original file: {accra_orig.total_bounds}")

AOI CRS: PROJCS["Africa_Equidistant_Conic",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]],PROJECTION["Equidistant_Conic"],PARAMETER["latitude_of_center",0],PARAMETER["longitude_of_center",25],PARAMETER["standard_parallel_1",20],PARAMETER["standard_parallel_2",-23],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["ESRI","102023"]]
AOI bounds: [-2658215.83450204   587278.34351349 -2580000.97254483   653908.83864846]
AOI geometry type: Polygon


DataSourceError: Failed to read GeoJSON data; At line 7561, character 60716860: GeoJSON object too complex/large. You may define the OGR_GEOJSON_MAX_OBJ_SIZE configuration option to a value in megabytes to allow for larger features, or 0 to remove any size limit.

## Step 2: Extract Buildings via GCS

This is where the magic happens! No authentication needed, just configure and run.

**For testing:** Use the smaller sample AOI (faster, ~1-2 minutes)  
**For production:** Use the full Accra AOI (complete data, ~5-10 minutes)

In [5]:
# Choose your AOI and output
# Option 1: Small sample for testing (RECOMMENDED FOR FIRST RUN)
input_aoi = Path("../data/aoi/accra_sample_aoi.geojson")
output_file = Path("../data/02_clipped/accra_buildings_sample.geojson")

# Option 2: Full Accra area (uncomment to use)
input_aoi = Path("../data/aoi/accra_aoi.geojson")
output_file = Path("../data/02_clipped/all_accra_buildings.geojson")

# Configure extraction
gcs_config = OpenBuildingsGCSConfig(
    aoi_file=input_aoi,
    output_dir=output_file.parent,
    
    # Quality filters
    confidence_threshold=0.75,  # Min confidence (0.5-1.0)
    min_area_m2=8.0,           # Min building size
    max_area_m2=100000.0,       # Max building size
    
    # Output settings
    export_format="geojson",    # Options: geojson, shapefile, csv
    overwrite_existing=True,     # Overwrite if exists
    
    # Performance
    num_workers=8                # Parallel workers (adjust based on CPU)
)

# Update output file to match config
output_file = gcs_config.get_output_file_path()

print("📋 Configuration:")
print(f"  Input AOI: {input_aoi}")
print(f"  Output: {output_file}")
print(f"  Confidence: ≥{gcs_config.confidence_threshold}")
print(f"  Area range: {gcs_config.min_area_m2}-{gcs_config.max_area_m2} m²")
print(f"  Workers: {gcs_config.num_workers}")

📋 Configuration:
  Input AOI: ../data/aoi/accra_aoi.geojson
  Output: ../data/02_clipped/open_buildings.geojson
  Confidence: ≥0.75
  Area range: 8.0-100000.0 m²
  Workers: 8


In [6]:
if isinstance(gcs_config, OpenBuildingsGCSConfig):
    print(True)
    config_dict = gcs_config.model_dump(mode='json')

print(config_dict)

True
{'aoi_file': '../data/aoi/accra_aoi.geojson', 'output_dir': '../data/02_clipped', 'data_type': 'polygons', 's2_level': 6, 'gcs_bucket_path': 'gs://open-buildings-data/v3/polygons_s2_level_6_gzip_no_header', 'confidence_threshold': 0.75, 'min_area_m2': 8.0, 'max_area_m2': 100000.0, 'export_format': 'geojson', 'include_confidence': True, 'include_area': True, 'include_plus_codes': True, 'overwrite_existing': True, 'num_workers': 8, 'chunk_size': 2000000, 'service_account_key': None, 'use_anonymous_access': True}


## Step 3: Run the Extraction

This cell does the actual extraction. Progress will be shown in real-time.

**Expected time:**
- Sample area: ~1-2 minutes
- Full Accra: ~5-10 minutes

In [7]:
print("🚀 Starting building extraction...\n")

try:
    # Create processor
    processor = OpenBuildingsGCSProcessor(gcs_config)
    
    # Run extraction
    result = processor.process()
    
    # Display results
    if result.success:
        print(f"\n✅ {result.message}")
        print(f"\n📊 Summary:")
        print(f"  Buildings extracted: {result.processed_count:,}")
        print(f"  Processing time: {result.elapsed_time:.1f}s")
        print(f"  Output file: {result.output_paths[0]}")
        
        # File size
        if result.output_paths[0].exists():
            file_size_mb = result.output_paths[0].stat().st_size / (1024 * 1024)
            print(f"  File size: {file_size_mb:.2f} MB")
        
        # Show metrics
        if hasattr(processor, 'get_metric'):
            s2_cells = processor.get_metric('s2_cells_processed')
            if s2_cells:
                print(f"  S2 cells processed: {s2_cells}")
        
        print("\n🎉 Extraction completed successfully!")
        
    else:
        print(f"\n❌ Extraction failed: {result.message}")
        
except Exception as e:
    print(f"\n❌ Error: {e}")
    import traceback
    traceback.print_exc()

2025-10-04 16:41:50,753 - INFO - Starting OpenBuildingsGCSProcessor processing
2025-10-04 16:41:50,765 - INFO - Initialized GCS client with anonymous access
2025-10-04 16:41:50,774 - INFO - Computing S2 cell coverage for AOI...
2025-10-04 16:41:50,775 - INFO - Processing 2 S2 level-6 cells with 8 parallel workers
2025-10-04 16:41:50,775 - INFO - Downloading and filtering buildings from GCS...


🚀 Starting building extraction...



Output()

2025-10-04 16:43:45,609 - INFO - Processing S2 cells completed: 2/2 items in 114.8s (0.0 items/sec)
2025-10-04 16:44:35,089 - INFO - Created 1,988,802 records
2025-10-04 16:44:35,117 - INFO - Exported 1988802 buildings to ../data/02_clipped/open_buildings.geojson
2025-10-04 16:44:35,665 - INFO - Successfully extracted 1,988,802 buildings from 2 S2 cells
2025-10-04 16:44:35,666 - INFO - Successfully completed OpenBuildingsGCSProcessor processing



✅ Successfully extracted 1,988,802 buildings from 2 S2 cells

📊 Summary:
  Buildings extracted: 1,988,802
  Processing time: 164.9s
  Output file: ../data/02_clipped/open_buildings.geojson
  File size: 877.28 MB

🎉 Extraction completed successfully!


## Step 4: Verify Results

Load and inspect the extracted buildings.

In [8]:
import geopandas as gpd

if output_file.exists():
    # Load buildings
    buildings = gpd.read_file(result.output_paths[0])
    
    print(f"📊 Building Statistics:")
    print(f"  Total buildings: {len(buildings):,}")
    print(f"  Average area: {buildings['area_in_meters'].mean():.1f} m²")
    print(f"  Median area: {buildings['area_in_meters'].median():.1f} m²")
    print(f"  Average confidence: {buildings['confidence'].mean():.3f}")
    print(f"\n  Area range: {buildings['area_in_meters'].min():.1f} - {buildings['area_in_meters'].max():.1f} m²")
    print(f"  Confidence range: {buildings['confidence'].min():.3f} - {buildings['confidence'].max():.3f}")
    
    # Show first few records
    print(f"\n🔍 Sample records:")
    print(buildings[['confidence', 'area_in_meters']].head())
    
else:
    print("❌ Output file not found")

📊 Building Statistics:
  Total buildings: 1,988,802
  Average area: 129.0 m²
  Median area: 85.4 m²
  Average confidence: 0.844

  Area range: 8.0 - 39628.0 m²
  Confidence range: 0.750 - 0.987

🔍 Sample records:
   confidence  area_in_meters
0      0.8641         66.3996
1      0.8415         69.9676
2      0.8818        178.0097
3      0.8803         47.2763
4      0.7980         99.3359


## Troubleshooting

### No buildings extracted?
- Check AOI location (must be in covered area)
- Lower confidence threshold: `confidence_threshold=0.5`
- Remove area filters temporarily

### Slow extraction?
- Increase workers: `num_workers=8`
- Use CSV format (faster): `export_format="csv"`
- Check network speed

### Memory issues?
- Reduce workers: `num_workers=2`
- Use smaller AOI
- Process in batches

### Import errors?
```bash
pip install geoworkflow[extraction]
```