# 1. Data Collection for Building Classification

**Paper Reference:** Section 3.1.1 - Dataset

This notebook implements the data collection pipeline for building classification from satellite imagery.

## Overview

We collect 512×512-pixel satellite images at approximately 0.15 m/pixel resolution from Google Earth using the `segment-geospatial` (samgeo) Python package.

**Building Classes:**
1. Commercial
2. High-Rise
3. Hospital
4. Industrial
5. Multi-family Residential
6. Schools
7. Single-family Residential

**Data Source:** Google Earth via TMS tiles converted to GeoTIFF format.

## 1.1 Environment Setup

In [None]:
# Install required packages
# Uncomment to install if not already installed
# !pip install segment-geospatial leafmap geopandas

In [None]:
# Core imports
import os
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Geospatial imports
try:
    from samgeo import tms_to_geotiff
    import geopandas as gpd
    from shapely.geometry import box
    SAMGEO_AVAILABLE = True
except ImportError:
    print("Warning: samgeo not installed. Install with: pip install segment-geospatial")
    SAMGEO_AVAILABLE = False

print("Environment setup complete.")

## 1.2 Configuration

Define building classes and data directories as per paper methodology.

In [None]:
# Building classification categories (Paper Section 3.1.1)
BUILDING_CLASSES = [
    'Commercial',      # Retail stores, shopping centers, office buildings
    'High',            # High-rise buildings (multi-story towers)
    'Hospital',        # Healthcare facilities
    'Industrial',      # Factories, warehouses, manufacturing plants
    'Multi',           # Multi-family residential (apartments, condos)
    'Schools',         # Educational institutions
    'Single'           # Single-family residential homes
]

# Image specifications (Paper Section 3.1.1)
IMAGE_SIZE = 512          # pixels
RESOLUTION = 0.15         # meters per pixel (approximately)
ZOOM_LEVEL = 20           # Google Earth zoom level for ~0.15m resolution

# Directory structure
DATA_DIR = Path('../data')
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
METADATA_DIR = DATA_DIR / 'metadata'

print(f"Building Classes: {BUILDING_CLASSES}")
print(f"Image Size: {IMAGE_SIZE}x{IMAGE_SIZE} pixels")
print(f"Resolution: ~{RESOLUTION} m/pixel")

## 1.3 Building Footprints Data

Load building footprint data with coordinates for satellite image extraction.

In [None]:
def create_bounding_box(lat, lon, size_meters=100):
    """
    Create a bounding box around a coordinate point.
    
    Args:
        lat (float): Latitude of center point
        lon (float): Longitude of center point  
        size_meters (float): Size of bounding box in meters
        
    Returns:
        tuple: (min_lon, min_lat, max_lon, max_lat)
    """
    # Approximate conversion: 1 degree ≈ 111,320 meters at equator
    lat_offset = size_meters / 111320
    lon_offset = size_meters / (111320 * np.cos(np.radians(lat)))
    
    return (
        lon - lon_offset,  # min_lon
        lat - lat_offset,  # min_lat
        lon + lon_offset,  # max_lon
        lat + lat_offset   # max_lat
    )

print("Bounding box function defined.")

## 1.4 Satellite Image Download Function

Using the `samgeo` package (Wu & Osco, 2023) to download satellite imagery via TMS tiles.

In [None]:
def download_satellite_image(lat, lon, output_path, zoom=ZOOM_LEVEL, source='Satellite'):
    """
    Download satellite image patch for a building location.
    
    Paper Reference: Section 3.1.1
    "We employed the segment-geospatial Python package (samgeo)... 
     Using the tms_to_geotiff function, we specified bounding box coordinates 
     and zoom levels to retrieve detailed satellite images."
    
    Args:
        lat (float): Latitude of building center
        lon (float): Longitude of building center
        output_path (str): Path to save the GeoTIFF image
        zoom (int): Zoom level (20 = ~0.15 m/pixel)
        source (str): Image source ('Satellite' for Google Earth)
        
    Returns:
        bool: True if download successful, False otherwise
    """
    if not SAMGEO_AVAILABLE:
        print("Error: samgeo package not available")
        return False
    
    try:
        # Create bounding box (~100m x 100m around building)
        bbox = create_bounding_box(lat, lon, size_meters=100)
        
        # Download satellite image as GeoTIFF
        tms_to_geotiff(
            output=output_path,
            bbox=bbox,
            zoom=zoom,
            source=source
        )
        
        return True
        
    except Exception as e:
        print(f"Error downloading image: {e}")
        return False

print("Download function defined.")

## 1.5 Batch Download Pipeline

Process multiple building locations from a CSV file.

In [None]:
def batch_download_buildings(csv_path, output_dir, building_class, limit=None):
    """
    Download satellite images for multiple building locations.
    
    Args:
        csv_path (str): Path to CSV with building coordinates
        output_dir (str): Directory to save images
        building_class (str): Building class name
        limit (int): Maximum number of images to download (for testing)
        
    Returns:
        pd.DataFrame: Download statistics
    """
    # Create output directory
    output_path = Path(output_dir) / building_class
    output_path.mkdir(parents=True, exist_ok=True)
    
    # Load building locations
    df = pd.read_csv(csv_path)
    
    if limit:
        df = df.head(limit)
    
    # Download statistics
    stats = {'total': len(df), 'success': 0, 'failed': 0}
    
    for idx, row in df.iterrows():
        lat, lon = row['latitude'], row['longitude']
        filename = f"{lat:.6f}_{lon:.6f}_{building_class}_{idx}.tif"
        file_path = output_path / filename
        
        if file_path.exists():
            stats['success'] += 1
            continue
            
        success = download_satellite_image(lat, lon, str(file_path))
        if success:
            stats['success'] += 1
        else:
            stats['failed'] += 1
            
        # Progress update every 10 images
        if (idx + 1) % 10 == 0:
            print(f"Progress: {idx + 1}/{len(df)} images processed")
    
    print(f"\n{building_class}: {stats['success']}/{stats['total']} images downloaded successfully")
    return stats

print("Batch download pipeline defined.")

## 1.6 Dataset Statistics

Generate statistics about the collected dataset (Table 2 in paper).

In [None]:
def count_images_by_class(data_dir):
    """
    Count images per building class.
    
    Paper Reference: Table 2 - Number of Images Collected per Building Class and State
    
    Args:
        data_dir (str): Path to data directory
        
    Returns:
        dict: Image counts per class
    """
    counts = {}
    data_path = Path(data_dir)
    
    for class_name in BUILDING_CLASSES:
        class_dir = data_path / class_name
        if class_dir.exists():
            # Count .tif files
            count = len(list(class_dir.glob('*.tif')))
            counts[class_name] = count
        else:
            counts[class_name] = 0
    
    return counts

# Display current dataset statistics
if PROCESSED_DIR.exists():
    for split in ['train', 'val', 'test']:
        split_dir = PROCESSED_DIR / split
        if split_dir.exists():
            counts = count_images_by_class(split_dir)
            total = sum(counts.values())
            print(f"\n{split.upper()} Set:")
            for class_name, count in counts.items():
                print(f"  {class_name}: {count}")
            print(f"  Total: {total}")
else:
    print("Data directory not found. Run data collection to generate dataset.")

## Summary

This notebook provides the data collection pipeline for:

1. **Satellite Image Acquisition**: Download 512×512 pixel images at ~0.15 m/pixel resolution
2. **Geographic Coverage**: Images from diverse U.S. locations
3. **Building Classes**: 7 distinct categories as defined in paper

**Next Step**: `02_preprocessing_segmentation.ipynb` - Apply ReFineNet segmentation and preprocessing