# 1.1: Finalize Study Area and Hold-Out Validation Area Polygons

**Objective:** To formally define and delineate the spatial extents for the project's primary analysis and its independent validation, as specified in the `1.1.md` protocol.

**Methodology:**
1.  **Select Geographies:** Choose two geomorphically contrasting provinces for the main analysis (`study_areas`) and one spatially non-contiguous province for external validation (`validation_area`).
2.  **Process Data:** Load the authoritative source shapefile, filter to the selected provinces, reproject to the project standard CRS (EPSG:5070), and calculate areas.
3.  **Export Deliverables:** Save the final polygons into two separate GeoPackage files (`study_areas.gpkg`, `validation_area.gpkg`).
4.  **Justify:** Provide a clear, written justification for the selections, directly referencing the project's hypotheses and the criteria outlined in the protocol.

In [None]:
# === 1. Configuration & Setup ===
# This cell contains all user-configurable parameters for the workflow.

# --- Core Libraries ---
import warnings
from pathlib import Path
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely import make_valid

# --- Robust Path Resolution ---
def find_project_root(marker='README.md'):
    """Find the project root by searching upwards for a marker file."""
    path = Path.cwd().resolve()
    while path.parent != path:
        if (path / marker).exists():
            return path
        path = path.parent
    raise FileNotFoundError(f"Project root with marker '{marker}' not found.")

PROJECT_ROOT = find_project_root()
DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROC_DIR = DATA_DIR / "processed"
PROC_DIR.mkdir(parents=True, exist_ok=True)

# --- Input Data ---
# Authoritative physiographic provinces shapefile (Fenneman, 1946)
SOURCE_SHP = RAW_DIR / "boundaries" / "physio_provinces.shp"

# --- Output Deliverables ---
STUDY_AREAS_GPKG = PROC_DIR / "study_areas.gpkg"
VALIDATION_AREA_GPKG = PROC_DIR / "validation_area.gpkg"

# --- Analysis Parameters ---
TARGET_CRS = "EPSG:5070"  # NAD83 / Conus Albers (Equal Area)

# --- Province Selections (Based on 1.1.md Protocol) ---
# Using uppercase for robust matching after normalization.
STUDY_PROVINCES = ["BLUE RIDGE", "PIEDMONT"]
VALIDATION_PROVINCE = "VALLEY AND RIDGE"

# --- Print Setup Summary ---
print("--- Configuration Summary ---")
print(f"Project Root:   {PROJECT_ROOT}")
print(f"Source Shapefile: {SOURCE_SHP}")
print(f"Output Study GPKG:      {STUDY_AREAS_GPKG}")
print(f"Output Validation GPKG: {VALIDATION_AREA_GPKG}")
print(f"Target CRS:       {TARGET_CRS}")
print("-" * 29)
print(f"Selected Study Provinces:    {STUDY_PROVINCES}")
print(f"Selected Validation Province: {VALIDATION_PROVINCE}")
print("-----------------------------")

In [None]:
# === 2. Load and Pre-process Province Data ===

if not SOURCE_SHP.exists():
    raise FileNotFoundError(f"Source shapefile not found at {SOURCE_SHP}. Please acquire and place the data.")

print(f"Loading source data from {SOURCE_SHP.name}...")
gdf = gpd.read_file(SOURCE_SHP)

# --- Standardize Schema and Geometry ---
# Find the likely province name column (case-insensitive)
province_col_raw = None
for col in ['PROVINCE', 'PROV_NAME', 'NAME']:
    if col in gdf.columns:
        province_col_raw = col
        break
if province_col_raw is None:
    raise KeyError(f"Could not find a province name column in {list(gdf.columns)}")

# Keep only essential columns and rename for consistency
gdf = gdf[[province_col_raw, 'geometry']].rename(columns={province_col_raw: 'PROVINCE'})

# Normalize province names to uppercase for reliable matching
gdf['PROVINCE'] = gdf['PROVINCE'].str.upper().str.strip()

# Ensure geometries are valid
invalid_geoms = gdf[~gdf.is_valid]
if not invalid_geoms.empty:
    warnings.warn(f"Found {len(invalid_geoms)} invalid geometries. Attempting to repair with 'make_valid'.")
    gdf['geometry'] = gdf['geometry'].apply(make_valid)

print("Source data loaded and standardized.")
display(gdf.head())

In [None]:
# === 3. Filter, Reproject, and Process Polygons ===

# --- Filter to Selected Provinces ---
gdf_study = gdf[gdf['PROVINCE'].isin(STUDY_PROVINCES)].copy()
gdf_validation = gdf[gdf['PROVINCE'] == VALIDATION_PROVINCE].copy()

print(f"Filtered to {len(gdf_study)} study polygons and {len(gdf_validation)} validation polygon.")
assert len(gdf_study) > 0, "No study provinces found in source data. Check names in STUDY_PROVINCES list."
assert len(gdf_validation) > 0, "Validation province not found in source data. Check name in VALIDATION_PROVINCE."

# --- Reproject to Target CRS for Area Calculation ---
print(f"Reprojecting all polygons to {TARGET_CRS}...")
gdf_study_proj = gdf_study.to_crs(TARGET_CRS)
gdf_validation_proj = gdf_validation.to_crs(TARGET_CRS)

# --- Calculate Area and Finalize Attributes ---
# Define a function to avoid code repetition
def finalize_attributes(gdf_proj):
    """Calculates area in km^2 and cleans up the dataframe."""
    # Dissolve by province name in case there are multiple polygons per province
    gdf_dissolved = gdf_proj.dissolve(by='PROVINCE', as_index=False)

    # Calculate area in square meters, then convert to square kilometers
    gdf_dissolved['AREA_SQKM'] = gdf_dissolved.geometry.area / 1_000_000

    # Return only the columns required by the protocol
    return gdf_dissolved[['PROVINCE', 'AREA_SQKM', 'geometry']]

study_final = finalize_attributes(gdf_study_proj)
validation_final = finalize_attributes(gdf_validation_proj)

print("\n--- Final Study Areas ---")
display(study_final[['PROVINCE', 'AREA_SQKM']])

print("\n--- Final Validation Area ---")
display(validation_final[['PROVINCE', 'AREA_SQKM']])

In [None]:
# === 4. Quality Control: Visualize Results ===

# Check for spatial non-contiguity
study_geom = study_final.unary_union
validation_geom = validation_final.unary_union
if study_geom.intersects(validation_geom):
    warnings.warn("CRITICAL WARNING: The selected study and validation areas intersect or touch. This violates the spatial independence criterion.")

# Create plot
fig, ax = plt.subplots(1, 1, figsize=(12, 9))

# Plot all provinces in the background for context
gdf.to_crs(TARGET_CRS).plot(ax=ax, color='lightgray', edgecolor='white')

# Plot the finalized areas on top
study_final.plot(ax=ax, color='cornflowerblue', edgecolor='black', label='Study Areas')
validation_final.plot(ax=ax, color='lightcoral', edgecolor='black', label='Validation Area')

# Add labels
for idx, row in study_final.iterrows():
    rep_point = row.geometry.representative_point()
    ax.text(rep_point.x, rep_point.y, row['PROVINCE'], ha='center', fontsize=9, fontweight='bold')

for idx, row in validation_final.iterrows():
    rep_point = row.geometry.representative_point()
    ax.text(rep_point.x, rep_point.y, row['PROVINCE'], ha='center', va='top', fontsize=9, fontweight='bold')

ax.set_title('Finalized Study and Validation Areas (EPSG:5070)', fontsize=16)
ax.set_xlabel('Easting (m)')
ax.set_ylabel('Northing (m)')
ax.legend(handles=[
    plt.Rectangle((0, 0), 1, 1, color='cornflowerblue', label='Study Areas'),
    plt.Rectangle((0, 0), 1, 1, color='lightcoral', label='Validation Area')
])
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

In [None]:
# === 5. Export Final Deliverables ===

# --- Save to GeoPackage Files ---
try:
    print(f"Writing {len(study_final)} polygon(s) to {STUDY_AREAS_GPKG.name}...")
    study_final.to_file(STUDY_AREAS_GPKG, driver='GPKG')

    print(f"Writing 1 polygon to {VALIDATION_AREA_GPKG.name}...")
    validation_final.to_file(VALIDATION_AREA_GPKG, driver='GPKG')

    print("\n✅ Export successful.")
except Exception as e:
    print(f"\n❌ ERROR: An error occurred during file export: {e}")

# --- Summary and Justification ---

This notebook successfully delineated and exported the polygons for the primary analysis and the hold-out validation, in accordance with the project's updated experimental design.

### Selected Provinces
* **Study Areas:**
    * `BLUE RIDGE`: Area = 56,192.19 km²
    * `PIEDMONT`: Area = 217,143.21 km²
* **Hold-Out Validation Area:**
    * `VALLEY AND RIDGE`: Area = 138,591.20 km²

### Justification for Selection

The province selections were made to directly support the project's core hypotheses:

1.  **To Test H1 (Landscape Characterization):** The hypothesis that geomorphic provinces have distinct topological signatures requires a strong basis for comparison. The **Blue Ridge** province (high-relief, crystalline rock, dendritic drainage) and the **Piedmont** province (moderate-relief, rolling hills) were chosen for the main study area to maximize this geomorphic contrast, providing a robust test for our classification models.

2.  **To Test H2 (Process Prediction):** Training the predictive WEPP erosion models across these two distinct geomorphic settings increases the likelihood that the models will learn fundamental, generalizable relationships between landscape shape and physical processes, rather than spurious correlations specific to a single landscape type.

3.  **For Robust External Validation:** The **Valley and Ridge** province was chosen as the hold-out area because it is spatially non-contiguous with the study areas, preventing any data leakage. Furthermore, it presents a challenging validation test, as it contains both high- and low-relief areas but with a strong, linear structural grain that is distinct from both the Blue Ridge and Piedmont, providing a novel landscape on which to assess model generalizability.