Skip to content

Pull through additional OSM and Overture metadata fields.#30

Merged
njhenry merged 1 commit into
mainfrom
feature/better-metadata
May 21, 2026
Merged

Pull through additional OSM and Overture metadata fields.#30
njhenry merged 1 commit into
mainfrom
feature/better-metadata

Conversation

@njhenry

@njhenry njhenry commented May 21, 2026

Copy link
Copy Markdown
Contributor

Pulls through richer per-POI metadata from both OSM and Overture into the conflated snapshot, plus one minor bug fix related to brand in the conflated dataset.

File changes

Overture extraction (src/openpois/io/overture.py)

  • DuckDB SELECT now grabs addresses[1].{freeform, locality, region, postcode, country}, websites, phones, socials, and categories.alternate, with the same column names propagated through _finalize_snapshot_in_duckdb.

OSM extraction (config.yaml)

  • Adds access to download.osm.extract_keys so it survives into the snapshot.

Merge (src/openpois/conflation/merge.py)

  • New helpers _present, _blend_with_backfill, _col_or_null, _unwrap_first_element consolidate the column-guard / blend / list-unwrap logic that was previously inline.
  • Blend semantics are now bidirectional: the higher-confidence source still wins, but if its value is null/empty the merge falls back to the other side instead of leaving the cell null. Empty strings ("") are treated as missing — Overture sometimes emits them.
  • Matched rows now carry:
    • Blended primaries: addr_street, addr_city, addr_state, addr_postcode, addr_country, phone, website.
    • OSM-only: addr_housenumber, addr_unit, opening_hours, access.
    • Overture-only: overture_socials, overture_categories_alternate.
    • Per-source traceability: osm_addr_*, overture_addr_*, osm_phone, overture_phones, osm_website, overture_websites.
  • _build_unmatched_osm_gdf and _build_unmatched_overture_gdf populate the same schema with appropriate nulls so the three sub-frames concat cleanly.

Conflation driver (scripts/conflation/conflate.py)

  • Adds the new metadata columns to OSM_MERGE_COLS / OVERTURE_MERGE_COLS so they survive the per-shared-label row slice.

Tests (tests/test_merge.py)

  • New TestMetadataPullThrough class (9 tests) covers: matched-row blending, OSM-only fields, unmatched-side traceability, LIST<STRING> unwrap (including empty list → None), access flow, and two regressions — brand backfill when OSM is higher-confidence but null, and empty-string-as-missing.
  • Existing matched-schema assertion extended to include every new column.

Bug fix:

Before this PR, when OSM had higher confidence than Overture but a null brand, the matched row's brand was dropped instead of falling back to Overture. _blend_with_backfill makes that backfill symmetric and is now used for name, brand, every addr_*, phone, and website.

@njhenry njhenry merged commit dd2166a into main May 21, 2026
@njhenry njhenry deleted the feature/better-metadata branch May 21, 2026 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant