Skip to content

Bump clone multipliers to hit ~40M/~100M dataset targets #266

@jathavaan

Description

@jathavaan

Problem

Synthesized dataset sizes fall ~17% short of targets set in #255. Last setup run (release 2026-05-16.1):

size target actual multiplier (incl. originals)
small ~5M 4,166,773 1× (passthrough)
medium ~40M 33,334,184 8× (7 clones + original)
large ~100M 83,335,460 20× (19 clones + original)

Root cause: targets in DatasetSize.clones_per_polygon were sized assuming a ~5M base, but the conflated source dataset is 4.17M polygons.

Fix

Bump clones_per_polygon in src/domain/enums/dataset_size.py:10:

  • MEDIUM: 7 → 9 clones → ~41.7M rows
  • LARGE: 19 → 23 clones → ~100.0M rows

Update docstring targets in the same enum.

Validation

Re-run setup container, confirm buildings_medium ≈ 40M and buildings_large ≈ 100M (minus dropped invalid clones).

Refs #255.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions