Skip to content

refactor: hardcoded values cleanup (Waves 1-6)#37

Merged
pcalnon merged 4 commits intomainfrom
feature/hardcoded-values-wave1
Apr 11, 2026
Merged

refactor: hardcoded values cleanup (Waves 1-6)#37
pcalnon merged 4 commits intomainfrom
feature/hardcoded-values-wave1

Conversation

@pcalnon
Copy link
Copy Markdown
Owner

@pcalnon pcalnon commented Apr 10, 2026

Summary

Cross-repo hardcoded-values refactor for juniper-data. Canonical source for the X-API-Key header constant and the /v1 API version prefix consumed by juniper-data-client.

  • Wave 1: Three new layer-scoped constants modules:
    • juniper_data/api/constants.py (32 symbols: header names, status code defaults, body/rate-limit limits, error message templates, exempt paths)
    • juniper_data/storage/constants.py (16 symbols: filenames, metadata keys, table/column names)
    • juniper_data/core/constants.py (11 symbols: encoding strings, magic numbers, fixed metadata keys)
    • Per-generator Field(default=...) defaults moved from inline literals into named module constants in each generator's params.py.
  • Wave 2: Replaced ~115 inline literals across 17 files in the api, storage, and core layers, plus all 7 generator parameter modules.
  • Wave 3: Adopted starlette.status constants for HTTP status codes (no more 404/413/500/503 magic numbers in routes/middleware/security). Consolidated 'utf-8' encoding literals into core/constants.py.
  • Wave 5 cross-repo verification (PASS):
    • Generator output bit-identity: SHA-256 of X_full/y_full arrays at seed=42 is bit-identical between this branch and origin/main for all 5 generators (spiral=3f0c88c1d0d8ac64, gaussian=d2c0928e0ccf7896, checkerboard=db1ba33f4513b94f, circles=123b27f4864d6e45, xor=0c30c5d07cdedece)
    • HEADER_X_API_KEY='X-API-Key' matches the value used by juniper-data-client.constants.API_KEY_HEADER_NAME
  • Wave 6: AGENTS.md updated with the layer-scoped constants module pattern; CHANGELOG.md gains an Unreleased entry.

No public API changes — HTTP request/response shapes, settings prefix, and storage formats are unchanged.

Test plan

  • Full pytest suite (835+ tests) — exit 0
  • pre-commit run --all-files — all 19 hooks pass (ruff lint+format, mypy, bandit, yamllint, shellcheck)
  • Generator output reproducibility verification (Wave 5)
  • CI on PR

🤖 Generated with Claude Code

pcalnon added 4 commits April 9, 2026 16:15
Creates 10 new constants files (145 constants total) holding every
hardcoded literal currently inlined in the API, storage, core, and
generator layers.

api/constants.py (33):
- Exempt-paths set for security middleware
- Security header names and values (CSP, HSTS, X-Frame, X-Content,
  Referrer, Permissions, X-RateLimit-*, X-Request-ID, X-API-Key)
- Default content security policy
- Max request body bytes (10 MiB)
- Rate limit defaults (60 req/min, 60s window)
- Observability defaults (service name, namespace, log format,
  Sentry trace sample rate, Prometheus duration buckets)
- Generation status outcome strings (success/error)

Note: HTTP status code constants are intentionally NOT defined here.
Wave 2 will adopt starlette.status per the data plan doc Phase 3.

storage/constants.py (16):
- Redis: host, port, db, key prefix, meta/artifact key suffixes
- PostgreSQL: host, port, datasets table name
- Local FS: .meta.json, .npz, .tmp file suffixes
- JSON serialization indent
- Pagination defaults (limit=100, offset=0)
- Default artifact format (npz)

core/constants.py (11):
- CHARSET_UTF8 (used in 7+ source files)
- DATASET_ID_HASH_PREFIX_LENGTH (16)
- Pydantic field length constraints (DESCRIPTION_MAX_LENGTH=500,
  CREATED_BY_MAX_LENGTH=100)
- Batch operation size limits (delete/update=100, create/export=50)
- DatasetListFilter tags_match default and pattern

Per-generator defaults.py (7 files, 85 constants total) — each follows
the existing generators/spiral/defaults.py structure with per-field
defaults plus validation bounds:
- xor/defaults.py (12 const)
- checkerboard/defaults.py (11 const)
- gaussian/defaults.py (14 const)
- circles/defaults.py (14 const)
- mnist/defaults.py (11 const)
- csv_import/defaults.py (9 const)
- arc_agi/defaults.py (14 const)

All values verified against current source line-by-line. Naming follows
the existing spiral defaults convention.

Wave 1 of the hardcoded values refactor — constants only, no application
logic modified. All 357 existing tests pass; pre-commit hooks pass.
…sks 2.7-2.10)

Replaces ~115 inline hardcoded literals across 17 files with imports
from the Wave 1 constants modules: juniper_data.api.constants,
juniper_data.storage.constants, juniper_data.core.constants, and the
per-generator defaults.py modules.

Task 2.7 — API layer (3 files, ~30 replacements):
- middleware.py: All 7 security headers and their values now reference
  HEADER_*/*_VALUE constants; HSTS conditional uses HEADER_X_FORWARDED_PROTO
  + PROXY_PROTOCOL_HTTPS; rate limit response headers use
  HEADER_X_RATELIMIT_*; the 413 status code uses
  starlette.status.HTTP_413_REQUEST_ENTITY_TOO_LARGE (per the data plan
  Phase 3 decision to adopt starlette.status). Module-level
  EXEMPT_PATHS, _DEFAULT_CSP, and _MAX_REQUEST_BODY_BYTES are kept as
  thin aliases to the constants for backwards compatibility with any
  test that may import them.
- security.py: APIKeyHeader name uses HEADER_X_API_KEY; api_key lookup
  uses HEADER_X_API_KEY; RateLimiter constructor defaults use
  DEFAULT_RATE_LIMIT_REQUESTS_PER_MINUTE and
  DEFAULT_RATE_LIMIT_WINDOW_SECONDS; rate-limit response headers use
  HEADER_X_RATELIMIT_LIMIT/REMAINING/RESET and HEADER_RETRY_AFTER;
  get_rate_limiter fallback uses DEFAULT_RATE_LIMIT_REQUESTS_PER_MINUTE
- observability.py: _SERVICE_NAME_DEFAULT and _NAMESPACE_DEFAULT now
  reference DEFAULT_SERVICE_NAME and DEFAULT_NAMESPACE; X-Request-ID
  reads + writes use HEADER_X_REQUEST_ID; the JSON format check uses
  LOG_FORMAT_JSON; the plain-text formatter uses DEFAULT_LOG_FORMAT_PLAIN;
  configure_sentry default traces_sample_rate uses
  DEFAULT_SENTRY_TRACES_SAMPLE_RATE; the Histogram buckets tuple uses
  DATASET_GENERATION_DURATION_BUCKETS; the success-status comparison
  uses GENERATION_STATUS_SUCCESS

Task 2.8 — Storage layer (5 files, ~25 replacements; plan said 4):
- local_fs.py: meta/npz/tmp file suffixes use META_FILE_SUFFIX,
  NPZ_FILE_SUFFIX, TMP_FILE_SUFFIX; JSON serialization uses
  JSON_INDENT_DEFAULT; all encoding="utf-8" uses CHARSET_UTF8;
  list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET
- redis_store.py: Constructor defaults use REDIS_DEFAULT_HOST/PORT/DB
  and REDIS_DEFAULT_KEY_PREFIX; meta/artifact key suffixes use
  REDIS_META_KEY_SUFFIX and REDIS_ARTIFACT_KEY_SUFFIX; all utf-8
  encode/decode use CHARSET_UTF8; list_datasets defaults use
  DEFAULT_LIST_LIMIT/OFFSET; the slice [-5] (length of ":meta") now
  uses len(REDIS_META_KEY_SUFFIX) for safety
- postgres_store.py: Constructor host/port defaults use
  POSTGRES_DEFAULT_HOST and POSTGRES_DEFAULT_PORT; artifact file
  extension uses NPZ_FILE_SUFFIX
- memory.py: list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET
- cached.py: list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET

Task 2.9 — Core layer (2 files, ~10 replacements):
- dataset_id.py: hash digest encoding uses CHARSET_UTF8; the
  hash_digest[:16] slice now uses DATASET_ID_HASH_PREFIX_LENGTH
- models.py: All 8 batch operation max_length / min_length args now
  use the BATCH_*_MAX_ITEMS / BATCH_MIN_ITEMS constants
  (BatchDeleteRequest, BatchCreateRequest, BatchUpdateTagsRequest,
  BatchExportRequest, plus the BatchCreateItem.description and
  created_by); CreateDatasetRequest.description and created_by use
  DESCRIPTION_MAX_LENGTH and CREATED_BY_MAX_LENGTH;
  DatasetListFilter.tags_match default and pattern use
  TAGS_MATCH_DEFAULT and TAGS_MATCH_PATTERN

Task 2.10 — Generator params (7 files, ~50 replacements):
- xor/params.py: All 7 Field defaults + 3 validation bounds now use
  XOR_DEFAULT_* / MIN_* constants from xor/defaults.py
- checkerboard/params.py: All 8 Field defaults + 4 validation bounds
  use CHECKERBOARD_DEFAULT_* / MIN_* / MAX_* constants
- circles/params.py: All 7 Field defaults + 2 validation bounds use
  CIRCLES_DEFAULT_* / MIN_* constants
- gaussian/params.py: All 9 Field defaults + 5 validation bounds use
  GAUSSIAN_DEFAULT_* / MIN_* / MAX_* constants
- mnist/params.py: All 6 Field defaults use MNIST_DEFAULT_* constants
- csv_import/params.py: All 8 Field defaults use CSV_IMPORT_DEFAULT_*
  constants
- arc_agi/params.py: All 8 Field defaults + 4 validation bounds use
  ARC_AGI_DEFAULT_* / MIN_* / MAX_* constants
- spiral/params.py was already importing from defaults.py in Wave 1
  (the model/example for the other generators) — no changes needed

Notes:
- The "datasets" SQL table name in postgres_store.py SCHEMA_SQL was
  NOT replaced via interpolation: the table appears in CREATE TABLE,
  ALTER TABLE, INSERT INTO, and CREATE INDEX statements across the
  multiline template, and interpolating it would change f-string
  semantics in subtle ways. POSTGRES_DATASETS_TABLE remains in
  storage/constants.py as documentation of the canonical name.
- The "juniper_data" default database name in PostgresDatasetStore
  __init__ has no Wave 1 constant — left as-is.
- The HSTS, CSP, and 'nosniff' literal strings in cascor's middleware
  were the model — but cascor's Wave 1 constants don't include them,
  so the cascor refactor in this wave was scoped narrower; juniper-data
  has its own complete header constants and they are all used here.

Verified:
- pytest juniper_data/tests/ exit=0 (~654 tests pass)
- pre-commit run on all 17 files: all hooks pass after Ruff split the
  middleware import block into 3 statements (the `as` aliases trigger
  a Ruff rule that prefers separate imports for re-aliasing)
- Import smoke test: all generator default values match the original
  literals (XorParams n_points_per_quadrant=50 margin=0.1, Checkerboard
  n_samples=200 n_squares=4, Gaussian n_classes=2 center_radius=3.0,
  Mnist dataset=mnist flatten=True), middleware._MAX_REQUEST_BODY_BYTES
  = 10485760, _DEFAULT_CSP starts with "default-src 'none'",
  EXEMPT_PATHS contains /v1/health, dataset_id format unchanged
  (test-v1-2fb118e57a9d6c47 has 16-char suffix), DatasetListFilter
  tags_match default = "any"
…sks 3.2-3.3)

Replaces ~14 inline hardcoded literals across 6 files with imports
from juniper_data.core.constants (CHARSET_UTF8) and the standard
starlette.status module (per the data plan Phase 3 decision to
prefer starlette.status over custom HTTP status integer constants).

Task 3.3 — starlette.status (3 files, ~10 replacements):
- api/app.py: ValueError handler now raises status.HTTP_400_BAD_REQUEST;
  general exception handler raises status.HTTP_500_INTERNAL_SERVER_ERROR
- api/routes/datasets.py: get_store storage check uses
  status.HTTP_500_INTERNAL_SERVER_ERROR; create_dataset decorator uses
  status.HTTP_201_CREATED; unknown generator + invalid params errors
  use status.HTTP_400_BAD_REQUEST; batch-create decorator uses
  status.HTTP_201_CREATED; batch-export "none found" error uses
  status.HTTP_404_NOT_FOUND; "no versions found" error uses
  status.HTTP_404_NOT_FOUND; all 5 "Dataset not found" errors use
  status.HTTP_404_NOT_FOUND; delete decorator uses
  status.HTTP_204_NO_CONTENT
- api/routes/generators.py: schema lookup "Generator not found" uses
  status.HTTP_404_NOT_FOUND

Task 3.2 — utf-8 encoding strings (3 files, 4 replacements):
- storage/kaggle_store.py: open(...encoding="utf-8") uses CHARSET_UTF8
- generators/arc_agi/generator.py: same
- generators/csv_import/generator.py: 2 file open() calls, both use
  CHARSET_UTF8

Notes:
- juniper_data/storage/local_fs.py and redis_store.py + core/dataset_id.py
  were already using CHARSET_UTF8 from Wave 2 (task 2.8/2.9). The
  remaining 4 utf-8 literals targeted by task 3.2 lived in
  generators/storage helper code that wasn't touched in Wave 2.
- Test files (tests/unit/test_kaggle_store.py and
  test_security_boundaries.py) still contain `encoding="utf-8"`
  literals — these are test fixtures rather than production code, so
  they were left alone per the Wave 3 testing-vs-source distinction.
- The middleware.py 413 status code already uses
  status.HTTP_413_REQUEST_ENTITY_TOO_LARGE from Wave 2 (task 2.7).

Verified:
- pytest juniper_data/tests/: exit=0 (all tests pass)
- pre-commit on all 6 files: black/ruff/mypy/bandit clean
… CHANGELOG

Wave 6 (Documentation & Release) for the hardcoded-values refactor.

- AGENTS.md: add `api/constants.py`, `storage/constants.py`, and
  `core/constants.py` to the directory tree and the Component Overview
  table; extend the "Code Style Conventions → Constants" section to
  document the layer-scoped constants module pattern, the role of each
  layer's constants.py, the per-generator `params.py` defaults, and the
  rule that HTTP status codes use `starlette.status` constants instead
  of magic numbers.
- CHANGELOG.md: add an Unreleased entry summarizing the three new
  constants modules (Wave 1), the ~115 inline-literal replacements
  across 17 files in the api, storage, and core layers plus all 7
  generator parameter modules (Waves 2 + 3), the starlette.status
  adoption, and the Wave 5 generator-output bit-identity verification
  (sha256 of X_full/y_full at seed=42 matches origin/main for all 5
  generators; full pytest pass; 19 pre-commit hooks clean).
Copy link
Copy Markdown
Owner Author

@pcalnon pcalnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

@pcalnon pcalnon merged commit 9bb0d0b into main Apr 11, 2026
36 of 40 checks passed
@pcalnon pcalnon deleted the feature/hardcoded-values-wave1 branch April 11, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant