refactor: hardcoded values cleanup (Waves 1-6) by pcalnon · Pull Request #37 · pcalnon/juniper-data

pcalnon · 2026-04-10T22:57:03Z

Summary

Cross-repo hardcoded-values refactor for juniper-data. Canonical source for the X-API-Key header constant and the /v1 API version prefix consumed by juniper-data-client.

Wave 1: Three new layer-scoped constants modules:
- juniper_data/api/constants.py (32 symbols: header names, status code defaults, body/rate-limit limits, error message templates, exempt paths)
- juniper_data/storage/constants.py (16 symbols: filenames, metadata keys, table/column names)
- juniper_data/core/constants.py (11 symbols: encoding strings, magic numbers, fixed metadata keys)
- Per-generator Field(default=...) defaults moved from inline literals into named module constants in each generator's params.py.
Wave 2: Replaced ~115 inline literals across 17 files in the api, storage, and core layers, plus all 7 generator parameter modules.
Wave 3: Adopted starlette.status constants for HTTP status codes (no more 404/413/500/503 magic numbers in routes/middleware/security). Consolidated 'utf-8' encoding literals into core/constants.py.
Wave 5 cross-repo verification (PASS):
- Generator output bit-identity: SHA-256 of X_full/y_full arrays at seed=42 is bit-identical between this branch and origin/main for all 5 generators (spiral=3f0c88c1d0d8ac64, gaussian=d2c0928e0ccf7896, checkerboard=db1ba33f4513b94f, circles=123b27f4864d6e45, xor=0c30c5d07cdedece)
- HEADER_X_API_KEY='X-API-Key' matches the value used by juniper-data-client.constants.API_KEY_HEADER_NAME
Wave 6: AGENTS.md updated with the layer-scoped constants module pattern; CHANGELOG.md gains an Unreleased entry.

No public API changes — HTTP request/response shapes, settings prefix, and storage formats are unchanged.

Test plan

Full pytest suite (835+ tests) — exit 0
pre-commit run --all-files — all 19 hooks pass (ruff lint+format, mypy, bandit, yamllint, shellcheck)
Generator output reproducibility verification (Wave 5)
CI on PR

🤖 Generated with Claude Code

Creates 10 new constants files (145 constants total) holding every hardcoded literal currently inlined in the API, storage, core, and generator layers. api/constants.py (33): - Exempt-paths set for security middleware - Security header names and values (CSP, HSTS, X-Frame, X-Content, Referrer, Permissions, X-RateLimit-*, X-Request-ID, X-API-Key) - Default content security policy - Max request body bytes (10 MiB) - Rate limit defaults (60 req/min, 60s window) - Observability defaults (service name, namespace, log format, Sentry trace sample rate, Prometheus duration buckets) - Generation status outcome strings (success/error) Note: HTTP status code constants are intentionally NOT defined here. Wave 2 will adopt starlette.status per the data plan doc Phase 3. storage/constants.py (16): - Redis: host, port, db, key prefix, meta/artifact key suffixes - PostgreSQL: host, port, datasets table name - Local FS: .meta.json, .npz, .tmp file suffixes - JSON serialization indent - Pagination defaults (limit=100, offset=0) - Default artifact format (npz) core/constants.py (11): - CHARSET_UTF8 (used in 7+ source files) - DATASET_ID_HASH_PREFIX_LENGTH (16) - Pydantic field length constraints (DESCRIPTION_MAX_LENGTH=500, CREATED_BY_MAX_LENGTH=100) - Batch operation size limits (delete/update=100, create/export=50) - DatasetListFilter tags_match default and pattern Per-generator defaults.py (7 files, 85 constants total) — each follows the existing generators/spiral/defaults.py structure with per-field defaults plus validation bounds: - xor/defaults.py (12 const) - checkerboard/defaults.py (11 const) - gaussian/defaults.py (14 const) - circles/defaults.py (14 const) - mnist/defaults.py (11 const) - csv_import/defaults.py (9 const) - arc_agi/defaults.py (14 const) All values verified against current source line-by-line. Naming follows the existing spiral defaults convention. Wave 1 of the hardcoded values refactor — constants only, no application logic modified. All 357 existing tests pass; pre-commit hooks pass.

…sks 2.7-2.10) Replaces ~115 inline hardcoded literals across 17 files with imports from the Wave 1 constants modules: juniper_data.api.constants, juniper_data.storage.constants, juniper_data.core.constants, and the per-generator defaults.py modules. Task 2.7 — API layer (3 files, ~30 replacements): - middleware.py: All 7 security headers and their values now reference HEADER_*/*_VALUE constants; HSTS conditional uses HEADER_X_FORWARDED_PROTO + PROXY_PROTOCOL_HTTPS; rate limit response headers use HEADER_X_RATELIMIT_*; the 413 status code uses starlette.status.HTTP_413_REQUEST_ENTITY_TOO_LARGE (per the data plan Phase 3 decision to adopt starlette.status). Module-level EXEMPT_PATHS, _DEFAULT_CSP, and _MAX_REQUEST_BODY_BYTES are kept as thin aliases to the constants for backwards compatibility with any test that may import them. - security.py: APIKeyHeader name uses HEADER_X_API_KEY; api_key lookup uses HEADER_X_API_KEY; RateLimiter constructor defaults use DEFAULT_RATE_LIMIT_REQUESTS_PER_MINUTE and DEFAULT_RATE_LIMIT_WINDOW_SECONDS; rate-limit response headers use HEADER_X_RATELIMIT_LIMIT/REMAINING/RESET and HEADER_RETRY_AFTER; get_rate_limiter fallback uses DEFAULT_RATE_LIMIT_REQUESTS_PER_MINUTE - observability.py: _SERVICE_NAME_DEFAULT and _NAMESPACE_DEFAULT now reference DEFAULT_SERVICE_NAME and DEFAULT_NAMESPACE; X-Request-ID reads + writes use HEADER_X_REQUEST_ID; the JSON format check uses LOG_FORMAT_JSON; the plain-text formatter uses DEFAULT_LOG_FORMAT_PLAIN; configure_sentry default traces_sample_rate uses DEFAULT_SENTRY_TRACES_SAMPLE_RATE; the Histogram buckets tuple uses DATASET_GENERATION_DURATION_BUCKETS; the success-status comparison uses GENERATION_STATUS_SUCCESS Task 2.8 — Storage layer (5 files, ~25 replacements; plan said 4): - local_fs.py: meta/npz/tmp file suffixes use META_FILE_SUFFIX, NPZ_FILE_SUFFIX, TMP_FILE_SUFFIX; JSON serialization uses JSON_INDENT_DEFAULT; all encoding="utf-8" uses CHARSET_UTF8; list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET - redis_store.py: Constructor defaults use REDIS_DEFAULT_HOST/PORT/DB and REDIS_DEFAULT_KEY_PREFIX; meta/artifact key suffixes use REDIS_META_KEY_SUFFIX and REDIS_ARTIFACT_KEY_SUFFIX; all utf-8 encode/decode use CHARSET_UTF8; list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET; the slice [-5] (length of ":meta") now uses len(REDIS_META_KEY_SUFFIX) for safety - postgres_store.py: Constructor host/port defaults use POSTGRES_DEFAULT_HOST and POSTGRES_DEFAULT_PORT; artifact file extension uses NPZ_FILE_SUFFIX - memory.py: list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET - cached.py: list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET Task 2.9 — Core layer (2 files, ~10 replacements): - dataset_id.py: hash digest encoding uses CHARSET_UTF8; the hash_digest[:16] slice now uses DATASET_ID_HASH_PREFIX_LENGTH - models.py: All 8 batch operation max_length / min_length args now use the BATCH_*_MAX_ITEMS / BATCH_MIN_ITEMS constants (BatchDeleteRequest, BatchCreateRequest, BatchUpdateTagsRequest, BatchExportRequest, plus the BatchCreateItem.description and created_by); CreateDatasetRequest.description and created_by use DESCRIPTION_MAX_LENGTH and CREATED_BY_MAX_LENGTH; DatasetListFilter.tags_match default and pattern use TAGS_MATCH_DEFAULT and TAGS_MATCH_PATTERN Task 2.10 — Generator params (7 files, ~50 replacements): - xor/params.py: All 7 Field defaults + 3 validation bounds now use XOR_DEFAULT_* / MIN_* constants from xor/defaults.py - checkerboard/params.py: All 8 Field defaults + 4 validation bounds use CHECKERBOARD_DEFAULT_* / MIN_* / MAX_* constants - circles/params.py: All 7 Field defaults + 2 validation bounds use CIRCLES_DEFAULT_* / MIN_* constants - gaussian/params.py: All 9 Field defaults + 5 validation bounds use GAUSSIAN_DEFAULT_* / MIN_* / MAX_* constants - mnist/params.py: All 6 Field defaults use MNIST_DEFAULT_* constants - csv_import/params.py: All 8 Field defaults use CSV_IMPORT_DEFAULT_* constants - arc_agi/params.py: All 8 Field defaults + 4 validation bounds use ARC_AGI_DEFAULT_* / MIN_* / MAX_* constants - spiral/params.py was already importing from defaults.py in Wave 1 (the model/example for the other generators) — no changes needed Notes: - The "datasets" SQL table name in postgres_store.py SCHEMA_SQL was NOT replaced via interpolation: the table appears in CREATE TABLE, ALTER TABLE, INSERT INTO, and CREATE INDEX statements across the multiline template, and interpolating it would change f-string semantics in subtle ways. POSTGRES_DATASETS_TABLE remains in storage/constants.py as documentation of the canonical name. - The "juniper_data" default database name in PostgresDatasetStore __init__ has no Wave 1 constant — left as-is. - The HSTS, CSP, and 'nosniff' literal strings in cascor's middleware were the model — but cascor's Wave 1 constants don't include them, so the cascor refactor in this wave was scoped narrower; juniper-data has its own complete header constants and they are all used here. Verified: - pytest juniper_data/tests/ exit=0 (~654 tests pass) - pre-commit run on all 17 files: all hooks pass after Ruff split the middleware import block into 3 statements (the `as` aliases trigger a Ruff rule that prefers separate imports for re-aliasing) - Import smoke test: all generator default values match the original literals (XorParams n_points_per_quadrant=50 margin=0.1, Checkerboard n_samples=200 n_squares=4, Gaussian n_classes=2 center_radius=3.0, Mnist dataset=mnist flatten=True), middleware._MAX_REQUEST_BODY_BYTES = 10485760, _DEFAULT_CSP starts with "default-src 'none'", EXEMPT_PATHS contains /v1/health, dataset_id format unchanged (test-v1-2fb118e57a9d6c47 has 16-char suffix), DatasetListFilter tags_match default = "any"

…sks 3.2-3.3) Replaces ~14 inline hardcoded literals across 6 files with imports from juniper_data.core.constants (CHARSET_UTF8) and the standard starlette.status module (per the data plan Phase 3 decision to prefer starlette.status over custom HTTP status integer constants). Task 3.3 — starlette.status (3 files, ~10 replacements): - api/app.py: ValueError handler now raises status.HTTP_400_BAD_REQUEST; general exception handler raises status.HTTP_500_INTERNAL_SERVER_ERROR - api/routes/datasets.py: get_store storage check uses status.HTTP_500_INTERNAL_SERVER_ERROR; create_dataset decorator uses status.HTTP_201_CREATED; unknown generator + invalid params errors use status.HTTP_400_BAD_REQUEST; batch-create decorator uses status.HTTP_201_CREATED; batch-export "none found" error uses status.HTTP_404_NOT_FOUND; "no versions found" error uses status.HTTP_404_NOT_FOUND; all 5 "Dataset not found" errors use status.HTTP_404_NOT_FOUND; delete decorator uses status.HTTP_204_NO_CONTENT - api/routes/generators.py: schema lookup "Generator not found" uses status.HTTP_404_NOT_FOUND Task 3.2 — utf-8 encoding strings (3 files, 4 replacements): - storage/kaggle_store.py: open(...encoding="utf-8") uses CHARSET_UTF8 - generators/arc_agi/generator.py: same - generators/csv_import/generator.py: 2 file open() calls, both use CHARSET_UTF8 Notes: - juniper_data/storage/local_fs.py and redis_store.py + core/dataset_id.py were already using CHARSET_UTF8 from Wave 2 (task 2.8/2.9). The remaining 4 utf-8 literals targeted by task 3.2 lived in generators/storage helper code that wasn't touched in Wave 2. - Test files (tests/unit/test_kaggle_store.py and test_security_boundaries.py) still contain `encoding="utf-8"` literals — these are test fixtures rather than production code, so they were left alone per the Wave 3 testing-vs-source distinction. - The middleware.py 413 status code already uses status.HTTP_413_REQUEST_ENTITY_TOO_LARGE from Wave 2 (task 2.7). Verified: - pytest juniper_data/tests/: exit=0 (all tests pass) - pre-commit on all 6 files: black/ruff/mypy/bandit clean

… CHANGELOG Wave 6 (Documentation & Release) for the hardcoded-values refactor. - AGENTS.md: add `api/constants.py`, `storage/constants.py`, and `core/constants.py` to the directory tree and the Component Overview table; extend the "Code Style Conventions → Constants" section to document the layer-scoped constants module pattern, the role of each layer's constants.py, the per-generator `params.py` defaults, and the rule that HTTP status codes use `starlette.status` constants instead of magic numbers. - CHANGELOG.md: add an Unreleased entry summarizing the three new constants modules (Wave 1), the ~115 inline-literal replacements across 17 files in the api, storage, and core layers plus all 7 generator parameter modules (Waves 2 + 3), the starlette.status adoption, and the Wave 5 generator-output bit-identity verification (sha256 of X_full/y_full at seed=42 matches origin/main for all 5 generators; full pytest pass; 19 pre-commit hooks clean).

pcalnon

approved

pcalnon added 4 commits April 9, 2026 16:15

pcalnon mentioned this pull request Apr 10, 2026

refactor: hardcoded values cleanup (Waves 1-6) pcalnon/juniper-data-client#18

Merged

4 tasks

pcalnon self-assigned this Apr 11, 2026

pcalnon commented Apr 11, 2026

View reviewed changes

pcalnon merged commit 9bb0d0b into main Apr 11, 2026
36 of 40 checks passed

pcalnon deleted the feature/hardcoded-values-wave1 branch April 11, 2026 22:41

pcalnon mentioned this pull request May 6, 2026

refactor(observability): migrate to juniper-observability register_or_reuse (Phase 2a) #95

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: hardcoded values cleanup (Waves 1-6)#37

refactor: hardcoded values cleanup (Waves 1-6)#37
pcalnon merged 4 commits intomainfrom
feature/hardcoded-values-wave1

pcalnon commented Apr 10, 2026

Uh oh!

pcalnon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pcalnon commented Apr 10, 2026

Summary

Test plan

Uh oh!

pcalnon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant