refactor: hardcoded values cleanup (Waves 1-6)#37
Merged
Conversation
Creates 10 new constants files (145 constants total) holding every hardcoded literal currently inlined in the API, storage, core, and generator layers. api/constants.py (33): - Exempt-paths set for security middleware - Security header names and values (CSP, HSTS, X-Frame, X-Content, Referrer, Permissions, X-RateLimit-*, X-Request-ID, X-API-Key) - Default content security policy - Max request body bytes (10 MiB) - Rate limit defaults (60 req/min, 60s window) - Observability defaults (service name, namespace, log format, Sentry trace sample rate, Prometheus duration buckets) - Generation status outcome strings (success/error) Note: HTTP status code constants are intentionally NOT defined here. Wave 2 will adopt starlette.status per the data plan doc Phase 3. storage/constants.py (16): - Redis: host, port, db, key prefix, meta/artifact key suffixes - PostgreSQL: host, port, datasets table name - Local FS: .meta.json, .npz, .tmp file suffixes - JSON serialization indent - Pagination defaults (limit=100, offset=0) - Default artifact format (npz) core/constants.py (11): - CHARSET_UTF8 (used in 7+ source files) - DATASET_ID_HASH_PREFIX_LENGTH (16) - Pydantic field length constraints (DESCRIPTION_MAX_LENGTH=500, CREATED_BY_MAX_LENGTH=100) - Batch operation size limits (delete/update=100, create/export=50) - DatasetListFilter tags_match default and pattern Per-generator defaults.py (7 files, 85 constants total) — each follows the existing generators/spiral/defaults.py structure with per-field defaults plus validation bounds: - xor/defaults.py (12 const) - checkerboard/defaults.py (11 const) - gaussian/defaults.py (14 const) - circles/defaults.py (14 const) - mnist/defaults.py (11 const) - csv_import/defaults.py (9 const) - arc_agi/defaults.py (14 const) All values verified against current source line-by-line. Naming follows the existing spiral defaults convention. Wave 1 of the hardcoded values refactor — constants only, no application logic modified. All 357 existing tests pass; pre-commit hooks pass.
…sks 2.7-2.10) Replaces ~115 inline hardcoded literals across 17 files with imports from the Wave 1 constants modules: juniper_data.api.constants, juniper_data.storage.constants, juniper_data.core.constants, and the per-generator defaults.py modules. Task 2.7 — API layer (3 files, ~30 replacements): - middleware.py: All 7 security headers and their values now reference HEADER_*/*_VALUE constants; HSTS conditional uses HEADER_X_FORWARDED_PROTO + PROXY_PROTOCOL_HTTPS; rate limit response headers use HEADER_X_RATELIMIT_*; the 413 status code uses starlette.status.HTTP_413_REQUEST_ENTITY_TOO_LARGE (per the data plan Phase 3 decision to adopt starlette.status). Module-level EXEMPT_PATHS, _DEFAULT_CSP, and _MAX_REQUEST_BODY_BYTES are kept as thin aliases to the constants for backwards compatibility with any test that may import them. - security.py: APIKeyHeader name uses HEADER_X_API_KEY; api_key lookup uses HEADER_X_API_KEY; RateLimiter constructor defaults use DEFAULT_RATE_LIMIT_REQUESTS_PER_MINUTE and DEFAULT_RATE_LIMIT_WINDOW_SECONDS; rate-limit response headers use HEADER_X_RATELIMIT_LIMIT/REMAINING/RESET and HEADER_RETRY_AFTER; get_rate_limiter fallback uses DEFAULT_RATE_LIMIT_REQUESTS_PER_MINUTE - observability.py: _SERVICE_NAME_DEFAULT and _NAMESPACE_DEFAULT now reference DEFAULT_SERVICE_NAME and DEFAULT_NAMESPACE; X-Request-ID reads + writes use HEADER_X_REQUEST_ID; the JSON format check uses LOG_FORMAT_JSON; the plain-text formatter uses DEFAULT_LOG_FORMAT_PLAIN; configure_sentry default traces_sample_rate uses DEFAULT_SENTRY_TRACES_SAMPLE_RATE; the Histogram buckets tuple uses DATASET_GENERATION_DURATION_BUCKETS; the success-status comparison uses GENERATION_STATUS_SUCCESS Task 2.8 — Storage layer (5 files, ~25 replacements; plan said 4): - local_fs.py: meta/npz/tmp file suffixes use META_FILE_SUFFIX, NPZ_FILE_SUFFIX, TMP_FILE_SUFFIX; JSON serialization uses JSON_INDENT_DEFAULT; all encoding="utf-8" uses CHARSET_UTF8; list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET - redis_store.py: Constructor defaults use REDIS_DEFAULT_HOST/PORT/DB and REDIS_DEFAULT_KEY_PREFIX; meta/artifact key suffixes use REDIS_META_KEY_SUFFIX and REDIS_ARTIFACT_KEY_SUFFIX; all utf-8 encode/decode use CHARSET_UTF8; list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET; the slice [-5] (length of ":meta") now uses len(REDIS_META_KEY_SUFFIX) for safety - postgres_store.py: Constructor host/port defaults use POSTGRES_DEFAULT_HOST and POSTGRES_DEFAULT_PORT; artifact file extension uses NPZ_FILE_SUFFIX - memory.py: list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET - cached.py: list_datasets defaults use DEFAULT_LIST_LIMIT/OFFSET Task 2.9 — Core layer (2 files, ~10 replacements): - dataset_id.py: hash digest encoding uses CHARSET_UTF8; the hash_digest[:16] slice now uses DATASET_ID_HASH_PREFIX_LENGTH - models.py: All 8 batch operation max_length / min_length args now use the BATCH_*_MAX_ITEMS / BATCH_MIN_ITEMS constants (BatchDeleteRequest, BatchCreateRequest, BatchUpdateTagsRequest, BatchExportRequest, plus the BatchCreateItem.description and created_by); CreateDatasetRequest.description and created_by use DESCRIPTION_MAX_LENGTH and CREATED_BY_MAX_LENGTH; DatasetListFilter.tags_match default and pattern use TAGS_MATCH_DEFAULT and TAGS_MATCH_PATTERN Task 2.10 — Generator params (7 files, ~50 replacements): - xor/params.py: All 7 Field defaults + 3 validation bounds now use XOR_DEFAULT_* / MIN_* constants from xor/defaults.py - checkerboard/params.py: All 8 Field defaults + 4 validation bounds use CHECKERBOARD_DEFAULT_* / MIN_* / MAX_* constants - circles/params.py: All 7 Field defaults + 2 validation bounds use CIRCLES_DEFAULT_* / MIN_* constants - gaussian/params.py: All 9 Field defaults + 5 validation bounds use GAUSSIAN_DEFAULT_* / MIN_* / MAX_* constants - mnist/params.py: All 6 Field defaults use MNIST_DEFAULT_* constants - csv_import/params.py: All 8 Field defaults use CSV_IMPORT_DEFAULT_* constants - arc_agi/params.py: All 8 Field defaults + 4 validation bounds use ARC_AGI_DEFAULT_* / MIN_* / MAX_* constants - spiral/params.py was already importing from defaults.py in Wave 1 (the model/example for the other generators) — no changes needed Notes: - The "datasets" SQL table name in postgres_store.py SCHEMA_SQL was NOT replaced via interpolation: the table appears in CREATE TABLE, ALTER TABLE, INSERT INTO, and CREATE INDEX statements across the multiline template, and interpolating it would change f-string semantics in subtle ways. POSTGRES_DATASETS_TABLE remains in storage/constants.py as documentation of the canonical name. - The "juniper_data" default database name in PostgresDatasetStore __init__ has no Wave 1 constant — left as-is. - The HSTS, CSP, and 'nosniff' literal strings in cascor's middleware were the model — but cascor's Wave 1 constants don't include them, so the cascor refactor in this wave was scoped narrower; juniper-data has its own complete header constants and they are all used here. Verified: - pytest juniper_data/tests/ exit=0 (~654 tests pass) - pre-commit run on all 17 files: all hooks pass after Ruff split the middleware import block into 3 statements (the `as` aliases trigger a Ruff rule that prefers separate imports for re-aliasing) - Import smoke test: all generator default values match the original literals (XorParams n_points_per_quadrant=50 margin=0.1, Checkerboard n_samples=200 n_squares=4, Gaussian n_classes=2 center_radius=3.0, Mnist dataset=mnist flatten=True), middleware._MAX_REQUEST_BODY_BYTES = 10485760, _DEFAULT_CSP starts with "default-src 'none'", EXEMPT_PATHS contains /v1/health, dataset_id format unchanged (test-v1-2fb118e57a9d6c47 has 16-char suffix), DatasetListFilter tags_match default = "any"
…sks 3.2-3.3) Replaces ~14 inline hardcoded literals across 6 files with imports from juniper_data.core.constants (CHARSET_UTF8) and the standard starlette.status module (per the data plan Phase 3 decision to prefer starlette.status over custom HTTP status integer constants). Task 3.3 — starlette.status (3 files, ~10 replacements): - api/app.py: ValueError handler now raises status.HTTP_400_BAD_REQUEST; general exception handler raises status.HTTP_500_INTERNAL_SERVER_ERROR - api/routes/datasets.py: get_store storage check uses status.HTTP_500_INTERNAL_SERVER_ERROR; create_dataset decorator uses status.HTTP_201_CREATED; unknown generator + invalid params errors use status.HTTP_400_BAD_REQUEST; batch-create decorator uses status.HTTP_201_CREATED; batch-export "none found" error uses status.HTTP_404_NOT_FOUND; "no versions found" error uses status.HTTP_404_NOT_FOUND; all 5 "Dataset not found" errors use status.HTTP_404_NOT_FOUND; delete decorator uses status.HTTP_204_NO_CONTENT - api/routes/generators.py: schema lookup "Generator not found" uses status.HTTP_404_NOT_FOUND Task 3.2 — utf-8 encoding strings (3 files, 4 replacements): - storage/kaggle_store.py: open(...encoding="utf-8") uses CHARSET_UTF8 - generators/arc_agi/generator.py: same - generators/csv_import/generator.py: 2 file open() calls, both use CHARSET_UTF8 Notes: - juniper_data/storage/local_fs.py and redis_store.py + core/dataset_id.py were already using CHARSET_UTF8 from Wave 2 (task 2.8/2.9). The remaining 4 utf-8 literals targeted by task 3.2 lived in generators/storage helper code that wasn't touched in Wave 2. - Test files (tests/unit/test_kaggle_store.py and test_security_boundaries.py) still contain `encoding="utf-8"` literals — these are test fixtures rather than production code, so they were left alone per the Wave 3 testing-vs-source distinction. - The middleware.py 413 status code already uses status.HTTP_413_REQUEST_ENTITY_TOO_LARGE from Wave 2 (task 2.7). Verified: - pytest juniper_data/tests/: exit=0 (all tests pass) - pre-commit on all 6 files: black/ruff/mypy/bandit clean
… CHANGELOG Wave 6 (Documentation & Release) for the hardcoded-values refactor. - AGENTS.md: add `api/constants.py`, `storage/constants.py`, and `core/constants.py` to the directory tree and the Component Overview table; extend the "Code Style Conventions → Constants" section to document the layer-scoped constants module pattern, the role of each layer's constants.py, the per-generator `params.py` defaults, and the rule that HTTP status codes use `starlette.status` constants instead of magic numbers. - CHANGELOG.md: add an Unreleased entry summarizing the three new constants modules (Wave 1), the ~115 inline-literal replacements across 17 files in the api, storage, and core layers plus all 7 generator parameter modules (Waves 2 + 3), the starlette.status adoption, and the Wave 5 generator-output bit-identity verification (sha256 of X_full/y_full at seed=42 matches origin/main for all 5 generators; full pytest pass; 19 pre-commit hooks clean).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cross-repo hardcoded-values refactor for
juniper-data. Canonical source for theX-API-Keyheader constant and the/v1API version prefix consumed byjuniper-data-client.juniper_data/api/constants.py(32 symbols: header names, status code defaults, body/rate-limit limits, error message templates, exempt paths)juniper_data/storage/constants.py(16 symbols: filenames, metadata keys, table/column names)juniper_data/core/constants.py(11 symbols: encoding strings, magic numbers, fixed metadata keys)Field(default=...)defaults moved from inline literals into named module constants in each generator'sparams.py.starlette.statusconstants for HTTP status codes (no more404/413/500/503magic numbers in routes/middleware/security). Consolidated'utf-8'encoding literals intocore/constants.py.X_full/y_fullarrays atseed=42is bit-identical between this branch andorigin/mainfor all 5 generators (spiral=3f0c88c1d0d8ac64, gaussian=d2c0928e0ccf7896, checkerboard=db1ba33f4513b94f, circles=123b27f4864d6e45, xor=0c30c5d07cdedece)HEADER_X_API_KEY='X-API-Key'matches the value used byjuniper-data-client.constants.API_KEY_HEADER_NAMENo public API changes — HTTP request/response shapes, settings prefix, and storage formats are unchanged.
Test plan
pre-commit run --all-files— all 19 hooks pass (ruff lint+format, mypy, bandit, yamllint, shellcheck)🤖 Generated with Claude Code