Skip to content

Persistent registry; lifecycle management#351

Merged
kaste merged 30 commits intomainfrom
peristent-registry
Mar 31, 2026
Merged

Persistent registry; lifecycle management#351
kaste merged 30 commits intomainfrom
peristent-registry

Conversation

@kaste
Copy link
Copy Markdown
Collaborator

@kaste kaste commented Mar 31, 2026

Let's try this one.

Created an orphaned branch the-registry, changed crawl.yml to read the registry from there, and to push it (if it changed) to that branch back.

generate_registry learned lifecycle management. New packages are tagged with "first_seen"; removed packages are re-added as tombstones.

A new tool generate_seed.py extracts "first_seen"/"removed" data from workspace.json files or registries.

Old data has been scraped from packagecontrol.io; so we have more tombstones in the database than before. Unlikely we have all; don't know what the actual policy was for pc.io.

kaste added 30 commits March 31, 2026 12:21
Treat removed workspace entries without a stored source as coming from
MAIN_REPOSITORY_SOURCE when enforcing ensure_secure_source(). This
closes a takeover gap for imported tombstones that lacked source data.

Keep the denial message honest by showing the persisted workspace value
in diagnostics. When source is missing, report "<not-set>" instead of a
synthesized trusted source.

Add a deny-rules test that covers removed entries without source and
asserts both denial behavior and the new message wording.
When crawl_package() raises, keep existing workspace state but ensure
source is present by defaulting from the registry package contract.

This keeps security behavior stable for denied source moves while also
repairing entries that never had a successful crawl and thus missed
source entirely.

Add a regression test that verifies failed crawls adopt source from the
registry entry when the existing workspace entry has no source.
Exclude registry entries with a removed field from the scheduler in both
normal and presto modes.

Also block explicit --name crawls for tombstoned packages with a clear
message so manual runs follow the same tombstone rule.

Add focused scheduler tests that verify removed entries are skipped and
that the next-run hint ignores tombstoned packages.
Keep --name handling simple and explicit: if the selected registry
package is tombstoned, print a clear message and return without crawl.

Add a focused regression test for main_() that verifies tombstoned
packages are rejected in name mode and workspace remains unchanged.
Teach explain_main() to treat tombstoned registry entries explicitly.

For tombstones, print a clear status line to stderr. In normal mode,
print the raw entry as pretty JSON. In EFFECTIVE mode, print only the
status line and no JSON payload since there is no effective release view.

Keep this path simple by inlining the tombstone JSON print instead of
adding a helper wrapper.
Move the effective explain logic and its helper functions out of
crawl.py and into _explain_package.py so the explain-specific code
lives together in one place.

As part of that extraction, move the shared sublime_text selector
parsing helpers into _utils.py so both crawl runtime logic and explain
logic use the same implementation.

Update the explain tests to import the helper-facing functions from
_explain_package.
Teach maintenance() to copy tombstoned registry entries into
workspace.packages before the legacy orphan-marking step.

This keeps removed packages present in workspace and intentionally
overwrites stale crawl-only fields with the canonical tombstone data.

Add focused maintenance tests for tombstone import, overwrite behavior,
and continued orphan removed marking.
Add a regression test that imports a tombstoned package via
maintenance(), then runs main_() with an active registry entry for the
same name.

Verify resurrection works without special-case code: the package is
crawled, removed is cleared, source remains stable, and first_seen is
preserved.
Add describe_registry_changes.py to generate commit-message text from
old/new registry snapshots.

Implement change classification for both packages and libraries,
including single-change messages, metadata bulk edits, and mixed bulk
edits with additions, tombstones, and resurrections.

Keep repositories out of primary classification, but fall back to
"Update registry.json" when repositories change without any entity
change. This keeps "Same." strict so it only appears when no commit is
needed.

Add focused tests for all supported classifications and fallback cases,
using loader mocking for CLI tests.
Implement implicit seed loading in generate_registry based on --output,
with explicit overrides via --seed and opt-out via --no-seed.

In seeded mode, preserve package first_seen, synthesize tombstones for
missing packages, preserve tombstone removed timestamps, and keep
resurrection first_seen. Libraries remain non-tombstoned.

Keep fetching_source_failed behavior intact and add focused registry
tests that cover seed/no-seed behavior, tombstones, resurrection,
library handling, and deterministic package ordering.
Simplify seeded lifecycle handling after initial implementation.

Use pick() for seed extraction, inline package sorting by name, and
remove a no-op removed-field cleanup path.

Also ensure first_seen is populated when missing, including for
tombstoned entries, while still preserving seeded first_seen when
available.
Clarify generate_registry seed semantics in both CLI help and README.
The docs now explain implicit vs explicit --seed behavior and the
interaction with --no-seed plus fetching_source_failed.

Add scripts.seed_from_workspace as a first-class script and document
its usage inline with generate_registry. The script emits sparse output
for optional fields and avoids writing null source values.
Introduce scripts.generate_seed as the new seed extraction command.
It accepts exactly one input source via a required mutually exclusive
flag: --workspace [PATH] or --registry [PATH].

Update README examples to use generate_seed and document both
supported input modes. Remove the old seed_from_workspace script.
Add an incomplete-shape warning in generate_seed based on expected
entry sizes (2 keys for active entries, 5 for tombstoned entries).
The warning triggers when more than 10% of entries are incomplete.

Special-case the all-incomplete scenario with a clearer message:
"All packages have an incomplete shape".
Separate lifecycle seeding from source-failure recovery in
generate_registry. Recovery of failed repositories now requires
registry-shaped data and no longer reconstructs entries from
workspace/seed maps.

If an explicit seed is not registry-shaped, the command falls back to
prior --output when available for recovery data. On fetch failures,
emit a focused warning when the seed knows package names but no full
recovery entries exist, with message text that reflects --no-seed
behavior.

Add regression tests for non-registry seed input and fallback-to-output
recovery behavior.
Use has_registry_shape directly in resolve_failure_recovery_db
without additionally checking available. The shape flag already
encodes successful load plus registry-compatible structure.

This keeps behavior unchanged while reducing redundant conditions.
Replace the SeedLoad available/null-object pattern with SeedDb | None.
This removes ambiguous truthiness handling and makes seed presence
explicit at call sites.

Also rename read_seed_db(explicit=...) to strict=... to better reflect
its behavior: strict mode raises on read/parse errors, while implicit
mode returns None.
Simplify lifecycle seed handling by building the name->entry map
directly inside apply_seed_lifecycle and removing extract_seed_packages.

Also generalize build_tombstone to accept Mapping[str, Any], keeping
the key filtering in one place.
Replace generic Mapping typing for recovery_db with a dedicated
RecoveryDb shape. This matches runtime guarantees and simplifies
recovery iteration code.

Use a TypeGuard for registry-shape detection and return RecoveryDb | None
from resolve_failure_recovery_db.
Replace iter_db_entries(db, kind) with iter_package_entries(db),
since this helper is only used for package traversal.

This removes an unused selector parameter and makes call sites more
explicit.
Avoid false-positive recovery warnings when a full registry-shaped seed
is present but a failed repository has no known entries.

Emit a clear generic warning only for compact seed.json inputs where
failed-source recovery cannot be guaranteed, and point users to a full
registry.json seed for complete recovery.

Update and extend registry tests to cover compact-seed warning behavior
and registry-seed non-warning behavior.
Update crawl workflow to seed generate_registry from
./.the-registry/registry.json and sync branch state via an
extracted shell script.

Add .github/workflows/sync_registry_branch.sh to perform
compare-first syncing, fallback commit messaging, and push to
the-registry.

Add pytest coverage for happy path, no-op behavior, and
classifier crash fallback using a local bare origin to avoid
network pushes.
@kaste kaste merged commit a825b0e into main Mar 31, 2026
3 checks passed
@kaste kaste deleted the peristent-registry branch March 31, 2026 10:30
@braver
Copy link
Copy Markdown
Member

braver commented Apr 1, 2026

don't know what the actual policy was for pc.io.

Not sure what it looks like in the data, but on the site removals are simply not handled (ie. nothing gets removed or even marked as such). Maybe sometimes Will did something manually sometimes but not in recent years. I removed this one ages ago for instance: https://packagecontrol.io/packages/Theme%20-%20Sea%20Lion.

@kaste
Copy link
Copy Markdown
Collaborator Author

kaste commented Apr 1, 2026

Oh, they just look like pages that don't get downloads. How would I scrape them? It is basically just a page with an outdated LAST SEEN tag, so you go just through all the pages and look for that. Certainly possible.

@kaste
Copy link
Copy Markdown
Collaborator Author

kaste commented Apr 1, 2026

Actually, this is typically to be installed on the package_control_channel repo; but we don't have enough permissions to run it over there and it'd be just a hassle.

@braver
Copy link
Copy Markdown
Member

braver commented Apr 3, 2026

Maybe one day though! 🤞🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants