Skip to content

nexus oid cache is broken across schema migrations #5561

@davepacheco

Description

@davepacheco

We updated dogfood today and ran into a new problem with instance provisions:

error_message_internal = saga ACTION error at node "sled_id": unexpected database error: type with ID 218 does not exist

After some digging, it appears that Nexus's OID caching is hanging onto a stale value that was changed by the schema migration. Namely, some sequence like this happens during the upgrade:

  • After mupdate, new Nexus starts up, establishes its connections to CockroachDB, and populates its OID cache (which maps enum types like sled_resource_kind to their numeric database OID).
  • We run the schema migration that drops and re-creates the sled_resource_kind enum. This invalidates the cache entry because now the name sled_resource_kind points to a different OID.
  • In whatever context Diesel uses that cache (which appears to include at least INSERT statements), it uses the old OID, which does not correspond to any existing type any more, and we get this error from the database.

The workaround is to restart Nexus instances after this happens because when they come back up they will re-populate their cache with the correct value. The real fix will be to somehow invalidate this cache after schema migrations but we're still figuring out how to do that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions