Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed, robustness and auditability by eliminating answer topic creation #601

Closed
8 tasks done
thclark opened this issue Jul 19, 2023 · 3 comments · Fixed by #603
Closed
8 tasks done

Improve speed, robustness and auditability by eliminating answer topic creation #601

thclark opened this issue Jul 19, 2023 · 3 comments · Fixed by #603
Assignees
Labels
backend Related to the back end devops System admin and devops framework Octue app or twined framework and communications system performance If you want to run cool, you've got to run on heavy, heavy fuel strategy Long term technology strategy tech-debt Technical debt (tidy up, refactoring, restructuring, caused by laziness now) user experience (UX) Key UX issues

Comments

@thclark
Copy link
Contributor

thclark commented Jul 19, 2023

Feature request

Use Case

We want to:

  • Speed up the question ask/answer process
  • Improve robustness in the ask/answer cycle (eg by eliminating the problem where the topic limit on GCP can be reached, preventing additional answer topics from being created)
  • Allow infrastructure to be brought under clear IAC control (ie by not creating resources dynamically)
  • Allow a shift toward a fully event-driven ask/answer process (ie by making questions much more auditable and reconstructable)
  • Reduce the surface area of permissions required by services
  • Simplify resource/infrastructure definition

Current state

  • A service revision has a question topic which it is subscribed to permanently.
  • On receipt of a question, the service will create a new topic for the answer (dynamic resource creation, requiring topic create permission, and taking several seconds).
  • Any other service listening to the answer must know a priori the topic name, and subscribe to it (depending on which service creates the answer topic and exactly what order that happens, this can create a race condition if the topic is not created yet)
    • That race condition has two aspects: failure where attempting to subscribe to not-yet-created topic, and missed initial messages when subscribing after the first messages were published.
  • The service then sends an event stream (heartbeats, monitors, logs, answer etc) down the answer topic.
  • The service does not clear up after itself (intentionally, since it cannot know if messages have successfully been pushed to their ultimate recipients yet, and even if it did, failed processes could leave orphaned answer topics). So a Cloud Function must be periodically triggered to clear up answer topics, (requiring both a cloud function and a periodic scheduler for it to be defined in infrastructure).
  • Recently, the Cloud Function for clearup has failed on a project (this led to outage when the number of topics exceeded GCP's limit of 10,000)

Proposed Solution

The reason we introduced this crazy setup was twofold:

  1. We believed it would become necessary to scope subscriptions to a single answer only (ie grant tight permissions that would allow untrusted parties would directly subscribe to an answer topic but no other)
  2. There was no filtering functionality available on pub/sub so it wasn't possible to communicate in a two-way fashion between a parent and child.

Number 1 has been surpassed - the nature of pub/sub has been shown to be very suitable for intra-infrastructure communication but never once have we accessed that from outside our infrastructure (except where granting service account keys to locally running services, in which event the local services are essentially treated as part of internal infrastructure). The way we manage permission and access control is at a layer outside the pub/sub message queues.

Number 2 has been surpassed with the availability of message filtering subscriptions.

So we are now able to:

  • Alter service revision subscriptions, to filter messages and receive questions only
    • (Also update github actions to match this new pattern when subscribing service revisions) - beta version
    • (Also update github actions to match this new pattern when subscribing service revisions)
  • Remove answer topic creation code and respond with answers on the same topic
  • Subscribe to non-question messages in order to:
    • receive just answers
    • or to all to receive both questions and answers
    • (Update django-twined to subscribe appropriately to receive the complete service event stream)
@thclark thclark added backend Related to the back end devops System admin and devops performance If you want to run cool, you've got to run on heavy, heavy fuel tech-debt Technical debt (tidy up, refactoring, restructuring, caused by laziness now) strategy Long term technology strategy framework Octue app or twined framework and communications system user experience (UX) Key UX issues labels Jul 19, 2023
@thclark thclark changed the title Improve speed by eliminating topic creation Improve speed, robustness and auditability by eliminating topic creation Jul 19, 2023
@thclark thclark changed the title Improve speed, robustness and auditability by eliminating topic creation Improve speed, robustness and auditability by eliminating answer topic creation Jul 19, 2023
@cortadocodes
Copy link
Member

Just to note this will be a breaking change in inter-service communications so we'll have to update the entire network of live services at once.

This was linked to pull requests Jul 25, 2023
@cortadocodes
Copy link
Member

@thclark what do you think about moving the message type from the event into the attributes? It would let me clean up some of the message types' required key names

@thclark
Copy link
Contributor Author

thclark commented Nov 23, 2023

Let's talk about what else is in the attributes - I'm feeling like the type of message should belong with it but not sure (and I don't see how it'd allow cleanup so would be good to see)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Related to the back end devops System admin and devops framework Octue app or twined framework and communications system performance If you want to run cool, you've got to run on heavy, heavy fuel strategy Long term technology strategy tech-debt Technical debt (tidy up, refactoring, restructuring, caused by laziness now) user experience (UX) Key UX issues
Projects
Status: Done
2 participants