Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Abstraction] Decide how to handle key deserialization when key schema might vary #86

Open
timmc-edx opened this issue Aug 1, 2022 · 6 comments
Labels
event-bus Work related to the Event Bus.

Comments

@timmc-edx
Copy link
Contributor

In the current event bus code, it is possible for signals to be broadcast as events using varying key IDs on the same topic. For example, the CERTIFICATE_CREATED signal could be emitted as events with a key taken from certificate.user.id, or a key taken from certificate.course.course_key. This presents a problem for deserialization: One of those is a number, and the other is a string; deserialization of the key will fail if the key schema is not knowable from the event.

I can see several possible solutions:

  • By documented convention, never use the same topic to send events for the same signal but with differing key-fields
    • Producer and consumer code would continue to explicitly specify "produce on/listen to topic X for signal Y with key-field Z"
    • Kind of fragile and redundant
  • Have the producer include the key-field in the event headers ("key_field": "certificate.user.id") so that the consumer can tell how to deserialize the key.
    • Self-documenting in a way that should be pretty robust; would handle schema evolution very easily.
  • Specify the key-field on the OpenEdxPublicSignal itself, so that signals can always only be sent with the same key-field.
    • Less flexible, although there's a migration path for changing that.
    • Simpler.
    • Conflates signals and events maybe a little more than we want.
    • Not clear how we'd handle composite keys such as we'd likely need for certificates -- combine the user and course identifiers by using a list of key-fields? Have a blank key?
  • Obviate the whole issue by never deserializing keys, and just emitting them as bytestrings when using them in logging code and similar.
    • It's possible we don't really have a use for keys beyond debugging and other observability.
@timmc-edx timmc-edx added event-bus Work related to the Event Bus. backlog To be put on a team's backlog or wishlist labels Aug 1, 2022
@timmc-edx timmc-edx added this to the [Event Bus] Future milestone Aug 1, 2022
@ormsbee
Copy link

ormsbee commented Aug 3, 2022

Is the idea that the key field is just used to partition things within a given topic?

@timmc-edx
Copy link
Contributor Author

I think that's right? And for log compaction, I guess.

@ashultz0
Copy link

ashultz0 commented Aug 4, 2022

sorry for a bit of a driveby but I can't stop thinking about this one

using the cert event there is one model change: new certificate for Andy in Cheating 101

it feels definitely wrong for the event to sometimes by keyed under Andy and sometimes under Cheating101

it feels suboptimal but understandable for us to start out by keying it under Andy but decide later that these should all come out keyed under Cheating101

having it keyed by (Andy, Cheating101) implies that a watcher has some syntax to filter for (*, Cheating101). But that makes me think that actually what watcher would even want to get only some cert events? Or do they have a list of courses they are watching? Or... what? This feels like the tip of a pretty weird iceberg.

I think I'm coming down for the last option. And maybe not calling this "key" when it's actually "debug/log helper label". Then we could just make sure to log "Andy/Cheating101"

@timmc-edx
Copy link
Contributor Author

For cert events, I think we might just want to go with a constant/empty key -- no partitioning, no log compaction, just "these are things that are happening". If we want log compaction ("only keep the latest information for certificate 123") we start including a an explicit ID field for certificates. I assume some kind of ID like that exists in the DB; we would just need to start including it.

@ormsbee
Copy link

ormsbee commented Aug 5, 2022

@ashultz0: I think that the partitioning use case is less about filtering and more about load distribution and ordering guarantees. For example:

  • If the key is "Andy", then all the certificate related messages going to that topic for Andy will be sent to the same Consumer, so there's less of a concern around race conditions when "Andy gets this cert!" and "Andy cheated, delete this cert!" messages come close together and get processed in the wrong order by competing Consumers.
  • Keying by user for this also helps distribute the load better. Keying by the CourseKey might put half the load onto exactly one consumer because there's one mega-blockbuster course that's super-popular at the moment, while the rest of the consumers are underutilized.

@timmc-edx: Some random thoughts:

  1. I agree with @ashultz0 that we should very much discourage people from casually picking the key on the producer side. That just sounds like a recipe for weird behavior.
  2. While this is nominally a concern of the event bus and not the signal itself, all the major broker options have similar ideas around partitioning using a key. Given that, I think it makes sense to make that decision at the OpenEdxPublicSignal level (via a list of field names in the constructor), and remove the guesswork for other transport implementations.
  3. I don't think it needs to be deserialized at all. It can be serialized into a string for sending purposes, but all the data in it should be present in the OpenEdxPublicSignal data payload itself (since it should be derived from that data). It might make sense to leave it in some metadata header as a string for debugging purposes, but I don't think any application code should actually inspect the values used for partitioning otherwise.

@robrap robrap changed the title Decide how to handle key deserialization when key schema might vary [Defer] Decide how to handle key deserialization when key schema might vary Aug 9, 2022
@robrap robrap changed the title [Defer] Decide how to handle key deserialization when key schema might vary [Abstraction] Decide how to handle key deserialization when key schema might vary Nov 15, 2022
@robrap
Copy link
Contributor

robrap commented Nov 15, 2022

For now, I'm listing this ticket as part of "Abstraction" work, in case there are any changes we want to the API before other technologies are implemented.

@robrap robrap removed the backlog To be put on a team's backlog or wishlist label Apr 20, 2023
@robrap robrap removed this from the [Event Bus] Future milestone Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
event-bus Work related to the Event Bus.
Projects
None yet
Development

No branches or pull requests

4 participants