Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WH] Avoid adding UserPersonalData to the event bus #66

Closed
3 tasks
robrap opened this issue Jun 15, 2022 · 12 comments
Closed
3 tasks

[WH] Avoid adding UserPersonalData to the event bus #66

robrap opened this issue Jun 15, 2022 · 12 comments
Assignees
Labels
event-bus Work related to the Event Bus.

Comments

@robrap
Copy link
Contributor

robrap commented Jun 15, 2022

Some of the events specified in openedx-events contain PII. One clear example is the UserPersonalData class.

AC:

  • Find a way to go from an OpenEdxPublicSignal (which includes a UserPersonalData) to be serialized for the event bus without this data, and deserialized in a way that it can be sent using the same signal, but without the original personal data.
  • Include tests
  • Add documentation for consuming these events
@robrap robrap added the event-bus Work related to the Event Bus. label Jun 15, 2022
@robrap
Copy link
Contributor Author

robrap commented Jun 16, 2022

Leaving this ungroomed until we get more feedback on the possibility of dropping the personal data from the events.

@whuang1202
Copy link
Contributor

The event bus team is looking for further input on how to proceed with signals such as enrollments which contain the user's sensitive data.

In our current version of the event bus, we pass in signals which can sometimes contain user information. Most signals have this information in a UserData field, which in turn contains a pii field of type UserPersonalData. We are able to completely drop UserPersonalData fields off an eventmwhen serializing and deserializing the signal with the event bus. However, whether that is the actual direction we want to proceed in is still uncertain as silently removing data that the receiver is expecting may cause more bugs in the current system.

For the time being we can just send an error message when handling signals with UserPersonalData since this is sensitive data that should not be serialized, but we are hoping to generate a discussion on how to proceed with this field for the future, since we intend to remove all personal data that is passed.

A list of some possible solutions is included below:

Drop data before the serialize/deserialize process and fill it in with empty data, this fix will essentially “wipe” all traces of the user’s pii when the signals are passed to the event bus silently.

Error when we are expecting the UserPersonalData field, this solution will basically make it so the serializer and deserializer do not work with the userpersonaldata field by throwing an error, and as such, mark these signals as not for use with the event bus if it contains user pii. This is a temporary solution and as more signals are added in, we would have to continuously update which signals contain UserPersonalSata and mark them.

Make UserPersonalData an optional field across all current signals, this solution will allow the signals to continue being serialized when userpersonaldata is not passed in, although it may be tricky to determine serialization for signals that have this field optional yet still contain the user’s personal data. For future signals we would need a way to make it so UserPersonalData is optional or document it for cases where it is not optional

@ormsbee
Copy link

ormsbee commented Jun 29, 2022

Is the idea that PII should not go over the event bus at all, or that we want to be very explicit when your service needs PII and opt-out by default?

@ormsbee
Copy link

ormsbee commented Jun 29, 2022

If the desire is to be explicit, we could define two different event signals for the same logical thing (one with PII and one without), give them both proper names with some convention to indicate PII, and emit both. No deserialization surprises, and the number of events with PII shouldn't be that high (right?).

@whuang1202
Copy link
Contributor

I do believe that the intention is to not have PII go to the event bus at all, PII is something that I don't believe any signal explicitly "needs" to perform their specific function. Although this solution of defining two signals, an optional PII and a required PII, doesn't seem to be a bad idea in terms of making sure there aren't any deserialization surprises.

@feanil
Copy link
Contributor

feanil commented Jun 29, 2022

I like the explicitness of PII messages and non-PII messages, then people can choose to hook up the ones they want and there's no surprises about data they thought was there but was actually missing because your particular flavor doesn't consume PII. If the object doesn't have PII in the first place, and that's clear, then you don't have to solve for "how will people deal with the PII being missing some of the time when they try to access it?"

This raises another question for me though. Does this imply that we will need configuration to choose which events turn into messages on the bus and which don't?

@ormsbee
Copy link

ormsbee commented Jun 29, 2022

This raises another question for me though. Does this imply that we will need configuration to choose which events turn into messages on the bus and which don't?

This might be straightforward if it's a binary config setting as to whether PII messages get sent over the bus or not. I hope we can avoid anything more granular than that.

I do want to understand the reasoning behind this a bit better though. Is there a particular regulatory compliance issue with sending PII in this manner that this is meant to address? A specific voiced concern from the security working group or some other body? Or is this because we just believe that it's always prudent to minimize exposure on this front...?

@robrap
Copy link
Contributor Author

robrap commented Jun 29, 2022

  1. If PII data cannot be put on the event bus, I agree that I like the explicitness of the PII and non-PII message proposal.
  2. Somewhere along the way, I thought we either needed to keep PII data off the event bus, or have some complicated solution to enable user retirement. Maybe I was coupling this with an (incorrect) assumption around infinite data retention? Maybe a short enough data retention policy (2 weeks?) would be a good enough solution? This might require sign-off by 2U legal. I don't know if other community members would have concerns. Also, do we need to be concerned with other services potentially consuming this data and updating user details after a user retirement request?
  3. In the long term, I think people would find it useful to have a topic for user changes available on the event bus to enable redundant user data in services. However, some people have argued that the identity service (LMS) should be the only service to store and provide this data, and thus it would never go to the event bus. As noted above, if we did create this topic and it used infinite retention, there would be additional user retirement concerns. I ultimately would like to discuss and resolve this, but I think this is a separate issue.
  4. At this time, it is the LMS that is coupling PII user data to events because it also happens to be the identity provider. Is this wise for the future of the platform? Other services would not be allowed to do the same thing, or they might be supplying stale user data. I'm not clear if we are committed to forever coupling the LMS and the identity service, or if we want to avoid tech debt that would make this split more complex.

From @feanil:

This raises another question for me though. Does this imply that we will need configuration to choose which events turn into messages on the bus and which don't?

Yes. I think events need to be explicitly configured (potentially using a decorator?) to be turned into a message on the bus. Ideally, this information would make it out to the documentation for the events.

@robrap
Copy link
Contributor Author

robrap commented Jun 29, 2022

I do want to understand the reasoning behind this a bit better though. Is there a particular regulatory compliance issue with sending PII in this manner that this is meant to address? A specific voiced concern from the security working group or some other body? Or is this because we just believe that it's always prudent to minimize exposure on this front...?

  1. My comment above describes some about where this came from.
  2. Additionally, at one point I thought I read or heard that UserPersonalData was specifically designed as a separate class so it could be included/excluded as needed. This was also related to my potentially incorrect assumptions.

Given that, I still have the understanding that user retirement continues to be a requirement and that permanent storage of immutable PII data would not be ok.

@ormsbee
Copy link

ormsbee commented Jun 30, 2022

Maybe I was coupling this with an (incorrect) assumption around infinite data retention? Maybe a short enough data retention policy (2 weeks?) would be a good enough solution? This might require sign-off by 2U legal. I don't know if other community members would have concerns.

I don't think infinite data retention should be one of our goals. It would be a goal if we were using this as the record of truth and deriving all state from it, but the event bus as envisioned is more of a notification and update propagation mechanism, with canonical state still stored in Django models. Kafka's design will keep messages around for quite some time, but I think the services should treat that state as ephemeral in principle, and only lean on the history for things like recovering from service downtime. If a service needs to have the full history of all the events of a certain type, it can copy it to some other place.

Also, do we need to be concerned with other services potentially consuming this data and updating user details after a user retirement request?

That's a possibility, but I think that no matter what, a service consumes PII data must be required to process retirement events as well, so that it can dispose of any PII it may have copied over.

In the long term, I think people would find it useful to have a topic for user changes available on the event bus to enable redundant user data in services. However, some people have argued that the identity service (LMS) should be the only service to store and provide this data, and thus it would never go to the event bus.

I'm actually kind of surprised by the stance that only the LMS would hold this data. Don't we already have copying of basic user information (like names) to various services so they can do things like generate report files with names in them or display paginated lists of users without making a crazy number of calls to the LMS?

At this time, it is the LMS that is coupling PII user data to events because it also happens to be the identity provider. Is this wise for the future of the platform?

Fair point. The current event is STUDENT_REGISTRATION_COMPLETED, which we've characterized as a learning event (org.openedx.learning.student.registration.completed.v1). We don't even really have a generic "user information has changed" event–it's entirely possible for the user to change their email address and full name after registration. Based on the name, it's not even clear if the existing event covers staff users.

I think that it's actually fine to have STUDENT_REGISTRATION_COMPLETED as it is now, but there should probably be a different event for basic user data changes that lives in a different subdomain from learning. The LMS might still be the one emitting that new user-update event for now, but the identity provider could emit it if and when it's extracted from edx-platform.

Other services would not be allowed to do the same thing, or they might be supplying stale user data.

I think that it's actually okay for other services to emit the same data structure, as long as it's in a different event. For instance, say there's an event when someone posts on the forum. I think it's reasonable for the FORUM_POST_CREATED signal to have a user attribute that is a UserData. As a receiver, I know that this is not where I should go for getting canonical user data updates, but I often don't care that deeply–I just want that bit of convenience. For instance, if I had a feature that was "send a welcome email for their first post", I would expect to be able to listen for FORUM_POST_CREATED and write my message based on that.

(There's probably another version of this where we lean into edx-ace and keep the periphery more ignorant of user details, but as far as I know, nobody's really pushing that work forward at the moment.)

I'm not clear if we are committed to forever coupling the LMS and the identity service, or if we want to avoid tech debt that would make this split more complex.

I think that as long as we keep the identity service oriented events in a separate package (a peer to learning and enterprise), it will be decoupled enough.

@robrap
Copy link
Contributor Author

robrap commented Jun 30, 2022

@ormsbee: Thanks again for your thoughts.

I don't think infinite data retention should be one of our goals.

Agreed. I was just trying to understand why this was even an issue in my mind. I had a partial-decision in my head from past discussions, with no ADR to explain how we got there, and with the partial-decision no longer making sense to me. So, we'll be looking to see if short data retention is a secure enough solution.

I'm actually kind of surprised by the stance that only the LMS would hold this data. Don't we already have copying of basic user information (like names) to various services...

Yes, we do this now. However, there was a push to minimize/stop this in order to minimize the spread of PII and make the overall plaform more secure. There is always a trade-off between convenience and security. The convenient thing to do here is just to copy the PII across and let people copy only what they must into the service. This provides some additional PII exposure inside Kafka, but not for all users across all time. I can side with convenience, but would like a second opinion from those who lean towards security. For example, I think one team was asked to jump through a lot of hoops to avoid copying this data, for better or for worse. I'd need to dig in to this to provide more details.

Other services would not be allowed to do the same thing, or they might be supplying stale user data.

I think that it's actually okay for other services to emit the same data structure, as long as it's in a different event.
...
I think that as long as we keep the identity service oriented events in a separate package (a peer to learning and enterprise), it will be decoupled enough.

I can accept this. I do think it will take proper education for people to understand that this potentially-stale data should be used with caution.

@whuang1202
Copy link
Contributor

whuang1202 commented Jul 5, 2022

If anyone is interested in reviewing the ADR summarizing this discussion, I've linked it in the pull request here: #73, we'll close this issue once the PR lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
event-bus Work related to the Event Bus.
Projects
None yet
Development

No branches or pull requests

4 participants