Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permanently delete all data from SponsorLink's database that has been collected during builds that included Moq (notably any version 4.20.*) #1395

Closed
stakx opened this issue Aug 13, 2023 · 18 comments · Fixed by devlooped/SponsorLink#49
Assignees

Comments

@stakx
Copy link
Contributor

stakx commented Aug 13, 2023

In the interest of all those devs who had their personally identifiable information (PII) processed and sent off to SponsorLink's data storage (regardless of whether it was hashed/anonymized or not) because they ran a build including Moq, I would like to request that all of this data be permanently purged from SponsorLink's and all other storage systems, without making any further usage of it. Those users had no way to opt-in (nor opt-out) of that process, so the Moq and SponsorLink projects should err on the side of caution and assume that the affected users did not give their consent. Also, IANAL but the data exfiltration may well have been in violation of certain data privacy laws (GDPR being the most prominent), so again, Moq and SponsorLink should play it safe and assume that they did in fact violate law. The damage cannot be fully undone, but the deletion of all collected data would be a demonstration of goodwill that the two projects want to conform to the law and respect users' wishes regarding the processing of their PII. Let's try to regain at least some of the trust that has been lost.

(Btw. if you've already deleted the data in question, and I've simply missed your notification about it buried deep within the other SponsorLink issues, then my apologies... in that case, could you please just restate that the data has been deleted? Thanks!)

@wrexbe
Copy link

wrexbe commented Aug 14, 2023

GDPR aside, I don't think it's even legal in Argentina.
https://caseguard.com/articles/act-no-25326-of-2000-landmark-personal-privacy-law/

@Gavin-Williams
Copy link

Gavin-Williams commented Aug 15, 2023

You wouldn't have to store the email addresses would you? I could just store the hash values and compare the hashes to authenticate? It's not perfect but it's probably good enough. A hash is not identifiable information.

Regarding GDPR , the GDPR considers data anonymized if there is no “reasonably likely” means to re-identify the data subject. So a list of hashes isn't going to be used to identify persons generally. It would literally be easier to scrape github and other sites for public information on people.

What exactly is the concern here with your email being hashed? If dev's on github stop hashing my email, am I going to be protected from the thousands of spam and scam emails I get every year? Is my credit card going to be protected from hackers? Many companies and developers, at least 500 or more already have my email. Is this the one to break the camel's back? I already disregard email as a safe means of communication. I certainly don't trust anything I receive via email off the bat.

@stakx
Copy link
Contributor Author

stakx commented Aug 15, 2023

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

@wrexbe
Copy link

wrexbe commented Aug 15, 2023

That won't protect them, it would take 2 seconds to get a bunch of github commit emails, and compare hashes

what stakx said

@ewrogers
Copy link

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

@Fuchs
Copy link

Fuchs commented Aug 15, 2023

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

If you choose a sufficient algorithm, the knowledge of which one you chose doesn't matter.
Algorithms for way more sensitive encryption are completely open and that does not make them any less secure.
In this case the algorithm chosen was neither technically nor legally (speaking GDPR) sufficient, thus legally, the data needs to be deleted unless one wants to be liable. (Which one might already be, given how this was handled, which is by no means even remotely GDPR compliant)

@wrexbe
Copy link

wrexbe commented Aug 15, 2023

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

If you choose a sufficient algorithm, the knowledge of which one you chose doesn't matter. Algorithms for way more sensitive encryption are completely open and that does not make them any less secure. In this case the algorithm chosen was neither technically nor legally (speaking GDPR) sufficient, thus legally, the data needs to be deleted unless one wants to be liable. (Which one might already be, given how this was handled, which is by no means even remotely GDPR compliant)

Hashing is not magic, it's a one way function.
Say "hello@gmail.com" becomes "1234" when you hash it

How do I figure out who owns "1234" you may ask?
Well when you make a commit on github, it includes your email. So you can get the commit history to make a list of emails, and then hash them to see if it matches "1234"

Aha, but I can add a machine specific thing to the hash, to add an unknown element to the hash, so it can't be reversed you might say.

That also won't work, because you see, when you make a commit, it has a time stamp. You can use the timestamp to figure out who owns it, because building generally happens right after, or before the commit. You can also find projects that specifically use SponsorLink (directly, or indirectly), so you can narrow down the targets a lot

@stakx
Copy link
Contributor Author

stakx commented Aug 16, 2023

Thank you @kzu for following up on this! I'm relieved that this has been taken care of.

@Gavin-Williams
Copy link

Gavin-Williams commented Aug 16, 2023

So, let's say I'm a sponsor of moq, someone knows the hashing algorithm. They have my email and they hash it. Then they've gotten their hands on the moq sponsors hash list. They work out that I'm a sponsor of moq this way because they see that the has of email is on the moq sponsors list. They could just look at my profile page to see I'm a sponsor of moq. For all that effort they have gained absolutely nothing. I still don't see the point of the outrage.

Is this just a European thing? Or a pickle for all anonymity activists.

I should note that I'm opposed to anonymity and push for compulsory identification of communications online. Which might explain why I just can't understand the outrage. There must be some specific issue I can't see though, to explain all the twisted knickers.

@stakx
Copy link
Contributor Author

stakx commented Aug 16, 2023

@Gavin-Williams, just FYI, your hypothetical scenario is no longer relevant. It's my understanding that the "hash list" has just been deleted (see devlooped/SponsorLink#49, referenced above). kzu also stated a while ago that email hashing is gone and won't happen again (https://github.com/moq/moq/issues/1374#issuecomment-1671240325), and that upcoming versions of SponsorLink won't process any PII (https://github.com/moq/moq/issues/1374#issuecomment-1671866096).

(And yes, data privacy law and "anonymity by default" is a pretty hot topic in Europe. I'm not knee-deep into what exactly is going on everywhere, but it's my impression that while GDPR and related legislations aren't brand new anymore, the dust hasn't settled yet and people are still figuring out how to fully conform to it. So when faced with uncertainty, it's in one's own best interest to err on the side of caution, since the penalties for breaking those laws can be quite substantial.)

@apacurariu
Copy link

apacurariu commented Aug 17, 2023

@stakx Also noteworthy is that, while I don't assume it to not have been the case here, it is actually impossible to prove that the data was actually deleted after it has been collected (e.g. without it having been copied or distributed). Again, I'm not saying this is the case here, but just to put in context that there is nothing anyone can do to validate that the data was actually deleted, other than trusting the author. So while we're trusting him about deleting the data and we're not talking about intentional malice, for future actions it's important to note that their effects might never be able to be truly undone to any extent or at least that it's not possible to prove that their effect was undone and to which extent.

(And, per my understanding of GDPR, any personally identifiable information falls under its jurisdiction. Since a hash is a deterministic function, being provided with the same email address yields the same output. Therefore, the hashed output is inextricably tied to a person and it is therefore itself personally identifiable information and subject to GDPR. Having said that, the email itself, while PII under GDPR, is not in itself sensitive enough to be likely to lead to actual real-world harm, in my opinion (although I could think of some fringe scenarios), so it's more a matter of principle and law than actual risk in this particular case.)

@stakx
Copy link
Contributor Author

stakx commented Aug 17, 2023

@apacurariu, not being able to personally verify whether a service provider really deleted your data like they told you isn't new... so I am not especially worried about that theoretical possibility only now, in this particular instance.

On the contrary, kzu has been very transparent about the deletion process and documented it in detail, why would he have fabricated all of that. (People don't seem to notice, but IIRC he has actually followed up on all demands except for a single one.) I for one see no reason to doubt his honesty here, and I probably couldn't have asked for a better outcome for this request.

@apacurariu
Copy link

apacurariu commented Aug 17, 2023

@stakx I didn't imply that he didn't delete the data. On the contrary, I said I trust that he did but this is only based on trust.

However, while I stand behind the theoretical and practical impossibility of demonstrating data deletion, not just in one's inability to personally verify this, I don't want to deviate the thread further especially since this is a purely theoretical consideration at this point.

@stakx
Copy link
Contributor Author

stakx commented Aug 17, 2023

@apacurariu, I understand, and your point is well taken. Also, don't worry too much about deviating the thread, the issue is resolved and closed anyway, and I'm probably going to step away from this whole issue tracker for a while. The whole situation is rather frustrating and I really need to take a break.

@apacurariu
Copy link

@stakx I admit, I also need to take a step back from this. I'll try to find something constructive to add or contribute.

@AraHaan
Copy link

AraHaan commented Aug 19, 2023

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

Not to mention many hash algorithms can be defeated with rainbow tables.

@Gavin-Williams
Copy link

Using closed source with open source is not ironic. There is a group of people who think open-source is an ideology, and must be everywhere and can't be mixed. But many people see open source as simply a tool, particularly for software that is under-developed, and probably under-resourced. So that it's behavior can be understood. And fixes & features can be provided by users. But mixing open source and closed source isn't an issue at all.

@AraHaan
Copy link

AraHaan commented Aug 19, 2023

Sometimes there is just no option to not to mix them either, sometimes a single person is working on the open source part and might need help so they ask a company that if they want to use the open source code in their projects and need additional functionality that they should program in the extensions or plugins for it. After all sometimes even single developers have limitations that they can do.

Example:

  • open source developer A creates a system that can be used on all types of applications.
  • Company A then asks developer A to add some specific apis for things like DirectX, GUI stuffs, etc.
  • developer A cant because they might not know 3D programming at all or even GUI programming directly so they ask the company to do it for them as an extension to said system.

@devlooped devlooped locked as resolved and limited conversation to collaborators Jan 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants