Permanently delete all data from SponsorLink's database that has been collected during builds that included Moq (notably any version 4.20.*) #1395

stakx · 2023-08-13T14:29:35Z

In the interest of all those devs who had their personally identifiable information (PII) processed and sent off to SponsorLink's data storage (regardless of whether it was hashed/anonymized or not) because they ran a build including Moq, I would like to request that all of this data be permanently purged from SponsorLink's and all other storage systems, without making any further usage of it. Those users had no way to opt-in (nor opt-out) of that process, so the Moq and SponsorLink projects should err on the side of caution and assume that the affected users did not give their consent. Also, IANAL but the data exfiltration may well have been in violation of certain data privacy laws (GDPR being the most prominent), so again, Moq and SponsorLink should play it safe and assume that they did in fact violate law. The damage cannot be fully undone, but the deletion of all collected data would be a demonstration of goodwill that the two projects want to conform to the law and respect users' wishes regarding the processing of their PII. Let's try to regain at least some of the trust that has been lost.

(Btw. if you've already deleted the data in question, and I've simply missed your notification about it buried deep within the other SponsorLink issues, then my apologies... in that case, could you please just restate that the data has been deleted? Thanks!)

wrexbe · 2023-08-14T13:19:53Z

GDPR aside, I don't think it's even legal in Argentina.
https://caseguard.com/articles/act-no-25326-of-2000-landmark-personal-privacy-law/

Gavin-Williams · 2023-08-15T13:19:15Z

You wouldn't have to store the email addresses would you? I could just store the hash values and compare the hashes to authenticate? It's not perfect but it's probably good enough. A hash is not identifiable information.

Regarding GDPR , the GDPR considers data anonymized if there is no “reasonably likely” means to re-identify the data subject. So a list of hashes isn't going to be used to identify persons generally. It would literally be easier to scrape github and other sites for public information on people.

What exactly is the concern here with your email being hashed? If dev's on github stop hashing my email, am I going to be protected from the thousands of spam and scam emails I get every year? Is my credit card going to be protected from hackers? Many companies and developers, at least 500 or more already have my email. Is this the one to break the camel's back? I already disregard email as a safe means of communication. I certainly don't trust anything I receive via email off the bat.

stakx · 2023-08-15T14:43:48Z

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

wrexbe · 2023-08-15T15:36:38Z

That won't protect them, it would take 2 seconds to get a bunch of github commit emails, and compare hashes

what stakx said

ewrogers · 2023-08-15T16:19:35Z

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

Fuchs · 2023-08-15T16:46:19Z

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

If you choose a sufficient algorithm, the knowledge of which one you chose doesn't matter.
Algorithms for way more sensitive encryption are completely open and that does not make them any less secure.
In this case the algorithm chosen was neither technically nor legally (speaking GDPR) sufficient, thus legally, the data needs to be deleted unless one wants to be liable. (Which one might already be, given how this was handled, which is by no means even remotely GDPR compliant)

wrexbe · 2023-08-15T22:38:19Z

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

If you choose a sufficient algorithm, the knowledge of which one you chose doesn't matter. Algorithms for way more sensitive encryption are completely open and that does not make them any less secure. In this case the algorithm chosen was neither technically nor legally (speaking GDPR) sufficient, thus legally, the data needs to be deleted unless one wants to be liable. (Which one might already be, given how this was handled, which is by no means even remotely GDPR compliant)

Hashing is not magic, it's a one way function.
Say "hello@gmail.com" becomes "1234" when you hash it

How do I figure out who owns "1234" you may ask?
Well when you make a commit on github, it includes your email. So you can get the commit history to make a list of emails, and then hash them to see if it matches "1234"

Aha, but I can add a machine specific thing to the hash, to add an unknown element to the hash, so it can't be reversed you might say.

That also won't work, because you see, when you make a commit, it has a time stamp. You can use the timestamp to figure out who owns it, because building generally happens right after, or before the commit. You can also find projects that specifically use SponsorLink (directly, or indirectly), so you can narrow down the targets a lot

stakx · 2023-08-16T07:36:26Z

Thank you @kzu for following up on this! I'm relieved that this has been taken care of.

Gavin-Williams · 2023-08-16T08:55:02Z

So, let's say I'm a sponsor of moq, someone knows the hashing algorithm. They have my email and they hash it. Then they've gotten their hands on the moq sponsors hash list. They work out that I'm a sponsor of moq this way because they see that the has of email is on the moq sponsors list. They could just look at my profile page to see I'm a sponsor of moq. For all that effort they have gained absolutely nothing. I still don't see the point of the outrage.

Is this just a European thing? Or a pickle for all anonymity activists.

I should note that I'm opposed to anonymity and push for compulsory identification of communications online. Which might explain why I just can't understand the outrage. There must be some specific issue I can't see though, to explain all the twisted knickers.

stakx · 2023-08-16T09:35:19Z

@Gavin-Williams, just FYI, your hypothetical scenario is no longer relevant. It's my understanding that the "hash list" has just been deleted (see devlooped/SponsorLink#49, referenced above). kzu also stated a while ago that email hashing is gone and won't happen again (https://github.com/moq/moq/issues/1374#issuecomment-1671240325), and that upcoming versions of SponsorLink won't process any PII (https://github.com/moq/moq/issues/1374#issuecomment-1671866096).

(And yes, data privacy law and "anonymity by default" is a pretty hot topic in Europe. I'm not knee-deep into what exactly is going on everywhere, but it's my impression that while GDPR and related legislations aren't brand new anymore, the dust hasn't settled yet and people are still figuring out how to fully conform to it. So when faced with uncertainty, it's in one's own best interest to err on the side of caution, since the penalties for breaking those laws can be quite substantial.)

apacurariu · 2023-08-17T11:00:42Z

@stakx Also noteworthy is that, while I don't assume it to not have been the case here, it is actually impossible to prove that the data was actually deleted after it has been collected (e.g. without it having been copied or distributed). Again, I'm not saying this is the case here, but just to put in context that there is nothing anyone can do to validate that the data was actually deleted, other than trusting the author. So while we're trusting him about deleting the data and we're not talking about intentional malice, for future actions it's important to note that their effects might never be able to be truly undone to any extent or at least that it's not possible to prove that their effect was undone and to which extent.

(And, per my understanding of GDPR, any personally identifiable information falls under its jurisdiction. Since a hash is a deterministic function, being provided with the same email address yields the same output. Therefore, the hashed output is inextricably tied to a person and it is therefore itself personally identifiable information and subject to GDPR. Having said that, the email itself, while PII under GDPR, is not in itself sensitive enough to be likely to lead to actual real-world harm, in my opinion (although I could think of some fringe scenarios), so it's more a matter of principle and law than actual risk in this particular case.)

stakx · 2023-08-17T18:02:56Z

@apacurariu, not being able to personally verify whether a service provider really deleted your data like they told you isn't new... so I am not especially worried about that theoretical possibility only now, in this particular instance.

On the contrary, kzu has been very transparent about the deletion process and documented it in detail, why would he have fabricated all of that. (People don't seem to notice, but IIRC he has actually followed up on all demands except for a single one.) I for one see no reason to doubt his honesty here, and I probably couldn't have asked for a better outcome for this request.

apacurariu · 2023-08-17T19:26:53Z

@stakx I didn't imply that he didn't delete the data. On the contrary, I said I trust that he did but this is only based on trust.

However, while I stand behind the theoretical and practical impossibility of demonstrating data deletion, not just in one's inability to personally verify this, I don't want to deviate the thread further especially since this is a purely theoretical consideration at this point.

stakx · 2023-08-17T20:14:29Z

@apacurariu, I understand, and your point is well taken. Also, don't worry too much about deviating the thread, the issue is resolved and closed anyway, and I'm probably going to step away from this whole issue tracker for a while. The whole situation is rather frustrating and I really need to take a break.

apacurariu · 2023-08-17T20:16:44Z

@stakx I admit, I also need to take a step back from this. I'll try to find something constructive to add or contribute.

AraHaan · 2023-08-19T03:14:34Z

@Gavin-Williams, SponsorLink already doesn't store email addresses, but hashes derived from them. The problem here is that hashing doesn't sufficiently anonymize the email addresses, because it turns out that the hashing can be reversed in this case: email addresses are often publicly known, so if you have a list of email addresses, you can just hash them the same way SponsorLink hashes them, and build a lookup table; if you then get hold of the SponsorLink database and the hashes stored therein, you can map them back to email addresses. See devlooped/SponsorLink#31.

Ironically, making it open-source means you know the hashing algorithm. Or you don't and the irony is that a non-OSS project is used for sponsoring OSS projects. Catch-22.

Not to mention many hash algorithms can be defeated with rainbow tables.

Gavin-Williams · 2023-08-19T07:09:48Z

Using closed source with open source is not ironic. There is a group of people who think open-source is an ideology, and must be everywhere and can't be mixed. But many people see open source as simply a tool, particularly for software that is under-developed, and probably under-resourced. So that it's behavior can be understood. And fixes & features can be provided by users. But mixing open source and closed source isn't an issue at all.

AraHaan · 2023-08-19T12:42:07Z

Sometimes there is just no option to not to mix them either, sometimes a single person is working on the open source part and might need help so they ask a company that if they want to use the open source code in their projects and need additional functionality that they should program in the extensions or plugins for it. After all sometimes even single developers have limitations that they can do.

Example:

open source developer A creates a system that can be used on all types of applications.
Company A then asks developer A to add some specific apis for things like DirectX, GUI stuffs, etc.
developer A cant because they might not know 3D programming at all or even GUI programming directly so they ask the company to do it for them as an extension to said system.

stakx assigned kzu Aug 13, 2023

kzu mentioned this issue Aug 15, 2023

Replace log analytics workspace which is GONE devlooped/SponsorLink#49

Merged

kzu closed this as completed in devlooped/SponsorLink#49 Aug 15, 2023

kzu mentioned this issue Aug 17, 2023

GDPR compliance devlooped/SponsorLink#51

Closed

devlooped locked as resolved and limited conversation to collaborators Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permanently delete all data from SponsorLink's database that has been collected during builds that included Moq (notably any version 4.20.*) #1395

Permanently delete all data from SponsorLink's database that has been collected during builds that included Moq (notably any version 4.20.*) #1395

stakx commented Aug 13, 2023

wrexbe commented Aug 14, 2023

Gavin-Williams commented Aug 15, 2023 •

edited

stakx commented Aug 15, 2023

wrexbe commented Aug 15, 2023 •

edited

ewrogers commented Aug 15, 2023

Fuchs commented Aug 15, 2023 •

edited

wrexbe commented Aug 15, 2023 •

edited

stakx commented Aug 16, 2023

Gavin-Williams commented Aug 16, 2023 •

edited

stakx commented Aug 16, 2023

apacurariu commented Aug 17, 2023 •

edited

stakx commented Aug 17, 2023

apacurariu commented Aug 17, 2023 •

edited

stakx commented Aug 17, 2023

apacurariu commented Aug 17, 2023

AraHaan commented Aug 19, 2023

Gavin-Williams commented Aug 19, 2023

AraHaan commented Aug 19, 2023 •

edited

Permanently delete all data from SponsorLink's database that has been collected during builds that included Moq (notably any version 4.20.*) #1395

Permanently delete all data from SponsorLink's database that has been collected during builds that included Moq (notably any version 4.20.*) #1395

Comments

stakx commented Aug 13, 2023

wrexbe commented Aug 14, 2023

Gavin-Williams commented Aug 15, 2023 • edited

stakx commented Aug 15, 2023

wrexbe commented Aug 15, 2023 • edited

ewrogers commented Aug 15, 2023

Fuchs commented Aug 15, 2023 • edited

wrexbe commented Aug 15, 2023 • edited

stakx commented Aug 16, 2023

Gavin-Williams commented Aug 16, 2023 • edited

stakx commented Aug 16, 2023

apacurariu commented Aug 17, 2023 • edited

stakx commented Aug 17, 2023

apacurariu commented Aug 17, 2023 • edited

stakx commented Aug 17, 2023

apacurariu commented Aug 17, 2023

AraHaan commented Aug 19, 2023

Gavin-Williams commented Aug 19, 2023

AraHaan commented Aug 19, 2023 • edited

Gavin-Williams commented Aug 15, 2023 •

edited

wrexbe commented Aug 15, 2023 •

edited

Fuchs commented Aug 15, 2023 •

edited

wrexbe commented Aug 15, 2023 •

edited

Gavin-Williams commented Aug 16, 2023 •

edited

apacurariu commented Aug 17, 2023 •

edited

apacurariu commented Aug 17, 2023 •

edited

AraHaan commented Aug 19, 2023 •

edited