New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable partial masking of IP addresses in access logs #124
Conversation
based on what? (I'm not challenging; just asking.) Please add to the above to one of the commit messages and point to the text specification/description of what this does. I can see that the "specification" is not complicated and is an extension to already existing accesslog format specifiers, and apologize for the formalities, but precautions are necessary. Please use https://wiki.lighttpd.net/mod_accesslog instead of https://redmine.lighttpd.net/projects/lighttpd/wiki/Mod_accesslog
Is there a convention for masking the lower bits instead of masking the upper bits? The lower bits might be better for matching up the same (slightly anonymized) client. Is this masking permitted by GDPR? For what reasons are you masking? Can you provide pointers to the specifications/guidance? Is masking reasonable? Should the IP instead be passed through a hash function such as SHA256? I'll try to review more later this weekend. I hope there is a way to do this with less code and hopefully without allocation.
|
Yes, I know I named a function pointer The code in the |
This change is inspired by mod_log_ipmask for the apache webserver, which was originally developed by the data protection office in Saxonia, Germany (see https://www.saechsdsb.de/component/jifile/download/ZjU5MGFlZjU5MzczYzdhOTNkM2NhYjFiYzg1OGU0ZWE=/dokumentation-ip-maskierung-v1-0-pdf ). The goal is to de-personalize the IP address, which is otherwise considered a piece of personally identifiable information (in some jurisdictions at least). The code here was written from scratch by myself and is not a work derived from the original mod_log_ipmask code. It adds an optional parameter of the form "[v4|v6]:n[,...]" to the "%a" / "%h" placeholders in lighttpd's accesslog.format configuration parameter, where n is the number of trailing bits to mask out in addresses belonging to the respective protocols, for the purpose of logging. For example, the placeholder "%{v4:8,v6:72}a" would replace the rightmost 8 bits of IPv4 addresses with 0's, and the rightmost 72 bits of IPv6 addresses. In the absence of the parameter the old behaviour is preserved, i. e. the full address is logged.
Not sure I understand the question. I have sufficient programming background to write code given a task like "mask lower bits of an ip address".
Done. I included the "specification" into the message. I couldn't find a place in the repo where to add it. I guess it will be added to the wiki at some point.
Done.
Done (but irrelevant after refactoring).
Done.
Here's a pointer to something official: https://www.datenschutzzentrum.de/artikel/575-IP-Adressen-und-andere-Nutzungsdaten-Haeufig-gestellte-Fragen.html An IP address is considered personally identifiable information because it is possible to find out who was using it at a given time with the help of the ISP who owns the address. By stripping off a sufficient number of bits from the address this possibility no longer exists, which means the remaining bits can be considered anonymous. Stripping off the rightmost bits instead of left has two main reasons:
Hashing is not sufficient, because at least for v4 addresses it is feasible to find an IP address for a given hash by brute-forcing the IP address range. Also, hashing has the downside of losing the address prefix and thus the geographic origin of a request.
Done.
Done. |
Haha. I do not think you woke up one day and decided "You know what I need? I need to mask IP bits in my logs!" :) You answered the question with your response to my more detailed questions about regulations and how masking the IP address is permitted by (at least some) official guidelines. Regarding SHA256 hash, yes, I should have suggested that the hash be salted and the salt changed periodically, similar to what lighttpd does multiple times per day with the TLS session ticket encryption key. Then again, you raise a good point about losing the utility of retaining the high bits of the address for offline analytics. Some technical questions for discussion (before asking you to make any further code changes): Any specific existing convention on which you based the format specifier syntax extension ( mod_accesslog is not the only place in lighttpd which deals with IP addresses and logging. Among others, there is also the lighttpd error log(s), which is why I mentioned in https://redmine.lighttpd.net/boards/3/topics/10949 that I am generally of the opinion that arbitrary log munging should be done by a piped logger. Regarding use of lighttpd A quick internet search suggests to me that IP masking in-server (e.g. not performed by piped loggers or similar) is frequently implemented via string matching / regex on the stringified IP address. Note: I have not measured the performance of potentially (limited, non-configurable) masking based on a string match of the already stringified IP address, versus masking the binary address and then stringifying using |
This approach would be more likely to minimize the overhead for existing lighttpd use, without the extended format specification.
|
While I am a fan of C99 designated initializers, lighttpd is still built by some people for very old machines with ancient compilers, or on systems with brain-dead compilers (historically, MS C++ compilers refused to support C99, even over a decade after C99 was published). Your Depending on other discussions above, we might combine this and |
None that I know. It felt natural.
There is no law specifying precisely how many bits must be removed in order to sufficiently anonymize an address, and even if there was the value might vary between jurisdictions. I therefore think it should be configurable.
Right. mod_log_ipmask ignores the error log as well IIRC. In my understanding, an IP address in the error log is more likely to be a case of legitimate interest for the server operator. The article I linked above specifically mentions that it is acceptable to keep unmasked IP addresses for security purposes for up to 7 days (think of fail2ban for example).
Yes, I noticed. My first impression was that the cache is most likely a case where Knuth's premature optimization rule should have been applied. :-)
That's easy to do when you just want to drop the last octet from an IPv4 address, but more difficult if you want to strip a different number of bits. It's also more difficult for IPv6 addresses. |
I encourage you to measure the time and CPU cost of uncached The stringified IP cache was potentially called multiple times per request in much older lighttpd code history. The precursor to sock_addr_cache.[ch] was called For high-load scenarios, the overhead of checking the small, static cache is small, even if there are cache misses most of the time. However, the existence of a cache with very small number of elements is not for the benefit of many, many different clients hitting the server in a short timeframe. Rather, the cache is/was to reduce repeating work in different lighttpd modules during the same request, reducing latency and CPU use. In that case, the lookup does have a higher chance of having a cache hit. In any case, the reason I am focusing on this cache is because I recently made changes to lighttpd mod_extforward to handle HTTP/2 requests from different clients that are multiplexed by upstream load balancers. Given our conversation here, I'll ask you to please remove the use of the cache in your PR, as that does seem to be a premature optimization since you were not aware of the cache use case and have not measured the performance of halving the effective cache size. By the time the request is being logged in the access log, the IP cache is not likely to be called again during the same request. For high-traffic bursts of requests from the same client, using a cache might theoretically contribute to marginally lower latency in responses. When that use case is measured, we can reconsider adding a masked IP cache to mod_accesslog. Separately, and outside the scope of this PR, I might revisit the existing single use of
My post came out of a quick search to review existing usage conventions for how to mask IPs in Apache and nginx. I prefer to operate on structured data (i.e. the binary address) rather than string flinging, which is why we are discussing your PR here. For others reading this post, I wanted to point out that there are existing solutions that can be used today with lighttpd without this PR, even if those solutions are potentially less efficient. Since I have not measured the current differences between string flinging and masking the binary IP and using |
Sorry, that falls under the category of "no". Such policy enforcement, which might need to look for other PII (personally identifiable information), too, should be done by a filter before persistent storage of that information, e.g. by a piped logger. The recommendation for a separate policy enforcement layer prior to persistent storage remains the same as when I posted in 2018 in https://redmine.lighttpd.net/boards/2/topics/8097 around the time that GDPR came into force. Since mod_accesslog is optional and optionally used for analytics, this PR is being considered for convenience of those using mod_accesslog and trying to comply with GDPR. I do not wish to expand the scope of this PR. |
What should be the default if only one of IPv4 or IPv6 mask is specified? If the user-interface is confusing to us, then it will very likely be confusing to many people who try to use it for GDPR compliance. I am still mulling over this interface, as it is a user-visible change, and if we do it right, then others might use it as a reference and convention. However, I have to refresh myself on IPv6 notation and all that junk to have a more informed opinion. My questions above are me dipping my toe in. I had the preliminary thought of |
Done.
Done.
Done. Thanks for the explanation.
Good questions. And they're related. IMO 32 bits is not sufficient for IPv6 since I've seen ISPs handing out /64s per customer. An alternative could be to take a value of n for IPv4 and n+64 for IPv6 (at least for n > 0). |
Note: my time is more limited the next few days, so follow-ups may be delayed. Continuing discussions: masking interface (mod_accesslog format specifier) and masking requirements Regarding masking requirements, while lighttpd mod_accesslog could be configurable, is there a strong reason not to do the same as is done by Google Analytics? https://support.google.com/analytics/answer/2763052 (remove last octet for IPv4, remove last 10 octets for IPv6) This is repeated on more than a few other sites. Some examples: https://www.mediawiki.org/wiki/GDPR_(General_Data_Protection_Regulation)_and_MediaWiki_software
https://www.cookieyes.com/blog/ip-anonymization-in-google-analytics-for-gdpr-compliance/
https://github.com/ankane/ip_anonymizer
Since lighttpd If we develop a reasonable degree of confidence that doing X masking satisfies GDPR, then I would prefer a simpler interface. |
break; | ||
#endif | ||
default: | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reusable function in a generic sock_addr.[ch], this should probably copy the address, too, without a mask. Maybe *dest = *source
at the top. (Setting sa_family
and other parts of sock_addr appears to be missing, too.) With the removal of the use of IP cache in mod_accesslog in this PR, the calling function in mod_accesslog need not check for sa_family (and should have used inline function sock_addr_get_family()
anyway), and the caller and might use sock_addr.c:sock_addr_inet_ntop_append_buffer()
on the masked (sock_addr *)dest. That would use the stack and avoid an allocation.
However, given other discussion, this function might be removed from the PR so these comments are not action items.
Lightly tested string flinging masking of IPv4 final octet (8 bits) and IPv6 final 10 octets (80 bits):
|
I wasn't aware of that. It's been quite a few years since I last researched the subject. I found this recommendation to strip at least 88 bits in IPv6 addresses, but it predates GDPR. I don't see a technical reason not to follow Google's approach. However, choosing fixed numbers means making a decision with potential legal consequences on behalf of your users. Not sure if you want to do that, although I believe the license terms would protect us from these consequences.
Seems reasonable (but I agree that "%{gdpr}a" may not be a good idea).
6in4 and 6over4 are about routing and are not distinguishable on the endpoints, these can be ignored. Mapped IPv4 we cannot ignore, because that would mean IPv4 addresses are always logged as '::' (or '::ffff' with your implementation above) when
I agree. |
Are you a lawyer? If you are not a lawyer, please do not present any opinions which might be mistaken as legal opinions. I have discarded your entire statement without consideration. The question/answer for IP masking in Google Analytics has been active for at least a decade (10 years) as of next month. As I mentioned further above, in order for users to reliably do the right thing, the interfaces should be simple and helpful. Providing a fully-configurable interface to mask IPv4 and IPv6, and requiring the user to separately fill in the values for IPv4 and IPv6 is flexible, but definitely not as simple as it could be. Absent explicit, official government documents indicating otherwise, matching the IP masking behavior widely used on the internet for years is simple and likely the best thing to do for those trying to comply with the government regulations. Therefore, I think
Same question as before. If you think lighttpd should mask 88 bits from IPv6 instead of masking 80 bits, then please provide more justification why that is better than doing the same as Google Analytics has done for the past decade. The link you provided which suggested masking 88 bits is not convincing since as you note, the document predates the GDPR and the link you provided is to the wayback machine, not to a live document. |
I pushed an experimental commit to my dev branch using See top commit at https://git.lighttpd.net/lighttpd/lighttpd1.4/src/branch/personal/gstrauss/master |
Thanks. I'm closing this PR in favour of your change. Your code lgtm. I found the line
a bit hard to understand, maybe this version would be more readable:
Also, for consistency you might want to add a |
Yes, slightly clearer, but does not fit in 80 chars, so I'll keep the single line as-is. This would fit in two lines:
Were the code in question many lines, then a better named variable is much more desirable. The comments directly above the one line say this masks 80 bits. I could expand the comment to indicate how the existing one line is doing it, describing how 6 octets are 48-bits of a 128-bit address, masking 80 bits, but at some point comments should not describe that i += 1 says to add 1 to i. Comments should generally describe why or what, not how, unless the algo is complex. In this case, I prefer counting octets (or even nibbles) on the stringified IP.
The top of the func is commented: The asserts are unnecessary given that the input has already been normalized. The Thank you for the discussion. |
(thx pmconrad) IPv4: mask final octet (8 bits) of address IPv6: mask final 10 octets (80 bits) of address x-ref: Enable partial masking of IP addresses in access logs #124 IP masking in Universal Analytics https://support.google.com/analytics/answer/2763052 github: closes #124
This PR is inspired by mod_log_ipmask for apache. It was written from scratch by myself (as should be obvious from the git history), i. e. is not based on the actual code of mod_log_ipmask.
It adds an optional format parameter to the %a / %h placeholders in the accesslog.format config option. The parameter can specify how many trailing bits of the client IP address to mask (i. e. set to 0). Separate values can be set for IPv4 and IPv6. Both default to 0, meaning no masking, i. e. the behaviour is compatible with previous versions.