Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable partial masking of IP addresses in access logs #124

Closed
wants to merge 8 commits into from

Conversation

pmconrad
Copy link

@pmconrad pmconrad commented Mar 4, 2023

This PR is inspired by mod_log_ipmask for apache. It was written from scratch by myself (as should be obvious from the git history), i. e. is not based on the actual code of mod_log_ipmask.

It adds an optional format parameter to the %a / %h placeholders in the accesslog.format config option. The parameter can specify how many trailing bits of the client IP address to mask (i. e. set to 0). Separate values can be set for IPv4 and IPv6. Both default to 0, meaning no masking, i. e. the behaviour is compatible with previous versions.

@gstrauss
Copy link
Member

gstrauss commented Mar 4, 2023

It was written from scratch by myself

based on what? (I'm not challenging; just asking.) Please add to the above to one of the commit messages and point to the text specification/description of what this does. I can see that the "specification" is not complicated and is an extension to already existing accesslog format specifiers, and apologize for the formalities, but precautions are necessary.

Please use https://wiki.lighttpd.net/mod_accesslog instead of https://redmine.lighttpd.net/projects/lighttpd/wiki/Mod_accesslog

<netinet/in.h> unfortunately is not as portable as we'd all like. Please replace that and <sys/socket.h> with the include of the local abstraction #include "sys-socket.h"

mask() is a poor name for a function, even a local one. Are you masking IP bits? Are you masking upper IP bits or lower IP bits? Can the function name better describe what it is doing (in brief) in two or three words separate by underscore? (Yes, I can read what the code does, despite these questions. )

Is there a convention for masking the lower bits instead of masking the upper bits? The lower bits might be better for matching up the same (slightly anonymized) client. Is this masking permitted by GDPR? For what reasons are you masking? Can you provide pointers to the specifications/guidance? Is masking reasonable? Should the IP instead be passed through a hash function such as SHA256?

I'll try to review more later this weekend. I hope there is a way to do this with less code and hopefully without allocation.

sock_addr_is_addr_eq_bits() in sock_addr.c might be referenced, and we might consider a function in sock_addr.c rather than cut-n-pasting code into mod_accesslog.c, which ideally should not care about IPv4, IPv6, unix domain socket distinctions, as should be apparent by you having to add less-portable headers into mod_accesslog.c.

@gstrauss
Copy link
Member

gstrauss commented Mar 4, 2023

Yes, I know I named a function pointer esc() for the local variable, so asking you to rename mask() might seem unfair.

The code in the for() loop in log_access_record() is intentionally tight, so your additions into log_access_record() should be pulled out into a better-named mask(), (accesslog_append_remote_addr()?) which takes various parameters including b, the buffer into which to write the address, rather than the tmp that you created. I'm nearly certain that we'll end up marking that function __attribute_noinline__

This change is inspired by mod_log_ipmask for the apache webserver,
which was originally developed by the data protection office in
Saxonia, Germany (see
https://www.saechsdsb.de/component/jifile/download/ZjU5MGFlZjU5MzczYzdhOTNkM2NhYjFiYzg1OGU0ZWE=/dokumentation-ip-maskierung-v1-0-pdf ).
The goal is to de-personalize the IP address, which is otherwise
considered a piece of personally identifiable information (in some
jurisdictions at least).

The code here was written from scratch by myself and is not a work
derived from the original mod_log_ipmask code.

It adds an optional parameter of the form "[v4|v6]:n[,...]" to the
"%a" / "%h" placeholders in lighttpd's accesslog.format configuration
parameter, where n is the number of trailing bits to mask out in
addresses belonging to the respective protocols, for the purpose of
logging. For example, the placeholder "%{v4:8,v6:72}a" would replace
the rightmost 8 bits of IPv4 addresses with 0's, and the rightmost
72 bits of IPv6 addresses. In the absence of the parameter the old
behaviour is preserved, i. e. the full address is logged.
@pmconrad
Copy link
Author

pmconrad commented Mar 5, 2023

It was written from scratch by myself

based on what?

Not sure I understand the question. I have sufficient programming background to write code given a task like "mask lower bits of an ip address".
To be clear, I knew the mod_log_ipmask code before implementing this for lighttpd. If that means I cannot contribute here, then so be it. No worries.

Please add to the above to one of the commit messages and point to the text specification/description of what this does.

Done. I included the "specification" into the message. I couldn't find a place in the repo where to add it. I guess it will be added to the wiki at some point.

Please use https://wiki.lighttpd.net/mod_accesslog instead of https://redmine.lighttpd.net/projects/lighttpd/wiki/Mod_accesslog

Done.

<netinet/in.h> unfortunately is not as portable as we'd all like. Please replace that and <sys/socket.h> with the include of the local abstraction #include "sys-socket.h"

Done (but irrelevant after refactoring).

mask() is a poor name for a function, even a local one. Are you masking IP bits? Are you masking upper IP bits or lower IP bits? Can the function name better describe what it is doing (in brief) in two or three words separate by underscore?

Done.

Is there a convention for masking the lower bits instead of masking the upper bits? The lower bits might be better for matching up the same (slightly anonymized) client. Is this masking permitted by GDPR? For what reasons are you masking? Can you provide pointers to the specifications/guidance? Is masking reasonable? Should the IP instead be passed through a hash function such as SHA256?

Here's a pointer to something official: https://www.datenschutzzentrum.de/artikel/575-IP-Adressen-und-andere-Nutzungsdaten-Haeufig-gestellte-Fragen.html
Obviously, this applies to Germany. IANAL, but I think the principle applies to the GDPR in general.

An IP address is considered personally identifiable information because it is possible to find out who was using it at a given time with the help of the ISP who owns the address. By stripping off a sufficient number of bits from the address this possibility no longer exists, which means the remaining bits can be considered anonymous.

Stripping off the rightmost bits instead of left has two main reasons:

  • since IP address blocks are handed out to ISPs per prefix, the prefix still contains some information about the (geographic) origin of a request, which is probably one of the main reasons for logging IP addresses at all.
  • For IPv6 addresses, the lower bits are in some cases derived from hardware IDs (MAC address). This alone is sufficient to identify a specific device (and thus its owner). Therefore, stripping off higher bits would not be sufficient to anonymize the address.

Hashing is not sufficient, because at least for v4 addresses it is feasible to find an IP address for a given hash by brute-forcing the IP address range. Also, hashing has the downside of losing the address prefix and thus the geographic origin of a request.

sock_addr_is_addr_eq_bits() in sock_addr.c might be referenced, and we might consider a function in sock_addr.c rather than cut-n-pasting code into mod_accesslog.c, which ideally should not care about IPv4, IPv6, unix domain socket distinctions, as should be apparent by you having to add less-portable headers into mod_accesslog.c.

Done.

The code in the for() loop in log_access_record() is intentionally tight, so you additions into log_access_record() should be pulled out into a better-named mask(),

Done.

@gstrauss
Copy link
Member

gstrauss commented Mar 5, 2023

It was written from scratch by myself

based on what?

Not sure I understand the question. I have sufficient programming background to write code given a task like "mask lower bits of an ip address".

Haha. I do not think you woke up one day and decided "You know what I need? I need to mask IP bits in my logs!" :)

You answered the question with your response to my more detailed questions about regulations and how masking the IP address is permitted by (at least some) official guidelines.

Regarding SHA256 hash, yes, I should have suggested that the hash be salted and the salt changed periodically, similar to what lighttpd does multiple times per day with the TLS session ticket encryption key. Then again, you raise a good point about losing the utility of retaining the high bits of the address for offline analytics.


Some technical questions for discussion (before asking you to make any further code changes):


Any specific existing convention on which you based the format specifier syntax extension (%{v[46]:numbits[,...]}a)? Is there a necessity for this to be configurable? Are there regulations saying IP mask should be a minimum of XX bits? The article you referenced suggests that masking the lower octet of IPv4 addresses is sufficient. (I don't agree, but that's not the point.) If there is a convention or standard, should we more simply provide an option to do "suggested" masking?

mod_accesslog is not the only place in lighttpd which deals with IP addresses and logging. Among others, there is also the lighttpd error log(s), which is why I mentioned in https://redmine.lighttpd.net/boards/3/topics/10949 that I am generally of the opinion that arbitrary log munging should be done by a piped logger.


Regarding use of lighttpd sock_addr_cache_inet_ntop_copy_buffer(). That is currently a small 4-element cache currently used only in one place in the lighttpd base codebase (connections.c). Leveraging the same cache for masked IP addresses reuses the code, which is good, but halves the size of effective cache, and adds some unnecessary work since the masked entries will (almost) never match for connections.c, and the unmasked entries will never match for mod_accesslog use. The cache as is currently written is not thread-safe, but lighttpd is not threaded. I could minimally extend sock_addr_cache_inet_ntop_copy_buffer() to take a cache object on which to operate, and have connections.c and mod_accesslog provide their own cache objects. (sock_addr_cache_inet_ntop_copy_buffer() could also be modified to append instead of to overwrite the passed-in buffer, to avoid the need for buffer allocation and double-copy.) Alternatively, the use of the cache could be removed from mod_accesslog in this PR, similar to how the cache is not used in http_cgi.c when setting SERVER_ADDR. (On the other hand, if sock_addr_cache_inet_ntop_copy_buffer() is extended, then http_cgi.c might also be extended to have its own cache object.)

A quick internet search suggests to me that IP masking in-server (e.g. not performed by piped loggers or similar) is frequently implemented via string matching / regex on the stringified IP address. Note: I have not measured the performance of potentially (limited, non-configurable) masking based on a string match of the already stringified IP address, versus masking the binary address and then stringifying using inet_ntop(), potentially mediated by a small cache. String flinging of the stringified IP could be implemented in lighttpd mod_magnet and while reasonably fast and possible to do today without patching lighttpd, I do believe that the PR here will be faster.

@gstrauss
Copy link
Member

gstrauss commented Mar 5, 2023

This approach would be more likely to minimize the overhead for existing lighttpd use, without the extended format specification. accesslog_append_addr_masked() would be marked __attribute_noinline__ and would have to handle non-IP strings (e.g. AF_UNIX), but there are existing routines in sock_addr.h to help.

-    buffer_append_string_buffer(b, r->dst_addr_buf);
+    if (!f->opt)
+        buffer_append_string_buffer(b, r->dst_addr_buf);
+    else
+        accesslog_append_addr_masked(b, r->dst_addr, f->opt);
     break;

@gstrauss
Copy link
Member

gstrauss commented Mar 5, 2023

While I am a fan of C99 designated initializers, lighttpd is still built by some people for very old machines with ancient compilers, or on systems with brain-dead compilers (historically, MS C++ compilers refused to support C99, even over a decade after C99 was published).

Your sock_addr_mask_lower_bits() is nicely reusable.

Depending on other discussions above, we might combine this and sock_addr_cache_inet_ntop_copy_buffer() into a specialized sock_addr_cache_inet_ntop_masked_append_buffer()

@pmconrad
Copy link
Author

pmconrad commented Mar 5, 2023

Any specific existing convention on which you based the format specifier syntax extension (%{v[46]:numbits[,...]}a)?

None that I know. It felt natural.

Is there a necessity for this to be configurable? Are there regulations saying IP mask should be a minimum of XX bits? The article you referenced suggests that masking the lower octet of IPv4 addresses is sufficient. (I don't agree, but that's not the point.) If there is a convention or standard, should we more simply provide an option to do "suggested" masking?

There is no law specifying precisely how many bits must be removed in order to sufficiently anonymize an address, and even if there was the value might vary between jurisdictions. I therefore think it should be configurable.

mod_accesslog is not the only place in lighttpd which deals with IP addresses and logging. Among others, there is also the lighttpd error log(s), which is why I mentioned in https://redmine.lighttpd.net/boards/3/topics/10949 that I am generally of the opinion that arbitrary log munging should be done by a piped logger.

Right. mod_log_ipmask ignores the error log as well IIRC. In my understanding, an IP address in the error log is more likely to be a case of legitimate interest for the server operator. The article I linked above specifically mentions that it is acceptable to keep unmasked IP addresses for security purposes for up to 7 days (think of fail2ban for example).
Again, this might vary between jurisdictions, so maybe we should look for a way to apply the masking in a more general way.

Regarding use of lighttpd sock_addr_cache_inet_ntop_copy_buffer(). That is currently a small 4-element cache currently used only in one place in the lighttpd base codebase (connections.c).

Yes, I noticed. My first impression was that the cache is most likely a case where Knuth's premature optimization rule should have been applied. :-)
IMO in a high-load installation the cache will be of little use because it is too small, so entries will be overwritten before they could be re-used. In a low-load installation the cache might be hit more often, but the few extra cycles for stringifying the address each time wouldn't hurt.

A quick internet search suggests to me that IP masking in-server (e.g. not performed by piped loggers or similar) is frequently implemented via string matching / regex on the stringified IP address.

That's easy to do when you just want to drop the last octet from an IPv4 address, but more difficult if you want to strip a different number of bits. It's also more difficult for IPv6 addresses.

@gstrauss
Copy link
Member

gstrauss commented Mar 5, 2023

My first impression was that the cache is most likely a case where Knuth's premature optimization rule should have been applied. :-)
IMO in a high-load installation the cache will be of little use because it is too small, so entries will be overwritten before they could be re-used. In a low-load installation the cache might be hit more often, but the few extra cycles for stringifying the address each time wouldn't hurt.

I encourage you to measure the time and CPU cost of uncached inet_ntop() on your own system. It is not free.

The stringified IP cache was potentially called multiple times per request in much older lighttpd code history. The precursor to sock_addr_cache.[ch] was called inet_ntop_cache_get_ip() in the since-renamed inet_ntop_cache.[ch].

For high-load scenarios, the overhead of checking the small, static cache is small, even if there are cache misses most of the time. However, the existence of a cache with very small number of elements is not for the benefit of many, many different clients hitting the server in a short timeframe. Rather, the cache is/was to reduce repeating work in different lighttpd modules during the same request, reducing latency and CPU use. In that case, the lookup does have a higher chance of having a cache hit.

In any case, the reason I am focusing on this cache is because I recently made changes to lighttpd mod_extforward to handle HTTP/2 requests from different clients that are multiplexed by upstream load balancers. inet_ntop() is used as part of the IP string normalization. The IP string cache is not used there, and is not used in http_cgi.c. Those are potential future optimizations if supported by (future) measurements.

Given our conversation here, I'll ask you to please remove the use of the cache in your PR, as that does seem to be a premature optimization since you were not aware of the cache use case and have not measured the performance of halving the effective cache size. By the time the request is being logged in the access log, the IP cache is not likely to be called again during the same request. For high-traffic bursts of requests from the same client, using a cache might theoretically contribute to marginally lower latency in responses. When that use case is measured, we can reconsider adding a masked IP cache to mod_accesslog.

Separately, and outside the scope of this PR, I might revisit the existing single use of sock_addr_cache_inet_ntop_copy_buffer() in connections.c. For small embedded devices, I believe that the cache will still provided benefits for multiple requests from a single client in a short time-frame, as is typical when requesting a web page and supporting assets.


A quick internet search suggests to me that IP masking in-server (e.g. not performed by piped loggers or similar) is frequently implemented via string matching / regex on the stringified IP address.

That's easy to do when you just want to drop the last octet from an IPv4 address, but more difficult if you want to strip a different number of bits. It's also more difficult for IPv6 addresses.

My post came out of a quick search to review existing usage conventions for how to mask IPs in Apache and nginx. I prefer to operate on structured data (i.e. the binary address) rather than string flinging, which is why we are discussing your PR here. For others reading this post, I wanted to point out that there are existing solutions that can be used today with lighttpd without this PR, even if those solutions are potentially less efficient.

Since I have not measured the current differences between string flinging and masking the binary IP and using inet_ntop(), I am asking that you remove your use of the IP cache in mod_accesslog in this PR. As I said above, when that use case is measured, we can reconsider adding a masked IP cache to mod_accesslog. Thanks.

@gstrauss
Copy link
Member

gstrauss commented Mar 5, 2023

Again, this might vary between jurisdictions, so maybe we should look for a way to apply the masking in a more general way.

Sorry, that falls under the category of "no". Such policy enforcement, which might need to look for other PII (personally identifiable information), too, should be done by a filter before persistent storage of that information, e.g. by a piped logger. The recommendation for a separate policy enforcement layer prior to persistent storage remains the same as when I posted in 2018 in https://redmine.lighttpd.net/boards/2/topics/8097 around the time that GDPR came into force.

Since mod_accesslog is optional and optionally used for analytics, this PR is being considered for convenience of those using mod_accesslog and trying to comply with GDPR. I do not wish to expand the scope of this PR.

@gstrauss
Copy link
Member

gstrauss commented Mar 5, 2023

Any specific existing convention on which you based the format specifier syntax extension (%{v[46]:numbits[,...]}a)?

None that I know. It felt natural.

What should be the default if only one of IPv4 or IPv6 mask is specified?

If the user-interface is confusing to us, then it will very likely be confusing to many people who try to use it for GDPR compliance.

I am still mulling over this interface, as it is a user-visible change, and if we do it right, then others might use it as a reference and convention. However, I have to refresh myself on IPv6 notation and all that junk to have a more informed opinion. My questions above are me dipping my toe in. I had the preliminary thought of %{~24}a or some other syntax that looks to be the inverse of CIDR notation as the number of bits to mask, and I am considering the limitation of masking up to 32-bits and whether or not that would be sufficient for IPv6 -- and whether or not the IP masking interface must allow specifying separate IPv4 and IPv6 masks.

@pmconrad
Copy link
Author

pmconrad commented Mar 6, 2023

This approach would be more likely to minimize the overhead for existing lighttpd use

Done.

lighttpd is still built by some people for very old machines with ancient compilers

Done.

please remove the use of the cache in your PR

Done. Thanks for the explanation.

What should be the default if only one of IPv4 or IPv6 mask is specified?
I am considering the limitation of masking up to 32-bits and whether or not that would be sufficient for IPv6
and whether or not the IP masking interface must allow specifying separate IPv4 and IPv6 masks.

Good questions. And they're related.

IMO 32 bits is not sufficient for IPv6 since I've seen ISPs handing out /64s per customer. An alternative could be to take a value of n for IPv4 and n+64 for IPv6 (at least for n > 0).

@gstrauss
Copy link
Member

gstrauss commented Mar 6, 2023

Note: my time is more limited the next few days, so follow-ups may be delayed.

Continuing discussions: masking interface (mod_accesslog format specifier) and masking requirements

Regarding masking requirements, while lighttpd mod_accesslog could be configurable, is there a strong reason not to do the same as is done by Google Analytics? https://support.google.com/analytics/answer/2763052

(remove last octet for IPv4, remove last 10 octets for IPv6)

This is repeated on more than a few other sites. Some examples:

https://www.mediawiki.org/wiki/GDPR_(General_Data_Protection_Regulation)_and_MediaWiki_software

Mask IP addresses of unregistered users, by removing the last byte of IPv4 address (maybe by setting it to 0, for example, 192.168.1.5 would become 192.168.1.0), or the last 80 bits in case of a IPv6 address. This is what Google Analytics does for being compliant: IP anonymization on analytics.

https://www.cookieyes.com/blog/ip-anonymization-in-google-analytics-for-gdpr-compliance/

IP anonymization, also known as IP masking, is a method of replacing the original IP address with one that cannot be associated with or traced back to an individual user. This can be done by setting the last octet of IPV4 addresses or the last 80 bits of IPv6 addresses to zeros.

https://github.com/ankane/ip_anonymizer

This is the approach Google Analytics uses for IP anonymization:

For IPv4, set the last octet to 0
For IPv6, set the last 80 bits to zeros

Since lighttpd r->dst_addr_buf is a normalized IP string, string flinging to drop the bottom octet is trivial, and removing the bottom 10 octets for IPv6 is doable. Is this something we should consider? If we did, then the interface to %{...}a would be different, e.g. %{anon}a or %{gdpr}a (but that may be too bold a statement). If this were done, could we ignore distinctions for IPv6 6in4, 6to4, 6over4?

If we develop a reasonable degree of confidence that doing X masking satisfies GDPR, then I would prefer a simpler interface.

break;
#endif
default:
return 0;
Copy link
Member

@gstrauss gstrauss Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reusable function in a generic sock_addr.[ch], this should probably copy the address, too, without a mask. Maybe *dest = *source at the top. (Setting sa_family and other parts of sock_addr appears to be missing, too.) With the removal of the use of IP cache in mod_accesslog in this PR, the calling function in mod_accesslog need not check for sa_family (and should have used inline function sock_addr_get_family() anyway), and the caller and might use sock_addr.c:sock_addr_inet_ntop_append_buffer() on the masked (sock_addr *)dest. That would use the stack and avoid an allocation.

However, given other discussion, this function might be removed from the PR so these comments are not action items.

@gstrauss
Copy link
Member

gstrauss commented Mar 6, 2023

Lightly tested string flinging masking of IPv4 final octet (8 bits) and IPv6 final 10 octets (80 bits):

__attribute_noinline__
static void
accesslog_append_remote_addr_masked (buffer *b, request_st *r)
{
    /* r->dst_addr_buf is normalized and valid string; operate on known valid */
    const char * const s = r->dst_addr_buf->ptr;
    uint32_t i = 0;
    switch (sock_addr_get_family(r->dst_addr)) {
      case AF_INET:
        /* IPv4: mask final octet (8 bits) of address */
       #ifdef __COVERITY__
        force_assert(buffer_clen(r->dst_addr_buf) > 2);
        force_assert(strchr(s, '.') != NULL);
       #endif
        for (i = buffer_clen(r->dst_addr_buf)-1; s[--i] != '.'; ) ;
        buffer_append_str2(b, s, i+1, CONST_STR_LEN("0"));
        break;
     #ifdef HAVE_IPV6
      case AF_INET6:
        /* IPv6: mask final 10 octets (80 bits) of address */
        /* Note: treat string starting with "::..." as "::" even if "::x:..."
         * (rather than special-casing since I do not care about "::x:...") */
        for (int j = 0; s[i] != ':' || ((j += 2) != 6 && s[i+1] != ':'); ++i) ;
        buffer_append_str2(b, s, i+1, CONST_STR_LEN(":"));
        break;
     #endif
      default:
        buffer_append_string_buffer(b, r->dst_addr_buf);
        break;
    }
}

@pmconrad
Copy link
Author

pmconrad commented Mar 8, 2023

Regarding masking requirements, while lighttpd mod_accesslog could be configurable, is there a strong reason not to do the same as is done by Google Analytics? https://support.google.com/analytics/answer/2763052
This is repeated on more than a few other sites.

I wasn't aware of that. It's been quite a few years since I last researched the subject.

I found this recommendation to strip at least 88 bits in IPv6 addresses, but it predates GDPR.

I don't see a technical reason not to follow Google's approach. However, choosing fixed numbers means making a decision with potential legal consequences on behalf of your users. Not sure if you want to do that, although I believe the license terms would protect us from these consequences.

Since lighttpd r->dst_addr_buf is a normalized IP string, string flinging to drop the bottom octet is trivial, and removing the bottom 10 octets for IPv6 is doable. Is this something we should consider? If we did, then the interface to %{...}a would be different, e.g. %{anon}a or %{gdpr}a (but that may be too bold a statement).

Seems reasonable (but I agree that "%{gdpr}a" may not be a good idea).

If this were done, could we ignore distinctions for IPv6 6in4, 6to4, 6over4?

6in4 and 6over4 are about routing and are not distinguishable on the endpoints, these can be ignored.
6to4 is deprecated and can probably be treated like regular IPv6, for simplicity. It would be covered automatically if we followed the above recommendation to strip 88 bits from IPv6 addresses (leaving 16 bit 6to4 prefix and 24 bits of the IPv4 address).

Mapped IPv4 we cannot ignore, because that would mean IPv4 addresses are always logged as '::' (or '::ffff' with your implementation above) when server.v4mapped = "enable" is configured.

If we develop a reasonable degree of confidence that doing X masking satisfies GDPR, then I would prefer a simpler interface.

I agree.

@gstrauss
Copy link
Member

gstrauss commented Mar 8, 2023

However, choosing fixed numbers means making a decision with potential legal consequences on behalf of your users. Not sure if you want to do that, although I believe the license terms would protect us from these consequences.

Are you a lawyer? If you are not a lawyer, please do not present any opinions which might be mistaken as legal opinions. I have discarded your entire statement without consideration.

The question/answer for IP masking in Google Analytics has been active for at least a decade (10 years) as of next month.
https://web.archive.org/web/20130405085532/https://support.google.com/analytics/answer/2763052

As I mentioned further above, in order for users to reliably do the right thing, the interfaces should be simple and helpful.

Providing a fully-configurable interface to mask IPv4 and IPv6, and requiring the user to separately fill in the values for IPv4 and IPv6 is flexible, but definitely not as simple as it could be. Absent explicit, official government documents indicating otherwise, matching the IP masking behavior widely used on the internet for years is simple and likely the best thing to do for those trying to comply with the government regulations.

Therefore, I think %{anon}a or %{mask}a is the direction we should go in lighttpd mod_accesslog.

Regarding masking requirements, while lighttpd mod_accesslog could be configurable, is there a strong reason not to do the same as is done by Google Analytics? https://support.google.com/analytics/answer/2763052

Same question as before. If you think lighttpd should mask 88 bits from IPv6 instead of masking 80 bits, then please provide more justification why that is better than doing the same as Google Analytics has done for the past decade. The link you provided which suggested masking 88 bits is not convincing since as you note, the document predates the GDPR and the link you provided is to the wayback machine, not to a live document.

@gstrauss
Copy link
Member

gstrauss commented Mar 8, 2023

I pushed an experimental commit to my dev branch using %{mask}a and extending my code posted above to handle v4mapped.

See top commit at https://git.lighttpd.net/lighttpd/lighttpd1.4/src/branch/personal/gstrauss/master
currently: https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/1cf9d1c0899a65691fa47d04ac8d276a40df4943

Update: merged as 0ccf30c and 708211e

@pmconrad
Copy link
Author

Thanks. I'm closing this PR in favour of your change.

Your code lgtm. I found the line

for (int j=0; s[i] != ':' || ((j+=2) != 6 && s[i+1] != ':'); ++i) ;

a bit hard to understand, maybe this version would be more readable:

for (int colons=0; s[i] != ':' || ((++colons) != 3 && s[i+1] != ':'); ++i) ;

Also, for consistency you might want to add a force_assert in the V6 branch too to ensure there is either a :: or at least 3 colons in the address.

@pmconrad pmconrad closed this Mar 12, 2023
@gstrauss
Copy link
Member

for (int colons=0; s[i] != ':' || ((++colons) != 3 && s[i+1] != ':'); ++i) ;

Yes, slightly clearer, but does not fit in 80 chars, so I'll keep the single line as-is. This would fit in two lines:

int colons = 0;
while (s[i] != ':' || ((++colons) != 3 && s[i+1] != ':')) ++i ;

Were the code in question many lines, then a better named variable is much more desirable. The comments directly above the one line say this masks 80 bits. I could expand the comment to indicate how the existing one line is doing it, describing how 6 octets are 48-bits of a 128-bit address, masking 80 bits, but at some point comments should not describe that i += 1 says to add 1 to i. Comments should generally describe why or what, not how, unless the algo is complex. In this case, I prefer counting octets (or even nibbles) on the stringified IP.

Also, for consistency you might want to add a force_assert in the V6 branch too to ensure there is either a :: or at least 3 colons in the address.

The top of the func is commented: /* r->dst_addr_buf is normalized and valid string; operate on known valid */

The asserts are unnecessary given that the input has already been normalized. The force_assert() in the IPv4 case is pre-emptively providing the hint to coverity, and is not active during normal builds.


Thank you for the discussion.

lighttpd-git pushed a commit that referenced this pull request Mar 12, 2023
(thx pmconrad)

IPv4: mask final octet (8 bits) of address
IPv6: mask final 10 octets (80 bits) of address

x-ref:
  Enable partial masking of IP addresses in access logs
  #124
  IP masking in Universal Analytics
  https://support.google.com/analytics/answer/2763052

github: closes #124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants