New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support new fingerprinting list format. #73
Support new fingerprinting list format. #73
Conversation
5dd1e21
to
5e0a647
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good. Just have a couple of questions/comments.
@@ -8,7 +8,7 @@ s3_bucket=mmc-shavar | |||
|
|||
[tracking-protection] | |||
# The location of the Disconnect list | |||
disconnect_categories=Advertising,Analytics,Social,Disconnect | |||
categories=Advertising|Analytics|Social|Disconnect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this categories
value is the same as DEFAULT_DISCONNECT_LIST_CATEGORIES
so it's not actually necessary, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. We can remove it if you'd like, but I think it helps demonstrate how the filters work (this is just the sample config file).
[tracking-protection-base-fingerprinting] | ||
disconnect_tags=fingerprinting | ||
disconnect_categories=Advertising,Analytics,Social,Disconnect | ||
categories=Advertising|Analytics|Social|Content,Fingerprinting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm understanding the definition syntax correctly ... this means that tracking-protection-base-fingerprinting
will include all domains:
(
- In
Advertising
- OR in
Analytics
- OR in
Social
- OR in
Content
) - AND in
Fingerprinting
?
If so, should that Content
category filter be Disconnect
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, here's the output I saw when I ran this ...
------ tracking-protection-base-fingerprinting ------
-->blocklist: https://raw.githubusercontent.com/mozilla-services/shavar-prod-lists/689b0f563f00c5ef55935c7248b357a57e588672/disconnect-blacklist.json)
* filter ['Advertising', 'Analytics', 'Social', 'Content'] matched 2398 domains
* filter ['Fingerprinting'] matched 39 domains. Reduced set to 29 items.
* removing 1 rule(s) due to DNT exceptions
Tracking protection(tracking-protection-base-fingerprinting): publishing 30 items; file size 980
------ tracking-protection-content-fingerprinting ------
-->blocklist: https://raw.githubusercontent.com/mozilla-services/shavar-prod-lists/689b0f563f00c5ef55935c7248b357a57e588672/disconnect-blacklist.json)
* filter ['Fingerprinting'] matched 39 domains
* filter ['Advertising', 'Analytics', 'Social', 'Content'] matched 2398 domains
* exclusion filters removed 29 domains from output
* removing 1 rule(s) due to DNT exceptions
Tracking protection(tracking-protection-content-fingerprinting): publishing 11 items; file size 372
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have the correct understanding of the filter syntax. If you see a better way to express it please let me know! That was the best I could come up with.
The absence of the Disconnect category is intentional. It can be added back in but the parser silently drops it by default if it was initialized with a remapping file.
The way we were using the category previously was confusing. It was used more as a boolean flag to indicate whether or not to include remapped domains from the Disconnect
category into the other categories specified by disconnect_categories
. Take disconnect_categories=Social,Disconnect
-- this wasn't used specify that we should keep all domains from the Disconnect
and Social
categories (as this would include non-Social
domains from the Disconnect
category). Rather, it meant to keep only those domains that map into the Social
category from the Disconnect
category. I thought about replacing it with a remap_disconnect=True/False
config parameter which would then control whether to initialize the parser with the remapping file, but it's always true so I just hard coded it. I'm happy to add the remap_disconnect
config option in and remove the Disconnect
category from all of the new filters if you think it will help make it clearer. I can also just remove all of the Disconnect
tags and make a note that the category is automatically remapped for all lists in a comment.
That is the expected output. I am unsure whether we should include the Content
category domains in the base
list, or restrict them to the content
list (confusing naming of the lists on my part). Right now the split between the two lists is "fingerprinting trackers" and "fingerprinting non-trackers". If we move the Content category to the content
list it becomes "fingerprinting trackers that we know don't break sites" and "fingerprinting trackers that we know break sites and fingerprinting non-trackers". I decided to go with the simpler version and see where we end up :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we could file a follow-up issue to re-name Content
to Strict
in the future? That seems to be the real way we're using the items in the Content
category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #75
I recommend merging #71 and #72 first. After that, I'll rebase this and the diff should be much easier to read.
In mozilla-services/shavar-prod-lists#56 Disconnect created a new top-level
Fingerprinting
category, which contains all fingerprinting domains irrespective of their tracking classification. This PR updates the list creation script to support compound and exclusion category filters, such as those specified in the new config (mozilla-services/shavar-list-creation-config#45).I migrated the
disconnect_categories
option over to acategories
option with slightly different syntax. In the new format,OR
category filters should use the|
delimiter andAND
category filters should use,
. All of the old filters wereOR
(i.e., domains in category A or category B, etc), so these now use|
. The new filters in mozilla-services/shavar-list-creation-config#45 use a combination.Using the procedure outlined in #72, I verified that these new changes don't cause any meaningful difference in the domains written to each list when compared to the original script. The exception is the fingerprinting domains (expected), and the set of duplicate domains that I flagged in mozilla-services/shavar-prod-lists#56 (review).