Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support new fingerprinting list format. #73

Conversation

englehardt
Copy link
Contributor

@englehardt englehardt commented Mar 11, 2019

I recommend merging #71 and #72 first. After that, I'll rebase this and the diff should be much easier to read.

In mozilla-services/shavar-prod-lists#56 Disconnect created a new top-level Fingerprinting category, which contains all fingerprinting domains irrespective of their tracking classification. This PR updates the list creation script to support compound and exclusion category filters, such as those specified in the new config (mozilla-services/shavar-list-creation-config#45).

I migrated the disconnect_categories option over to a categories option with slightly different syntax. In the new format, OR category filters should use the | delimiter and AND category filters should use ,. All of the old filters were OR (i.e., domains in category A or category B, etc), so these now use |. The new filters in mozilla-services/shavar-list-creation-config#45 use a combination.

Using the procedure outlined in #72, I verified that these new changes don't cause any meaningful difference in the domains written to each list when compared to the original script. The exception is the fingerprinting domains (expected), and the set of duplicate domains that I flagged in mozilla-services/shavar-prod-lists#56 (review).

Copy link
Contributor

@groovecoder groovecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. Just have a couple of questions/comments.

@@ -8,7 +8,7 @@ s3_bucket=mmc-shavar

[tracking-protection]
# The location of the Disconnect list
disconnect_categories=Advertising,Analytics,Social,Disconnect
categories=Advertising|Analytics|Social|Disconnect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this categories value is the same as DEFAULT_DISCONNECT_LIST_CATEGORIES so it's not actually necessary, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. We can remove it if you'd like, but I think it helps demonstrate how the filters work (this is just the sample config file).

[tracking-protection-base-fingerprinting]
disconnect_tags=fingerprinting
disconnect_categories=Advertising,Analytics,Social,Disconnect
categories=Advertising|Analytics|Social|Content,Fingerprinting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm understanding the definition syntax correctly ... this means that tracking-protection-base-fingerprinting will include all domains:
(

  • In Advertising
  • OR in Analytics
  • OR in Social
  • OR in Content
    )
  • AND in Fingerprinting ?

If so, should that Content category filter be Disconnect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, here's the output I saw when I ran this ...

------ tracking-protection-base-fingerprinting ------
-->blocklist: https://raw.githubusercontent.com/mozilla-services/shavar-prod-lists/689b0f563f00c5ef55935c7248b357a57e588672/disconnect-blacklist.json)
 * filter ['Advertising', 'Analytics', 'Social', 'Content'] matched 2398 domains
 * filter ['Fingerprinting'] matched 39 domains. Reduced set to 29 items.
 * removing 1 rule(s) due to DNT exceptions
Tracking protection(tracking-protection-base-fingerprinting): publishing 30 items; file size 980

------ tracking-protection-content-fingerprinting ------
-->blocklist: https://raw.githubusercontent.com/mozilla-services/shavar-prod-lists/689b0f563f00c5ef55935c7248b357a57e588672/disconnect-blacklist.json)
 * filter ['Fingerprinting'] matched 39 domains
 * filter ['Advertising', 'Analytics', 'Social', 'Content'] matched 2398 domains
 * exclusion filters removed 29 domains from output
 * removing 1 rule(s) due to DNT exceptions
Tracking protection(tracking-protection-content-fingerprinting): publishing 11 items; file size 372

Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the correct understanding of the filter syntax. If you see a better way to express it please let me know! That was the best I could come up with.

The absence of the Disconnect category is intentional. It can be added back in but the parser silently drops it by default if it was initialized with a remapping file.

The way we were using the category previously was confusing. It was used more as a boolean flag to indicate whether or not to include remapped domains from the Disconnect category into the other categories specified by disconnect_categories. Take disconnect_categories=Social,Disconnect -- this wasn't used specify that we should keep all domains from the Disconnect and Social categories (as this would include non-Social domains from the Disconnect category). Rather, it meant to keep only those domains that map into the Social category from the Disconnect category. I thought about replacing it with a remap_disconnect=True/False config parameter which would then control whether to initialize the parser with the remapping file, but it's always true so I just hard coded it. I'm happy to add the remap_disconnect config option in and remove the Disconnect category from all of the new filters if you think it will help make it clearer. I can also just remove all of the Disconnect tags and make a note that the category is automatically remapped for all lists in a comment.

That is the expected output. I am unsure whether we should include the Content category domains in the base list, or restrict them to the content list (confusing naming of the lists on my part). Right now the split between the two lists is "fingerprinting trackers" and "fingerprinting non-trackers". If we move the Content category to the content list it becomes "fingerprinting trackers that we know don't break sites" and "fingerprinting trackers that we know break sites and fingerprinting non-trackers". I decided to go with the simpler version and see where we end up :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could file a follow-up issue to re-name Content to Strict in the future? That seems to be the real way we're using the items in the Content category.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #75

@groovecoder groovecoder merged commit 2285c0d into mozilla-services:list_parser_cleanup Mar 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants