New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent network request to Google analytics aka Blacklist functionality #20

Merged
merged 3 commits into from Oct 26, 2017

Conversation

Projects
None yet
2 participants
@stereobooster
Collaborator

stereobooster commented Oct 21, 2017

Fix for #19

@peterbe

It's great start! I like what you want to achieve with this.
I'm wondering if there's perhaps a better way to lump all those URL matchings into its own little lib function. For example, that whole section about .png, .jpg etc. could probably be improved if we can also inspect the URL's content-type instead of just looking at the file extension. (Not to mention, the file extension matching should probably be case-insensitive).

For this to be mergeable we need to also make this new option possible in the cli function.

@@ -0,0 +1,12 @@
const Url = require("url");

This comment has been minimized.

@peterbe

peterbe Oct 23, 2017

Owner

Curious that Prettier didn't catch this. It should be single quotes.
Perhaps the config for Prettier is wrong.

const Url = require("url");
const blacklist = {
'www.google-analytics.com': true,

This comment has been minimized.

@peterbe

peterbe Oct 23, 2017

Owner

When is the value ever false?
Why not make this an array or a set? Sets are as fast to look up as dictionaries.

This comment has been minimized.

@stereobooster

stereobooster Oct 23, 2017

Collaborator

Ah yes, I forgot about new ES features. Set is what I was looking for

src/run.js Outdated
request.abort()
} else if (stylesheetAstObjects[request.url]) {
// no point downloading this again
request.abort()
} else if (isInBlacklist(request.url)) {

This comment has been minimized.

@peterbe

peterbe Oct 23, 2017

Owner

This should be...

} else if (isInBlacklist(options.blacklist, request.url)) {

That way, the "user/client" can control the blacklist. We can certainly maintain a little default list of popular URLs like www.google-analytics.com. But if we do maintain a list it should be lifted out and put somewhere else. like in this src/run.js file (near the top and call it BLACKLIST not blacklist)

@@ -0,0 +1,12 @@
const Url = require("url");

This comment has been minimized.

@peterbe

peterbe Oct 23, 2017

Owner

Don't rename the module name as if it was a class. Call it url here and rename the argument in the isInBlacklist function to uri or url_.

'b.tiles.mapbox.com': true,
}
const isInBlacklist = url => blacklist[Url.parse(url).hostname]

This comment has been minimized.

@peterbe

peterbe Oct 23, 2017

Owner

I don't feel comfortable with this. It feels too "brutal". I don't even know what api.mapbox.com is but imagine it's a REST endpoint that gives the web page the JSON it needs to paint the screen in a certain way onload. Then blacklisting it would prevent the DOM becoming what it becomes for normal browser users.

First of all, let's minimize the list of default domains in the list. Second, I wonder if we should do more to filter than just look at the domain. What if you want some assets from a certain domain but skip all others. For example, you might to blacklist a.tiles.mapbox.com/*.js.

This comment has been minimized.

@peterbe

peterbe Oct 23, 2017

Owner

The name isn't ideal. That fact that it's a list internally is not important externally. The function can be called isBlacklisted or blacklistedUrl or perhaps even more user friendly is skippableUrl. What do you think? To me the word "blacklisted" means something very permanent. Like things that are set in stone and unlikely to change other than more potentially being added.

This comment has been minimized.

@stereobooster

stereobooster Oct 24, 2017

Collaborator
  1. Yes agree api.mapbox.com is too specific.
  2. a.tiles.mapbox.com provides tiles which are either images or binary vector files (protobuf file format). It would make sense to blacklist *.tiles.mapbox.com
src/run.js Outdated
@@ -36,13 +37,14 @@ const minimalcss = async options => {
if (/data:image\//.test(request.url)) {
// don't need to download those
request.abort()
} else if (/\.(png|jpg|jpeg$)/.test(request.url)) {
} else if (/\.(png|jpg|jpeg|webp$)/.test(request.url)) {

This comment has been minimized.

@peterbe

peterbe Oct 23, 2017

Owner

We probably want to skip .gif too. And I bet there are a bunch others we've forgotten too.

This comment has been minimized.

@stereobooster

stereobooster Oct 23, 2017

Collaborator

well let's start with something and will see. I will add gif

@stereobooster

This comment has been minimized.

Collaborator

stereobooster commented Oct 23, 2017

How about this version?

@stereobooster

This comment has been minimized.

Collaborator

stereobooster commented Oct 24, 2017

  1. What if CLI will skip all third-party requests by default (except css obviously)?
  2. Or what if lib will have callback, but CLI will have default blacklist?

@stereobooster stereobooster changed the title from Blacklist functionality to Prevent network request to Google analytics aka Blacklist functionality Oct 24, 2017

src/run.js Outdated
@@ -36,13 +36,14 @@ const minimalcss = async options => {
if (/data:image\//.test(request.url)) {
// don't need to download those
request.abort()
} else if (/\.(png|jpg|jpeg$)/.test(request.url)) {
} else if (/\.(png|jpg|jpeg|webp|gif$)/.test(request.url)) {

This comment has been minimized.

@peterbe

peterbe Oct 25, 2017

Owner

There's a bug in this line. It was there before you made it better :)

#24

I'm going to try to fix that now. Afterwards, can you rebase up master?

This comment has been minimized.

@peterbe

peterbe Oct 25, 2017

Owner

If you remove your change and the rebase against master the conflict will go away.

src/run.js Outdated
request.abort()
} else if (stylesheetAstObjects[request.url]) {
// no point downloading this again
request.abort()
} else if (options.skippable && options.skippable(request)) {

This comment has been minimized.

@peterbe

peterbe Oct 25, 2017

Owner

Excellent! I think this is a great start. But it's currently undocumented and that's almost more important than the functionality.

Can you update the README about this being a possible options key and add an example (perhaps one with your mapbox URLs).
In fact, under the "API" section of the README you can just mention it's a callable that gets a Request object and expects to return a boolean if defined.

Then add another section where you describe an example of a more elaborate version. Perhaps an example that uses those mapbox.com URLs. Also, it would be cool if the example demonstrates how you can also skip the URL based on the Content-Type header.

This comment has been minimized.

@stereobooster

stereobooster Oct 25, 2017

Collaborator

but this is request, not response. I suppose it does not have Content-Type header.

This comment has been minimized.

@peterbe

peterbe Oct 25, 2017

Owner

You're so totally right. Sorry, I rushed that comment. Ignore that bit about content-type.
However, your documentation should point to Request on this URL: https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#class-request

@peterbe

This comment has been minimized.

Owner

peterbe commented Oct 25, 2017

What if CLI will skip all third-party requests by default (except css obviously)?

First of all, the solution we're talking about is blocking requests independent of it being the CLI or used as a module.
We already do this to some degree. The assumption is that you don't ever need the images to get a complete DOM. But it makes me nervous actually. What if someone has a piece of code like this:

window.onload = function() {
   var img = document.createElement('img');
   img.src = '/image.png';
   img.onload = function() {
       // Only insert this after the image has been loaded
       var caption = document.createElement('div');
       caption.classList.add('caption');
       document.getElementById('#foo').appendChild(caption);
   }
   document.getElementById('#foo').appendChild(img);
};

Our default blocking of images would affect how the DOM solidifies when we do the minimal CSS analysis.

Perhaps the solution to this is an option (also exposed in the cli) where you can optionally load all images.

@peterbe

This comment has been minimized.

Owner

peterbe commented Oct 25, 2017

Or what if lib will have callback, but CLI will have default blacklist?

I disagree. The CLI should just be about executing the main code from the command line.

@stereobooster

This comment has been minimized.

Collaborator

stereobooster commented Oct 25, 2017

done

@peterbe peterbe merged commit 8a116c9 into peterbe:master Oct 26, 2017

@stereobooster

This comment has been minimized.

Collaborator

stereobooster commented Oct 26, 2017

Can you bump this as npm package too? 🙏

@peterbe

This comment has been minimized.

Owner

peterbe commented Oct 26, 2017

Will do. That an the new image loading override option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment