Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds data-exclude support and localStorage.plausible_ignore support #489

Merged
merged 17 commits into from
Jan 22, 2021

Conversation

Vigasaurus
Copy link
Contributor

@Vigasaurus Vigasaurus commented Dec 19, 2020

Changes

Adds support in the plausible.js script for exlcuding specific routes from couting towards events, as well as preventing all plausible events if a localStorage flag plausible_ignore is present.

@ukutaht This isn't technically done yet, I wanted your feedback on methodology and usability before finalizing it. I also am mostly unsure what the p.js script is for/when its used (I found no meaningful reference to it) - so I'm unsure if my changes need to be added to it.

Once we can agree on the format/the method - I can write up the docs/changelog as needed.

Basic premise is that we have two new exclusions on the client-side, similar to localhost or phantom.
First, localStorage.plausible_ignore - if this exists at all (i.e. is truthy in its string representation), all plausible events are prevented.
Second, if a string is passed in to data-exclude with the format data-exclude="/tou, /rule/*, /how-to-play, /*/admin, */secret, */priv/*" pages that match those routes will be excluded. In this example, we read those routes as the following table. I decided to use a format like this in pseudo-regex to make it a bit more usable/easy to understand for an end user - lmk if you'd prefer just a straight regex or something.

data-exclude input parsed RegEx
/tou /^\/tou\/?$/
/rule/* /^\/rule\/.*\/?$/
/how-to-play /^\/how-to-play\/?$/
/*/admin /^\/.*\/admin\/?$/
*/secret /^.*\/secret\/?$/
*/priv/* /^.*\/priv\/.*\/?$/

Also, a sidenote - it seems like the plausible script has creeped above 1KB - sitting at 1268B (1.24 KiB = 1.27 KB) on master - and with this update it creeps up to 1591B (1.55 KiB = 1.59 KB). I don't think it super matters, since we are adding features to get to this size - but I just figure some branding/copy on the landing might need some updating to match it.

Tests

  • This PR does not require tests

Changelog

  • Entry has been added to changelog

Documentation

@ukutaht
Copy link
Contributor

ukutaht commented Dec 21, 2020

Looks good overall @Vigasaurus

p.js is deprecated but kept around to support some early users. No need to worry about it. If someone using that script wants access to new features, they are welcome to use the new plausible.js script.

First, localStorage.plausible_ignore - if this exists at all (i.e. is truthy in its string representation), all plausible events are prevented.

I haven't tested yet but my only worry is someone doing localStorage.plausible_ignore = false to disable it and getting confused because that evaluates to a truthy value in our script? Not sure how localStorage serialization works but setting localStorage.plausible_ignore = false should make the script count again.

I decided to use a format like this in pseudo-regex to make it a bit more usable/easy to understand for an end user - lmk if you'd prefer just a straight regex or something.

I think it's great, much preferred to regular Regex syntax and I think using stars in URLs is very widely accepted/supported.

Also, a sidenote - it seems like the plausible script has creeped above 1KB - sitting at 1268B (1.24 KiB = 1.27 KB) on master - and with this update it creeps up to 1591B (1.55 KiB = 1.59 KB). I don't think it super matters, since we are adding features to get to this size - but I just figure some branding/copy on the landing might need some updating to match it.

This is the raw size but we advertise the size with Brotli compression which is what you actually get over the wire. With compression it's probably not over 1KB yet.

Since adding page exclusions requires one to re-deploy their script anyways, I think we should add it as an extra script option that doesn't get included in the standard script. See how I've done hashMode and outboundLinks.

@Vigasaurus
Copy link
Contributor Author

Coolio, thanks for the detailed feedback on all the PRs :D I'll get going on all the comments some time today or tomorrow 😀

I'm unsure about the formatting decision in tracker.ex - lmk.
This will certainly have conflicts with my other PRs related to the tracking scripts right now, I'll make one extra PR after both are done to ensure they're consolidated into the compiled scripts.
@Vigasaurus Vigasaurus marked this pull request as ready for review December 22, 2020 00:24
@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented Dec 22, 2020

@ukutaht This should be ready to go - I'm happy with most everything here, but I'm unsure about the formatting in tracker.ex and the very long array in the map - do lmk if you'd prefer some different formatting there.

I'll get started on the docs update too.

@ukutaht
Copy link
Contributor

ukutaht commented Dec 22, 2020

Will review later today but wanted to jot down a thought so I'll remember to test later on.

What's the behaviour of the pseudo-regex with trailing slashes and subdirectories? What I mean is:

Exclusion rule Actual URL Excluded?
/page /page Yes
/page /page/ Yes (?)
/page/ /page/ Yes
/page/ /page No (?)
/page /page/nested No
/page/ /page/nested No
/page/* /page/nested Yes
/page/* /page No (?)
/page/* /page/ No (?)

The 'Excluded' column is what I intuitively expect to happen, question marks mean that I am not 100% certain this is the expected/correct behaviour. However it is implemented, it should be well documented and as easy to understand as possible.

@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented Dec 22, 2020

Yeah this is definitely a sticking point for me too, as far as documentation is concerned. I feel as though a technical explanation of what the regex becomes would be best for power-users, but would likely confuse others. I think I kept it pretty well explained in plausible/docs#45 but it could definitely use more detail if you think its unclear. (Maybe this table you've laid out can help actually)

Honestly, I think the biggest confusion point will be trailing slashes - so maybe a note on there saying Don't put any trailing slashes in the rules! They happen automatically! or something to that effect could be beneficial. Or we could swap the trailing slash in the rule if it exists to always remove a provided trailing slash and swap it for our own optional one.

Exclusion rule Actual URL Excluded? Actually Excluded?
/page /page Yes Yes
/page /page/ Yes (?) Yes
/page/ /page/ Yes Yes
/page/ /page No (?) No
/page /page/nested No No
/page/ /page/nested No No
/page/* /page/nested Yes Yes
/page/* /page No (?) No
/page/* /page/ No (?) Yes (!)

Same table as yours, last column is whether or not the current implementation excludes it

So most of them work how you'd expect - the last one is the only one which doesn't and that's mostly just a decision to be made here. Do we allow zero-length *'s or no - I'd say yes, but up to you (it's a single-character fix if not).
Generally, I'd say that the trailing slash be fully optional like it is right now is best - since some browsers always add it, some always strip it, and others leave it the same - so I think making it always an option makes it most likely to match when it should, but again, up to you.

The way I tested these was by just extracting the regex constructor from the script and running it manually. Here's what it became if you want to try it at all.

"actual_pathname_to_test".match(new RegExp('^' + "excluded-string".trim().replace(/\*/g, '.*') + '\/?$'))

so I tested your examples by doing for example

"/page/nested".match(new RegExp('^' + "/page/*".trim().replace(/\*/g, '.*') + '\/?$'))

@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented Dec 29, 2020

I'm actually going to make a change here (tomorrow, that's why I'm moving to draft) to make docs and explaining it simpler, in that the * in the provided pattern can't match a /, such that /page/* will only match /page/nested and not /page/nested/doublenested/etc and similarly /how-to-* will only match /how-to-play and /how-to-cook but not /how-to-cook/recipe. From some light research I've found this is a more consistent implementation with what stars in URLs usually mean elsewhere.

If this format all works well etc, I'll eventually add this same matching to page goals if wanted (for stuff like WooCommerce's /order/confirmation/<order_id>

@Vigasaurus Vigasaurus marked this pull request as draft December 29, 2020 07:51
@Vigasaurus
Copy link
Contributor Author

This is now updated to not let the * match /s or whitespace. I'm updating the docs setup still, and will get some updated examples here as well. Once both are done I'll move this off draft, but it's otherwise done.

@Vigasaurus
Copy link
Contributor Author

Honestly, I think the biggest confusion point will be trailing slashes

I added a note in the docs saying that trailing slashes are added automatically and are not needed in date-exclude.

Double Nesting

I've changed it such that a * in data-exclude cannot match a / or a whitespace character in the pathname, as such paths such as /admin/* now only matches /admin/1 and similar, but not /admin/1/edit nor /admin

the last one is the only one which doesn't and that's mostly just a decision to be made here. Do we allow zero-length *'s or no - I'd say yes, but up to you (it's a single-character fix if not).

More research found that zero-length asterisks in URLs in this format are non-standard, so I've changed them to be 1-length or greater.

Updated page path testing regex:

"actual_pathname_to_test".match(new RegExp('^' + "split-excluded-string".trim().replace(/\*/g, '[^\\s\/]+') + '\/?$'))

Overall, I added much more detail in plausible/docs#45, and an option for tables in the docs.

Please do let me know your thoughts :D

@Vigasaurus Vigasaurus marked this pull request as ready for review January 4, 2021 08:06
@ukutaht
Copy link
Contributor

ukutaht commented Jan 20, 2021

Again sorry about the super long wait on this :)

Overall this is in a great place. Good job on the documentation side as well.

I've changed it such that a * in data-exclude cannot match a / or a whitespace character in the pathname, as such paths such as /admin/* now only matches /admin/1 and similar, but not /admin/1/edit nor /admin

Sounds good. I think we should use bash glob as the main inspiration for the syntax here. It's well known. In glob, a regular asterisk does not match a forward slash as you've done here. But I think we should add a way to exclude a whole subdirectory.

In glob, this can be done with /admin/**. What do you think about adding ** as a way to match any character, including forward slash?

Also in glob both single and double asterisks do match a zero-length character. So in glob /page* matches with /page whereas in this implementation it doesn't seem to. Is there a good reason to require a character when * is specified?

If this format all works well etc, I'll eventually add this same matching to page goals if wanted (for stuff like WooCommerce's /order/confirmation/<order_id>

Love it. The same syntax could also be used to roll up pageviews in regular reporting (i.e. in the Top Pages report). I know many people want to group pageviews with similar URLs.

@Vigasaurus
Copy link
Contributor Author

Love the ideas. I think the double glob mechanic would definitely be useful too (and should be easy to implement, since it's just the old implementation tacked on).

I have no particular issues with either implementation as far as zero-length goes - I think it just comes down to consistency and documentation.

I'll make the changes 👍

  • Zero Length is allowed
  • Double Glob
  • Docs update for the above

Copy link
Contributor

@ukutaht ukutaht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, requested a few changes. Once they're done I will merge this

lib/plausible_web/plugs/tracker.ex Show resolved Hide resolved
tracker/src/plausible.js Outdated Show resolved Hide resolved
tracker/src/plausible.js Outdated Show resolved Hide resolved
@ukutaht ukutaht merged commit fb7a3fe into plausible:master Jan 22, 2021
@metmarkosaric
Copy link
Contributor

metmarkosaric commented Jan 26, 2021

Got a question on Twitter: they're using the same script on both the main domain and subdomain but want to exclude the / so the home pages doesn't get counted combined. Any thoughts?

https://twitter.com/tim_nolet/status/1354040020018077698

@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented Jan 26, 2021

I'm a tiny bit confused by the use case/end goal, but can confirm that "/**" would exclude any path

If you have more details about what they're trying to exclude/the subdomain structure, I'm sure this could support it to an extent

@metmarkosaric
Copy link
Contributor

thanks @Vigasaurus! i've asked them to clarify. btw if you exclude a specific page does that mean that any custom goals on that specific page won't be counting?

@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented Jan 26, 2021

Nope, the exclusion only happens for triggering pageviews - outbound links and other custom events still fire.

@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented Jan 26, 2021

Also reading more of the Twitter thread, the use case makes sense to me. I think excluding "/**" is definitely the way to do it for them, if they want all the marketing pages on the www or none subdomain to track normally, and then in app subdomain they run the full exclusion and fire custom events as needed.

@metmarkosaric
Copy link
Contributor

thanks! so by excluding "/**" no visitors / pageviews will be counted both from the main domain and from the subdomain?

@Vigasaurus
Copy link
Contributor Author

Vigasaurus commented Jan 26, 2021

Well, I assume they have separate html outputs for each subdomain, so they'd likely only put the exclude on one of the subs' html - the main domain's html (the marketing site) should just not have any exclusion

@metmarkosaric
Copy link
Contributor

ahh yes! i see. makes sense to me now. thanks!

@Vigasaurus
Copy link
Contributor Author

Sweet! Feel free to ping me here or on the Twitter thread if they have other concerns/questions :D

@metmarkosaric
Copy link
Contributor

they responded with more details so i think it's more clear now: https://twitter.com/tim_nolet/status/1354048423742341120

last question: if they do as described will the referral source still show for custom event conversions?

@Vigasaurus
Copy link
Contributor Author

Mm, I don't think so, because from my understanding the referral source for custom events comes by linking the entry page+referrer for the entire session to the event session, and if there are no pageviews there are no entry pages or referrers. But to be honest I haven't looked a whole lot at how the referrer and sessions stuff works, so Uku would know better/more concretely.

@metmarkosaric
Copy link
Contributor

ok cool, thanks!

@ukutaht
Copy link
Contributor

ukutaht commented Jan 26, 2021

I think the referrer stuff and everything should work normally. In their setup the normal funnel seems to:

  1. Generate some pageviews on www.domain.com
  2. Signup on app.domain.com

As long as they're sending the stats to the same dashboard, the session should stay intact and the referrer source for the initial pageview on www.domain.com will be visible in the dashboard.

@metmarkosaric
Copy link
Contributor

perfect! thanks for confirming!

@Vigasaurus
Copy link
Contributor Author

Oh yeah that makes sense! I guess the generated user hash would line up the same too since the data-domain would be the same.

I guess the break down would happen if someone was referred directly to app.domain.com/signup, and that wouldn't get any referrer on load, but that's probably a bit unlikely/could be planned around from a subdomain design standpoint

oliver-kriska pushed a commit to payout-one/analytics that referenced this pull request Jan 26, 2021
…lausible#489)

* Adds data-exclude support and plausible_ignore support

* Splits exclusion into separate script option

* localStorage parsing upgrades

* Additional script type additions

I'm unsure about the formatting decision in tracker.ex - lmk.

* Adds new compiled files

This will certainly have conflicts with my other PRs related to the tracking scripts right now, I'll make one extra PR after both are done to ensure they're consolidated into the compiled scripts.

* Moves localStorage blocker out of special script

* Changelog

* Second thoughts on localStorage exclusion

* Updates `*` to not match `/` or whitespace

* Fix formatting

* Removes zero-length asterisks

* Adds support for double glob, zero-length replacements

* Update to reduce size+allow localStorage exclude

Co-authored-by: Uku Taht <Uku.taht@gmail.com>
@ukutaht
Copy link
Contributor

ukutaht commented Mar 19, 2021

A (potential) bug raised by one of our clients:
Exclusion rule: /l/*
Actual page path: /l/some-page-here

Expected to match but this is the result:

"/l/testing".match(new RegExp("/l/*".trim().replace(/\*\*/g, '.*').replace(/[^\.]\*/g, '[^\\s\/]*') + '\/?$'))
<- null

@Vigasaurus any idea what's going on here?

@Vigasaurus
Copy link
Contributor Author

@ukutaht Ah yeah that's actually a pretty big bug, that I caught when I was doing pageview goal globs, but completely forgot to update in the tracker script. Notice here the regex for the second replace was appropriately changed to a negative lookbehind, instead of the negative replacement, but in the tracker I didn't make this change. The appropriate new version in the tracker would be:

"/l/testing".match(new RegExp("/l/*".trim().replace(/\*\*/g, '.*').replace(/(?<!\.)\*/g, '[^\\s\/]*') + '\/?$'))
<- Array [ "/l/testing" ]

which is the correct result, notice the swap from .replace(/[^\.]\*/g, '[^\\s\/]*') to .replace(/(?<!\.)\*/g, '[^\\s\/]*'). Will PR shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants