Add blacklisting based on hostname #972

robingustafsson · 2019-03-21T13:10:26Z

Similar to how k6 supports blacklisting of IP ranges it should allow blacklisting of hostnames, including wildcards.

I propose we add this functionality as follows:

If the hostname of a requested URL matches an entry in the blacklist no DNS, TCP or HTTP traffic is generated to that host whatsoever, and the necessary error_code and error tags should be emitted and a JS BlacklistedHotname error will be thrown, that unless caught will exit the current VU iteration.
A new CLI option --blacklist-hostname HOSTNAME that can be specified multiple times, and where HOSTNAME can contain * wildcards.
A new env var option K6_BLACKLIST_HOSTNAMES with a comma separated list of hostnames to block, again wildcards allowed.
A new JS option blacklistHostnames:

export let options = {
    blacklistHostnames: ["example.com", "*.example.com"]
};

Pointers for implementation:

See how the blacklisting of IP blocks is implemented: https://github.com/loadimpact/k6/blob/master/lib/netext/dialer.go#L69

The text was updated successfully, but these errors were encountered:

na-- · 2019-03-21T14:45:26Z

Shouldn't we just use regular expressions instead of wildcards? It will be both more flexible, and easier to implement... And if we want to expose this in a user-friendly UI, converting wildcards to regular expressions is also trivial, while the reverse is impossible...

bookmoons · 2019-06-11T14:56:13Z

I'm looking into this.

@na-- I can see you want to ensure this doesn't impede valuable features.

What would you both think about an expanded syntax?

Wrap a specifier in // to define with a regex, in golang syntax. /(ruby|diamond)\.example\.com/
An unwrapped specifier allows simple * wildcards. *.example.com
An unwrapped specifier without wildcards is a static host. example.com

This can be accepted through all option channels. They can all be translated to regex internally so there's a single match logic.

--blacklist-hostname "/(ruby|diamond)\.example\.com/"
--blacklist-hostname "*.example.com"
--blacklist-hostname example.com

K6_BLACKLIST_HOSTNAMES="/(ruby|diamond)\.example\.com/"
K6_BLACKLIST_HOSTNAMES=*.example.com
K6_BLACKLIST_HOSTNAMES=example.com

blacklistHostnames: [ "/(ruby|diamond)\\.example\\.com/" ]
blacklistHostnames: [ "*.example.com" ]
blacklistHostnames: [ "example.com" ]

na-- · 2019-06-12T07:35:07Z

@bookmoons, yeah, this approach seems very good to me! The simple things are kept simple, but you can make complicated blocks as well.

krashanoff · 2020-06-23T03:58:32Z

I noticed that this issue has been stagnant for a while. Mind if I try taking it on?

na-- · 2020-06-23T06:59:04Z

@krashanoff, sure, go ahead!

krashanoff · 2020-06-23T21:09:46Z

@na-- Getting on this right now, just a quick question: should we enforce valid hostname patterns, or should that be left to the user? Also, should blacklisted hosts throw the same error code as blacklisted IPs, or use a new one?

na-- · 2020-06-24T07:50:50Z

@krashanoff, I don't think you can enforce valid hostname patterns, if you go with the approach suggested in #972 (comment). I'd be tricky when you have --blacklist-hostname "/[\w]+\.example\.com/"" to say for sure if it always matches a valid domain.

That said, I just realized that implementing this with regexes will probably incur some potentially serious performance issues for large lists, similar to the issues of the current naive implementation of IP blocking (#1256). If we allow regexes, I can't think of an easy implementation approach that is not just iterating over the list of blacklisted hostnames and checking against each... At that point we might as well extend this and call it "blacklisted URLs" and allow blocking by any part of the URL...

Whereas if we allow only plain hostnames and leaf * wildcards, there's a somewhat easy way to efficiently implement it, by reversing the domains and constructing a map (flat or nested) or a trie, so that the lookup will be much faster... This only makes sense if we expect to have large blacklists often though. For a few items, iterating over them will probably be faster...

And we might also want to use a different option name than --blacklist-hostname... While I don't think the etymology of the word is problematic (see also this), in the context of k6, --block-hostname, --skip-hostname and --ignore-hostname are more than viable alternatives. Personally, I think I prefer --ignore-hostname, since it conveys the intention behind the option even more clearly to me...

In short, sorry for the delay @krashanoff, but I think there's some further evaluation of this issue needed, before you can start implementing it. cc @robingustafsson @mstoykov @imiric, what are your thoughts about ⬆️ ?

imiric · 2020-06-24T09:28:47Z

Agree with most of what you said @na--.

We shouldn't bother with validating hostnames from this list, except for maybe the minimum of [0-9A-Za-z\.-] before starting the test. Otherwise we'd have to keep this validation up to date, which is an exercise in frustration to say the least.
Regexes will be expensive during the test, but a quick ASCII sanity check before building the trie would likely be fine.
--ignore-hostname sounds good, though --block-hostname will probably be more familiar to everyone. And yeah, let's avoid using "blacklist" given the current climate. :)

mstoykov · 2020-06-24T10:10:12Z

I like the idea of the trie and reversing it ( I remember seeing this done for exactly this, somewhere).
So we propose that:

the only thing you can do as a "wildcard" will be to have * as a whole label within the hostname such as *.example.com
Or we want to support *something.example.com so it catches whateversomething.example.com.
How about something*.example.com catching somethingelse.example.com
Is something.*.example.com meaning that else.example.com is fine but something.else.example.com isn't
Also does *.example.com mean that something.else.example.com is fine or will it get blocked? or will that need to be **.example.com for example?

Given that k6 won't just ignore them but actually block them I would argue for --block-hostname
Not really certain if checking the input is really worth it ... but if it comes naturally I am fine with it.
I do think that the error should be a different one though :D

na-- · 2020-06-24T10:20:11Z

You've convinced me about the option name, so 👍 for --block-hostname / K6_BLOCK_HOSTNAMES / blockHostnames from me as well.

In regards to the wildcard, in the initial version of this, I think we should only allow wildcards in the beginning of the value, i.e. *something, so that when you reverse the string, the wildcard will be the leaf of the trie. And that wildcard should block parts of the subdomain and whole chains of subdomains afterwards, i.e. --block-hostname="*test.example.com" should block all of the following domains:

test.example.com
thetest.example.com
sub.test.example.com
sub.sub.test.example.com

This seems relatively easy to implement if you treat domains as reversed strings, while also fairly powerful and devoid of any surprise behavior to users, I think...

robingustafsson · 2020-06-24T12:46:41Z

Agree with what's been said. We definitely need wildcards because the block lists would otherwise potentially be even larger than they'll already be (eg. k6 Cloud). Only allowing the wildcard in the beginning is fine.

hynd · 2020-06-24T15:36:55Z

Just throwing something out there... In the majority of my tests, i usually have the opposite need - to only allow specific hosts (ie; my own domain) and deny all the others (which are invariably third parties who i don't want to annoy).
Would a companion option (--allow-hostname?) using a similar mechanism to this, or even a single option that covers both (--filter-hostname?) make sense?

krashanoff · 2020-06-24T20:16:03Z

Thank you all for taking the time to write back! This is my first time contributing to open source so it's been a good experience. Just summing up what I gleaned from the above conversation, it seems like this is the consensus on implementation:

The option will be named --block-hostname, K6_BLOCK_HOSTNAMES, and blockHostnames.
Blocked hostname errors will have a new code thrown.
Don't bother validating blocked hostname strings past a short check for [0-9A-Za-z\.-\*].
Wildcards are only permitted at the start of a value. Something like foo.*.bar is not permitted, but something like *.foo.bar is.
Blocked hostname strings will be stored backwards in a trie.

The only thing that is really left up to debate at this point is whether regexes should be permitted. I personally think it would be nice to have, but as @na-- mentioned, there's big issues with performance and implementation. I was tooling around with using Go's regexp for blacklisting last night, and found that things get nasty fast if one mirrors the current implementation pattern of IP blacklisting with regexp. Lines like (*net.IPNet)(ipnet).Contains(ip) are not ideal in my opinion.

Pertaining to @hynd's idea of --allow-hostname, I think it would make sense to have iff it were mutually exclusive with --block-hostname.

na-- · 2020-06-30T08:19:46Z

@krashanoff, @hynd, sorry for the late response... 😞

Regarding --allow-hostname or something similar, can you elaborate more on your use case? To me, --allow-hostname seems like an option that's more suitable for the various converters, especially the HAR to k6 converters. And indeed, we have the --only flag in the old k6 convert HAR converter and we plan to implement something like that in the new standalone HAR converter: grafana/har-to-k6#39

The biggest use case I can think for this is, when you've recorded a browser session, to exclude requests to any external services like analytics, ads, script CDNs, etc. So if you record an example.com browser session, you can just skip all requests to external websites recorded in the HAR file by just allowing --allow-hostname="*.example.com". It's better for these requests to never be included in the k6 script file at all, both to keep it cleaner, and to prevent you from having a bunch of errors in the metrics.

Whereas, to me, the biggest use case for an --block-hostname flag would be abuse prevention for things like k6 cloud. We can't control what scripts people run, but if a someone complains that our service is being used to DDoS their website/API/etc. without their consent, we'd be easily able to add their domain to the list of blocked ones.

There are other benefits and non-k6-cloud uses of the --block-hostname flag, of course. It can also be used to block external requests, albeit more clumsily than something like --allow-hostname, and it can allow people to selectively exclude some parts of a test run temporarily... But for a lot of these use cases there are usually other, somewhat cleaner, ways to achieve them, like environment variables and if statements, or the upcoming #1007 scenarios option...

@krashanoff, regarding your other points, 👍 for the option name, new error code, simple validation (though I'm not sure how these should be handled), wildcards only in the beginning, trie-based implementation.

👎 for regex support, for the initial version of this feature at least... We can always change that decision in the future, and #972 (comment) illustrates how we can add regex support in a backwards compatible way, if we decide to do so...

hynd · 2020-06-30T17:35:39Z

Ah fair point,--allow does seem more appropriate on the converters - i think i was probably conflating the two steps and hadn't spotted the --only flag. I guess unlike some other tools which can slurp up any referenced media etc at run time, it's not so easy (in a good way!) with k6 to go chasing down arbitrary urls (and if my endpoint redirects requests to some random domain by mistake, that's on me 😊). Thanks for your detailed reply!

krashanoff · 2020-06-30T18:29:31Z

No worries about the late reply, thank you for taking the time. I think a preliminary implementation will just validate wildcard placement. Then, to account for internationalized domain names, we can use a combined character class of (\pL|[0-9]). In any case, I'll just implement the check as a regex so it can be changed later. For handling intl' names in the trie, we can just use Runes for each node value.

Thank you for poring over the details and the detailed responses. Will get to work on this very soon 👍

Closes grafana#972

warnings. Closes grafana#972

robingustafsson added good first issue feature labels Mar 21, 2019

na-- added this to the v0.26.0 milestone Aug 27, 2019

na-- modified the milestones: v0.26.0, v0.27.0 Oct 10, 2019

na-- modified the milestones: v0.27.0, v0.28.0 May 21, 2020

na-- added the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label Jun 24, 2020

na-- mentioned this issue Jun 30, 2020

Add an option to filter out certain domains and/or URLs grafana/har-to-k6#39

Open

krashanoff mentioned this issue Jul 3, 2020

use of K6_BLACKLIST_IPS causes nil pointer dereference #1531

Closed

krashanoff added a commit to krashanoff/k6 that referenced this issue Jul 3, 2020

Remove unnecessary export. Clarify comments.

8a940bd

Closes grafana#972

krashanoff added a commit to krashanoff/k6 that referenced this issue Jul 3, 2020

Remove unnecessary export. Clarify comments. Address golangci-lint

4493e38

warnings. Closes grafana#972

krashanoff mentioned this issue Jul 3, 2020

Add blocking by hostname patterns (--block-hostname) #1532

Merged

na-- modified the milestones: v0.28.0, v0.29.0 Sep 9, 2020

mstoykov closed this as completed in 0e96344 Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blacklisting based on hostname #972

Add blacklisting based on hostname #972

robingustafsson commented Mar 21, 2019 •

edited

Loading

na-- commented Mar 21, 2019

bookmoons commented Jun 11, 2019

na-- commented Jun 12, 2019

krashanoff commented Jun 23, 2020

na-- commented Jun 23, 2020

krashanoff commented Jun 23, 2020 •

edited

Loading

na-- commented Jun 24, 2020

imiric commented Jun 24, 2020

mstoykov commented Jun 24, 2020

na-- commented Jun 24, 2020

robingustafsson commented Jun 24, 2020

hynd commented Jun 24, 2020

krashanoff commented Jun 24, 2020 •

edited

Loading

na-- commented Jun 30, 2020

hynd commented Jun 30, 2020

krashanoff commented Jun 30, 2020 •

edited

Loading

Add blacklisting based on hostname #972

Add blacklisting based on hostname #972

Comments

robingustafsson commented Mar 21, 2019 • edited Loading

na-- commented Mar 21, 2019

bookmoons commented Jun 11, 2019

na-- commented Jun 12, 2019

krashanoff commented Jun 23, 2020

na-- commented Jun 23, 2020

krashanoff commented Jun 23, 2020 • edited Loading

na-- commented Jun 24, 2020

imiric commented Jun 24, 2020

mstoykov commented Jun 24, 2020

na-- commented Jun 24, 2020

robingustafsson commented Jun 24, 2020

hynd commented Jun 24, 2020

krashanoff commented Jun 24, 2020 • edited Loading

na-- commented Jun 30, 2020

hynd commented Jun 30, 2020

krashanoff commented Jun 30, 2020 • edited Loading

robingustafsson commented Mar 21, 2019 •

edited

Loading

krashanoff commented Jun 23, 2020 •

edited

Loading

krashanoff commented Jun 24, 2020 •

edited

Loading

krashanoff commented Jun 30, 2020 •

edited

Loading