Prevent inadvertent blocking of good domains appearing in query strings #2027
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
By submitting this pull request, I confirm the following:
git rebase
)Please make sure you Sign Off all commits. Pi-hole enforces the DCO.
What does this PR aim to accomplish?:
This PR prevents certain entries in blocklists from being inadvertently formatted into entirely different, often benign domains.
For example, the OpenPhish blocklist currently includes the following line:
The domain that should be blocked here is
s-adv.kovo.vn
. However, aftergravity_ParseFileIntoDomains
runs, this line is converted tologin.live.com
, blocking a benign domain.How does this PR accomplish the above?:
The core problem was this awk statement:
The aim of the statement is to delete the scheme of the url, if present, and also any text following any of the characters
:?\/;
. However, if a scheme-like substring appeared twice in a given line, as it does in the faulty lines, this statement had the effect of consuming all of the text up to and including that second scheme.To fix this bug, this PR changes the statement in question to:
This new statement matches the scheme more strictly, following RFC 3986 section 3.1, as noted in a comment. With the new statement, the example given above is correctly converted to
s-adv.kovo.vn
.For illustrative purposes, I've included a full diff comparing conversions of the OpenPhish blocklist by the old code, and by the new, at the bottom of this PR. You can cross-reference the domains to the list and see that the changes are all beneficial. With this change, for example, the following domains that were blocked by default will now no longer be blocked by default due to the OpenPhish list:
Finally, this diff also makes some cosmetic modifications to the same awk code in question. Most notably, it replaces this construct:
with the equivalent, more idiomatic:
What documentation changes (if any) are needed to support this PR?:
No documentation changes should be necessary to support this PR.
As promised, here's the full diff for the OpenPhish blocklist before and after, generated with something like
git diff -U0 --no-index old.out new.out
: