Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent inadvertent blocking of good domains appearing in query strings #2027

Merged
merged 2 commits into from
Apr 17, 2018
Merged

Prevent inadvertent blocking of good domains appearing in query strings #2027

merged 2 commits into from
Apr 17, 2018

Conversation

ravron
Copy link
Contributor

@ravron ravron commented Mar 7, 2018

By submitting this pull request, I confirm the following:

  • I have read and understood the contributors guide, as well as this entire template.
  • I have made only one major change in my proposed changes.
  • I have commented my proposed changes within the code.
  • I have tested my proposed changes, and have included unit tests where possible.
  • I am willing to help maintain this change if there are issues with it later.
  • I give this submission freely and claim no ownership.
  • It is compatible with the EUPL 1.2 license
  • I have squashed any insignificant commits. (git rebase)

Please make sure you Sign Off all commits. Pi-hole enforces the DCO.


What does this PR aim to accomplish?:
This PR prevents certain entries in blocklists from being inadvertently formatted into entirely different, often benign domains.

For example, the OpenPhish blocklist currently includes the following line:

https://s-adv.kovo.vn/updates/Validation/login.php?https://login.live.com/public/IdentifyUser.aspx?LOB=RBGLogon

The domain that should be blocked here is s-adv.kovo.vn. However, after gravity_ParseFileIntoDomains runs, this line is converted to login.live.com, blocking a benign domain.

How does this PR accomplish the above?:
The core problem was this awk statement:

gsub(/(^.*:\/\/(.*:.*@)?|[:?\/;].*)/, "", $0)

The aim of the statement is to delete the scheme of the url, if present, and also any text following any of the characters :?\/;. However, if a scheme-like substring appeared twice in a given line, as it does in the faulty lines, this statement had the effect of consuming all of the text up to and including that second scheme.

To fix this bug, this PR changes the statement in question to:

gsub(/(^[a-zA-Z][a-zA-Z0-9+.-]*:\/\/(.*:.*@)?|[:?\/;].*)/, "", $0)

This new statement matches the scheme more strictly, following RFC 3986 section 3.1, as noted in a comment. With the new statement, the example given above is correctly converted to s-adv.kovo.vn.

For illustrative purposes, I've included a full diff comparing conversions of the OpenPhish blocklist by the old code, and by the new, at the bottom of this PR. You can cross-reference the domains to the list and see that the changes are all beneficial. With this change, for example, the following domains that were blocked by default will now no longer be blocked by default due to the OpenPhish list:

chaseonline.chase.com
id.orange.fr
ims-na1.adobelogin.com
login.live.com
www.dropbox.com
www.ebay.co.uk

Finally, this diff also makes some cosmetic modifications to the same awk code in question. Most notably, it replaces this construct:

{ if ($0 ~ <regex>) { <statement> } }

with the equivalent, more idiomatic:

<regex> { <statement> }

What documentation changes (if any) are needed to support this PR?:
No documentation changes should be necessary to support this PR.


As promised, here's the full diff for the OpenPhish blocklist before and after, generated with something like git diff -U0 --no-index old.out new.out:

diff --git a/old.out b/new.out
index 692166d..3666540 100644
--- a/old.out
+++ b/new.out
@@ -1130 +1130 @@ wattelectronica.es
-dfghjkliuytrewert
+www.educationinterface.com
@@ -3747 +3747 @@ barcelonaguideservice.com
-login.live.com
+barcelonaguideservice.com
@@ -4523,2 +4523,2 @@ hiringbai.com
-id.orange.fr
-id.orange.fr
+multipoliv.ru
+multipoliv.ru
@@ -7850,2 +7850,2 @@ ieas.org.gr
-login.live.com
-login.live.com
+www.goodsteel.vn
+www.goodsteel.vn
@@ -8536,2 +8536,2 @@ thelaughingmonk.com
-login.live.com
-login.live.com
+s-adv.kovo.vn
+s-adv.kovo.vn
@@ -8546,2 +8546,2 @@ woodbug.com
-login.live.com
-login.live.com
+www.goodsteel.vn
+www.goodsteel.vn
@@ -8558,2 +8558,2 @@ www.ambitionpowerbd.com
-login.live.com
-login.live.com
+www.pakethosting.com
+www.pakethosting.com
@@ -8583,2 +8583,2 @@ www.clicknew.ir
-chaseonline.chase.com
-chaseonline.chase.com
+swtitletx.com
+swtitletx.com
@@ -12620 +12620 @@ silverbeatband.com
-login.live.com
+vintusdoo.com
@@ -12622 +12622 @@ confermazione-datii.it
-login.live.com
+fazzfoods.com
@@ -13462,2 +13462,2 @@ bizimsurucukursu.com
-login.live.com
-login.live.com
+abdifurnish.com
+abdifurnish.com
@@ -13624 +13624 @@ code2crack.com
-ims-na1.adobelogin.com
+www.theqube.eu
@@ -14416 +14415,0 @@ uniqueairgroup.co.za
-chaseonline.chase.com
@@ -14418 +14417,2 @@ tazatarin.com
-chaseonline.chase.com
+tazatarin.com
+tazatarin.com
@@ -14684 +14684 @@ panicimis.in.rs
-chaseonline.chase.com
+panicimis.in.rs
@@ -14747 +14747 @@ sovana.com.au
-chaseonline.chase.com
+sman3pasuruan.sch.id
@@ -14845 +14845 @@ www.hakdilaravakfi.com
-www.dropbox.com
+www.hakdilaravakfi.com
@@ -15060 +15060 @@ hakdilaravakfi.com
-www.dropbox.com
+hakdilaravakfi.com
@@ -15064 +15064 @@ theviralscene.com
-www.dropbox.com
+hakdilaravakfi.com
@@ -15872 +15872 @@ ah100online.com
-www.ebay.co.uk
+fishmkt.com

Signed-off-by: Riley Avron <riley.avron@gmail.com>
Signed-off-by: Riley Avron <riley.avron@gmail.com>
@ravron ravron mentioned this pull request Mar 7, 2018
8 tasks
@dschaper dschaper requested a review from a team April 17, 2018 00:30
@dschaper dschaper added the PR: Approved Open Pull Request, Approved by required number of reviewers label Apr 17, 2018
@dschaper dschaper merged commit c9f3c02 into pi-hole:development Apr 17, 2018
@jacobsalmela
Copy link
Contributor

@ravron please contact me directly for a special thank you for contributing to our project.

@ravron ravron deleted the ravron-urls-in-query-strings branch April 17, 2018 21:31
@dschaper dschaper removed the PR: Approved Open Pull Request, Approved by required number of reviewers label May 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants