Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gravity: Check suitablility of sourced lists #3150

Merged
merged 6 commits into from Feb 24, 2020

Conversation

DL6ER
Copy link
Member

@DL6ER DL6ER commented Feb 22, 2020

By submitting this pull request, I confirm the following:

  • I have read and understood the contributors guide, as well as this entire template.
  • I have made only one major change in my proposed changes.
  • I have commented my proposed changes within the code.
  • I have tested my proposed changes, and have included unit tests where possible.
  • I am willing to help maintain this change if there are issues with it later.
  • I give this submission freely and claim no ownership.
  • It is compatible with the EUPL 1.2 license
  • I have squashed any insignificant commits. (git rebase)

What does this PR aim to accomplish?:

Improve gravity.

How does this PR accomplish the above?:

  • Show lines from the temporary file if SQL errors are detected
  • Check domains for validity before importing them (should eliminate all SQL warnings/errors as a side-effect)
  • Show how many domains we're considering for import and show how many are invalid (if any)

This all has been done in a performant way so I refrained from adding a switch for enabling it.

Exemplary output provided by a user on Discourse:

  [i] Target: http://www.malwaredomainlist.com/hostslist/hosts.txt
  [✓] Status: No changes detected
  [i] Received 1104 domains

  [i] Target: https://easylist.to/easylistgermany/easylistgermany.txt
  [✓] Status: Retrieval successful
  [i] Received 945 domains, 943 domains invalid!
      Sample of invalid domains:
      - 1.1]
      - easylist.germany@gmail.com
      - ||2mdn.net^$object,third-party,domain=anleger-fernsehen.de|blick.ch|fitforfun.de|focus.de|giga.de|golem.de|helpster.de|myspass.de|netzwelt.de|stol.it|sueddeutsche.de|tvtoday.de
      - ||4rm.de^$third-party
      - ||85.114.133.62^$third-party

  [i] Target: http://pgl.yoyo.org/adservers/serverlist.php?hostformat=hosts&showintro=0&mimetype=plaintext
  [✓] Status: Retrieval successful
  [i] Received 3292 domains

What documentation changes (if any) are needed to support this PR?:

None

Signed-off-by: DL6ER <dl6er@dl6er.de>
Signed-off-by: DL6ER <dl6er@dl6er.de>
Signed-off-by: DL6ER <dl6er@dl6er.de>
Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER DL6ER added this to the v5.0 milestone Feb 22, 2020
@DL6ER DL6ER requested a review from a team February 22, 2020 14:25
@pralor-bot
Copy link

This pull request has been mentioned on Pi-hole Userspace. There might be relevant details there:

https://discourse.pi-hole.net/t/encountered-non-critical-sql-warnings/28543/113

@DL6ER DL6ER changed the title Tweak/database warnings inspection Gravity: Check suitablility of sourced lists Feb 22, 2020
@dschaper
Copy link
Member

I'm not for this. It's not what we should be doing.

@dschaper
Copy link
Member

[i] List quality: 8882 of 8882 lines importable (100.0%) This needs to change, I do not want to look like we are saying the list is of 100% quality. I can make a list that is full importable and is of such a level of garbage that it breaks everything.

@roja6969
Copy link

I have to agree with Dan, but quality of the list is a user issue. There is no way intelligently policing a list of suspect sites for quality. Syntax checking I guess but validity that’s subjective. I’ve had to unblock sites because key tools use sites on blocked list so one persons garbage is another persons gold and is subjective. You cannot expect pi team to police these. “You” do it not them. Open source is about choice so choose wisely.

@PromoFaux
Copy link
Member

I think perhaps just a [I] n/n domains imported would suffice. That way we are just stating a fact. Domains have been imported, but no reference to the quality of the domains that are being imported.

Signed-off-by: DL6ER <dl6er@dl6er.de>
… as importing does only happen a bit later. Only show the number of invalid domains if there are invalid domains.

Signed-off-by: DL6ER <dl6er@dl6er.de>
@DL6ER
Copy link
Member Author

DL6ER commented Feb 24, 2020

Okay, I changed it to

  [i] Target: https://hosts-file.net/ad_servers.txt
  [✓] Status: No changes detected
  [i] Received 45736 domains

(that is: all importable)

  [i] Received 51233 domains, 37997 domains invalid!
      Sample of invalid domains:
      - localhost.localdomain
      - 01mspmd5yalky8.com
      - analytics.247sports.com
      - www.analytics.247sports.com
      - logitechlogitechglobal.112.2o7.net

(that means crap on the list, this is not really true in this example, just showcasing here)

@DL6ER
Copy link
Member Author

DL6ER commented Feb 24, 2020

Code + PR description updated, what is your opinion @dschaper @PromoFaux @roja6969 ?

@dschaper
Copy link
Member

The only changes I could see would be to say “# lines invalid” instead of “# domains” since it’s possible the bad entries won’t be domains. For things like Adblock lists that are css or html.

@DL6ER
Copy link
Member Author

DL6ER commented Feb 24, 2020

@dschaper Do you want me to change this? Lists that are added should only contain domains. And these lines are invalid as domains so they can also be understood as invalid domains ;-)

@dschaper
Copy link
Member

No, no need to change it and dismiss the approvals. Looks good to me.

@DL6ER DL6ER merged commit f4a1cc6 into release/v5.0 Feb 24, 2020
@DL6ER DL6ER deleted the tweak/database_warnings_inspection branch February 24, 2020 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants