Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine the user agent check #112

Closed
Zodiac1978 opened this issue Oct 3, 2018 · 4 comments
Closed

Refine the user agent check #112

Zodiac1978 opened this issue Oct 3, 2018 · 4 comments
Assignees
Milestone

Comments

@Zodiac1978
Copy link
Member

Statify comes with a default filter that whitelists Windows, Linux, iPhone, etc.
The mobile bots however match this to force mobile display instead of desktop version.

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexBot/3.0; +http://yandex.com/bots) — Indexing robot.

This behavior should probably be changed in future versions to a more refined blacklist.

Currently the only chance is a custom skip hook or a blacklist plugin to filter their IP subnet. (my blacklist extension does not feature user agent filter either at the moment)

There is a bot from a French search engine which is using "Linux" in his user agent:
https://regex101.com/r/xZALC3/1/
Exabot - https://www.keycdn.com/blog/web-crawlers

Because of this problem and the problem mentioned above about mobile user agents we should blacklist the most used bots too.

I found a great resource of bots with regex patterns to detect and blacklist many bots.
https://github.com/monperrus/crawler-user-agents

@stklcode
Copy link
Contributor

stklcode commented Oct 4, 2018

Exclude the word “bot“ in addition to the current filter works quite well on another system where I implemended a similar statistics processor on my own. (shame on me I never contributed back 🙈 )
I got a PHPUnit Test somewhere with about 50 known bot strings. Think we should add such checks here, too.

@Zodiac1978
Copy link
Member Author

This is a very basis regex blacklist which should detect most of them:
https://gist.github.com/geerlingguy/a438b41a9a8f988ee106

@patrickrobrecht patrickrobrecht added this to the 1.8.0 milestone Dec 28, 2018
mahype added a commit that referenced this issue Mar 22, 2019
@mahype
Copy link
Contributor

mahype commented Mar 22, 2019

I have enhanced the bot detection by using the gist given by @Zodiac1978.

@mahype mahype self-assigned this Mar 22, 2019
@mahype mahype modified the milestones: 1.8.0, 1.7.0 Mar 22, 2019
mahype added a commit that referenced this issue Mar 22, 2019
Zodiac1978 added a commit that referenced this issue Mar 22, 2019
…ent-check

Ticket #112 - Enhanced bot detection.
@Zodiac1978
Copy link
Member Author

@mahype This ticket can be closed after #125 is merged. Correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants