Crawler gets stuck on pages that generate more HTML pages #6

jake-bickle · 2019-01-04T18:24:29Z

EDIT: For those experiencing this problem, read about ways to fix it here.

Arachnid has no method in place that stops crawling pages that continuously generate more pages. For example, a web calendar may have a "next month" button that one could theoretically continuously navigate to move forwards hundreds of years.

This is a common bot trap. The issue is that the crawler is stuck continuously navigating to new pages and gathering useless data. It also takes a long or indefinite amount of time for the scheduler to provide a new, unique URL that navigates the crawler away from the page.

Methods attempted so far:

The AOPIC algorithm

The algorithm

The official paper is located here.
Explained in layman's terms here.

The issue

This algorithm is designed for incremental crawlers, not for snapshot crawlers. However, it naturally pushes the evil pages towards the end of the crawl. If we could think of a way to modify this algorithm to detect when the crawler is stuck on an evil page, this would be our solution.

I pushed the branch "AOPIC" with my implementation of the algorithm. The file of interest is here. Feel free to fork it and mess around with the algorithm.

The URLDiffFilter

The algorithm

The DiffFilter checks an arbitrary amount X of previously crawled URL strings for a "significant difference." A significant difference is when 3 or more characters differ between two strings.
If the current URL and all previous X amount of URLs have do not have a significant difference, the current URL will not be added to the schedule.

The issue

This simple algorithm breaks quite easily when an evil page generates multiple evil pages that have a significant difference in URLs. Also, there is small potential for false positives to arise.

jake-bickle · 2019-09-25T06:15:42Z

The solution for now will be a slight modification of the AOPIC algorithm along with a few new CL options.

The AOPIC algorithm, as mentioned before, will not be enough to solve the issue on its own. In essence, it naturally pushes the evil pages to the end of the crawl. However, a few options have been added to automate the crawl.

--page-limit (Number of pages)
Will add an upper limit to the amount of pages to be crawled.

--time-limit (m, h:m, or h:m:s format where h=hours, m=minutes, s=seconds)
Will add an upper limit to the amount of time for the crawler to run.

--blacklist-dir (Space delimited directory names)
Will forbid the crawler to go to any page within a specified directory.

--no-query
Which will disregard the query section of scraped URLs.

As the AOPIC algorithm pushes the evil pages to the end of the crawl, using the --page-limit or --time-limit will likely end the crawl when necessary. If you are aware of how the website is causing infinite crawls, using --blacklist-dir or --no-query may prove useful as well.

EX. --blacklist-dir calendar blog Will block the following pages:

https://www.example.com/calendar/2019/september
https://www.example.com/jake-bickle/blog/post01

This is helpful if you know of a directory that may cause problems.

EX. With --no-query set, Arachnid will interpret the following pages

https://www.example.com/page?key1=value1&key2=value2
https://www.example.com/page
https://www.example.com/page?key1=value203&key2=value1029

as

https://www.example.com/page

Which is helpful for pages that generate an infinite amount of different queries.

To those curious, the algorithm is nearly identical to what's explained here. The changes are as follows:

The starting cache is always 100000 (An arbitrary number) and it is given to one seed URL
Fuzzed links and links found from robots.txt are "supplemental urls" and they're sent out of the scheduler independently of the AOPIC algorithm. This is because they aren't found naturally throughout a website crawl and would never get any credit. Should any of these links exist and the scraper finds new pages, the fuzzed/robots link is assumed to be 1000 cache (an arbitrary number, but high enough not to get buried).
When dividing the cache to found pages in step 8, disregard links that have 0 cache.
The crawl is considered complete when all pages have 0 cache.

This problem is considered solved and I'm closing the issue, but I'm willing to reopen it should new information or ideas come up to provide a better solution.

jake-bickle added the help wanted Extra attention is needed label Apr 3, 2019

jake-bickle pinned this issue Apr 3, 2019

jake-bickle closed this as completed Sep 25, 2019

jake-bickle self-assigned this Sep 25, 2019

jake-bickle added bug Something isn't working and removed help wanted Extra attention is needed labels Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler gets stuck on pages that generate more HTML pages #6

Crawler gets stuck on pages that generate more HTML pages #6

jake-bickle commented Jan 4, 2019 •

edited

Loading

jake-bickle commented Sep 25, 2019 •

edited

Loading

Crawler gets stuck on pages that generate more HTML pages #6

Crawler gets stuck on pages that generate more HTML pages #6

Comments

jake-bickle commented Jan 4, 2019 • edited Loading

The AOPIC algorithm

The algorithm

The issue

The URLDiffFilter

The algorithm

The issue

jake-bickle commented Sep 25, 2019 • edited Loading

jake-bickle commented Jan 4, 2019 •

edited

Loading

jake-bickle commented Sep 25, 2019 •

edited

Loading