-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler gets stuck on pages that generate more HTML pages #6
Comments
The solution for now will be a slight modification of the AOPIC algorithm along with a few new CL options. The AOPIC algorithm, as mentioned before, will not be enough to solve the issue on its own. In essence, it naturally pushes the evil pages to the end of the crawl. However, a few options have been added to automate the crawl.
As the AOPIC algorithm pushes the evil pages to the end of the crawl, using the EX.
This is helpful if you know of a directory that may cause problems. EX. With
as
Which is helpful for pages that generate an infinite amount of different queries. To those curious, the algorithm is nearly identical to what's explained here. The changes are as follows:
This problem is considered solved and I'm closing the issue, but I'm willing to reopen it should new information or ideas come up to provide a better solution. |
EDIT: For those experiencing this problem, read about ways to fix it here.
Arachnid has no method in place that stops crawling pages that continuously generate more pages. For example, a web calendar may have a "next month" button that one could theoretically continuously navigate to move forwards hundreds of years.
This is a common bot trap. The issue is that the crawler is stuck continuously navigating to new pages and gathering useless data. It also takes a long or indefinite amount of time for the scheduler to provide a new, unique URL that navigates the crawler away from the page.
Methods attempted so far:
The AOPIC algorithm
The algorithm
The official paper is located here.
Explained in layman's terms here.
The issue
This algorithm is designed for incremental crawlers, not for snapshot crawlers. However, it naturally pushes the evil pages towards the end of the crawl. If we could think of a way to modify this algorithm to detect when the crawler is stuck on an evil page, this would be our solution.
I pushed the branch "AOPIC" with my implementation of the algorithm. The file of interest is here. Feel free to fork it and mess around with the algorithm.
The URLDiffFilter
The algorithm
The DiffFilter checks an arbitrary amount X of previously crawled URL strings for a "significant difference." A significant difference is when 3 or more characters differ between two strings.
If the current URL and all previous X amount of URLs have do not have a significant difference, the current URL will not be added to the schedule.
The issue
This simple algorithm breaks quite easily when an evil page generates multiple evil pages that have a significant difference in URLs. Also, there is small potential for false positives to arise.
The text was updated successfully, but these errors were encountered: