Crawler Phase 2 Design

Analysis of the current implementation of the crawler shows that on average each page is crawled on average over ten times, which is very inefficient. The old design assumed that the number of pages to crawl would exceed the number of pages crawled by such a large amount that the duplicated effort would be relatively low, but unfortunately that seems to not be the case.

Approach

In the new approach, each client crawler will be given a list of URLs to crawl by the central server. In addition to not duplicating work, this also helps reduce the risk of fraudulent activity since we can check the URLs returned by clients against the list of URLs that was given to them.

Data Design

We will need the following tables:

URLs: for each URL whether we have crawled or queued it, or it is currently assigned to a user
- URL (primary key)
- Status (queued, crawled, assigned)
- Updated date - when was the status last updated
Queue: URLs to be crawled (implemented using pq)
In progress items: URLs that have been given to clients to crawl
- User ID
- URL - primary key is (user, URL) pair
- Date assigned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler Phase 2 Design

Approach

Data Design

Clone this wiki locally