Crawler Phase 2 Design

Analysis of the current implementation of the crawler shows that on average each page is crawled on average over ten times, which is very inefficient. The old design assumed that the number of pages to crawl would exceed the number of pages crawled by such a large amount that the duplicated effort would be relatively low, but unfortunately that seems to not be the case.

Approach

In the new approach, each client crawler will be given a list of URLs to crawl by the central server. In addition to not duplicating work, this also helps reduce the risk of fraudulent activity since we can check the URLs returned by clients against the list of URLs that was given to them.

Data Design

For now, we plan to use Postgres to store the information about URL crawl status. We will need at least the following tables:

URLs: for each URL whether we have crawled or queued it, or it is currently assigned to a user
- URL (primary key)
- Status (queued, crawled, assigned)
- Updated date - when was the status last updated
Queue: URLs to be crawled (implemented using pq)
In progress items: URLs that have been given to clients to crawl
- User ID
- URL - primary key is (user, URL) pair
- Date assigned

Behaviour:

When requested by the client, items are popped from the queue and assigned to the user in the "in progress" table, and their status is updated in the URLs table
When a newly crawled page is sent by the client, the extracted URLs are examined
- Check the "in progress" table to confirm that this URL was indeed given to the client, if not, discard it
- Check the URL table to see if the extracted URL is new
- If it is new or the date it was last updated exceeds some threshold:
  - Add it to the URL table with a status of "queued"
  - Add it to the queue of URLs to crawl
- Add it to the queue of URLs to crawl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler Phase 2 Design

Approach

Data Design

Clone this wiki locally