Skip to content

Crawler Phase 2 Design

Daoud Clarke edited this page May 24, 2022 · 3 revisions

Analysis of the current implementation of the crawler shows that on average each page is crawled on average over ten times, which is very inefficient. The old design assumed that the number of pages to crawl would exceed the number of pages crawled by such a large amount that the duplicated effort would be relatively low, but unfortunately that seems to not be the case.

Approach

In the new approach, each client crawler will be given a list of URLs to crawl by the central server. In addition to not duplicating work, this also helps reduce the risk of fraudulent activity since we can check the URLs returned by clients against the list of URLs that was given to them.

Data Design

We will need the following tables:

  • URLs: for each URL whether we have crawled or queued it, or it is currently assigned to a user
    • URL (primary key)
    • Status (queued, crawled, assigned)
    • Updated date - when was the status last updated
  • Queue: URLs to be crawled (implemented using pq)
  • In progress items: URLs that have been given to clients to crawl
    • User ID
    • URL - primary key is (user, URL) pair
    • Date assigned