Skip to content

Commit 6f54df6

Browse files
author
Joseph Luce
authored
Update web_crawler.md
1 parent b3070b2 commit 6f54df6

File tree

1 file changed

+31
-6
lines changed

1 file changed

+31
-6
lines changed

system_design/web_crawler.md

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,18 +17,18 @@
1717
- Lastly, we will need a data store.
1818

1919
# Extractor / Crawler
20-
Let start here, we obviously would like to be able to scale as many crawlers as we want, but its sole responsibility is to just scrape whatever data it is meant to scrape.
21-
That can be only videos, HTML, or images.
22-
Since we can add or remove any number of crawlers, we should place a load balancer in front of this cluster.
20+
Lets start here, we obviously would like to be able to scale as many crawlers as we want, but its sole responsibility is to just scrape whatever data it is meant to scrape.
21+
That can be only videos, HTML, or images, each cluster will contain the same extractors and we will have multiple clusters for each content type.
22+
Since we can add or remove any number of crawlers, we should place a load balancer in front of each cluster.
2323
We don't know if a certain crawler may get many pages that require more processing time than others.
2424

2525
The URL extractor however, will act slightly different, it will further process the HTML scraper's result and get all the URLs on that page.
2626
This will be sent to the URL manager for later distribution.
2727

28-
The extractor can keep a queue of given URLs, but you would then need to keep in mind of fault tolerance, if the extractor goes down, we will lose whatever was in the queue.
29-
This will be explained below.
28+
Each extractor can keep a queue of given URLs, but you would then need to keep in mind of fault tolerance, if the extractor goes down, we will lose whatever was in the queue.
29+
This will be covered later.
3030

31-
Lastly, if given a link that does not lead to no where, we should ignore it.
31+
Lastly, if given a link that leads to no where, we should ignore it.
3232

3333
# URL Manager
3434
The URL Manager will be given a seed of URLs to start with.
@@ -41,4 +41,29 @@ We don't want to store very long URLs in our database.
4141
Instead, we can hash the URLs, however, that is not always a guarantee due to collisions.
4242
If we have to visit more pages than there are possible hash codes, this would be a problem.
4343

44+
# Content Manager
45+
We don't want to store a bunch of pages with the same set of data.
46+
So we need a way to filter the content that we recieve after we have scraped the data.
47+
Similar to the URL manager, we will hash the content and store a hash set in another database to use for checking.
48+
If the content hash isn't found, the content can then be stored into the data store, else rejected.
49+
50+
However, there is a penalty for hashing the entire page, it can take long.
51+
Especially when the content of the pages are very large.
52+
We could try to do a hash based on pieces of the content such as the title, headers, first few paragraphs and video links to figure out if this is new content.
53+
This isn't 100% best solution due to collisions, this is a question you need to ask whether it is good enough to have more duplicate content or missing out on unique content if it happens that the sub-content we select for the hash are the same.
54+
4455
# Fault Tolerance
56+
The URL Manager and the Content Manager should both have a master-slave architecture for better up-time.
57+
However, if both go down at the same time, we can revive them by using the database.
58+
Since the URL Manager will be getting a stream of URLs, hence, having a queue, it would be important to save this queue into the database.
59+
Therefore, requiring the database to have a visited URL hash codes and unvisited set of URL links.
60+
61+
For the extractors, it will depend if the extractors will have a queue or not.
62+
If they will hold a queue, it will be important to also have a master-slave architecture.
63+
The difference is that we cannot have a database for each extractor, as we plan to have many of them.
64+
If both master and slave of a pair of extractors go down, we will need to rely on the URL Manager to resend the URLs to another pair of extractors.
65+
This would then require a response back from the extractors that the URL they were given is complete so the URL Manager can mark that as visited. This would require another table for pending URLs in addition to its visited and unvisited URLs.
66+
67+
The second design could have the extractors to not have a queue, and only process one URL at a time, when they are done, ask the URL manager for a new URL.
68+
This would still require a pending URL table either way if an extractor goes down while scraping a URL.
69+
So this would justify this design better as we can save resources over keeping a master-slave for each extractor.

0 commit comments

Comments
 (0)