You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: system_design/web_crawler.md
+31-6Lines changed: 31 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,18 +17,18 @@
17
17
- Lastly, we will need a data store.
18
18
19
19
# Extractor / Crawler
20
-
Let start here, we obviously would like to be able to scale as many crawlers as we want, but its sole responsibility is to just scrape whatever data it is meant to scrape.
21
-
That can be only videos, HTML, or images.
22
-
Since we can add or remove any number of crawlers, we should place a load balancer in front of this cluster.
20
+
Lets start here, we obviously would like to be able to scale as many crawlers as we want, but its sole responsibility is to just scrape whatever data it is meant to scrape.
21
+
That can be only videos, HTML, or images, each cluster will contain the same extractors and we will have multiple clusters for each content type.
22
+
Since we can add or remove any number of crawlers, we should place a load balancer in front of each cluster.
23
23
We don't know if a certain crawler may get many pages that require more processing time than others.
24
24
25
25
The URL extractor however, will act slightly different, it will further process the HTML scraper's result and get all the URLs on that page.
26
26
This will be sent to the URL manager for later distribution.
27
27
28
-
The extractor can keep a queue of given URLs, but you would then need to keep in mind of fault tolerance, if the extractor goes down, we will lose whatever was in the queue.
29
-
This will be explained below.
28
+
Each extractor can keep a queue of given URLs, but you would then need to keep in mind of fault tolerance, if the extractor goes down, we will lose whatever was in the queue.
29
+
This will be covered later.
30
30
31
-
Lastly, if given a link that does not lead to no where, we should ignore it.
31
+
Lastly, if given a link that leads to no where, we should ignore it.
32
32
33
33
# URL Manager
34
34
The URL Manager will be given a seed of URLs to start with.
@@ -41,4 +41,29 @@ We don't want to store very long URLs in our database.
41
41
Instead, we can hash the URLs, however, that is not always a guarantee due to collisions.
42
42
If we have to visit more pages than there are possible hash codes, this would be a problem.
43
43
44
+
# Content Manager
45
+
We don't want to store a bunch of pages with the same set of data.
46
+
So we need a way to filter the content that we recieve after we have scraped the data.
47
+
Similar to the URL manager, we will hash the content and store a hash set in another database to use for checking.
48
+
If the content hash isn't found, the content can then be stored into the data store, else rejected.
49
+
50
+
However, there is a penalty for hashing the entire page, it can take long.
51
+
Especially when the content of the pages are very large.
52
+
We could try to do a hash based on pieces of the content such as the title, headers, first few paragraphs and video links to figure out if this is new content.
53
+
This isn't 100% best solution due to collisions, this is a question you need to ask whether it is good enough to have more duplicate content or missing out on unique content if it happens that the sub-content we select for the hash are the same.
54
+
44
55
# Fault Tolerance
56
+
The URL Manager and the Content Manager should both have a master-slave architecture for better up-time.
57
+
However, if both go down at the same time, we can revive them by using the database.
58
+
Since the URL Manager will be getting a stream of URLs, hence, having a queue, it would be important to save this queue into the database.
59
+
Therefore, requiring the database to have a visited URL hash codes and unvisited set of URL links.
60
+
61
+
For the extractors, it will depend if the extractors will have a queue or not.
62
+
If they will hold a queue, it will be important to also have a master-slave architecture.
63
+
The difference is that we cannot have a database for each extractor, as we plan to have many of them.
64
+
If both master and slave of a pair of extractors go down, we will need to rely on the URL Manager to resend the URLs to another pair of extractors.
65
+
This would then require a response back from the extractors that the URL they were given is complete so the URL Manager can mark that as visited. This would require another table for pending URLs in addition to its visited and unvisited URLs.
66
+
67
+
The second design could have the extractors to not have a queue, and only process one URL at a time, when they are done, ask the URL manager for a new URL.
68
+
This would still require a pending URL table either way if an extractor goes down while scraping a URL.
69
+
So this would justify this design better as we can save resources over keeping a master-slave for each extractor.
0 commit comments