Update web_crawler.md

Joseph Luce · web-flow · commit 6f54df69fa74 · 2019-08-30T00:42:17.000-07:00
diff --git a/system_design/web_crawler.md b/system_design/web_crawler.md
@@ -17,18 +17,18 @@
 - Lastly, we will need a data store.
 
 # Extractor / Crawler
-Let start here, we obviously would like to be able to scale as many crawlers as we want, but its sole responsibility is to just scrape whatever data it is meant to scrape.
-That can be only videos, HTML, or images.
-Since we can add or remove any number of crawlers, we should place a load balancer in front of this cluster.
+Lets start here, we obviously would like to be able to scale as many crawlers as we want, but its sole responsibility is to just scrape whatever data it is meant to scrape.
+That can be only videos, HTML, or images, each cluster will contain the same extractors and we will have multiple clusters for each content type.
+Since we can add or remove any number of crawlers, we should place a load balancer in front of each cluster.
 We don't know if a certain crawler may get many pages that require more processing time than others.
 
 The URL extractor however, will act slightly different, it will further process the HTML scraper's result and get all the URLs on that page.
 This will be sent to the URL manager for later distribution.
 
-The extractor can keep a queue of given URLs, but you would then need to keep in mind of fault tolerance, if the extractor goes down, we will lose whatever was in the queue.
-This will be explained below.
+Each extractor can keep a queue of given URLs, but you would then need to keep in mind of fault tolerance, if the extractor goes down, we will lose whatever was in the queue.
+This will be covered later.
 
-Lastly, if given a link that does not lead to no where, we should ignore it.
+Lastly, if given a link that leads to no where, we should ignore it.
 
 # URL Manager
 The URL Manager will be given a seed of URLs to start with.
@@ -41,4 +41,29 @@ We don't want to store very long URLs in our database.
 Instead, we can hash the URLs, however, that is not always a guarantee due to collisions.
 If we have to visit more pages than there are possible hash codes, this would be a problem.
 
+# Content Manager
+We don't want to store a bunch of pages with the same set of data.
+So we need a way to filter the content that we recieve after we have scraped the data.
+Similar to the URL manager, we will hash the content and store a hash set in another database to use for checking.
+If the content hash isn't found, the content can then be stored into the data store, else rejected.
+
+However, there is a penalty for hashing the entire page, it can take long.
+Especially when the content of the pages are very large.
+We could try to do a hash based on pieces of the content such as the title, headers, first few paragraphs and video links to figure out if this is new content.
+This isn't 100% best solution due to collisions, this is a question you need to ask whether it is good enough to have more duplicate content or missing out on unique content if it happens that the sub-content we select for the hash are the same.
+
 # Fault Tolerance
+The URL Manager and the Content Manager should both have a master-slave architecture for better up-time.
+However, if both go down at the same time, we can revive them by using the database.
+Since the URL Manager will be getting a stream of URLs, hence, having a queue, it would be important to save this queue into the database.
+Therefore, requiring the database to have a visited URL hash codes and unvisited set of URL links.
+
+For the extractors, it will depend if the extractors will have a queue or not.
+If they will hold a queue, it will be important to also have a master-slave architecture.
+The difference is that we cannot have a database for each extractor, as we plan to have many of them.
+If both master and slave of a pair of extractors go down, we will need to rely on the URL Manager to resend the URLs to another pair of extractors.
+This would then require a response back from the extractors that the URL they were given is complete so the URL Manager can mark that as visited. This would require another table for pending URLs in addition to its visited and unvisited URLs.
+
+The second design could have the extractors to not have a queue, and only process one URL at a time, when they are done, ask the URL manager for a new URL. 
+This would still require a pending URL table either way if an extractor goes down while scraping a URL. 
+So this would justify this design better as we can save resources over keeping a master-slave for each extractor.