|
| 1 | +# Assumptions |
| 2 | +- Given a seed of urls, crawl them. |
| 3 | +- Possible to crawl a billion pages+. |
| 4 | +- We would like to store all data/content in each page. |
| 5 | + |
| 6 | +# Edge Cases |
| 7 | +- What happens if we visit the same page? |
| 8 | +- What happens if we have the same content on different urls? |
| 9 | +- Web page traps? Dynamically created web pages? |
| 10 | +- What happens if we end up searching the entire web? |
| 11 | +- Dead links? |
| 12 | + |
| 13 | +# Components |
| 14 | +- We will need multiple different extractors, one for video, one for images, one for HTML, and for the URLs, etc... |
| 15 | +- We will need a way to keep track of the visited urls, so a url manager component with its own DB for redundancy will be needed. |
| 16 | +- We will also need a way to keep track of similar webpages, something like a content mangager, also with its own DB. |
| 17 | +- Lastly, we will need a data store. |
| 18 | + |
| 19 | +# Extractor / Crawler |
| 20 | +Let start here, we obviously would like to be able to scale as many crawlers as we want, but its sole responsibility is to just scrape whatever data it is meant to scrape. |
| 21 | +That can be only videos, HTML, or images. |
| 22 | +Since we can add or remove any number of crawlers, we should place a load balancer in front of this cluster. |
| 23 | +We don't know if a certain crawler may get many pages that require more processing time than others. |
| 24 | + |
| 25 | +The URL extractor however, will act slightly different, it will further process the HTML scraper's result and get all the URLs on that page. |
| 26 | +This will be sent to the URL manager for later distribution. |
| 27 | + |
| 28 | +The extractor can keep a queue of given URLs, but you would then need to keep in mind of fault tolerance, if the extractor goes down, we will lose whatever was in the queue. |
| 29 | +This will be explained below. |
| 30 | + |
| 31 | +Lastly, if given a link that does not lead to no where, we should ignore it. |
| 32 | + |
| 33 | +# URL Manager |
| 34 | +The URL Manager will be given a seed of URLs to start with. |
| 35 | +It will keep a set table of the URLs it has visited so far in a database. |
| 36 | +When given a URL from the URL extractor, it will check if this URL has been visited before. |
| 37 | +If it hasn't, it will send the URL to the extractor cluster. |
| 38 | + |
| 39 | +One important thing to note is that URLs can get very long. |
| 40 | +We don't want to store very long URLs in our database. |
| 41 | +Instead, we can hash the URLs, however, that is not always a guarantee due to collisions. |
| 42 | +If we have to visit more pages than there are possible hash codes, this would be a problem. |
| 43 | + |
| 44 | +# Fault Tolerance |
0 commit comments