Skip to content

Commit eea2efa

Browse files
author
Joseph Luce
authored
Update web_crawler.md
1 parent 72a19dd commit eea2efa

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

system_design/web_crawler.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ We don't want to store very long URLs in our database.
4141
Instead, we can hash the URLs, however, that is not always a guarantee due to collisions.
4242
If we have to visit more pages than there are possible hash codes, this would be a problem.
4343

44+
You can have webpages that have links to other websites, your crawlers can then end up crawling the entire internet.
45+
This is up to the URL Manager to keep a set of domains to compare to when deciding if its a valid URL.
46+
4447
# Content Manager
4548
We don't want to store a bunch of pages with the same set of data.
4649
So we need a way to filter the content that we recieve after we have scraped the data.

0 commit comments

Comments
 (0)