Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Common Crawl #162

Open
ghost opened this issue Nov 12, 2017 · 1 comment
Open

Common Crawl #162

ghost opened this issue Nov 12, 2017 · 1 comment

Comments

@ghost
Copy link

ghost commented Nov 12, 2017

https://commoncrawl.org/

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

I'm not sure how much data it is, but certainly a few TB.

@ghost
Copy link
Author

ghost commented Nov 12, 2017

Oh:

The crawl archive for October 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-43/. It contains 3.65 billion web pages and over 300 TiB of uncompressed content.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants