This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Common Crawl #162

Open

ghost opened this issue Nov 12, 2017 · 1 comment

ghost commented Nov 12, 2017

https://commoncrawl.org/

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

I'm not sure how much data it is, but certainly a few TB.

Author

ghost commented Nov 12, 2017

Oh:

The crawl archive for October 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-43/. It contains 3.65 billion web pages and over 300 TiB of uncompressed content.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.