Mining of the CommonCrawl Corpus
The mining of the common crawl corpus has been done in Spark. My experimental source code is available. I did various experiments with mining links in documents, but at the end, settled on something relatively simple: just show which pages link to a certain URL.
This webapp shows the results. There are two limitations: first, for tractability of the prototype, I am only including links to the domain mit.edu. Second, I've only mined the two first valid segments in CommonCrawl.