This repository contains hit lists to use for web application content discovery.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
commoncrawl/CC-MAIN-2014-49
.gitignore
LICENSE
README.md

README.md

content-discovery-hit-lists

This repository contains hit lists to use for web application content discovery.

Common Crawl

Hit lists associated with mining Common Crawl data can be found in the common-crawl directory. Underneath the common-crawl directory, lists are separated out by which Common Crawl data set the list was generated from. These hit lists were generated using the LavaHadoopCrawlAnalysis and lava-hadoop-processing projects.

The syntax of the hit list file paths is as follows:

common-crawl/<common crawl data set>/<server type>/hit_list_<coverage percentage>

For instance, take the following file:

common-crawl/CC-MAIN-2014-49/apache_generic/hit_list_99.9

The contents of this file were generated from the CC-MAIN-2014-49 Common Crawl data set and they comprise a hit list for Apache servers that do not specify an operating system. The URL paths found within the file are the most common URL paths (in descending order) associated with generic Apache servers as found in the CC-MAIN-2014-49 crawl data. In total, the URL paths represent 99.9% of all observed URL paths.

More details will be available via a blog post on lavalamp's personal blog in the near future.