content-discovery-hit-lists

This repository contains hit lists to use for web application content discovery.

Common Crawl

Hit lists associated with mining Common Crawl data can be found in the common-crawl directory. Underneath the common-crawl directory, lists are separated out by which Common Crawl data set the list was generated from. These hit lists were generated using the LavaHadoopCrawlAnalysis and lava-hadoop-processing projects.

The syntax of the hit list file paths is as follows:

common-crawl/<common crawl data set>/<server type>/hit_list_<coverage percentage>

For instance, take the following file:

common-crawl/CC-MAIN-2014-49/apache_generic/hit_list_99.9

The contents of this file were generated from the CC-MAIN-2014-49 Common Crawl data set and they comprise a hit list for Apache servers that do not specify an operating system. The URL paths found within the file are the most common URL paths (in descending order) associated with generic Apache servers as found in the CC-MAIN-2014-49 crawl data. In total, the URL paths represent 99.9% of all observed URL paths.

More details will be available via a blog post on lavalamp's personal blog in the near future.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
commoncrawl/CC-MAIN-2014-49		commoncrawl/CC-MAIN-2014-49
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

commoncrawl/CC-MAIN-2014-49

commoncrawl/CC-MAIN-2014-49

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

content-discovery-hit-lists

Common Crawl

About

Releases

Packages

Languages

License

lavalamp-/content-discovery-hit-lists

Folders and files

Latest commit

History

Repository files navigation

content-discovery-hit-lists

Common Crawl

About

Resources

License

Stars

Watchers

Forks

Languages