Skip to content

lavalamp-/content-discovery-hit-lists

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

content-discovery-hit-lists

This repository contains hit lists to use for web application content discovery.

Common Crawl

Hit lists associated with mining Common Crawl data can be found in the common-crawl directory. Underneath the common-crawl directory, lists are separated out by which Common Crawl data set the list was generated from. These hit lists were generated using the LavaHadoopCrawlAnalysis and lava-hadoop-processing projects.

The syntax of the hit list file paths is as follows:

common-crawl/<common crawl data set>/<server type>/hit_list_<coverage percentage>

For instance, take the following file:

common-crawl/CC-MAIN-2014-49/apache_generic/hit_list_99.9

The contents of this file were generated from the CC-MAIN-2014-49 Common Crawl data set and they comprise a hit list for Apache servers that do not specify an operating system. The URL paths found within the file are the most common URL paths (in descending order) associated with generic Apache servers as found in the CC-MAIN-2014-49 crawl data. In total, the URL paths represent 99.9% of all observed URL paths.

More details will be available via a blog post on lavalamp's personal blog in the near future.

About

This repository contains hit lists to use for web application content discovery.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages