Skip to content
This repository has been archived by the owner on Jul 8, 2020. It is now read-only.

HTTPArchive/hosts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Host scanner

Ingests a list of hostnames and runs the following checks:

  • Fetches the http:// site and checks for timeouts & errors
  • Fetches the https:// site
  • Records:
    • Results of the TLS handshake, if applicable
    • Final redirect destination based on above
    • "HTTPS-only" status

Installing & Running

Install Go Version Manager: https://github.com/moovweb/gvm

 $> gvm install 1.7.3 (+)
 $> go build

 # run with custom number of workers and output file
 $> ./hosts -workers=10 -output=results.json < /path/to/list-of-domains 2> output.log
 $> unzip -p list-of-domains.zip | DEBUG=true ./hosts -workers=2

 # run with verbose debug output
 $> DEBUG=true ./hosts -workers=10 -output=results.json < /path/to/list-of-domains 2> output.log

Dataflow pipeline

  1. Follow steps to install Apache Beam SDK.
  2. Activate environment and run the dataflow job:
$> pip install google-cloud-dataflow
$> virtualenv /path/to/dataflow-env
$> . /path/to/dataflow-env/bin/activate
$> python dataflow.py  --input gs://httparchive/urls/<input-file> --output project:dataset.tablename

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published