This project aims to take in a list of website and see what HTTP headers those sites are returning, looking for what's popular. Based on a project from Summer 2015, this repo represents the more interesting parts.
Have a look at the
sql/schema.sql to get an idea of how the data is stored
once collected. If you're interested in Event Loops then [this post is worth a read]
- Setup the database and user:
mysql> CREATE DATABASE popular_headers; mysql> CREATE USER 'ph_user'@'127.0.0.1' IDENTIFIED BY 'your-secure-password'; mysql> GRANT ALL PRIVILEGES ON popular_headers.* TO 'ph_user'@'127.0.0.1';
- Load the schema,
mysql -u ph_user -p popular_headers < sql/schema.sql
- Update the configuration file,
etc/config.yaml, with your database credentials
- Pipe a list of sites in to
wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip unzip top-1m.csv.zip cat top-1m.csv | head -n 10000 | bin/gather-headers
- Query the DB for results.
Running from Ubuntu, the only extras you'll need are a few additional perl libraries:
sudo apt-get install \ libmojolicious-perl # for Mojo::UserAgent libconfig-yaml-perl # for YAML libreadonly-perl # for Readonly
Each script and perl module has perldoc documentation that should be up-to-date and contain more information than this README.
The default schema stores the header value in a
VARCHAR(750). This means that
really long header values over this length will be truncated, something to
bear in mind.