This project aims to take in a list of website and see what HTTP headers those sites are returning, looking for what's popular. Based on a project from Summer 2015, this repo represents the more interesting parts.
Have a look at the sql/schema.sql
to get an idea of how the data is stored
once collected. If you're interested in Event Loops then [this post is worth a read]
(https://falkus.co/2016/04/using-an-event-loop-for-multiple-http-requests/).
- Setup the database and user:
mysql> CREATE DATABASE popular_headers; mysql> CREATE USER 'ph_user'@'127.0.0.1' IDENTIFIED BY 'your-secure-password'; mysql> GRANT ALL PRIVILEGES ON popular_headers.* TO 'ph_user'@'127.0.0.1';
- Load the schema,
mysql -u ph_user -p popular_headers < sql/schema.sql
- Update the configuration file,
etc/config.yaml
, with your database credentials - Pipe a list of sites in to
bin/gather-headers
.
wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip unzip top-1m.csv.zip cat top-1m.csv | head -n 10000 | bin/gather-headers
- Query the DB for results.
Running from Ubuntu, the only extras you'll need are a few additional perl libraries:
sudo apt-get install \
libmojolicious-perl # for Mojo::UserAgent
libconfig-yaml-perl # for YAML
libreadonly-perl # for Readonly
Each script and perl module has perldoc documentation that should be up-to-date and contain more information than this README.
The default schema stores the header value in a VARCHAR(750)
. This means that
really long header values over this length will be truncated, something to
bear in mind.