Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
etc
 
 
sql
 
 
 
 
 
 
 
 

This project aims to take in a list of website and see what HTTP headers those sites are returning, looking for what's popular. Based on a project from Summer 2015, this repo represents the more interesting parts.

Have a look at the sql/schema.sql to get an idea of how the data is stored once collected. If you're interested in Event Loops then [this post is worth a read] (https://falkus.co/2016/04/using-an-event-loop-for-multiple-http-requests/).

Running

  • Setup the database and user:
mysql> CREATE DATABASE popular_headers;
mysql> CREATE USER 'ph_user'@'127.0.0.1' IDENTIFIED BY 'your-secure-password';
mysql> GRANT ALL PRIVILEGES ON popular_headers.* TO 'ph_user'@'127.0.0.1';
  • Load the schema, mysql -u ph_user -p popular_headers < sql/schema.sql
  • Update the configuration file, etc/config.yaml, with your database credentials
  • Pipe a list of sites in to bin/gather-headers.
    wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
    unzip top-1m.csv.zip
    cat top-1m.csv | head -n 10000 | bin/gather-headers
  • Query the DB for results.

Requirements

Running from Ubuntu, the only extras you'll need are a few additional perl libraries:

sudo apt-get install \
    libmojolicious-perl # for Mojo::UserAgent
    libconfig-yaml-perl # for YAML
    libreadonly-perl    # for Readonly

Notes

Each script and perl module has perldoc documentation that should be up-to-date and contain more information than this README.

The default schema stores the header value in a VARCHAR(750). This means that really long header values over this length will be truncated, something to bear in mind.

About

Because HTTP headers are interesting.

Resources

License

Releases

No releases published

Packages

No packages published

Languages