There is a blog post that accompanies this code base.
The following allows you to run this code base locally. These commands were tested on a fresh installation of Ubuntu 14.04.3 LTS.
Various dependencies are needed, this will install them via Debian packages.
$ sudo apt-get update
$ sudo apt-get install -y \
default-jre \
jq \
python-dev \
python-pip \
python-virtualenv \
redis-server \
zip \
zookeeperd
The following will install and launch a known-good version of Kafka.
$ cd /tmp
$ curl -O http://mirror.cc.columbia.edu/pub/software/apache/kafka/0.8.2.1/kafka_2.11-0.8.2.1.tgz
$ tar -xzf kafka_2.11-0.8.2.1.tgz
$ cd kafka_2.11-0.8.2.1/
$ nohup bin/kafka-server-start.sh \
config/server.properties \
> ~/kafka.log 2>&1 &
$ export PATH="`pwd`/bin:$PATH"
The following will create the results and metrics topics in Kafka.
$ kafka-topics.sh \
--zookeeper 127.0.0.1:2181 \
--create \
--partitions 1 \
--replication-factor 1 \
--topic results
$ kafka-topics.sh \
--zookeeper 127.0.0.1:2181 \
--create \
--partitions 1 \
--replication-factor 1 \
--topic metrics
The following will launch Redis.
$ redis-server &
The following will create a virtual environment and install various Python-based dependencies.
$ virtualenv .ips
$ source .ips/bin/activate
$ pip install -r requirements.txt requirements-dev.txt
The following will bootstrap a local database.
$ cd ips
$ python manage.py migrate
The following will generate 4.7 million seed IP addresses that will be used by workers.
$ python manage.py gen_ips
Set the coordinator IP address:
$ python manage.py set_config 127.0.0.1
The following launches the web interface for the coordinator.
$ python manage.py runserver &
The following launches the look up worker, telemetry reporting and process that collects IP addresses from the coordinator.
$ python manage.py celeryd --concurrency=5 &
$ python manage.py celerybeat &
$ python manage.py get_ips_from_coordinator &
The following launches the process that collects the WHOIS records and stores unique CIDR blocks in Redis.
$ python manage.py collect_whois &
To see aggregated telemetry:
$ python manage.py telemetry
If you want to monitor celery's activity run the following:
$ watch 'python manage.py celery inspect stats'
To see the results of successful WHOIS queries:
$ kafka-console-consumer.sh \
--zookeeper localhost:2181 \
--topic results \
--from-beginning
To continuously dump results to a file:
$ kafka-console-consumer.sh \
--zookeeper localhost:2181 \
--topic results \
--from-beginning > output &
To see per-minute metrics from the workers:
$ kafka-console-consumer.sh \
--zookeeper localhost:2181 \
--topic metrics \
--from-beginning
To run Ansible on a cloud service you first need to create an inventory file like the following.
$ vi devops/inventory
[coordinator]
coord1 ansible_host=x.x.x.x ansible_user=ubuntu ansible_private_key_file=~/.ssh/ec2.pem
[worker]
worker1 ansible_host=x.x.x.x ansible_user=ubuntu ansible_private_key_file=~/.ssh/ec2.pem
worker2 ansible_host=x.x.x.x ansible_user=ubuntu ansible_private_key_file=~/.ssh/ec2.pem
worker3 ansible_host=x.x.x.x ansible_user=ubuntu ansible_private_key_file=~/.ssh/ec2.pem
To provision and deploy run:
$ zip -r \
app.zip \
ips/ *.txt \
-x *.sqlite3 \
-x *.pid \
-x *.pyc
$ cd devops
$ ansible-playbook bootstrap.yml