Basic moving parts including web application, worker pool, and scheduled tasks.
Responsible for running jobs on demand in response to pull requests and changes in the GitHub repository, and for publicly displaying status of the OpenAddresses data set.
This Python + Flask application is the center of the OpenAddresses Machine. Webhook maintains a connection to the database and queue, listens for new CI jobs from Github event hooks on the OpenAddresses repository, queues new source runs, and displays results of batch sets over time.
- Run from a Procfile using gunicorn.
- Triggered from a Github event hook on the OpenAddresses repository.
- Flask code can be found in
openaddr/ci/webhooks.py
and inopenaddr/ci/webapi.py
. - Public URL at
results.openaddresses.io
. - Lives on a long-running, 24×7 EC2
t2.small
instance.
Does the actual work of running a source and producing output files.
This Python script accepts new source runs from the tasks
queue, converts them into output Zip archives with CSV files, uploads those to S3, and notifies the dequeuer via the due
, done
, and heartbeat
queues. Worker is single-threaded, and intended to be run in parallel on multiple instances. Worker uses EC2 auto-scaling to respond to increased demand by launching new instances. One worker is kept alive at all times on the same EC2 instance as Webhook.
The actual work is done a separate sub-process, using the openaddr-process-one
script.
- Run from a Procfile.
- Worker code can be found in
openaddr/ci/worker.py
. openaddr-process-one
code can be found inopenaddr/process_one.py
.- Configured in an EC2 auto-scaling group with launch configuration.
- The time allotted for a single source run is currently limited to 9 hours.
- No public URLs.
Collects the results from runs that are completed and reports to GitHub and to Cloudwatch.
This Python script watches the done
, due
, and heartbeat
queues. Run status is updated based on the contents of those queues: if a run appears in the due
queue first, it will be marked as failed and any subsequent done
queue item will be ignored. If a run appears in the done
queue first, it will be marked as successful. Statuses are posted to the Github status API for runs connected to a CI job initiated by Webhook and to the runs
table with links.
This script also watches the overall size of the queue, and updates Cloudwatch metrics to determine when the Worker pool needs to grow or shrink.
- Run from a Procfile, on the same EC2 instance as Webhook with the same configuration.
- Dequeue code can be found in
openaddr/ci/run_dequeue.py
. - No public URL.
Large tasks that use the entire OpenAddresses dataset are scheduled with AWS Cloudwatch events on the same EC2 instance as Webhook. Event rules are updated with details found in update-scheduled-tasks.py
, and typically trigger task-specific, single-use EC2 instances via AWS Lambda code found in run-ec2-command.py
.
This Python script is meant to be run about once per week. It retrieves a current list of all sources on the master branch of the OpenAddresses repository, generates a set of runs, and slowly dribbles them into the tasks
queue over the course of a few days. It’s designed to be slow, and always pre-emptible by jobs from Github CI via Webhook. After a successful set of runs, the script generates new coverage maps.
- Run via the script
openaddr-enqueue-sources
. - Code can be found in
openaddr/ci/enqueue.py
. - Coverage maps are rendered from
openaddr/render.py
. - Resulting sets can be found at
results.openaddresses.io/sets
andresults.openaddresses.io/latest/set
. - A weekly cron task for this script runs on Friday evenings from the same EC2 instance as Webhook.
This Python script is meant to be run about once per day. It downloads all current processed data, generates a series of collection Zip archives for different regions of the world, and uploads them to S3.
- Run the script
openaddr-collect-extracts
. - Code can be found in
openaddr/ci/collect.py
. - Resulting collections are linked from results.openaddresses.io.
- A nightly cron task for this script runs every evening from the same EC2 instance as Webhook.
This Python script is meant to be run about once per week. It downloads all current processed data, generates an MBTiles file of worldwide address point coverage with Tippecanoe, and uploads it to Mapbox.
- Run via the script
openaddr-update-dotmap
. - Code can be found in
openaddr/ci/dotmap.py
. - Resulting map of dots is show at openaddresses.io.
- We plan to set up a weekly cron task for this script on the OpenStreetMap U.S. server.