Skip to content
This repository has been archived by the owner on May 5, 2022. It is now read-only.

Latest commit

 

History

History
79 lines (53 loc) · 7.56 KB

components.md

File metadata and controls

79 lines (53 loc) · 7.56 KB

Components

Basic moving parts including web application, worker pool, and scheduled tasks.

Responsible for running jobs on demand in response to pull requests and changes in the GitHub repository, and for publicly displaying status of the OpenAddresses data set.

This Python + Flask application is the center of the OpenAddresses Machine. Webhook maintains a connection to the database and queue, listens for new CI jobs from Github event hooks on the OpenAddresses repository, queues new source runs, and displays results of batch sets over time.

Does the actual work of running a source and producing output files.

This Python script accepts new source runs from the tasks queue, converts them into output Zip archives with CSV files, uploads those to S3, and notifies the dequeuer via the due, done, and heartbeat queues. Worker is single-threaded, and intended to be run in parallel on multiple instances. Worker uses EC2 auto-scaling to respond to increased demand by launching new instances. One worker is kept alive at all times on the same EC2 instance as Webhook.

The actual work is done a separate sub-process, using the openaddr-process-one script.

Collects the results from runs that are completed and reports to GitHub and to Cloudwatch.

This Python script watches the done, due, and heartbeat queues. Run status is updated based on the contents of those queues: if a run appears in the due queue first, it will be marked as failed and any subsequent done queue item will be ignored. If a run appears in the done queue first, it will be marked as successful. Statuses are posted to the Github status API for runs connected to a CI job initiated by Webhook and to the runs table with links.

This script also watches the overall size of the queue, and updates Cloudwatch metrics to determine when the Worker pool needs to grow or shrink.

Scheduled Tasks

Large tasks that use the entire OpenAddresses dataset are scheduled with AWS Cloudwatch events on the same EC2 instance as Webhook. Event rules are updated with details found in update-scheduled-tasks.py, and typically trigger task-specific, single-use EC2 instances via AWS Lambda code found in run-ec2-command.py.

This Python script is meant to be run about once per week. It retrieves a current list of all sources on the master branch of the OpenAddresses repository, generates a set of runs, and slowly dribbles them into the tasks queue over the course of a few days. It’s designed to be slow, and always pre-emptible by jobs from Github CI via Webhook. After a successful set of runs, the script generates new coverage maps.

This Python script is meant to be run about once per day. It downloads all current processed data, generates a series of collection Zip archives for different regions of the world, and uploads them to S3.

This Python script is meant to be run about once per week. It downloads all current processed data, generates an MBTiles file of worldwide address point coverage with Tippecanoe, and uploads it to Mapbox.