Skip to content

Commit

Permalink
Merge pull request #1479 from jwhitlock/update-datamaps-840-1404
Browse files Browse the repository at this point in the history
Update datamaps script and docs
  • Loading branch information
jwhitlock committed Jan 12, 2021
2 parents 3e4a64d + 1d49a79 commit 40a8c23
Show file tree
Hide file tree
Showing 14 changed files with 1,032 additions and 383 deletions.
29 changes: 21 additions & 8 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
**/.DS_Store
**/__pycache__
*.sublime-project
*.sublime-workspace
*.awscreds
venv

.circleci
.coverage
.coveragerc
.docker-build
.git
.gitattributes
.github
.gitignore
.gitmodules
*.sublime-project
*.sublime-workspace
circle.yml
.pytest_cache
.readthedocs.yml
contribute.json
dev
docker-compose.yml
docs/build/html/.git
node.make
pip-selfcheck.json
venv
my.env

docs/_build/
ichnaea/content/static/datamap/csv/
ichnaea/content/static/datamap/quadtrees/
ichnaea/content/static/datamap/shapes/
ichnaea/content/static/datamap/tiles/
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,13 @@
*.py[co]
*.so
*.sublime-*
*.awscreds
.DS_Store
docs/_build/
ichnaea/content/static/tiles/
ichnaea/content/static/datamap/csv/
ichnaea/content/static/datamap/quadtrees/
ichnaea/content/static/datamap/shapes/
ichnaea/content/static/datamap/tiles/

my.env
.docker-build
8 changes: 0 additions & 8 deletions MANIFEST.in

This file was deleted.

5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ default:
@echo " clean-assets - remove generated static assets"
@echo " update-vendored - re-download vendor source and test data"
@echo " update-reqs - regenerate Python requirements"
@echo " local-map - generate local map tiles"
@echo ""
@echo " help - see this text"
@echo ""
Expand Down Expand Up @@ -141,3 +142,7 @@ update-vendored: my.env
.PHONY: update-reqs
update-reqs: my.env
${DC} run --rm --no-deps app shell ./docker/run_update_requirements.sh

.PHONY: local-map
local-map: my.env .docker-build
${DC} run --rm app shell ./docker/run_local_map.sh
2 changes: 1 addition & 1 deletion docker/run_local_map.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/bash

python ./ichnaea/scripts/datamap.py --create --output /app/ichnaea/content/static/tiles/
python ./ichnaea/scripts/datamap.py --create --output /app/ichnaea/content/static/datamap/
14 changes: 12 additions & 2 deletions docs/configure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ documentation <metrics>`.

All metrics are prefixed with a `location` namespace.

.. _map_tile_and_download_assets:

Map tile and download assets
----------------------------
Expand All @@ -100,12 +101,21 @@ available via a HTTPS frontend (Amazon CloudFront).

Set ``ASSET_BUCKET`` and ``ASSET_URL`` accordingly.

To access the ``ASSET_BUCKET``, authorized AWS credentials are needed inside
the Docker image. See the `Boto3 credentials documentation`_ for details.

.. _`Boto3 credentials documentation`: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

The development environment defaults to serving map tiles from the web server,
and not serving public export files for download.

.. _mapbox:

Mapbox
------

The web site content uses Mapbox to generate tiles. In order to do this,
it requires a Mapbox API token.
The web site content uses Mapbox to display a world map. In order to do this,
it requires a Mapbox API token. Without a token, the map is not displayed.

You can create an account on their site: https://www.mapbox.com

Expand Down
71 changes: 68 additions & 3 deletions docs/dataflows.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Data flow:
{"api_key": key, "report": report, "source": source}

"source" can be one of:

* ``gnss``: Global Navigation Satellite System based data
* ``fused``: position data obtained from a combination of other sensors or
outside service queries
Expand Down Expand Up @@ -134,6 +134,71 @@ Datamap data
============

Datamap data is stored in the ``datamap_*`` (``ne``, ``nw``, ``se``, ``sw``)
tables.
tables. The north / south split is at 36˚N latitude, and the east / west split
at 5˚ longitude, to attempt to split the data into four equal shards. The rows
divide the world into boxes that are one-thousandth of a degree on each side
(about 112 meters at the equator), and record the first and latest time that
an observation was seen in that box. Other details, including the exact
position of the observation, are not recorded. These tables are updated during
the ``internal`` processing job.

A periodic task runs the script ``ichnaea/scripts/datamap.py`` to convert this
data into transparent tile images for the contribution map. Thread pools are
used to distribute the work across available processors. The process is:

1. Export the datamap tables as CSV files.

The latitude, longitude, and days since last observation are fed into a
randomizer that creates 0 to 13 nearby points, more for the recently
observed grid positions. This emulates the multiple observations that go
into each grid position, and hides details of observations for increased
privacy.

2. Convert the CSV files to a quadtree_ structure.

The binary quadtree structure efficiently stores points when there are
large areas with no points, and is faster for determining points within
the bounding box of a tile.

3. Merge the per-table quadtrees to a single quadtree file.

This includes removing duplicates at the boundaries of tables.

4. Generate and minimize tiles for the different zoom levels.

Each zoom level potentially has four times the tiles of the previous zoom
level, with 1 at zoom level 0, 4 at zoom level 1, 16 at zoom level 2, up
to over 4 million at maximum zoom level 11. However, tiles with no
observations are not rendered, so the actual number of generated tiles is
less. The tiles are stored in a folder structure by zoom level, x position,
and files at the y position, to match Mapbox tile standards and to avoid
having too many files in a folder.

Tiles are further optimized for disk space by reducing the colorspace,
without reducing quality below a target.

A double-resolution tile at zoom level 0 is created for the map overview
on the front page on high-resolution displays.

5. Upload the tiles to an S3 bucket.

There may be existing tiles in the S3 bucket from previous uploads. The
script collects the size and MD5 hash of existing S3 tiles, and compares
them to the newly generated tiles, to determine which are new, which are
updated, which are the same an can be ignored, and which S3 tiles should
be deleted.

New and updated tiles are uploaded. Uploading is I/O bound, so the
concurrency of uploads is doubled. The deleted tiles are deleted in
batches, for speed.

A file ``tiles/data.json`` is written to record when the upload completed
and details of the tile generation process.


Quadtree and tile generation tools are provided by `ericfischer/datamaps`_, and
PNG size optimization by pngquant_.

FIXME: data flow for datamap data?
.. _quadtree: https://en.wikipedia.org/wiki/Quadtree
.. _ericfischer/datamaps: https://github.com/ericfischer/datamaps
.. _pngquant: https://pngquant.org
33 changes: 33 additions & 0 deletions docs/local_dev.rst
Original file line number Diff line number Diff line change
Expand Up @@ -271,3 +271,36 @@ source. Update ``docker.make`` for the desired versions, and run::
$ make update-vendored build test

Commit the updated source tarballs.

Building Datamap Tiles
======================

To build datamap tiles for the local development environment, run::

$ make local-map

If you have data in the ``datamap`` tables, this will create many files
under ``ichnaea/content/static/datamap``. This uses
``ichnaea/scripts/datamap.py``, which can also be run directly.

To see the map locally, you will need to configure :ref:`mapbox`. A free
developer account should be sufficient.

To use an S3 bucket for tiles, you'll need to set ``ASSET_BUCKET`` and
``ASSET_URL`` (see :ref:`map_tile_and_download_assets`).
To upload tiles to an S3 bucket, you'll also need AWS credentials that
can read, write, and delete objects in the ``ASSET_BUCKET``. Here are
two ways, neither of which is ideal since it adds your AWS credentials
in plain text:

1. Add credentials as environment variables ``AWS_ACCESS_KEY_ID`` and
``AWS_SECRET_ACCESS_KEY`` in ``my.env``.
2. Add credentials to a file ``my.awscreds`` in the project folder,
and add ``AWS_SHARED_CREDENTIALS_FILE=/app/my.awscreds`` to ``my.env``.

You can then generate and upload tiles with::

$ docker-compose run --rm app map

This will generate a fresh set of tiles in a temporary directory and
sync the S3 bucket with the changes.
69 changes: 39 additions & 30 deletions docs/metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,7 @@ Metric Name App Type Tags
`data.station.confirm`_ task counter type
`data.station.dberror`_ task counter type, errno
`data.station.new`_ task counter type
`datamaps`_ datamaps timer func, count
`datamaps.dberror`_ datamaps counter errno
`datamaps.dberror`_ task counter errno
`locate.fallback.cache`_ web counter fallback_name, status
`locate.fallback.lookup`_ web counter fallback_name, status
`locate.fallback.lookup.timing`_ web timer fallback_name, status
Expand Down Expand Up @@ -772,6 +771,15 @@ Tags:

* ``type``: The :term:`station` type, one of ``blue``, ``cell``, or ``wifi``

datamaps.dberror
^^^^^^^^^^^^^^^^
``datamaps.dberror`` is a counter of the number of retryable database errors
when updating the ``datamaps`` tables.

Tags:

* ``errno``: The error number, same as `data.station.dberror`_

Backend Monitoring Metrics
--------------------------

Expand Down Expand Up @@ -887,34 +895,35 @@ rate_control.locate.dterm
the derivative term of the rate controller. It is emitted when the rate
controller is enabled.

Datamaps Metrics
================
The datamap script generates a data map from the gathered map statistics. It has
not been updated to work with current production infrastructure, so these metrics
were emitted from the previous infrastructure.

datamaps
--------
``datamaps`` is a timer for functions in the datamap process. It also counts items,
but as a timer.

.. NOTE::
The item counts should be moved to a new counter metric

Tags:

* ``func``: The export function being timed, such as ``export``, ``encode``,
``merge``, ``main``, ``render``, or ``upload``
* ``count``: The item counts, recorded as a timer, such as ``csv_rows``,
``quadtrees``, ``tile_new``, ``tile_changed``, ``tile_deleted``, ``tile_unchanged``

datamaps.dberror
^^^^^^^^^^^^^^^^
``datamaps.dberror`` counts the number of retryable database errors.

Tags:

* ``errno``: The error number, same as `data.station.dberror`_
Datamaps Structured Log
=======================
The datamap script generates a data map from the gathered observations. It does
not emit metrics.

The final ``canonical-log-line`` log entry has this data:

* ``bucketname``: The name of the S3 bucket
* ``concurrency``: The number of concurrent threads used
* ``create``: True if ``--create`` was set to generate tiles
* ``duration_s``: How long in seconds to run the script
* ``export_duration_s``: How long in seconds to export from tables to CSV
* ``merge_duration_s``: How long in seconds to merge the per-table quadtrees
* ``quadtree_count``: How many per-table quadtrees were generated
* ``quadtree_duration_s``: How long in seconds to convert CSV to quadtrees
* ``render_duration_s``: How long in seconds to render the merged quadtree to tiles
* ``row_count``: The number of rows across datamap tables
* ``script_name``: The name of the script (``ichnaea.scripts.datamap``)
* ``success``: True if the script completed without errors
* ``sync_duration_s``: How long in seconds it took to upload tiles to S3
* ``tile_changed``: How many existing S3 tiles were updated
* ``tile_count``: The total number of tiles generated
* ``tile_deleted``: How many existing S3 tiles were deleted
* ``tile_new``: How many new tiles were uploaded to S3
* ``tile_unchanged``: How many tiles were the same as the S3 tiles
* ``upload``: True if ``--upload`` was set to upload / sync tiles

Much of this data is also found in the file ``tiles/data.json`` in the S3
bucket for the most recent run.

Implementation
==============
Expand Down
File renamed without changes.
4 changes: 4 additions & 0 deletions ichnaea/content/tests/test_views.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ def test_get_map_tiles_url(self):
url = get_map_tiles_url("http://127.0.0.1:9/static")
assert url == "http://127.0.0.1:9/static/tiles/{z}/{x}/{y}.png"

def test_get_map_tiles_url_blank_asset_url(self):
url = get_map_tiles_url("")
assert url == "/static/datamap/tiles/{z}/{x}/{y}.png"


@pytest.fixture(scope="function")
def views(redis, session):
Expand Down
2 changes: 1 addition & 1 deletion ichnaea/content/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def get_map_tiles_url(asset_url):
:returns: tiles_url
"""
asset_url = asset_url if asset_url else ""
asset_url = asset_url if asset_url else "/static/datamap/"
if not asset_url.endswith("/"):
asset_url = asset_url + "/"
return urlparse.urljoin(asset_url, "tiles/" + TILES_PATTERN)
Expand Down

0 comments on commit 40a8c23

Please sign in to comment.