Technical approach metadata #13

olafveerman · 2019-03-01T02:23:52Z

This document describes how we plan to add the ability to store metadata about stations in OpenAQ, and subsequently query data in the platform by the metadata. We’ve come up with an approach that should:

make no breaking changes to the current endpoints; and
preserve the stability of the platform.

A description on the structure of the Station ID is here: #12

Approach

Overall, we propose to store the metadata in a table locations. When performing any query based on location or its metadata (eg. all rural stations within a country), the API will:

query the locations table and get a list of relevant locations
use the coordinates of those locations to query the measurements table.

This approach requires us to:

add the table locations
rewrite API endpoints
seed the table with current locations found in the table measurements
create a process to add new locations when they are reported
rewrite the logic in openaq-fetch that checks for the uniqueness constraint

New table: locations

We propose to store all the metadata in a - to be created - locations table in the database. Endpoints that rely on location will get data from this table, instead of aggregating it from the measurements table.
Each record will consist of a unique ID, the coordinates, and a jsonb field with the metadata.

Seeding the table

This table will be populated with the unique set of locations found in the measurements table. Uniqueness is determined on the coordinates.

Rewrite endpoints

The following endpoints will need to have (part of) their logic rewritten. The response from each of these endpoints will be the same, except for additional metadata.

/locations
- refactor: get unique locations from the locations table, instead of measurements
- new: query locations by metadata
- new: POST method
- new: PATCH method
/measurements
- new: query measurements on metadata in the locations table
/cities - controller
- refactor: get data from the locations table
/countries
- refactor: get data from the locations table
/latest - controller
- refactor: get unique locations from the locations table

Adding new locations

When a measurement is ingested for a new location, a new entry needs to be added in the locations table. Two potential ways to implement this:

openaq-fetch - when checking for uniqueness in the fetch process, perform a lookup in the locations table. When the locations table doesn’t contain a record with those coordinates, create a new one.
batch process - every hour, check if any of the new measurements has a corresponding set of coordinates in the locations table. This could be triggered by the same process that starts the Athena jobs.

Uniqueness constraints `openaq-fetch`

Before inserting data into the database, openaq-fetch checks for the uniqueness of the measurement. If a measurement with the same location, timestamp and pollutant already exists in the database, it will not insert it.
We need to rewrite this logic, and make sure it checks for uniqueness on the coordinates.

Open questions:

trim down the duplicated metadata stored in the measurements table. Deduping attributes like location, city, country, and coordinates.
We can store the station ID on every measurement object, and perform the aggregations and queries on the ID, instead of on the coordinates. The advantage is that it’s easier to serve measurements for locations without coordinates (currently around 160 active location). The disadvantage is that it would oblige us to update the complete archive on S3 with the station ID.

The text was updated successfully, but these errors were encountered:

maschu09 · 2019-03-05T20:10:09Z

All great. Only, as mentioned in #12, the seeding of the new table should include locations that have no coordinates. Do you always know the originating country? If not, then a special code (XX?) should be defined for unclear cases.

RocketD0g · 2019-03-06T22:35:37Z

Another question that has come up via @maschu09 :

Do you indeed intend to rewrite the S3 archive and include station IDs everywhere, or if the users (i.e. us) are supposed to do the mapping via the locations table ourselves?

@sethvincent or @olafveerman (and @jflasher for good measure) will be better suited to answering this q, but my quick take: I am guessing that it won't be feasible to re-write the 400 million data points in the S3 buckets, but a) you will be able to query the API* for this output and b) for older data there will be a fairly painless way that the user will be able to mesh the locations with the archived data.

*Separate from this project, but on the order of the next few months, we are working to no longer have a 90-day restriction on the API.

RocketD0g · 2019-03-06T23:01:57Z

Also, on:

Do you always know the originating country? If not, then a special code (XX?) should be defined for unclear cases.

Generally speaking, the country is usually defined, but you can imagine some scenarios where we don't know - or the idea of applying 'country' is not fitting, so I agree we should designate a special code for cases where that is unclear. To @maschu09's suggestion, 'XX' looks like the standard, so I think we should just go with that.

jflasher · 2019-05-05T22:18:59Z

@vgeorge @olafveerman I'm wondering if we actually need to change openaq-fetch to address uniqueness? Maybe put another way, what is the problem if we change nothing in openaq-fetch? The unique IDs would still be generated via the other code so no matter what is getting inserted by fetch, it'll still get handled downstream. If we were saving the unique ID with every measurement, then I think we'd need to update fetch, but if we're not doing that (which I don't believe we are?), then I am not sure we need to change fetch?

vgeorge · 2019-05-13T17:24:57Z

@jflasher I believe you are right. The Athena query groups measurements by coordinates and parameter id, using the queries from this file:

https://github.com/openaq/openaq-api/pull/394/files#diff-46368f02f56796a118231baa2c760aaf

If there is a duplicated measurement, there will be no effect in id generation.

olafveerman mentioned this issue Mar 1, 2019

Proposal unique ID #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technical approach metadata #13

Technical approach metadata #13

olafveerman commented Mar 1, 2019

maschu09 commented Mar 5, 2019 •

edited

Loading

RocketD0g commented Mar 6, 2019

RocketD0g commented Mar 6, 2019

jflasher commented May 5, 2019

vgeorge commented May 13, 2019

Technical approach metadata #13

Technical approach metadata #13

Comments

olafveerman commented Mar 1, 2019

Approach

New table: locations

Seeding the table

Rewrite endpoints

Adding new locations

Uniqueness constraints openaq-fetch

maschu09 commented Mar 5, 2019 • edited Loading

RocketD0g commented Mar 6, 2019

RocketD0g commented Mar 6, 2019

jflasher commented May 5, 2019

vgeorge commented May 13, 2019

Uniqueness constraints `openaq-fetch`

maschu09 commented Mar 5, 2019 •

edited

Loading