Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical approach metadata #13

Open
olafveerman opened this issue Mar 1, 2019 · 5 comments
Open

Technical approach metadata #13

olafveerman opened this issue Mar 1, 2019 · 5 comments

Comments

@olafveerman
Copy link

This document describes how we plan to add the ability to store metadata about stations in OpenAQ, and subsequently query data in the platform by the metadata. We’ve come up with an approach that should:

  1. make no breaking changes to the current endpoints; and
  2. preserve the stability of the platform.

A description on the structure of the Station ID is here: #12

Approach

Overall, we propose to store the metadata in a table locations. When performing any query based on location or its metadata (eg. all rural stations within a country), the API will:

  1. query the locations table and get a list of relevant locations
  2. use the coordinates of those locations to query the measurements table.

This approach requires us to:

  1. add the table locations
  2. rewrite API endpoints
  3. seed the table with current locations found in the table measurements
  4. create a process to add new locations when they are reported
  5. rewrite the logic in openaq-fetch that checks for the uniqueness constraint

New table: locations

We propose to store all the metadata in a - to be created - locations table in the database. Endpoints that rely on location will get data from this table, instead of aggregating it from the measurements table.
Each record will consist of a unique ID, the coordinates, and a jsonb field with the metadata.

Seeding the table

This table will be populated with the unique set of locations found in the measurements table. Uniqueness is determined on the coordinates.

Rewrite endpoints

The following endpoints will need to have (part of) their logic rewritten. The response from each of these endpoints will be the same, except for additional metadata.

  • /locations
    • refactor: get unique locations from the locations table, instead of measurements
    • new: query locations by metadata
    • new: POST method
    • new: PATCH method
  • /measurements
    • new: query measurements on metadata in the locations table
  • /cities - controller
    • refactor: get data from the locations table
  • /countries
    • refactor: get data from the locations table
  • /latest - controller
    • refactor: get unique locations from the locations table

Adding new locations

When a measurement is ingested for a new location, a new entry needs to be added in the locations table. Two potential ways to implement this:

  • openaq-fetch - when checking for uniqueness in the fetch process, perform a lookup in the locations table. When the locations table doesn’t contain a record with those coordinates, create a new one.
  • batch process - every hour, check if any of the new measurements has a corresponding set of coordinates in the locations table. This could be triggered by the same process that starts the Athena jobs.

Uniqueness constraints openaq-fetch

Before inserting data into the database, openaq-fetch checks for the uniqueness of the measurement. If a measurement with the same location, timestamp and pollutant already exists in the database, it will not insert it.
We need to rewrite this logic, and make sure it checks for uniqueness on the coordinates.

Open questions:

  • trim down the duplicated metadata stored in the measurements table. Deduping attributes like location, city, country, and coordinates.
  • We can store the station ID on every measurement object, and perform the aggregations and queries on the ID, instead of on the coordinates. The advantage is that it’s easier to serve measurements for locations without coordinates (currently around 160 active location). The disadvantage is that it would oblige us to update the complete archive on S3 with the station ID.
@maschu09
Copy link

maschu09 commented Mar 5, 2019

All great. Only, as mentioned in #12, the seeding of the new table should include locations that have no coordinates. Do you always know the originating country? If not, then a special code (XX?) should be defined for unclear cases.

@RocketD0g
Copy link

Another question that has come up via @maschu09 :

Do you indeed intend to rewrite the S3 archive and include station IDs everywhere, or if the users (i.e. us) are supposed to do the mapping via the locations table ourselves?

@sethvincent or @olafveerman (and @jflasher for good measure) will be better suited to answering this q, but my quick take: I am guessing that it won't be feasible to re-write the 400 million data points in the S3 buckets, but a) you will be able to query the API* for this output and b) for older data there will be a fairly painless way that the user will be able to mesh the locations with the archived data.

*Separate from this project, but on the order of the next few months, we are working to no longer have a 90-day restriction on the API.

@RocketD0g
Copy link

Also, on:

Do you always know the originating country? If not, then a special code (XX?) should be defined for unclear cases.

Generally speaking, the country is usually defined, but you can imagine some scenarios where we don't know - or the idea of applying 'country' is not fitting, so I agree we should designate a special code for cases where that is unclear. To @maschu09's suggestion, 'XX' looks like the standard, so I think we should just go with that.

@jflasher
Copy link

jflasher commented May 5, 2019

@vgeorge @olafveerman I'm wondering if we actually need to change openaq-fetch to address uniqueness? Maybe put another way, what is the problem if we change nothing in openaq-fetch? The unique IDs would still be generated via the other code so no matter what is getting inserted by fetch, it'll still get handled downstream. If we were saving the unique ID with every measurement, then I think we'd need to update fetch, but if we're not doing that (which I don't believe we are?), then I am not sure we need to change fetch?

@vgeorge
Copy link

vgeorge commented May 13, 2019

@jflasher I believe you are right. The Athena query groups measurements by coordinates and parameter id, using the queries from this file:

https://github.com/openaq/openaq-api/pull/394/files#diff-46368f02f56796a118231baa2c760aaf

If there is a duplicated measurement, there will be no effect in id generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants