Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streamflow API spec #88

Merged
merged 15 commits into from
Aug 23, 2018
Merged

Add streamflow API spec #88

merged 15 commits into from
Aug 23, 2018

Conversation

rod-glover
Copy link
Contributor

@rod-glover rod-glover commented Aug 14, 2018

Resolves #86

Updated from initial draft circulated a few weeks ago. Updates:

  • Base for streamflow-related endpoints: services.pacificclimate.org/streamflow/ (not routed_streamflows).
  • All resource names plural (e.g., /streamflow/orders, not /streamflow/order).
  • All response bodies are fully fleshed out (or should be).
  • Consistent use of hyperlinks in all response bodies.
  • Hyperlink ref properties are abbreviated (e.g., 'cancel') instead of long-form hyperlink.
  • Full factoring of commonalities in schemas, parameters, responses, etc.

Notes:

description: Defines a new order for a streamflow result set.
type: object
properties:
hydromodel_output_id:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, but am not sure, that the way to indicate the hydromodel output desired as input to this computation is with such an id. This has the virtue of simplicity but it exposes an internal implementation detail, and that is regarded as undesirable in resource-oriented architectures (RESTful architecture).

The alternative would be the hyperlink to the hydromodel output. This does not expose an internal, but it does require the backend to translate the hyperlink to an actual hydromodel output record, which is more roundout.

I'm not sure which to choose.

properties:
method:
type: string
enum: [email]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add none option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time thinking of a circumstance where we'd want the none option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and although it is easily circumvented, it is a way to attribute orders to users.

latitude:
type: number
example: 48.5
notification:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make this notifications: [ ... list ... ] ? Is there a plausible use case for multiple notifications?

@rod-glover
Copy link
Contributor Author

rod-glover commented Aug 14, 2018

An overarching question: Should the list resources (e.g., /streamflow/orders) return less metadata for each listed item than do the individual item resources (e.g., /streamflow/orders/4). Alternately stated: Should the individual item resources contain more info?

I'd like to review all such resources with an eye to this question and the general question of whether the representations for each type of resource is yet complete.

@corviday
Copy link
Contributor

corviday commented Aug 15, 2018

My approach has been to assume that the resource collection endpoints provide all information needed to answer the question: which resource am I looking for? And that the resource documents return all information needed to answer the question: I've now selected this resource, what information is available about it?

Maybe this is the backwards approach, though, since what information is needed to select the resource of interest varies based on how one expects the information to be used or what one expects the front end to be like.

One could, for example, imagine multiple UIs allowing a user to select a hydromodel output dataset:

  1. Possible UI 1: the user selects which GCM and emissions scenario they are interested in from dropdown menus, the spatial coverage of that hydromodel output dataset is displayed on a map, then the user picks a point they are interested in from a map. For this scenario, you'd want the hydromodel output resource collection to return GCM and emissions scenario information, but since you don't display the hydromodel output on a map until after the user picks one, you could have grid info on the resource document, and not in the resource collection.

  2. Possible UI 2: the user picks a point on a map, and then gets a list of all the available GCM/scenario combinations of hydromodel output that could provide data at that map point. In this case, you'd want the hydromodel output collections resource to contain spatial coverage information, but individual resource documents would have the GCM and scenario.

(I've been assuming our UI will look more like the former, because I think it would be easier to design a UI like that, but the latter might be more in line with how the engineers want to approach things; they probably start out knowing exactly what point in space they want to get information about, and don't care as much about available models.)


I'm not particularly attached to this approach and wouldn't lose anything if we decided that the resource collections endpoints should have as much information as the resource document endpoints.

example: rcp85
etc:
type: string
example: ... more tbd ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spatial coverage info is needed, I think, so that it's possible to tell whether a result for a given point can be generated from this hydromodel output.

Something like minimum and maximum latitude and longitude, as well as latitude and longitude cell size would be enough to provide context to the cell_x and cell_y values in the result endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuck, it looks like hydromodel output files have irregular spatial extents:
spatialcoverage

So in order to provide enough information to determine whether a result file for given point can be made from this hydromodel output, we may need to return a geoJSON polygon or something as the spatial output., not just a bounding box. That's a bit of a headache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe if we need to provide actual spatial coverage, it could be a separate resource, like maybe, like /streamflow/hydromodel_output/4/bounds? We probably don't want to calculate it until a client is specifically looking for it, though caching could be a big help, it's not like the coverage of a netCDF file is going to change.

Copy link
Contributor Author

@rod-glover rod-glover Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree completely. Caching will make a very big difference here, but it still may be better to provide the bounds at a separate endpoint. At this stage, I prefer that.

(BTW, the bounds would likely be a kind of information not supplied in the list resource that would be present in the hypothetical item resources. But partitioning it as a separate endpoint eliminates that question.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we want to add filtering (to the list resource) by location? E.g., list all hydromodels that contain this location? Or some spatial extent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Do you think it's better to do that sort of filtering in the back end or the front end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. If the unfiltered list is large or the filtering is compute-intensive, then backend, because there is some hope that caching can speed up the response for common or repeated searches. Otherwise (short list, cheap filtering) doing it on the frontend is neutral to better, since it simplifies the backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we have a short list. I don't know how compute-intensive filtering on spatial criteria is, but I'd guess it's not prohibitive. So we could probably do it just as easily on the frontend, assuming we can find a JS library with the required spatial testing method(s).


# Individual streamflow results

StreamflowResultState:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporal coverage information would be useful. A start date, a stop date, and a resolution, maybe? Temporal coverage info would mean you could actually figure out what dates to request for the annual cycle resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that temporal coverage also (or alternatively) be part of the hydromodel_output, which I believe determines the temporal coverage of the computed result?

(There's no reason we can't add temporal metadata to both. I tend to want to not duplicate information, but this isn't a database that should be normalized, but information resources that are meant to be useful!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what do you recommend as the resolution datum?

  • A string enum like daily, monthly, yearly?
  • An exact interval specified either in seconds or by an ISO8601 duration string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting temporal coverage in either the result metadata or the hydromodel output metadata would be fine, as long as it's accessible somewhere. It is theoretically possible to generate a result with a shorter timespan than the hydromodel output, but I can't imagine why we'd want to, and don't see any reason to build in support for that possibility out of the gate.


Upon further reflection, I think I was wrong. The result metadata does not need a time resolution attribute because I can't see any reason a client would need that information. A client never works with the raw result data and doesn't need details of its representation in the datafile; the client only interacts with the data in the form of the aggregated annual-mean, annual-max, and annual-cycle endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on both points. I think that the temporal metadata should go into hydromodel outputs as well.

- rel: 'results/metadata'
uri: 'http://apis.paccifcclimate.org/streamflow/results/6'
- rel: 'results/annual-max'
uri: 'http://apis.paccifcclimate.org/streamflow/results/6/annual-max'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's the RESTy thing to do to include these links to annual cycle, annual mean, and annual max even though they don't exist when the order is created and will be 404s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great observation. REST has nothing to say about this either way. A link to an absent resource isn't really non-RESTful, but it's not really very logical either. A good rule is don't have links that you can know in advance are invalid.

- rel: 'cancel'
uri: 'http://apis.paccifcclimate.org/streamflow/orders/5'
- rel: 'hydromodel_output'
uri: 'http://apis.paccifcclimate.org/hydromodel_outputs/4'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally named the hydromodel resources as /hydromodel_outputs and not /streamflow/hydromodel_outputs because they aren't necessarily associated only with streamflow. Streamflow uses them, but so could other things.

Alternative viewpoint is that these resources are tuned to streamflow's needs, and we'd not want other concerns changing the API and breaking streamflow app(s).

What do you think?

Copy link
Contributor

@corviday corviday Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess either way seems justifiable to me. No preference.

@rod-glover
Copy link
Contributor Author

@corviday, I think your reasoning (first 2 paragraphs) is right on. It matches what I understand of resource-oriented/RESTful design, and my own opinions.

I think the "support multiple UIs" principle is also important. You want UIs to be able to slice and dice information in various ways (like your example). It's an anti-pattern to have endpoints that are tuned for specific views of the information; instead you want endpoints that deliver broader information that the UI can present in various ways.

This does argue for collection endpoints that deliver as much metadata as the individual endpoints. Given that our endpoints are pretty much 100% metadata, aside from the explicitly data endpoints, I think that clinches it.



/streamflow/results/{id}/annual-cycle/{startYear}-{endYear}:
get:
Copy link
Contributor

@corviday corviday Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different way to handle the annual cycle resource that just occurred to me, if one of the REST design goals is to not make the client construct arbitrary URLs:

  • /streamflow/result/0 includes a link to /streamflow/result/0/annual-cycle (along with other metadata)
  • /streamflow/result/0/annual-cycle is solely a collection of links to annual cycles for valid climatologies present in this data, so it would have links to, eg:
    • /streamflow/result/0/annual-cycle/1961-1999
    • /streamflow/result/0/annual-cycle/1971-2000
    • /streamflow/result/0/annual-cycle/1981-2010
    • /streamflow/result/0/annual-cycle/2010-2039
    • /streamflow/result/0/annual-cycle/2040-2069
    • /streamflow/result/0/annual-cycle/2070-2099
  • Pages like /streamflow/result/0/annual-cycle/2070-2099 would actually return a timeseries.

Or there could be a climatology resource type, like /streamflow/result/0/climatology/1960-1989/annual-cycle or something, but I don't know of any other data analysis we'd like broken out by climatology, so I don't think that buys us anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You rock. This is an excellent idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This presumes that one knows in advance what climatologies are of interest. But it can't hurt to have a standard set listed.

@corviday
Copy link
Contributor

corviday commented Aug 17, 2018

I did a final readthrough of the API just now, and I think it looks great!

The only thing I noticed is that I think we've made the cell_x_index and cell_y_index attributes of result superfluous: a client doesn't need to care about cell indices in the netCDF file - a client cares about:

  1. whether the point it is interested in corresponds to an existing result, which it can determine with cell_x, cell_y, cell_x_size, and cell_y_size on result
  2. if not, whether the point it is interested in lies inside the spatial coverage of a particular hydromodel_output, which it can determine with hydromodel_output/{id}/bounds

There's no longer much use for cell_x_index or cell_y_index, though it doesn't do any harm to have them.

(I can, however, imagine scenarios in which a client might still want to know cell_x_size and cell_y_size on hydromodel_output; I can imagine a client that cares about the resolution of a dataset even if it doesn't have to calculate an exact cell itself.)

@rod-glover
Copy link
Contributor Author

Good point. I'll remove the cell indices; leaner is better.

@rod-glover
Copy link
Contributor Author

rod-glover commented Aug 18, 2018

I think this is it for the API design ... except for security (user authentication and authorization). But that's a big enough issue that I'll create a separate issue and PR (if needed) for it.

@rod-glover
Copy link
Contributor Author

Ack. API versioning. Which is a fraught subject. But important to consider and possibly do regardless.

Copy link
Contributor

@corviday corviday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me.

@rod-glover
Copy link
Contributor Author

rod-glover commented Aug 21, 2018

Re. API versioning. Definitely worth reading the article I mentioned above, Your API versioning is wrong.

And it's even more worth reading the often very well-considered responses to that article. The general sentiment there seems to be that putting versioning in the resource URIs (e.g., /api/v2/...), however "wrong" it may be, is the most pragmatic approach and arguably not all that wrong anyway. The article's author agrees. I agree too.

So assuming we all agree on that, the question is whether to introduce versioning (in the URIs) now, or wait until there is actually another version. If we do that, then URIs sans version should default to v1 (to preserve compatibility) and anyone who wants v2 or later must specify so in the URIs.

I vote for the latter: initially without explicit version. I think that we will have a little more iteration before v1 can be considered stable, but that's OK. If there ever is a genuine v2, then we start versioning then.

@corviday
Copy link
Contributor

That all seems reasonable to me.

@rod-glover
Copy link
Contributor Author

The failing tests have nothing to do with the API spec. Gonna merge.

@rod-glover rod-glover merged commit 8fb7e35 into master Aug 23, 2018
@rod-glover rod-glover deleted the i86-streamflow-API branch August 23, 2018 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants