Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to build other API output formats? #195

Closed
timwis opened this issue Jun 25, 2015 · 55 comments
Closed

How to build other API output formats? #195

timwis opened this issue Jun 25, 2015 · 55 comments

Comments

@timwis
Copy link
Contributor

timwis commented Jun 25, 2015

Hey guys, how would you recommend I go about building API output formats other than FeatureServer? For instance, how would I output a GeoJSON file in the OData spec? I see clearly how I'd add a provider, but this is on the other side of the fence and it's not quite clear, and I've perused the codebase.

EDIT: Looks like we'd need to add this to every provider as a method in the controller, eh? Any other way come to mind? I figured each provider just got it into a standard format (ie. GeoJSON) and then there were "output formats" that transformed the GeoJSON into the various output formats like FeatureService.

@chelm
Copy link
Contributor

chelm commented Jun 25, 2015

@timwis It is certainly possilble but we are planning some refactoring of the exporter code so its sort of a minefield.

Right now all the data exports are pumped through ogr2ogr on the command line. If you look at these lines you'll see the formats: https://github.com/Esri/koop/blob/master/lib/Exporter.js#L13-L20

They match a given mime type in a url (http://...ckan/:host/:id.{kml,csv,zip,geojson})

We need to change this though. The koop exporting code should be an easily extendable ecosystem that that supports converting geojson to any format - so long a format converter is included (not sure that make sense.) Long story short: its a bit of work that we are planning on doing but havent got to yet.

@timwis
Copy link
Contributor Author

timwis commented Jun 25, 2015

Thanks @chelm but I think this is a bit different than export format -- it's closer to the featureservice output format in that it's a queryable API. I'm talking about enabling querystring params to filter the results, potentially geometry queries, etc. In my opinion there are a lot of possibilities with this. Further exploration pointed me to processFeatureServer in BaseController, where I imagine processOData would likely go. Though ideally the processing logic would be a module, itself. In the OData example, there are a few node.js implementations of OData. If I can find one that works off of a GeoJSON file, it's literally just a matter of installing the module and popping in the call to it.

@chelm
Copy link
Contributor

chelm commented Jun 25, 2015

@timwis Oh im sorry! I jumped to conclusion. Cool so I am totally new to Odata. Will look at it. Koop does pass the where and geometry querystring params down to the DB. So this should be doable. For really large data its crucial that queries happen at the DB level. So im interested in module that converts Odata queries into SQL or similar. Will look.

@chelm
Copy link
Contributor

chelm commented Jun 25, 2015

@timwis I feel like something like this would be useful to have here: https://github.com/koopjs/koop-pgcache/blob/master/index.js#L324 where the options param is actually already the entire query string from the request. It could be parsed into a query.

@chelm
Copy link
Contributor

chelm commented Jun 25, 2015

This looks useful https://github.com/auth0/node-odata-parser

It could also make sense to add this at the cache wrapper level in the lib/Cache.js -> get method. it would do the conversion there into a set of params that koops caches want like limit, offset, where (sql), etc. see https://github.com/Esri/koop/blob/master/lib/Cache.js#L124

@chelm
Copy link
Contributor

chelm commented Jun 25, 2015

@timwis great idea BTW

@timwis
Copy link
Contributor Author

timwis commented Jun 25, 2015

Thanks! I appreciate the prompt responses. I didn't realize the output format and the query operations were separate like that. I see that the query operations even affect the GeoJSON output format:
http://koop.dc.esri.com/ckan/phil/heart-healthy-screening-sites?where=ZIP_CODE+%3D+19145

That lexer library looks like a good start, though I imagine applying the resulting object to a postgres database call is still a significant step. I saw some other libs out there for odata too though, so hopefully they'll get us the rest of the way.

@ajturner
Copy link
Contributor

@timwis there have only been a few new "output API" built - I think by adding more like ODATA or SODA it should illuminate what the common patterns are to then make the development of these very straightforward.

Personally I think it should be decoupled output vs. input - so you can write an ODATA output that is deployed with the GitHub/AGOL/CKAN inputs and choose if you want PGCache or ESCache. A bit like pluggable modules in the three layer stack.

@timwis
Copy link
Contributor Author

timwis commented Jun 26, 2015

Agreed, that sounds ideal

@timwis
Copy link
Contributor Author

timwis commented Jul 17, 2015

Got a working proof of concept. Just finishing up at Code for Philly so I will reply in the next day or two with what I learnt and ideas on a more thorough implementation, but wanted to show you in the meantime:

A koop plugin that does 2 things: parse the querystring (the API) and output XML (it's not technically OData compliant but proof of concept):
https://github.com/timwis/koop-odata

And I had to modify one of the providers to use it (which I have some ideas about):
https://github.com/koopjs/koop-ckan/compare/master...timwis:modular-output?expand=1

@chelm
Copy link
Contributor

chelm commented Jul 17, 2015

@timwis Very cool. I have some ideas on this too, and your branch looks like a great start. I may make a few PRs to your fork if thats cool.

Nice work

@timwis
Copy link
Contributor Author

timwis commented Jul 19, 2015

Thanks @chelm! Here are some thoughts I had along the way:

Output format vs API
First of all, we're talking about more than an output format - this is also an API. Presently, there appear to be 2 APIs built in to koop: featureserver (outFields, where, etc.) and perhaps tiles (x/y/z). This is different from the output format - for instance, you can output featureservers in HTML, JSON, KML, etc. and I'm pretty sure koop even lets you use the featureserver API with GeoJSON. Note that I'm calling the ability to query an "API" but perhaps there's a better/less confusing term to use. (EDIT: "Interface" or "query type" may be a better/less confusing name for the interaction type.)

Modular / plugins
Obviously adding an odata handler to every provider is not ideal, as we'd have to do it all over again when we want to add another output format. And the handlers for featureserver, tiles, etc. are not much different - what if there was simply a route for get /ckan/:id/:item/:api and the controller simply checked if a plugin by the api's name is registered and then passed it req, res before proceeding with the query? It would have to do it again once the query is fetched in order to adjust the output format. This is similar to how I'm doing it in my branch of koop-ckan but it would look more like:

if( ckan.koop[req.params.api] !== undefined ) {
   ckan.koop[req.params.api]._beforeFind(req, res);
}

(The plugin would modify req and res by reference, so the controller can be unopinionated about what it does to them, like a real plugin)

Query language
Finally, the way koop-odata works is it's translating the OData API string ($select=foo,bar&$filter=foo lt 1 AND bar eq 'two') into the featureserver version of the API string (outFields=foo,bar&where=foo < 1 AND bar = 'two') so it can be handled by koop. But then koop converts the featureserver query string into PostgreSQL syntax (or the query language of whatever database is being used). I'd suggest standardizing on an ORM or query builder and having the APIs simply be translated into a programmatic query on a database instance or query object that gets passed to them. For instance:

req.query.$select.split(',').forEach(function(field) {
   dbQuery.select(field);
});

Let me know if this is off base - part of me thinks that it should hit the API logic before it hits the provider, but at a certain point I'm probably overthinking it.

EDIT: I'm now noticing that the featureserver handler applies things like outFields after fetching the data, so I'm wondering whether I'm missing something critical here.

@timwis
Copy link
Contributor Author

timwis commented Jul 27, 2015

Hey guys, any thoughts on direction here? Didn't want to go too far down the path I described above without feedback from you as you're far more familiar with the tool than I am.

@dmfenton
Copy link
Contributor

My recommendation is the following:

  • Encapsulate all generic odata parsing code into a separate module
    • e.g. transformation between ODATA and GeoJSON
    • e.g. parsing ODATA into standard where clause
  • Then modify Koop's base controller and model to leverage this package. It would be similar to how we've handled feature service, but better because we'd have better separation of concerns as well as code that is reusable for other ODATA projects.

If you do this right from the base controller/base model, every provider should be able to get this functionality "for free" just by virtue of being part of Koop.

@ngoldman what do you think?

@timwis
Copy link
Contributor Author

timwis commented Jul 27, 2015

Thanks @dmfenton. What do you think of using a query builder like knex.js instead of translating ODATA to FeatureServer Query Language and then translating that to PostgreSQL (as it currently does)?

@dmfenton
Copy link
Contributor

I'm not sure this is going to work for our use-case.

I updated my comment above because we don't actually need to generate something specific to Feature Services. It's just a standard where clause.

At the end of the day it makes the most sense to generate the query at the cache level. Think about the difference between Elasticsearch and SQL syntax for example.

There's also the issue that this doesn't knex.js doesn't seem to support queries for data stored as JSON (which is specific to koop-pgcache).

@timwis
Copy link
Contributor Author

timwis commented Jul 28, 2015

Okay @dmfenton, having reread your comment, it looks like you're recommending making the ODATA format part of the core koop code. It seemed like @ajturner and I were talking about making output modular instead. In theory, that could mean koop core just provides GeoJSON interaction (perhaps with a basic ?where= filter), and FeatureServer is an output/interface module, as is ODATA and SODA.

Here's my understanding of how it currently works versus the modular output idea:

Current path for queries

  • provider -> router -> controller.featureserver
    • model.getResource
      • cache.get
        • db.select
          • select id, feature.properties as props, feature.geometry as geom from layer
          • parse where via db.createWhereFromSql()
          • parse geometry via db.parseGeometry()
          • (doesn't parse select/outfields)
          • query row count with filters applied
          • apply ORDER BY, LIMIT, OFFSET params
          • exec query via db._query()
          • push results into GeoJSON array & return
    • process results into featureserver format

Proposed path for queries

  • provider -> router -> controller.findResource
    • interface.parse
      • parse SELECT and WHERE params into some common form (SQL string, ORM, or JS object)
    • model.getResource
      • cache.get
        • db.select
          • same logic as before but using the common form described above
          • also parse select/outfields here
    • interface.formatOutput
      • convert GeoJSON to expected output of interface

@dmfenton
Copy link
Contributor

That's one way to go. The other would be to take the path of writing a plugin. You still would end up modifying the core koop code to enable that plugin.

https://github.com/koopjs/koop/blob/master/index.js#L104-L106
https://github.com/koopjs/koop/blob/master/lib/BaseModel.js#L70-L82
https://github.com/koopjs/koop-tile-plugin

Let me parse through the other part of your question later today.

@ungoldman
Copy link
Contributor

I think the idea of moving towards an architecture of modular inputs and outputs with koop's core acting as the standard interface for querying GeoJSON makes sense. This would take time but it's the right direction.

Right now the outputs are fixed to GeoJSON, FeatureService, KML, Shapefile, and CSV, but there's no reason other than time and work hours to make those components rather than the sole defacto outputs. @timwis I think we can use oData as a test case for developing a reasonable API that would fit this scenario. Keep in mind this would be long term in terms of the roadmap (at least a major version bump), as we need to continue supporting the current architecture in production.

I'll do some research and try to come up with a reasonable path.

@ajturner
Copy link
Contributor

@ngoldman++

I think something worth discussing is more distinct terminology. "provider" is generic and applies to both Source API and Client API.

a few areas worth investigating:

  • Source Providers (data) need to implement an interface that Koop can request data. e.g. persistence.js Sequelize bookshelf/knex etc. are examples.
  • Request Providers (output) would define a REST API and then bind to this ORM
  • Koop core would need to support currying (or similar) since not all Data Providers would support all API Provider queries.
    • for example: source: GitHub GeoJSON | request api: oData
    • Koop receives oData request for filtered data, fetches the entire GeoJSON static file and the GitHub Data Provider knows that it now has to do in-memory filtering to return the array of features to the oData Request provider that sends it to the client.

@ungoldman
Copy link
Contributor

If we're talking about redefining terminology, architecture, and API conventions, why not just call them inputs and outputs? Then you have an ecosystem like:

Inputs

  • ArcGIS Online
  • Socrata
  • CKAN
  • Github
  • OpenStreetMap
  • ...

Caches

  • PostGIS
  • ElasticSearch
  • ...

Outputs

  • GeoJSON (built-in)
  • CSV
  • KML
  • Shapefile
  • Feature Service
  • oData
  • ...

Core would take a request, use input module to perform ETL and stream data to cache, then use output module to stream response filtered as necessary. So:

  1. request: GET /:input/:resource/:output (with optional query params)
  2. extract resource from input
  3. transform to GeoJSON
  4. load into cache
  5. extract from cache
  6. filter and transform GeoJSON based on output query
  7. response: desired output

If data was already cached, skip steps 2 through 4.

As an aside, in my opinion the only piece handling raw requests and responses should be the server. ETL, cache management, GeoJSON manipulation, and querying should have nothing to do with web requests. There's no need to pass those objects around.

@ungoldman
Copy link
Contributor

(I see the need for an ORM or something along those lines to deal with the filtering at the core level)

@chelm
Copy link
Contributor

chelm commented Jul 28, 2015

@ngoldman What do you mean "server" in this quote?

in my opinion the only piece handling raw requests and responses should be the server

The term "server" is not mentioned in list of terms :)

@dmfenton
Copy link
Contributor

Chelm is being cheeky but he's got a point, I'm not sure I understand what you mean.

Also, one tricky thing about decoupling inputs and outputs totally is that the logic for creating cache keys is currently inside providers and is not standardized.

@ungoldman
Copy link
Contributor

@chelm I mean the express middleware piece of koop. Right now requests and responses are handed over to providers. I think that's an unnecessary coupling.

@dmfenton As for the logic for creating cache keys being inside of providers... ideally it shouldn't be there, and should be standardized :)

We're discussing a big set of tasks here but I think it's worth talking about. I know there are real hurdles to implementing this in the near-term, but it's something we can put on the roadmap.

@ajturner
Copy link
Contributor

Also, one tricky thing about decoupling inputs and outputs totally is that the logic for creating cache keys is currently inside providers and is not standardized.

This is straight forward to address as part of this architecture discussion. In fact it should be another part of Koop-core or a concrete Koop-cache-provider that encapsulates any concepts of cache and cache-keys.

Agreed with @ngoldman server means Koop core. Or call it Koop Kernel 🌽

@chelm
Copy link
Contributor

chelm commented Jul 28, 2015

@dmfenton @ngoldman The reason that cache keys are in providers are because they depend on the provider to define them based on values that only matter to that provider. The provider will always need to define what it uses to make a request and how that key is used to back out that same request...

@ungoldman
Copy link
Contributor

@chelm
Copy link
Contributor

chelm commented Jul 28, 2015

I use "always" loosely of course. Everything can change and I think this is a great discussion to be having.

@dmfenton
Copy link
Contributor

If we modify the base model and base controller (in the right way), we should not have to touch providers at all.

@ungoldman
Copy link
Contributor

Doesn't supporting oData mean defining new routes (which are all defined by each individual provider for now)?

@chelm
Copy link
Contributor

chelm commented Jul 28, 2015

@ngoldman isnt supporting odata just adding querystring params? Are there really new routes?

@dmfenton
Copy link
Contributor

Good point. Maybe the providers need to inherit routes.

@timwis
Copy link
Contributor Author

timwis commented Jul 28, 2015

Thanks for all the replies, guys! A few thoughts:

@ngoldman: If we're talking about redefining terminology, architecture, and API conventions, why not just call them inputs and outputs?

Check out my comment above - IMO there are 3 entities: "source type/provider," "output format," and "interface/API." I don't have much opinion on what they end up getting called of course, but I'm just trying to point out that "output" doesn't cover both. For example, the FeatureService "interface" can be "output" in Esri JSON, KML, HTML, GeoJSON, etc. OData has a $format= param that lets you alternate between XML and JSON (not a deal breaker, but just an example). Further, you might want to use some kind of filter to get all the features where ZIP_CODE=19107 as a CSV -- in that case, you're probably using the FeatureService interface and a CSV output format.

There are some implications of that on your list above - for instance, you may want to have the interface process the query first into some kind of common syntax or ORM object. (That way you can let the database handle the filtering when possible)

@dmfenton: the plugin route you described is how I implemented the koop-odata logic above as a proof of concept.

As regards a short-term solution, I'd point out that while OData is the topic at hand, the next one up is the SODA2 interface (already started), and after that perhaps the CKAN API interface (unless we think of a cooler one). So hard references to OData in core might not hold us over very long.

For what it's worth, I'm happy to keep working on this and even submit pull requests in line with where you want it to go. I'm really excited about the possibilities this could create, but I don't want to go too far down any of those paths without clarity from you guys on the direction. So I appreciate the conversation and am happy to participate!

@dmfenton
Copy link
Contributor

Isn't the SODA2 interface covered in Koop-Socrata? Is that spec used elsewhere?

@timwis
Copy link
Contributor Author

timwis commented Jul 29, 2015

@dmfenton: Isn't the SODA2 interface covered in Koop-Socrata? Is that spec used elsewhere?

Koop-Socrata enables you to connect to SODA2 APIs as a source/provider, but you have to interact with them using the FeatureService interface. It doesn't provide the ability to $select or $where using the SODA2 spec.

A SODA2 interface module would allow you to interact with any source/provider using the SODA2 query params. So you could have your data in CKAN or Esri and query it with the SODA2 API, just as we've been discussing with OData.

@ungoldman
Copy link
Contributor

@timwis I'm starting to see your point about interfaces vs. outputs... I was lumping exports (downloads) with interfaces into the output category but it may make sense to make those distinct.

@timwis
Copy link
Contributor Author

timwis commented Jul 31, 2015

Pulled a SODA2 parser together

@timwis
Copy link
Contributor Author

timwis commented Aug 5, 2015

I made a pull request based on the proposed path for queries above (koopjs/koop-provider-ckan#9). It doesn't solve (a) the fact that we'd have to do this for every provider, or (b) the fact that we're translating to FeatureServer query language before translating to a database query, but it seems like the best option before the more heavy changes discussed above.

@timwis
Copy link
Contributor Author

timwis commented Aug 6, 2015

Here's a basic, working version of a SODA2 interface for koop, working the same way as the OData interface: https://github.com/timwis/koop-soda

@timwis
Copy link
Contributor Author

timwis commented Aug 7, 2015

The above interface, koop-soda, works with very basic conditionals, ie. foo > 2 AND bar < 3, but when you get into more complex queries with operators and functions, it starts to get complicated. This list of features should highlight the need to talk about a standard way to query caches beyond translating into FeatureService query format and translating that into the cache's query language (ie. postgresql). Any thoughts? Or can we pull it all off via FeatureService query format after all? (I'm a bit rusty with it)

@ungoldman
Copy link
Contributor

@timwis I'd rather not force everything to get translated twice before interacting with the cache, and there's no need to make FeatureService queries the defacto format. However coming up with a query standard to translate other APIs into is not trivial and would take some time (would need to cover queries from many different interfaces). If you're not in a rush we can figure that out, but I don't want to hold up your project :)

@ajturner
Copy link
Contributor

ajturner commented Aug 8, 2015

future stuff

What seems optimal would be a mixture of Peg.js expression parser + a relational set algebra like Arel

I haven't investigated these but knit.js or Codd look promising.

@timwis
Copy link
Contributor Author

timwis commented Aug 9, 2015

In the meantime, I've just finished the first prototype of soda-postgres which basically translates the SODA-specific functions (like within_box) to their Postgres equivalents, outputs json in the right format, etc. Demo links at the bottom of the readme to see it in action. Interesting to think about for this discussion.

@ajturner
Copy link
Contributor

ajturner commented Aug 9, 2015

@timwis demo links to IP server that isn't up or available public

@timwis
Copy link
Contributor Author

timwis commented Aug 9, 2015

Sorry about that @ajturner, should be up now. It's really hard to come by a free postgres db with postgis enabled. Open Shift is the only one that offers it and I spent hours debugging their ambiguous logs before finally deciding to just do a $5 digitalocean droplet. Crossing my fingers that "forever" keeps it running like supervisor would.

@dmfenton
Copy link
Contributor

This discussion is well covered in the roadmap.

@timwis
Copy link
Contributor Author

timwis commented Nov 16, 2015

That's awesome! Just read the roadmap - extremely well articulated - can't wait! Let me know if there's anything I can do to help

@dmfenton
Copy link
Contributor

Will do. Next steps for us are to write out some documentation for the V3 spec. Then we'll prototype something. There's definitely space in there for you to write a package that supports something other than geoservices.

More to come in the following months.

@timwis
Copy link
Contributor Author

timwis commented Mar 14, 2018

Hey folks! Just saw the splash image in the readme suggests this was implemented! :-O Is that the case?

@dmfenton
Copy link
Contributor

dmfenton commented Mar 14, 2018

Yep! Check out koopjs/koop-output-geoservices and koopjs/koop-output-wfs for example

rgwozdz pushed a commit that referenced this issue Sep 21, 2022
Bumps [hosted-git-info](https://github.com/npm/hosted-git-info) from 2.7.1 to 2.8.9.
- [Release notes](https://github.com/npm/hosted-git-info/releases)
- [Changelog](https://github.com/npm/hosted-git-info/blob/v2.8.9/CHANGELOG.md)
- [Commits](npm/hosted-git-info@v2.7.1...v2.8.9)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants