Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] high-level dataset api methods #39

Closed
wants to merge 20 commits into from
Closed

[wip] high-level dataset api methods #39

wants to merge 20 commits into from

Conversation

rclark
Copy link
Contributor

@rclark rclark commented Dec 21, 2015

Begins work adding higher-level functionality to dataset api support, including:

  • automatic addition of IDs to features without them
  • cligj support for streaming feature collections or line-delimited features on stdin
  • creation of a new dataset, or wholesale replacement of an existing one based on a local geojson file, paginating through requests as needed.

I started by writing a put-features function and a put-dataset function. This are conceptually pretty simple, but I really like @sgillies suggestion to make more "unixy" commands like ls, cp, etc. Some thoughts:

ls: list features in a dataset, paginates for you unless you supply limit and/or start arguments
rm: delete a dataset (indicated by dataset id), or delete a feature (indicated by datasetid.featureid).
cp: copy features from one dataset to another, from a file to a dataset, or from a dataset to a file.

Other ideas? These kinds of commands start to beg for "query" parameters that would let you cp a subset of features, but I'm not sure we want to go there....

cc @mick @willwhite @perrygeo

@mick
Copy link

mick commented Dec 21, 2015

automatic addition of IDs to features without them

why have the cli assign feature ids and not let the api that already does that handle it?

@perrygeo
Copy link
Contributor

@rclark This is sweet. I'm taking a more in-depth look this morning but some initial thoughts:

automatic addition of IDs to features without them

@mick Last I checked, when I attempted to put features without ids (or features with non-string ids), they were rejected by the datasets api. I think autogenerating ids (and accepting integer ids) could ultimately be handled at the API level.

In the meantime, I'm torn if handling it in the CLI, the SDK or an external tool would be more appropriate. I'd lean towards just making the change in the API but don't have a sense of how much of lift that would be.

unixy commands

I like the idea but I think it could wait for a future release, maybe living beside but not replacing the existing functions.

These kinds of commands start to beg for "query" parameters that would let you cp a subset of features, but I'm not sure we want to go there....

Definitely. Longer term, attribute indexing/filtering at the API level might be the best way to achieve this. Not sure if anything like that is currently planned?

In the meantime, something like fio filter or jq could be used to query from the full collection. IOW don't think it's something we need to bake into the SDK/CLI.

service = ctx.obj.get('service')
to_put = []

def id():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is a python builtin - while python allows you to override builtins, we should probably just rename to fid to avoid conflicts.

@perrygeo
Copy link
Contributor

@rclark What about

  • adding a delete-features command
  • removing the batch-update-features command, and the singular put-feature and delete-feature (the multi-feature versions should handle a single feature as well)

def put_features(ctx, dataset, features):
"""Insert or update features in a dataset.

$ mapbox dataset put-features features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapbox dataset put-features DATASET features

@rclark
Copy link
Contributor Author

rclark commented Dec 30, 2015

@perrygeo I think we should keep the basic, API-mirrored commands batch-update-features, put-feature and delete-feature. In my experience it is always a good idea to expose the most basic "building block" commands, and add higher-level ones on top.

Though I'm not sure that it is warranted in our CLI at this point, I'll just say that I'm fond of the aws cli's approach to S3 commands where...

aws s3api COMMAND

... contains the set of commands which are straight-up, single API calls, while

aws s3 COMMAND

... are higher-level functions that may involve multiple API calls.

@sgillies sgillies added this to the 0.3 milestone Jan 4, 2016
@rclark
Copy link
Contributor Author

rclark commented Jan 4, 2016

Coming back to this after the break, and I think that if we can agree on some useful unixy command style, it makes sense to do that now, next to the functions that already exist as one-to-one API calls.

dataset uri scheme

In order to do things like rm and cp we need a uri scheme to identify datasets and features within datasets. There's some precedence for this in other parts of the stack, so I would propose a scheme like:

mapbox://datasets/:user/:datasetid/:featureid

proposed commands

  • mapbox datasets ls {uri}: List dataset ids or feature ids, depending on the depth of the uri.
  • mapbox datasets rm {uri}: Remove an individual feature from a dataset, or the entire dataset
  • mapbox datasets cp {source} {destination}: Here source and destination are either uris or file paths. Replace the destination with the content of the source, making all the appropriate API requests to completely overwrite a dataset if that's the intention. Allow - for the source if features are to be fed via stdin, while - as destination indicates output pushed to stdout. Supports output args sequence and rs like https://github.com/mapbox/cligj#example.

What I'm not sure about is how to differentiate an "append to dataset" command in a unixy kind of way. Any good ideas? @sgillies does this align at all with what you were thinking about when you originally made this suggestion?

@sgillies
Copy link
Contributor

sgillies commented Jan 5, 2016

@rclark I like it, though it feels like there's some repetition in mapbox datasets ls mapbox://datasets/sgillies/foo that we could pare away. Maybe just mapbox datasets ls /sgillies/foo?

OTOH, it's a lot like aws s3 ls s3://... usage and people seem completely fine with that.

Appending... maybe that's an -a mode for the cp command? Not quite like the standard file cp, though. I don't have a good idea for this right now.

@perrygeo
Copy link
Contributor

perrygeo commented Jan 5, 2016

dataset urls
Could we support the full url as well as an abbreviated :datasetid/:featureid path? The username can be inferred through the access token and the prefixes are static.

IOW you could write mapbox://datasets/perrygeo/foo/bar1 or the equivalent foo/bar1?

commands
I think we should feel free to deviate from basic unix commands where it makes sense.

  • We might need a dataset creation/editing command, maybe mapbox datasets touch?
  • There is no obvious unix equivalent but we'd need a way to do a read-dataset, maybe mapbox datasets info foo?
  • the cp -a for append sounds good but might be really easy to forget the -a and wipe out all the features in your dataset, maybe mapbox datasets append to make it more obvious?
  • We can probably stick to create-tileset as is

In terms of the other commands, are these more unixy variants intended to replace the more API-centric commands entirely? Or live side-by-side?

@rclark
Copy link
Contributor Author

rclark commented Jan 5, 2016

In terms of URI, I'm really mirroring aws-sdk here, where you have two ways (at least) to refer to an object:

  • by bucket & key --- in our case by username, datasetid, and featureid
  • by s3://bucket/key uri --- mapbox://datasets/username/datasetid/featureid

Following that model, aws s3api commands always require you to use one convention, while high-level aws s3 commands always require you to use the other:

$ aws s3api get-object --bucket my-bucket --key my/key
$ aws s3 cp s3://my-bucket/my/key - 

So I'm inclined to only support full mapbox://datasets/... uris for high level functions, and continue with the individual arguments for API mirroring functions. I'm also inclined to use all the static prefixing due to precedence already set in the upload API, where uploads with a mapbox://datasets/... uri are interpreted to be dataset --> tileset generation requests. This is how mapbox datasets create-tileset works.

are these more unixy variants intended to replace the more API-centric commands entirely? Or live side-by-side?

100% side-by-side. No intention of removing the functions that are straight mirrors of individual API calls.

The username can be inferred through the access token and the prefixes are static.

Yes, but if somewhere down the road we were to allow for cross-account dataset reads, having built the username into the URI would allow updating the mapbox cli to be a non-breaking change.

Finally, for additional commands that don't require multiple API calls or any trickery, we still have the API mirror functions:

  • mapbox datasets create
  • mapbox datasets read-dataset DATASETID
  • mapbox datasets update-dataset DATASETID

@sgillies
Copy link
Contributor

sgillies commented Jan 5, 2016

@rclark @perrygeo let's definitely aim for more of an aws s3 feel. I'm also feeling good about the mapbox://dataset identifiers. I can see an immediate application for them in Fiona, OGR, and other projects.

I'm concerned that our semantics aren't enough like the semantics of aws s3 cp to make the cp abstraction solid. I think I'd prefer put and append.

@rclark
Copy link
Contributor Author

rclark commented Jan 6, 2016

Okay, I've sketched out one vision of the future here:

  • actually changed the command to mapbox datasets, plural
  • changed API mirror functions list and create to list-datasets and create-dataset
  • new list operation prints URIs for a user's datasets or features in a dataset
  • new put operation moves data between stdin/stdout, datasets and files, overwriting the destination
  • new append operation moves data between stdin/stdout, datasets and files, appending to the destination

Next up: Does this make sense to @perrygeo @sgillies? The logic is getting more complex, and want to make sure that this is a path we're all comfortable with. Then, tests.

@rclark
Copy link
Contributor Author

rclark commented Jan 6, 2016

... another simple addition that might help round out these URI-based higher-level routines:

mapbox datasets cat URI: just print a dataset's JSON or a single GeoJSON feature to stdout

@perrygeo
Copy link
Contributor

@rclark - back to this after a long hiatus. Apologies for the silence on this front

Per voice with @sgillies yesterday, it seems like keeping the old commands alongside the higher-level commands could lead to much confusion. I spent some time going through the current set of mapbox datasets sub-commands and started to sketch out where there was overlap and how we could unify/clarify some of the functionality.

The proposed idea below would not be too much work: Mostly removing/renaming with a few minor tweaks to functionality plus two new commands (metadata and rm) which are largely just adjusting existing commands to the URI way of specifiying datasets.

The overarching goals that I'm aiming for with this proposed interface:

  • Use URIs everywhere, no more datasest ids, no more need to differentiate -dataset vs -feature as that can be infered.
  • Be as unix friendly and consistent with other tools as possible
  • Have one very obvious way of doing any particular dataset task
current notes proposal
append as is append
batch-update-features remove in favor of append, put and rm
cat redundant; same as put dataset -
create-dataset simplify name create
create-tileset renamed to clarify that it coverts dataset to a tileset export-tileset
delete-dataset wordy; dataset can be infered from uri rm
delete-feature wordy; feature can be infered; need to accept multiple uris rm
list as is list
list-datasets redundant; if details are needed use list then metadata
list-features redundant; functionality covered by put dataset -
put as is put
put-feature redundant
read-dataset clarify that this is reading information about the dataset metadata
read-feature redundant; use put feature -
update-dataset clarify that this is updating metadata metadata --update-x

@rclark, @sgillies How does this look to you?

I can start sketching out an implementation if we agree that this is a good path forward. I'm not married to this exact form so if you have any other ideas on how to clean up the command list, I'm flexible.

@willwhite
Copy link

/cc @lyzidiamond

We should make sure that the commands make sense in the context of the cli examples we are planning to add to our api-documentation. This is where it might make sense to retain a baseline low-level interface so there is consistency api surface across the CLI and all SDKs.

@perrygeo
Copy link
Contributor

Good point re: keeping the interfaces 1:1 with the API

Spending the morning working the new functions introduced in the PR (put, list and append) I can say two things with confidence

  • The new commands provide a significantly improved interface, are much easier to work with and replace the functionality of many of the low-level commands
  • Having the new commands side-by-side with the low-level commands was very confusing - 15 commands is a lot but 15 commands with overlapping functionality and subltly different interfaces is a new user's nightmare

So I see the value in the low level commands (consistency) and I see the value in the high-level commands (better UI) but I just don't see how they can live side-by-side in the same command without being painfully confusing.

Options?

  1. As @rclark mentioned, we could follow the lead of aws s3/aws s3api to provide separate subcommands for the two approaches
  2. The high level commands could be a separate project.

@rclark
Copy link
Contributor Author

rclark commented Jan 22, 2016

👍 mapbox datasets and mapbox datasetsapi

Seems to me to be the path of least resistance. The uri-based, higher-level commands are generally easier to use and should be more straightforward to type.

@rclark
Copy link
Contributor Author

rclark commented Mar 4, 2016

I rebased this branch to catch up with master, then split low/high-level functions between mapbox datasetsapi and mapbox datasets, respectively. Remaining tasks I'm seeing are:

  • settle on uri syntax
  • understand why travis fails on Python 3 and not 2?
  • write tests for mapbox datasets functions

@perrygeo
Copy link
Contributor

perrygeo commented Mar 4, 2016

@rclark The python 3 failures are due to some backwards-incompatible refactoring of the standard library (see http://python3porting.com/stdlib.html#urllib-urllib2-and-urlparse).

A try/except block is the way to support both python 2 and 3 in the same codebase. A bit clunky but we're stuck with python 2 for a while.

@perrygeo
Copy link
Contributor

perrygeo commented Mar 4, 2016

Speaking of backwards compatibility, if we're going to s/datasets/datasetsapi we'll need to coordinate an update of the examples in this repo and the api-documentation. And maybe do a deprecation warning on old mapbox datasets cmds on the next release?

Alternatively, we could keep the low-level datasets as is and come up with a new name for the high-level commands. Not only would that keep backwards compat but it would be consistent with the other low-level commands which are named according to their API doc page (no api postfix).

I'm fine with it either way - I hate to instigate more bikeshedding on names but it seems worth a few minutes of consideration, to make sure everyone's on the same page.

@rclark
Copy link
Contributor Author

rclark commented Mar 4, 2016

Thanks for the fix and lint 👍

I'm game to coordinate the doc/example updates. For my part, I'm not very worried about breaking backwards compatibility for an API that's not even in beta yet.

@sgillies
Copy link
Contributor

@rclark @perrygeo I like everything about this except the mapbox://datasets URIs. Here are my concerns about using them in this CLI:

  1. The mapbox:// URI scheme is poorly designed. It doesn't parse properly with Node or Python URL parsing libs and we haven't documented it. It's a cultural, not architectural, thing and that's a red flag for me.
  2. The mapbox://datasets segment of a URI is redundant because all that information is already baked into the CLI commands. You can't do anything with mapbox datasets except operate on Mapbox Datasets, right?

That said, we definitely do have a technical need for a datasets URI scheme. It's just that mapbox:// has big flaws. I think we shouldn't promote it anymore than we already do. I'm pretty sure I've persuaded the Studio team of this.

A solution for the second (not addressing the first) would be to rename these higher level commands. cmd ls mapbox://datasets/foo/bar feels much better to me.

Ultimately, what I'd like to convince you of is that cmd cp mapbox-datasets:foo/bar mapbox-datasets:foo/bar2 is the way to go. The redundancy would be gone and we would have a URI scheme that distinguishes Mapbox datasets in the ☁️ from local files, URIs that parse correctly and that we could be confident about registering and supporting for a long time.

@coveralls
Copy link

coveralls commented Aug 18, 2016

Coverage Status

Coverage decreased (-1.3%) to 93.554% when pulling eabaa67 on higher-ground into 19712b1 on master.

@coveralls
Copy link

coveralls commented Aug 18, 2016

Coverage Status

Coverage decreased (-0.5%) to 94.346% when pulling 9595666 on higher-ground into 19712b1 on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 94.346% when pulling 9595666 on higher-ground into 19712b1 on master.

@rclark
Copy link
Contributor Author

rclark commented Aug 18, 2016

Back here again, visiting my black sheep of a PR.

I've rebased this branch over master, written another test, and removed some dead code from private functions in datasets.py. I think that this is ready to go aside from:

  • passing tests on travis on python 3... I'm not sure here, I am passing locally with v2 and v3
  • one last round of review/suggestions from the ever-patient @perrygeo and @sgillies


assert result.exit_code == 0
assert result.output.strip() == datasets.strip()
body = json.loads(responses.calls[0].request.body)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rclark here's the root of the Python 3 failure: .body is type bytes. We want .text which is type str. I think these assertions on the mock response objects can be eliminated anyway, this stuff is already tested by Responses. All we need to assert is result.exit_code and result.output.

@sgillies
Copy link
Contributor

sgillies commented Sep 7, 2016

@rclark I found the Python 3 problem and suggested a bunch of assert deletions of one kind: we don't need to test the mock responses (except maybe while debugging), just our own API and CLI.

Want to make the Python 3 fix? I'll merge after that and then we'll revise the CLI from there. create-tileset strikes me as a command that should be in a Tilesets API/CLI, not in the Datasets API/CLI.

@perrygeo
Copy link
Contributor

This has PR has gotten stale. A lot of good ideas here but nothing moving. Feel free to re-open if there is a need.

@perrygeo perrygeo closed this May 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants