[wip] high-level dataset api methods #39

rclark · 2015-12-21T15:39:14Z

Begins work adding higher-level functionality to dataset api support, including:

automatic addition of IDs to features without them
cligj support for streaming feature collections or line-delimited features on stdin
creation of a new dataset, or wholesale replacement of an existing one based on a local geojson file, paginating through requests as needed.

I started by writing a put-features function and a put-dataset function. This are conceptually pretty simple, but I really like @sgillies suggestion to make more "unixy" commands like ls, cp, etc. Some thoughts:

ls: list features in a dataset, paginates for you unless you supply limit and/or start arguments
rm: delete a dataset (indicated by dataset id), or delete a feature (indicated by datasetid.featureid).
cp: copy features from one dataset to another, from a file to a dataset, or from a dataset to a file.

Other ideas? These kinds of commands start to beg for "query" parameters that would let you cp a subset of features, but I'm not sure we want to go there....

cc @mick @willwhite @perrygeo

mick · 2015-12-21T15:54:43Z

automatic addition of IDs to features without them

why have the cli assign feature ids and not let the api that already does that handle it?

perrygeo · 2015-12-21T17:23:08Z

@rclark This is sweet. I'm taking a more in-depth look this morning but some initial thoughts:

automatic addition of IDs to features without them

@mick Last I checked, when I attempted to put features without ids (or features with non-string ids), they were rejected by the datasets api. I think autogenerating ids (and accepting integer ids) could ultimately be handled at the API level.

In the meantime, I'm torn if handling it in the CLI, the SDK or an external tool would be more appropriate. I'd lean towards just making the change in the API but don't have a sense of how much of lift that would be.

unixy commands

I like the idea but I think it could wait for a future release, maybe living beside but not replacing the existing functions.

These kinds of commands start to beg for "query" parameters that would let you cp a subset of features, but I'm not sure we want to go there....

Definitely. Longer term, attribute indexing/filtering at the API level might be the best way to achieve this. Not sure if anything like that is currently planned?

In the meantime, something like fio filter or jq could be used to query from the full collection. IOW don't think it's something we need to bake into the SDK/CLI.

perrygeo · 2015-12-21T18:43:22Z

mapboxcli/scripts/datasets.py

+    service = ctx.obj.get('service')
+    to_put = []
+
+    def id():


id is a python builtin - while python allows you to override builtins, we should probably just rename to fid to avoid conflicts.

perrygeo · 2015-12-21T18:52:32Z

@rclark What about

adding a delete-features command
removing the batch-update-features command, and the singular put-feature and delete-feature (the multi-feature versions should handle a single feature as well)

perrygeo · 2015-12-21T19:25:38Z

mapboxcli/scripts/datasets.py

+def put_features(ctx, dataset, features):
+    """Insert or update features in a dataset.
+
+        $ mapbox dataset put-features features


mapbox dataset put-features DATASET features

rclark · 2015-12-30T19:00:59Z

@perrygeo I think we should keep the basic, API-mirrored commands batch-update-features, put-feature and delete-feature. In my experience it is always a good idea to expose the most basic "building block" commands, and add higher-level ones on top.

Though I'm not sure that it is warranted in our CLI at this point, I'll just say that I'm fond of the aws cli's approach to S3 commands where...

aws s3api COMMAND

... contains the set of commands which are straight-up, single API calls, while

aws s3 COMMAND

... are higher-level functions that may involve multiple API calls.

rclark · 2016-01-04T18:58:13Z

Coming back to this after the break, and I think that if we can agree on some useful unixy command style, it makes sense to do that now, next to the functions that already exist as one-to-one API calls.

dataset uri scheme

In order to do things like rm and cp we need a uri scheme to identify datasets and features within datasets. There's some precedence for this in other parts of the stack, so I would propose a scheme like:

mapbox://datasets/:user/:datasetid/:featureid

proposed commands

mapbox datasets ls {uri}: List dataset ids or feature ids, depending on the depth of the uri.
mapbox datasets rm {uri}: Remove an individual feature from a dataset, or the entire dataset
mapbox datasets cp {source} {destination}: Here source and destination are either uris or file paths. Replace the destination with the content of the source, making all the appropriate API requests to completely overwrite a dataset if that's the intention. Allow - for the source if features are to be fed via stdin, while - as destination indicates output pushed to stdout. Supports output args sequence and rs like https://github.com/mapbox/cligj#example.

What I'm not sure about is how to differentiate an "append to dataset" command in a unixy kind of way. Any good ideas? @sgillies does this align at all with what you were thinking about when you originally made this suggestion?

sgillies · 2016-01-05T06:18:12Z

@rclark I like it, though it feels like there's some repetition in mapbox datasets ls mapbox://datasets/sgillies/foo that we could pare away. Maybe just mapbox datasets ls /sgillies/foo?

OTOH, it's a lot like aws s3 ls s3://... usage and people seem completely fine with that.

Appending... maybe that's an -a mode for the cp command? Not quite like the standard file cp, though. I don't have a good idea for this right now.

perrygeo · 2016-01-05T20:30:34Z

dataset urls
Could we support the full url as well as an abbreviated :datasetid/:featureid path? The username can be inferred through the access token and the prefixes are static.

IOW you could write mapbox://datasets/perrygeo/foo/bar1 or the equivalent foo/bar1?

commands
I think we should feel free to deviate from basic unix commands where it makes sense.

We might need a dataset creation/editing command, maybe mapbox datasets touch?
There is no obvious unix equivalent but we'd need a way to do a read-dataset, maybe mapbox datasets info foo?
the cp -a for append sounds good but might be really easy to forget the -a and wipe out all the features in your dataset, maybe mapbox datasets append to make it more obvious?
We can probably stick to create-tileset as is

In terms of the other commands, are these more unixy variants intended to replace the more API-centric commands entirely? Or live side-by-side?

rclark · 2016-01-05T21:54:28Z

In terms of URI, I'm really mirroring aws-sdk here, where you have two ways (at least) to refer to an object:

by bucket & key --- in our case by username, datasetid, and featureid
by s3://bucket/key uri --- mapbox://datasets/username/datasetid/featureid

Following that model, aws s3api commands always require you to use one convention, while high-level aws s3 commands always require you to use the other:

$ aws s3api get-object --bucket my-bucket --key my/key
$ aws s3 cp s3://my-bucket/my/key -

So I'm inclined to only support full mapbox://datasets/... uris for high level functions, and continue with the individual arguments for API mirroring functions. I'm also inclined to use all the static prefixing due to precedence already set in the upload API, where uploads with a mapbox://datasets/... uri are interpreted to be dataset --> tileset generation requests. This is how mapbox datasets create-tileset works.

are these more unixy variants intended to replace the more API-centric commands entirely? Or live side-by-side?

100% side-by-side. No intention of removing the functions that are straight mirrors of individual API calls.

The username can be inferred through the access token and the prefixes are static.

Yes, but if somewhere down the road we were to allow for cross-account dataset reads, having built the username into the URI would allow updating the mapbox cli to be a non-breaking change.

Finally, for additional commands that don't require multiple API calls or any trickery, we still have the API mirror functions:

mapbox datasets create
mapbox datasets read-dataset DATASETID
mapbox datasets update-dataset DATASETID

sgillies · 2016-01-05T22:12:27Z

@rclark @perrygeo let's definitely aim for more of an aws s3 feel. I'm also feeling good about the mapbox://dataset identifiers. I can see an immediate application for them in Fiona, OGR, and other projects.

I'm concerned that our semantics aren't enough like the semantics of aws s3 cp to make the cp abstraction solid. I think I'd prefer put and append.

rclark · 2016-01-06T18:55:15Z

Okay, I've sketched out one vision of the future here:

actually changed the command to mapbox datasets, plural
changed API mirror functions list and create to list-datasets and create-dataset
new list operation prints URIs for a user's datasets or features in a dataset
new put operation moves data between stdin/stdout, datasets and files, overwriting the destination
new append operation moves data between stdin/stdout, datasets and files, appending to the destination

Next up: Does this make sense to @perrygeo @sgillies? The logic is getting more complex, and want to make sure that this is a path we're all comfortable with. Then, tests.

rclark · 2016-01-06T18:58:46Z

... another simple addition that might help round out these URI-based higher-level routines:

mapbox datasets cat URI: just print a dataset's JSON or a single GeoJSON feature to stdout

perrygeo · 2016-01-22T18:57:56Z

@rclark - back to this after a long hiatus. Apologies for the silence on this front

Per voice with @sgillies yesterday, it seems like keeping the old commands alongside the higher-level commands could lead to much confusion. I spent some time going through the current set of mapbox datasets sub-commands and started to sketch out where there was overlap and how we could unify/clarify some of the functionality.

The proposed idea below would not be too much work: Mostly removing/renaming with a few minor tweaks to functionality plus two new commands (metadata and rm) which are largely just adjusting existing commands to the URI way of specifiying datasets.

The overarching goals that I'm aiming for with this proposed interface:

Use URIs everywhere, no more datasest ids, no more need to differentiate -dataset vs -feature as that can be infered.
Be as unix friendly and consistent with other tools as possible
Have one very obvious way of doing any particular dataset task

current	notes	proposal
append	as is	`append`
batch-update-features	remove in favor of `append`, `put` and `rm`
cat	redundant; same as `put dataset -`
create-dataset	simplify name	`create`
create-tileset	renamed to clarify that it coverts dataset to a tileset	`export-tileset`
delete-dataset	wordy; dataset can be infered from uri	`rm`
delete-feature	wordy; feature can be infered; need to accept multiple uris	`rm`
list	as is	`list`
list-datasets	redundant; if details are needed use `list` then `metadata`
list-features	redundant; functionality covered by `put dataset -`
put	as is	`put`
put-feature	redundant
read-dataset	clarify that this is reading information about the dataset	`metadata`
read-feature	redundant; use `put feature -`
update-dataset	clarify that this is updating metadata	`metadata --update-x`

@rclark, @sgillies How does this look to you?

I can start sketching out an implementation if we agree that this is a good path forward. I'm not married to this exact form so if you have any other ideas on how to clean up the command list, I'm flexible.

willwhite · 2016-01-22T19:33:47Z

/cc @lyzidiamond

We should make sure that the commands make sense in the context of the cli examples we are planning to add to our api-documentation. This is where it might make sense to retain a baseline low-level interface so there is consistency api surface across the CLI and all SDKs.

perrygeo · 2016-01-22T20:02:47Z

Good point re: keeping the interfaces 1:1 with the API

Spending the morning working the new functions introduced in the PR (put, list and append) I can say two things with confidence

The new commands provide a significantly improved interface, are much easier to work with and replace the functionality of many of the low-level commands
Having the new commands side-by-side with the low-level commands was very confusing - 15 commands is a lot but 15 commands with overlapping functionality and subltly different interfaces is a new user's nightmare

So I see the value in the low level commands (consistency) and I see the value in the high-level commands (better UI) but I just don't see how they can live side-by-side in the same command without being painfully confusing.

Options?

As @rclark mentioned, we could follow the lead of aws s3/aws s3api to provide separate subcommands for the two approaches
The high level commands could be a separate project.

rclark · 2016-01-22T21:11:51Z

👍 mapbox datasets and mapbox datasetsapi

Seems to me to be the path of least resistance. The uri-based, higher-level commands are generally easier to use and should be more straightforward to type.

rclark · 2016-03-04T03:57:15Z

I rebased this branch to catch up with master, then split low/high-level functions between mapbox datasetsapi and mapbox datasets, respectively. Remaining tasks I'm seeing are:

settle on uri syntax
understand why travis fails on Python 3 and not 2?
write tests for mapbox datasets functions

perrygeo · 2016-03-04T14:37:43Z

@rclark The python 3 failures are due to some backwards-incompatible refactoring of the standard library (see http://python3porting.com/stdlib.html#urllib-urllib2-and-urlparse).

A try/except block is the way to support both python 2 and 3 in the same codebase. A bit clunky but we're stuck with python 2 for a while.

perrygeo · 2016-03-04T15:21:33Z

Speaking of backwards compatibility, if we're going to s/datasets/datasetsapi we'll need to coordinate an update of the examples in this repo and the api-documentation. And maybe do a deprecation warning on old mapbox datasets cmds on the next release?

Alternatively, we could keep the low-level datasets as is and come up with a new name for the high-level commands. Not only would that keep backwards compat but it would be consistent with the other low-level commands which are named according to their API doc page (no api postfix).

I'm fine with it either way - I hate to instigate more bikeshedding on names but it seems worth a few minutes of consideration, to make sure everyone's on the same page.

rclark · 2016-03-04T15:40:27Z

Thanks for the fix and lint 👍

I'm game to coordinate the doc/example updates. For my part, I'm not very worried about breaking backwards compatibility for an API that's not even in beta yet.

sgillies · 2016-03-22T17:28:44Z

@rclark @perrygeo I like everything about this except the mapbox://datasets URIs. Here are my concerns about using them in this CLI:

The mapbox:// URI scheme is poorly designed. It doesn't parse properly with Node or Python URL parsing libs and we haven't documented it. It's a cultural, not architectural, thing and that's a red flag for me.
The mapbox://datasets segment of a URI is redundant because all that information is already baked into the CLI commands. You can't do anything with mapbox datasets except operate on Mapbox Datasets, right?

That said, we definitely do have a technical need for a datasets URI scheme. It's just that mapbox:// has big flaws. I think we shouldn't promote it anymore than we already do. I'm pretty sure I've persuaded the Studio team of this.

A solution for the second (not addressing the first) would be to rename these higher level commands. cmd ls mapbox://datasets/foo/bar feels much better to me.

Ultimately, what I'd like to convince you of is that cmd cp mapbox-datasets:foo/bar mapbox-datasets:foo/bar2 is the way to go. The redundancy would be gone and we would have a URI scheme that distinguishes Mapbox datasets in the ☁️ from local files, URIs that parse correctly and that we could be confident about registering and supporting for a long time.

…tasets.py

coveralls · 2016-08-18T21:02:04Z

Coverage decreased (-1.3%) to 93.554% when pulling eabaa67 on higher-ground into 19712b1 on master.

coveralls · 2016-08-18T21:56:07Z

Coverage decreased (-0.5%) to 94.346% when pulling 9595666 on higher-ground into 19712b1 on master.

coveralls · 2016-08-18T21:56:07Z

Coverage decreased (-0.5%) to 94.346% when pulling 9595666 on higher-ground into 19712b1 on master.

rclark · 2016-08-18T21:57:37Z

Back here again, visiting my black sheep of a PR.

I've rebased this branch over master, written another test, and removed some dead code from private functions in datasets.py. I think that this is ready to go aside from:

passing tests on travis on python 3... I'm not sure here, I am passing locally with v2 and v3
one last round of review/suggestions from the ever-patient @perrygeo and @sgillies

sgillies · 2016-09-07T12:45:36Z

tests/test_datasets.py


    assert result.exit_code == 0
-    assert result.output.strip() == datasets.strip()
+    body = json.loads(responses.calls[0].request.body)


@rclark here's the root of the Python 3 failure: .body is type bytes. We want .text which is type str. I think these assertions on the mock response objects can be eliminated anyway, this stuff is already tested by Responses. All we need to assert is result.exit_code and result.output.

sgillies · 2016-09-07T12:57:31Z

@rclark I found the Python 3 problem and suggested a bunch of assert deletions of one kind: we don't need to test the mock responses (except maybe while debugging), just our own API and CLI.

Want to make the Python 3 fix? I'll merge after that and then we'll revise the CLI from there. create-tileset strikes me as a command that should be in a Tilesets API/CLI, not in the Datasets API/CLI.

perrygeo · 2018-05-31T17:15:44Z

This has PR has gotten stale. A lot of good ideas here but nothing moving. Feel free to re-open if there is a need.

perrygeo reviewed Dec 21, 2015
View reviewed changes

sgillies added this to the 0.3 milestone Jan 4, 2016

perrygeo mentioned this pull request Jan 30, 2016

deprecate in favor of mapbox-cli? mapbox/mapbox-data-cli#12

Closed

This was referenced Feb 15, 2016

CLI design principles and strategy #10

Closed

0.3 release #42

Closed

perrygeo modified the milestone: 0.3 Feb 17, 2016

rclark force-pushed the higher-ground branch from 24f49bc to 7464d67 Compare March 4, 2016 03:21

rclark and others added 12 commits August 18, 2016 13:59

fix renamed function tests

ddae49e

split datasets and datasetapi

39e0023

urlparse for python3

620e475

remove unused imports and indentation per linter

6ec3b2c

drop datasets read, rename datasets put --> copy, simple tests for da…

41e78c5

…tasets.py

pass tests in python3

4193700

use uuid module for random ids

0bbd488

simplify working with infinity

f8ff808

some help doc fixes

d57426c

start adding unit tests for datasets non-cli functions

646c3c4

unit test write_features output types

d3681c5

use unicode literal instead of byte literal

eabaa67

rclark force-pushed the higher-ground branch from 9f5b9f9 to eabaa67 Compare August 18, 2016 21:00

rclark added 3 commits August 18, 2016 14:32

test to cover copying a single feature

49b62a6

remove dead max_features code

bbe43c4

lint

9595666

sgillies reviewed Sep 7, 2016
View reviewed changes

perrygeo closed this May 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] high-level dataset api methods #39

[wip] high-level dataset api methods #39

rclark commented Dec 21, 2015

mick commented Dec 21, 2015

perrygeo commented Dec 21, 2015

perrygeo Dec 21, 2015

perrygeo commented Dec 21, 2015

perrygeo Dec 21, 2015

rclark commented Dec 30, 2015

rclark commented Jan 4, 2016

sgillies commented Jan 5, 2016

perrygeo commented Jan 5, 2016

rclark commented Jan 5, 2016

sgillies commented Jan 5, 2016

rclark commented Jan 6, 2016

rclark commented Jan 6, 2016

perrygeo commented Jan 22, 2016

willwhite commented Jan 22, 2016

perrygeo commented Jan 22, 2016

rclark commented Jan 22, 2016

rclark commented Mar 4, 2016

perrygeo commented Mar 4, 2016

perrygeo commented Mar 4, 2016

rclark commented Mar 4, 2016

sgillies commented Mar 22, 2016

coveralls commented Aug 18, 2016 •

edited

Loading

coveralls commented Aug 18, 2016 •

edited

Loading

coveralls commented Aug 18, 2016

rclark commented Aug 18, 2016

sgillies Sep 7, 2016

sgillies commented Sep 7, 2016

perrygeo commented May 31, 2018

[wip] high-level dataset api methods #39

[wip] high-level dataset api methods #39

Conversation

rclark commented Dec 21, 2015

mick commented Dec 21, 2015

perrygeo commented Dec 21, 2015

perrygeo Dec 21, 2015

Choose a reason for hiding this comment

perrygeo commented Dec 21, 2015

perrygeo Dec 21, 2015

Choose a reason for hiding this comment

rclark commented Dec 30, 2015

rclark commented Jan 4, 2016

dataset uri scheme

proposed commands

sgillies commented Jan 5, 2016

perrygeo commented Jan 5, 2016

rclark commented Jan 5, 2016

sgillies commented Jan 5, 2016

rclark commented Jan 6, 2016

rclark commented Jan 6, 2016

perrygeo commented Jan 22, 2016

willwhite commented Jan 22, 2016

perrygeo commented Jan 22, 2016

rclark commented Jan 22, 2016

rclark commented Mar 4, 2016

perrygeo commented Mar 4, 2016

perrygeo commented Mar 4, 2016

rclark commented Mar 4, 2016

sgillies commented Mar 22, 2016

coveralls commented Aug 18, 2016 • edited Loading

coveralls commented Aug 18, 2016 • edited Loading

coveralls commented Aug 18, 2016

rclark commented Aug 18, 2016

sgillies Sep 7, 2016

Choose a reason for hiding this comment

sgillies commented Sep 7, 2016

perrygeo commented May 31, 2018

coveralls commented Aug 18, 2016 •

edited

Loading

coveralls commented Aug 18, 2016 •

edited

Loading