CDK-385: Add dataset URIs #87

rdblue · 2014-05-28T23:30:24Z

No description provided.

This updates URIPattern so that it can match after glob patterns. For example, file:/*path/:dataset-name, will now match "dataset-name" before matching the path glob. This makes it possible to define URI patterns that include a globbed path component that ends in a dataset name. This also adds sample use, which needs to be updated.

Because query args are now supported in opaque URIs, matching the repo:*storage-uri pattern was removing query args as part of that match. It isn't really necessary to use URIPattern to match just the scheme, so this commit matches the scheme and passes the raw scheme-specific part to Registration as the dataset or repo URI. Changes to path handling broke the Hive URI tests because "" can be passed as a path. This also exposed that file: and hdfs: URIs were incorrectly substituting the current directory for "/" paths, which are passed as path "".

This moves the DatasetBuilder from the FS Loader into Registration so it can be reused to wrap any repository builder. This also consolidates registerRepoURI and registerDatasetURI into one register method that uses the generic dataset builder. The generic builder will first get a repository from the parsed URI options, then load a Dataset.

There is enough context from the methods used that there is no need to add schemes to the start of URIs. This removes repo: and dataset:. Only repository URIs provide backward compatibility.

ben-roling · 2014-06-02T14:30:04Z

I haven't thoroughly examined all of this, but the change in b96ba36 to remove URI prefixes caught my eye. I think this will be problematic for the Oozie integration because the way the Oozie integration is done is by implementing a new URIHandler that is linked to a given URI scheme. Without the prefixes, there is effectively no Kite URI scheme to tie a Kite URIHandler to.

rdblue · 2014-06-02T16:10:01Z

@ben-roling thanks for noticing that. I'll add the "dataset" prefix back as optional. I just don't want to force users to use it all the time when it is obvious, like Datasets.load("dataset:...")

tomwhite · 2014-06-03T13:27:18Z

I think the dataset: (and repo:) prefix should be required. Otherwise there's confusion about when you can drop the prefix (e.g. you can't when you are using Oozie), and also that it's a Kite URI at all.

We can relax this in the future, but if we make it optional now, we can't make it required in the future.

tomwhite · 2014-06-03T13:28:12Z

The documentation needs to be updated as a part of this change too.

rdblue · 2014-06-04T04:09:39Z

Tom, I think you're right. It makes sense in the context of a call to Datasets.load, but out of context the URIs are confusing. The namespace proposal would make the actual storage deviate from the apparent storage URI, too: hdfs:/datasets/ns/ds vs. hdfs:/datasets/ns.ds. It's clear that the URI does not match in the latter case and that the Kite URIs don't necessarily carry meaning in other contexts.

ben-roling · 2014-06-05T20:40:24Z

I'm wondering slightly about the way you are doing encoding of view equality constraints as query params. Generally I like the way it looks, but I think sometimes you are going to have query parameters from the repository URI that have to carry forward into the dataset URI such that it remains fully-descriptive. In such a case, the query parameter list becomes a mix of repository parameters and view constraint parameters without clear separation between the two. For example:

dataset:hive:/datasets/ds?hdfs-host=myhost&hdfs-port=9999&version=10

hdfs-host and hdfs-port are repository parameters and version is a view constraint parameter. When I was originally thinking about the URIs I was thinking about a single query param to contain all the view constraints such that there is clear separation between the view constraints and any other query parameters you might need for other reasons.

This idea yields URIs that are a bit uglier and more verbose even in the general cases where there are no query parameters contributed by the repository:

view:hive:/datasets/ds?constraints=version:10,lastName:Smith
vs
view:hive:/datasets/ds?version=10&lastName=Smith

Note: the format of the constraints query parameter above is just meant to demonstrate the concept rather than the specific encoding. I'm not sure if it is ok to use a colon as a separator there. I don't have enough expertise in URI design to know the best way to build something like that into the URI and I haven't spent the time to research it.

Anyway, I just wanted to mention the issue. You may feel that collisions between schema field names and repo query parameters would be so rare that the prettier URIs are worth the risk. Or perhaps I am missing something and somehow this isn't really an issue in the first place?

rdblue · 2014-06-06T17:22:03Z

Ben, you're right that the query part is used for both. I don't see it as much of an issue, as you noted, because I expect the cluster information to be set in the environment config. Allowing it in the URI is more of a safety valve.

I think the shorter URIs also match the common expectation for how these URIs should work. Extra key-value pairs act as selection criteria, rather than needing to know "constraints" is a keyword and the value is split by comma and colon.

ben-roling · 2014-06-06T17:58:36Z

Here is another case for your consideration:

Datasets.view("view:hdfs:/tmp/data/users?username=test")

The username part of this URI gets interpreted as applying both as a filter on the username attribute of the entity schema in addition to being userInfo attribute applied to the HDFS repo URI.

I actually just sort of accidentally ran into this when writing a new unit test in TestCrunchDatasets to test the feature I was adding for specifying the crunch target as a view URI. The existing tests in this class use a "User" entity with "username" and "email" attributes. I was trying to write a test with a target view URI similar to above. The test was failing because the HDFS location URIs were inconsistent between the view and the original dataset.

The stack trace was this:

java.lang.IllegalArgumentException: Wrong FS: hdfs://test-0@localhost:61357/var/folders/9k/wlyxzx3d4mb2kx3jkdk9t9t56bkgvq/T/1402077113218-0/out_job_local1436488949_0001/username_hash=1/f48a4518-3121-4bca-9654-6138625e99f2.avro, expected: hdfs://localhost:61357
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:625)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:173)
at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:339)
at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:306)
at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:1)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:142)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:456)

It would probably not be particularly rare for a consumer to have an Entity with "username" as an attribute. When they do, they may have trouble formulating a View with an equality constraint on that attribute.

So that I could move forward I just changed my test to use a view that filters on "email" instead but I thought I should mention it as it relates to this conversation.

I started this thread before I actually ran into this which makes it sort of funny that I didn't immediately realize what was happening when the test failed in the first place.

tomwhite · 2014-06-10T14:45:03Z

We need to avoid clashes, or at least have a mechanism for working around them.

The only place where query parameters are used is for Hive external tables. Perhaps we can change it so the dataset path has to be specified in the dataset descriptor at the time of creation for external tables. Readers wouldn't have to know that the dataset is backed by an external table or a managed one.

ben-roling · 2014-06-10T15:29:38Z

Another thing you might want to think about is the possibility that you will ever want to introduce new uses of query params in repository URIs. You may be constrained from doing so by the potential collisions they could introduce in Dataset URIs.

If you think it is sufficiently unlikely that those will ever be needed or you have an idea of how you would handle it then maybe it is not an issue.

rdblue · 2014-06-10T16:14:09Z

We can solve the problem with other arguments clashing easily by changing the keys that the URI host section uses. We could use more complex names for this, like "hdfs.user" rather than "username" to deconflict.

Readers wouldn't have to know that the dataset is backed by an external table or a managed one.

This is something we need to update anyway, which is covered by CDK-445. I'd like to be able to specify anything without a reasonable default in the URI so that we can use them as targets in MR or Crunch without requiring the developer to pre-create them.

rdblue · 2014-06-10T16:16:03Z

You may be constrained from doing so by the potential collisions they could introduce in Dataset URIs.

I think the idea of adding prefixes (e.g., hdfs.auth) would solve this problem. Another way of getting around it is to introduce other URI formats with different ways the query args are used.

ben-roling · 2014-06-10T16:39:48Z

I think the idea of adding prefixes (e.g., hdfs.auth) would solve this problem.

That would probably be a good enough solution.

Another way of getting around it is to introduce other URI formats with different ways the query args are used.

Yeah, that is obviously possible too although I suspect you'd like to minimize the number of different styles of URIs you support.

rdblue · 2014-06-10T16:43:19Z

Yeah, I definitely want to minimize the number of URI patterns, but I'm okay with that as a backup plan if we need to completely change how query args are used.

This updates the auth properties in URIPattern by adding the prefix "auth:" and updates the HDFS properties used in Hive URIs to use the prefix "hdfs:"

rdblue · 2014-06-10T21:28:10Z

To avoid the collisions, I've updated the URI handling code to use prefixes. Raw URI parts added to the match data are prefixed with "uri:", authority parts are prefixed with "auth:" and the Hive URI patterns now use "hdfs:" to prefix the HDFS host and port. This should deconflict query args.

I chose the ":" prefix because it produces a warning when used in a record field (not Hive compatible) and I wanted to reserve "." for nested fields. For example, "location.latitude" will likely be a valid nested field name in the future.

tomwhite · 2014-06-13T14:42:09Z

+1

The change for the Hive URI patterns needs to be documented in the Reference Guide.

Regarding the query prefixes, I checked and they can contain a ':' character without having to escape it. http://tools.ietf.org/html/rfc3986#section-3.4 defines

query       = *( pchar / "/" / "?” )

and

pchar         = unreserved / pct-encoded / sub-delims / ":" / “@“

rdblue · 2014-06-13T19:45:21Z

Merged. Thanks for the reviews!

rdblue added 10 commits May 19, 2014 23:03

CDK-385: Add support for query in opaque URIPattern matches.

f4d377c

CDK-385: Move URI registration to SPI, add return generics.

ac4ad69

CDK-385: Add HBase dataset URIs.

e9e90e8

CDK-385: Use string URIPattern constructor.

1af034f

CDK-385: Add Datasets.view to construct views from URIs.

b814814

CDK-385: Update CLI tests to use SPI Registration.

befec25

CDK-385: Remove URI prefixes.

b96ba36

There is enough context from the methods used that there is no need to add schemes to the start of URIs. This removes repo: and dataset:. Only repository URIs provide backward compatibility.

CDK-385: Add optional dataset: and view: URI prefixes.

4642835

rdblue added 4 commits June 10, 2014 13:45

CDK-385: Make URI prefixes required.

2490021

CDK-385: Remove generic repository in DatasetRepositories.

cd7a255

CDK-385: Add javadoc.

9e012d4

CDK-385: Add ":" to URI properties to avoid conflicts.

656ee6f

This updates the auth properties in URIPattern by adding the prefix "auth:" and updates the HDFS properties used in Hive URIs to use the prefix "hdfs:"

rdblue mentioned this pull request Jun 10, 2014

CDK-470: Adjusted the PathConversion code to encode String values to avo... #95

Closed

rdblue closed this Jun 13, 2014

rdblue deleted the CDK-385-add-dataset-uris branch June 13, 2014 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDK-385: Add dataset URIs #87

CDK-385: Add dataset URIs #87

rdblue commented May 28, 2014

ben-roling commented Jun 2, 2014

rdblue commented Jun 2, 2014

tomwhite commented Jun 3, 2014

tomwhite commented Jun 3, 2014

rdblue commented Jun 4, 2014

ben-roling commented Jun 5, 2014

rdblue commented Jun 6, 2014

ben-roling commented Jun 6, 2014

tomwhite commented Jun 10, 2014

ben-roling commented Jun 10, 2014

rdblue commented Jun 10, 2014

rdblue commented Jun 10, 2014

ben-roling commented Jun 10, 2014

rdblue commented Jun 10, 2014

rdblue commented Jun 10, 2014

tomwhite commented Jun 13, 2014

rdblue commented Jun 13, 2014

CDK-385: Add dataset URIs #87

CDK-385: Add dataset URIs #87

Conversation

rdblue commented May 28, 2014

ben-roling commented Jun 2, 2014

rdblue commented Jun 2, 2014

tomwhite commented Jun 3, 2014

tomwhite commented Jun 3, 2014

rdblue commented Jun 4, 2014

ben-roling commented Jun 5, 2014

rdblue commented Jun 6, 2014

ben-roling commented Jun 6, 2014

tomwhite commented Jun 10, 2014

ben-roling commented Jun 10, 2014

rdblue commented Jun 10, 2014

rdblue commented Jun 10, 2014

ben-roling commented Jun 10, 2014

rdblue commented Jun 10, 2014

rdblue commented Jun 10, 2014

tomwhite commented Jun 13, 2014

rdblue commented Jun 13, 2014