Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDK-385: Add dataset URIs #87

Closed
wants to merge 15 commits into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented May 28, 2014

No description provided.

rdblue added 10 commits May 19, 2014 23:03
This updates URIPattern so that it can match after glob patterns. For
example, file:/*path/:dataset-name, will now match "dataset-name" before
matching the path glob. This makes it possible to define URI patterns
that include a globbed path component that ends in a dataset name.

This also adds sample use, which needs to be updated.
Because query args are now supported in opaque URIs, matching the
repo:*storage-uri pattern was removing query args as part of that match.
It isn't really necessary to use URIPattern to match just the scheme, so
this commit matches the scheme and passes the raw scheme-specific part
to Registration as the dataset or repo URI.

Changes to path handling broke the Hive URI tests because "" can be
passed as a path. This also exposed that file: and hdfs: URIs were
incorrectly substituting the current directory for "/" paths, which are
passed as path "".
This moves the DatasetBuilder from the FS Loader into Registration so it
can be reused to wrap any repository builder. This also consolidates
registerRepoURI and registerDatasetURI into one register method that
uses the generic dataset builder. The generic builder will first get a
repository from the parsed URI options, then load a Dataset.
There is enough context from the methods used that there is no need to
add schemes to the start of URIs. This removes repo: and dataset:. Only
repository URIs provide backward compatibility.
@ben-roling
Copy link

I haven't thoroughly examined all of this, but the change in b96ba36 to remove URI prefixes caught my eye. I think this will be problematic for the Oozie integration because the way the Oozie integration is done is by implementing a new URIHandler that is linked to a given URI scheme. Without the prefixes, there is effectively no Kite URI scheme to tie a Kite URIHandler to.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 2, 2014

@ben-roling thanks for noticing that. I'll add the "dataset" prefix back as optional. I just don't want to force users to use it all the time when it is obvious, like Datasets.load("dataset:...")

@tomwhite
Copy link
Contributor

tomwhite commented Jun 3, 2014

I think the dataset: (and repo:) prefix should be required. Otherwise there's confusion about when you can drop the prefix (e.g. you can't when you are using Oozie), and also that it's a Kite URI at all.

We can relax this in the future, but if we make it optional now, we can't make it required in the future.

@tomwhite
Copy link
Contributor

tomwhite commented Jun 3, 2014

The documentation needs to be updated as a part of this change too.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 4, 2014

Tom, I think you're right. It makes sense in the context of a call to Datasets.load, but out of context the URIs are confusing. The namespace proposal would make the actual storage deviate from the apparent storage URI, too: hdfs:/datasets/ns/ds vs. hdfs:/datasets/ns.ds. It's clear that the URI does not match in the latter case and that the Kite URIs don't necessarily carry meaning in other contexts.

@ben-roling
Copy link

I'm wondering slightly about the way you are doing encoding of view equality constraints as query params. Generally I like the way it looks, but I think sometimes you are going to have query parameters from the repository URI that have to carry forward into the dataset URI such that it remains fully-descriptive. In such a case, the query parameter list becomes a mix of repository parameters and view constraint parameters without clear separation between the two. For example:

dataset:hive:/datasets/ds?hdfs-host=myhost&hdfs-port=9999&version=10

hdfs-host and hdfs-port are repository parameters and version is a view constraint parameter. When I was originally thinking about the URIs I was thinking about a single query param to contain all the view constraints such that there is clear separation between the view constraints and any other query parameters you might need for other reasons.

This idea yields URIs that are a bit uglier and more verbose even in the general cases where there are no query parameters contributed by the repository:

view:hive:/datasets/ds?constraints=version:10,lastName:Smith
vs
view:hive:/datasets/ds?version=10&lastName=Smith

Note: the format of the constraints query parameter above is just meant to demonstrate the concept rather than the specific encoding. I'm not sure if it is ok to use a colon as a separator there. I don't have enough expertise in URI design to know the best way to build something like that into the URI and I haven't spent the time to research it.

Anyway, I just wanted to mention the issue. You may feel that collisions between schema field names and repo query parameters would be so rare that the prettier URIs are worth the risk. Or perhaps I am missing something and somehow this isn't really an issue in the first place?

@rdblue
Copy link
Contributor Author

rdblue commented Jun 6, 2014

Ben, you're right that the query part is used for both. I don't see it as much of an issue, as you noted, because I expect the cluster information to be set in the environment config. Allowing it in the URI is more of a safety valve.

I think the shorter URIs also match the common expectation for how these URIs should work. Extra key-value pairs act as selection criteria, rather than needing to know "constraints" is a keyword and the value is split by comma and colon.

@ben-roling
Copy link

Here is another case for your consideration:

Datasets.view("view:hdfs:/tmp/data/users?username=test")

The username part of this URI gets interpreted as applying both as a filter on the username attribute of the entity schema in addition to being userInfo attribute applied to the HDFS repo URI.

I actually just sort of accidentally ran into this when writing a new unit test in TestCrunchDatasets to test the feature I was adding for specifying the crunch target as a view URI. The existing tests in this class use a "User" entity with "username" and "email" attributes. I was trying to write a test with a target view URI similar to above. The test was failing because the HDFS location URIs were inconsistent between the view and the original dataset.

The stack trace was this:

java.lang.IllegalArgumentException: Wrong FS: hdfs://test-0@localhost:61357/var/folders/9k/wlyxzx3d4mb2kx3jkdk9t9t56bkgvq/T/1402077113218-0/out_job_local1436488949_0001/username_hash=1/f48a4518-3121-4bca-9654-6138625e99f2.avro, expected: hdfs://localhost:61357
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:625)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:173)
at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:339)
at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:306)
at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:1)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:142)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:456)

It would probably not be particularly rare for a consumer to have an Entity with "username" as an attribute. When they do, they may have trouble formulating a View with an equality constraint on that attribute.

So that I could move forward I just changed my test to use a view that filters on "email" instead but I thought I should mention it as it relates to this conversation.

I started this thread before I actually ran into this which makes it sort of funny that I didn't immediately realize what was happening when the test failed in the first place.

@tomwhite
Copy link
Contributor

We need to avoid clashes, or at least have a mechanism for working around them.

The only place where query parameters are used is for Hive external tables. Perhaps we can change it so the dataset path has to be specified in the dataset descriptor at the time of creation for external tables. Readers wouldn't have to know that the dataset is backed by an external table or a managed one.

@ben-roling
Copy link

Another thing you might want to think about is the possibility that you will ever want to introduce new uses of query params in repository URIs. You may be constrained from doing so by the potential collisions they could introduce in Dataset URIs.

If you think it is sufficiently unlikely that those will ever be needed or you have an idea of how you would handle it then maybe it is not an issue.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 10, 2014

We can solve the problem with other arguments clashing easily by changing the keys that the URI host section uses. We could use more complex names for this, like "hdfs.user" rather than "username" to deconflict.

Readers wouldn't have to know that the dataset is backed by an external table or a managed one.

This is something we need to update anyway, which is covered by CDK-445. I'd like to be able to specify anything without a reasonable default in the URI so that we can use them as targets in MR or Crunch without requiring the developer to pre-create them.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 10, 2014

You may be constrained from doing so by the potential collisions they could introduce in Dataset URIs.

I think the idea of adding prefixes (e.g., hdfs.auth) would solve this problem. Another way of getting around it is to introduce other URI formats with different ways the query args are used.

@ben-roling
Copy link

I think the idea of adding prefixes (e.g., hdfs.auth) would solve this problem.

That would probably be a good enough solution.

Another way of getting around it is to introduce other URI formats with different ways the query args are used.

Yeah, that is obviously possible too although I suspect you'd like to minimize the number of different styles of URIs you support.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 10, 2014

Yeah, I definitely want to minimize the number of URI patterns, but I'm okay with that as a backup plan if we need to completely change how query args are used.

This updates the auth properties in URIPattern by adding the prefix
"auth:" and updates the HDFS properties used in Hive URIs to use the
prefix "hdfs:"
@rdblue
Copy link
Contributor Author

rdblue commented Jun 10, 2014

To avoid the collisions, I've updated the URI handling code to use prefixes. Raw URI parts added to the match data are prefixed with "uri:", authority parts are prefixed with "auth:" and the Hive URI patterns now use "hdfs:" to prefix the HDFS host and port. This should deconflict query args.

I chose the ":" prefix because it produces a warning when used in a record field (not Hive compatible) and I wanted to reserve "." for nested fields. For example, "location.latitude" will likely be a valid nested field name in the future.

@tomwhite
Copy link
Contributor

+1

The change for the Hive URI patterns needs to be documented in the Reference Guide.

Regarding the query prefixes, I checked and they can contain a ':' character without having to escape it. http://tools.ietf.org/html/rfc3986#section-3.4 defines

query       = *( pchar / "/" / "?” )

and

pchar         = unreserved / pct-encoded / sub-delims / ":" / “@“

@rdblue
Copy link
Contributor Author

rdblue commented Jun 13, 2014

Merged. Thanks for the reviews!

@rdblue rdblue closed this Jun 13, 2014
@rdblue rdblue deleted the CDK-385-add-dataset-uris branch June 13, 2014 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants