-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDK-385: Add dataset URIs #87
Conversation
This updates URIPattern so that it can match after glob patterns. For example, file:/*path/:dataset-name, will now match "dataset-name" before matching the path glob. This makes it possible to define URI patterns that include a globbed path component that ends in a dataset name. This also adds sample use, which needs to be updated.
Because query args are now supported in opaque URIs, matching the repo:*storage-uri pattern was removing query args as part of that match. It isn't really necessary to use URIPattern to match just the scheme, so this commit matches the scheme and passes the raw scheme-specific part to Registration as the dataset or repo URI. Changes to path handling broke the Hive URI tests because "" can be passed as a path. This also exposed that file: and hdfs: URIs were incorrectly substituting the current directory for "/" paths, which are passed as path "".
This moves the DatasetBuilder from the FS Loader into Registration so it can be reused to wrap any repository builder. This also consolidates registerRepoURI and registerDatasetURI into one register method that uses the generic dataset builder. The generic builder will first get a repository from the parsed URI options, then load a Dataset.
There is enough context from the methods used that there is no need to add schemes to the start of URIs. This removes repo: and dataset:. Only repository URIs provide backward compatibility.
I haven't thoroughly examined all of this, but the change in b96ba36 to remove URI prefixes caught my eye. I think this will be problematic for the Oozie integration because the way the Oozie integration is done is by implementing a new URIHandler that is linked to a given URI scheme. Without the prefixes, there is effectively no Kite URI scheme to tie a Kite URIHandler to. |
@ben-roling thanks for noticing that. I'll add the "dataset" prefix back as optional. I just don't want to force users to use it all the time when it is obvious, like |
I think the dataset: (and repo:) prefix should be required. Otherwise there's confusion about when you can drop the prefix (e.g. you can't when you are using Oozie), and also that it's a Kite URI at all. We can relax this in the future, but if we make it optional now, we can't make it required in the future. |
The documentation needs to be updated as a part of this change too. |
Tom, I think you're right. It makes sense in the context of a call to |
I'm wondering slightly about the way you are doing encoding of view equality constraints as query params. Generally I like the way it looks, but I think sometimes you are going to have query parameters from the repository URI that have to carry forward into the dataset URI such that it remains fully-descriptive. In such a case, the query parameter list becomes a mix of repository parameters and view constraint parameters without clear separation between the two. For example: dataset:hive:/datasets/ds?hdfs-host=myhost&hdfs-port=9999&version=10 hdfs-host and hdfs-port are repository parameters and version is a view constraint parameter. When I was originally thinking about the URIs I was thinking about a single query param to contain all the view constraints such that there is clear separation between the view constraints and any other query parameters you might need for other reasons. This idea yields URIs that are a bit uglier and more verbose even in the general cases where there are no query parameters contributed by the repository: view:hive:/datasets/ds?constraints=version:10,lastName:Smith Note: the format of the constraints query parameter above is just meant to demonstrate the concept rather than the specific encoding. I'm not sure if it is ok to use a colon as a separator there. I don't have enough expertise in URI design to know the best way to build something like that into the URI and I haven't spent the time to research it. Anyway, I just wanted to mention the issue. You may feel that collisions between schema field names and repo query parameters would be so rare that the prettier URIs are worth the risk. Or perhaps I am missing something and somehow this isn't really an issue in the first place? |
Ben, you're right that the query part is used for both. I don't see it as much of an issue, as you noted, because I expect the cluster information to be set in the environment config. Allowing it in the URI is more of a safety valve. I think the shorter URIs also match the common expectation for how these URIs should work. Extra key-value pairs act as selection criteria, rather than needing to know "constraints" is a keyword and the value is split by comma and colon. |
Here is another case for your consideration: Datasets.view("view:hdfs:/tmp/data/users?username=test") The username part of this URI gets interpreted as applying both as a filter on the username attribute of the entity schema in addition to being userInfo attribute applied to the HDFS repo URI. I actually just sort of accidentally ran into this when writing a new unit test in TestCrunchDatasets to test the feature I was adding for specifying the crunch target as a view URI. The existing tests in this class use a "User" entity with "username" and "email" attributes. I was trying to write a test with a target view URI similar to above. The test was failing because the HDFS location URIs were inconsistent between the view and the original dataset. The stack trace was this: java.lang.IllegalArgumentException: Wrong FS: hdfs://test-0@localhost:61357/var/folders/9k/wlyxzx3d4mb2kx3jkdk9t9t56bkgvq/T/1402077113218-0/out_job_local1436488949_0001/username_hash=1/f48a4518-3121-4bca-9654-6138625e99f2.avro, expected: hdfs://localhost:61357 It would probably not be particularly rare for a consumer to have an Entity with "username" as an attribute. When they do, they may have trouble formulating a View with an equality constraint on that attribute. So that I could move forward I just changed my test to use a view that filters on "email" instead but I thought I should mention it as it relates to this conversation. I started this thread before I actually ran into this which makes it sort of funny that I didn't immediately realize what was happening when the test failed in the first place. |
We need to avoid clashes, or at least have a mechanism for working around them. The only place where query parameters are used is for Hive external tables. Perhaps we can change it so the dataset path has to be specified in the dataset descriptor at the time of creation for external tables. Readers wouldn't have to know that the dataset is backed by an external table or a managed one. |
Another thing you might want to think about is the possibility that you will ever want to introduce new uses of query params in repository URIs. You may be constrained from doing so by the potential collisions they could introduce in Dataset URIs. If you think it is sufficiently unlikely that those will ever be needed or you have an idea of how you would handle it then maybe it is not an issue. |
We can solve the problem with other arguments clashing easily by changing the keys that the URI host section uses. We could use more complex names for this, like "hdfs.user" rather than "username" to deconflict.
This is something we need to update anyway, which is covered by CDK-445. I'd like to be able to specify anything without a reasonable default in the URI so that we can use them as targets in MR or Crunch without requiring the developer to pre-create them. |
I think the idea of adding prefixes (e.g., hdfs.auth) would solve this problem. Another way of getting around it is to introduce other URI formats with different ways the query args are used. |
That would probably be a good enough solution.
Yeah, that is obviously possible too although I suspect you'd like to minimize the number of different styles of URIs you support. |
Yeah, I definitely want to minimize the number of URI patterns, but I'm okay with that as a backup plan if we need to completely change how query args are used. |
This updates the auth properties in URIPattern by adding the prefix "auth:" and updates the HDFS properties used in Hive URIs to use the prefix "hdfs:"
To avoid the collisions, I've updated the URI handling code to use prefixes. Raw URI parts added to the match data are prefixed with "uri:", authority parts are prefixed with "auth:" and the Hive URI patterns now use "hdfs:" to prefix the HDFS host and port. This should deconflict query args. I chose the ":" prefix because it produces a warning when used in a record field (not Hive compatible) and I wanted to reserve "." for nested fields. For example, "location.latitude" will likely be a valid nested field name in the future. |
+1 The change for the Hive URI patterns needs to be documented in the Reference Guide. Regarding the query prefixes, I checked and they can contain a ':' character without having to escape it. http://tools.ietf.org/html/rfc3986#section-3.4 defines
and
|
Merged. Thanks for the reviews! |
No description provided.