New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDK-973: Added support for setting a reader schema in MR #346
Conversation
…ioning doesn't work * For cases where the input schema and output schema don't match, added an induce step after the validation in the TransformTask. * Implemented the induce by using a new deepCopy method to DataModelUtil that takes takes a destination schema and will handle adapting the source entity to the destination schema while doing the copy. * I'm not convinced I should use this method in DatasetKeyOutputFormat, but I left an implemenation in there that does that because I wanted a second opinion first. * This also fixes a bug that only manifests when using Crunch 0.11 or later. This bug is caused by CRUNCH-459 because chaning the name of the top-level record is considered compatible by Kite but it won't resolve the branch in the union for the value of the Pair now that the value is nullable. This hits in the version of Kite in CDH5.2 and 5.3.
@@ -324,7 +324,7 @@ public void write(E key, Void v) { | |||
} | |||
|
|||
private <E> E copy(E key) { | |||
return dataModel.deepCopy(schema, key); | |||
return DataModelUtil.deepCopy(key, schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we want this here, but I wanted a second opinion.
* Added support for setting a reader schema for crunch sources * Updated the copytask to set the reader schema to the schema of the destination dataset.
@@ -207,4 +207,38 @@ public static Schema makeNullableSchema(Schema schema) { | |||
return new EntityAccessor<E>(type, schema); | |||
} | |||
|
|||
public static <E> E deepCopy(E entity, Schema dstSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These may be good to have later, but if they aren't needed for this commit let's remove them. We can add them as a separate feature and add appropriate tests. Also, if what we want is a schema update then you might want to do that instead of a deep copy. There's no need to deep copy when the schema for a sub-record hasn't changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can remove it.
* Added View#asSchema() method * Added View#getSchema() method * Removed DatasetSourceTarget and CrunchDatasets methods that take a reader schema. * Updated DatasetKeyInputFormat to use View#getSchema() for GenericRecords
* | ||
* @since 1.1.0 | ||
*/ | ||
public Schema getSchema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be named getReadSchema
instead. The problem is that a dataset is also a view, so we need a way to distinguish between the dataset's schema and the schema you get when using it as a view. Otherwise it appears that I can change the dataset's schema using asSchema
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thinking about this more... I don't think my suggestion is valid.
The schema you set on the view is the view's schema and isn't necessarily the a read schema. The dataset's schema must always be able to read that schema, but it can be more strict that the dataset's schema. It can also be more permissive, in the case where you're projecting a subset of the columns you don't care about the ones that aren't being read.
So then we can't validate the schema from asSchema
when it is used to create a view, only when we use that view to get a reader or a writer. (Which we should do, by the way.) Because the schema could be either a read or write schema, we can't name this method for either one. But we still need to distinguish it from the dataset's schema. getViewSchema
or getEffectiveSchema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option that potentially solves both this and the dataset cast issue below (need two copies of a dataset with different types) is to make the instance class type available only on views. Then the dataset simply has a schema that is the same as the descriptor's schema. You can then get a view with a different type or a view with a different schema, but the underlying dataset is always Dataset<GenericRecord>
.
The Datasets
class would be simpler, replacing all methods that take a type with something like this:
View<Specific> data = Datasets.load(uri).asType(Specific.class);
This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.
This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.
This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.
Closing in favor of @rdblue's PR |
This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.
This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.
This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.
destination dataset.