CDK-973: Added support for setting a reader schema in MR #346

joey · 2015-04-01T03:45:39Z

Added support for setting a reader schema for crunch sources
Updated the copytask to set the reader schema to the schema of the
destination dataset.

…ioning doesn't work * For cases where the input schema and output schema don't match, added an induce step after the validation in the TransformTask. * Implemented the induce by using a new deepCopy method to DataModelUtil that takes takes a destination schema and will handle adapting the source entity to the destination schema while doing the copy. * I'm not convinced I should use this method in DatasetKeyOutputFormat, but I left an implemenation in there that does that because I wanted a second opinion first. * This also fixes a bug that only manifests when using Crunch 0.11 or later. This bug is caused by CRUNCH-459 because chaning the name of the top-level record is considered compatible by Kite but it won't resolve the branch in the union for the value of the Pair now that the value is nullable. This hits in the version of Kite in CDH5.2 and 5.3.

joey · 2015-04-01T03:46:03Z

...ata/kite-data-mapreduce/src/main/java/org/kitesdk/data/mapreduce/DatasetKeyOutputFormat.java

@@ -324,7 +324,7 @@ public void write(E key, Void v) {
    }

    private <E> E copy(E key) {
-      return dataModel.deepCopy(schema, key);
+      return DataModelUtil.deepCopy(key, schema);


I'm not convinced we want this here, but I wanted a second opinion.

* Added support for setting a reader schema for crunch sources * Updated the copytask to set the reader schema to the schema of the destination dataset.

rdblue · 2015-04-02T21:53:03Z

kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/DataModelUtil.java

@@ -207,4 +207,38 @@ public static Schema makeNullableSchema(Schema schema) {
    return new EntityAccessor<E>(type, schema);
  }

+  public static <E> E deepCopy(E entity, Schema dstSchema) {


These may be good to have later, but if they aren't needed for this commit let's remove them. We can add them as a separate feature and add appropriate tests. Also, if what we want is a schema update then you might want to do that instead of a deep copy. There's no need to deep copy when the schema for a sub-record hasn't changed.

Yeah, I can remove it.

* Added View#asSchema() method * Added View#getSchema() method * Removed DatasetSourceTarget and CrunchDatasets methods that take a reader schema. * Updated DatasetKeyInputFormat to use View#getSchema() for GenericRecords

rdblue · 2015-04-23T00:21:00Z

kite-data/kite-data-core/src/main/java/org/kitesdk/data/View.java

+   *
+   * @since 1.1.0
+   */
+  public Schema getSchema();


I think this should be named getReadSchema instead. The problem is that a dataset is also a view, so we need a way to distinguish between the dataset's schema and the schema you get when using it as a view. Otherwise it appears that I can change the dataset's schema using asSchema.

Okay, thinking about this more... I don't think my suggestion is valid.

The schema you set on the view is the view's schema and isn't necessarily the a read schema. The dataset's schema must always be able to read that schema, but it can be more strict that the dataset's schema. It can also be more permissive, in the case where you're projecting a subset of the columns you don't care about the ones that aren't being read.

So then we can't validate the schema from asSchema when it is used to create a view, only when we use that view to get a reader or a writer. (Which we should do, by the way.) Because the schema could be either a read or write schema, we can't name this method for either one. But we still need to distinguish it from the dataset's schema. getViewSchema or getEffectiveSchema?

Another option that potentially solves both this and the dataset cast issue below (need two copies of a dataset with different types) is to make the instance class type available only on views. Then the dataset simply has a schema that is the same as the descriptor's schema. You can then get a view with a different type or a view with a different schema, but the underlying dataset is always Dataset<GenericRecord>.

The Datasets class would be simpler, replacing all methods that take a type with something like this:

View<Specific> data = Datasets.load(uri).asType(Specific.class);

This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.

joey · 2015-06-05T23:31:19Z

Closing in favor of @rdblue's PR

This updates Joey's addition of View#asSchema for record/column projection and adds View#asType. The asSchema changes needed the ability to create a new backing Dataset instance with a different type. This also fixes the review items I posted on kite-sdk#346.

joey reviewed Apr 1, 2015
View reviewed changes

joey added 3 commits March 31, 2015 21:23

CDK-973: Fixed IDE shenanigans.

ff5d323

CDK-973: Backing out previous fix.

35dec7f

CDK-973: Added support for setting a reader schema in MR

b180cb0

* Added support for setting a reader schema for crunch sources * Updated the copytask to set the reader schema to the schema of the destination dataset.

joey changed the title ~~CDK-973: Copying to a destination that has a schema change and partition...~~ CDK-973: Added support for setting a reader schema in MR Apr 2, 2015

rdblue reviewed Apr 2, 2015
View reviewed changes

Added View#asSchema() and fixed CopyCommand based on this API

d0fa8c4

* Added View#asSchema() method * Added View#getSchema() method * Removed DatasetSourceTarget and CrunchDatasets methods that take a reader schema. * Updated DatasetKeyInputFormat to use View#getSchema() for GenericRecords

rdblue reviewed Apr 23, 2015
View reviewed changes

rdblue mentioned this pull request May 21, 2015

CDK-973: Add View#asSchema and View#asType projection methods #367

Merged

joey closed this Jun 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDK-973: Added support for setting a reader schema in MR #346

CDK-973: Added support for setting a reader schema in MR #346

joey commented Apr 1, 2015

joey Apr 1, 2015

rdblue Apr 2, 2015

joey Apr 2, 2015

rdblue Apr 23, 2015

rdblue Apr 23, 2015

rdblue Apr 23, 2015

joey commented Jun 5, 2015

CDK-973: Added support for setting a reader schema in MR #346

CDK-973: Added support for setting a reader schema in MR #346

Conversation

joey commented Apr 1, 2015

joey Apr 1, 2015

Choose a reason for hiding this comment

rdblue Apr 2, 2015

Choose a reason for hiding this comment

joey Apr 2, 2015

Choose a reason for hiding this comment

rdblue Apr 23, 2015

Choose a reason for hiding this comment

rdblue Apr 23, 2015

Choose a reason for hiding this comment

rdblue Apr 23, 2015

Choose a reason for hiding this comment

joey commented Jun 5, 2015