Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDK-973: Added support for setting a reader schema in MR #346

Closed
wants to merge 5 commits into from

Conversation

joey
Copy link
Member

@joey joey commented Apr 1, 2015

  • Added support for setting a reader schema for crunch sources
  • Updated the copytask to set the reader schema to the schema of the
    destination dataset.

…ioning doesn't work

* For cases where the input schema and output schema don't match,
  added an induce step after the validation in the TransformTask.
* Implemented the induce by using a new deepCopy method to DataModelUtil
  that takes takes a destination schema and will handle adapting the
  source entity to the destination schema while doing the copy.
* I'm not convinced I should use this method in DatasetKeyOutputFormat,
  but I left an implemenation in there that does that because I wanted a
  second opinion first.
* This also fixes a bug that only manifests when using Crunch 0.11 or
  later. This bug is caused by CRUNCH-459 because chaning the name of
  the top-level record is considered compatible by Kite but it won't
  resolve the branch in the union for the value of the Pair now that the
  value is nullable. This hits in the version of Kite in CDH5.2 and 5.3.
@@ -324,7 +324,7 @@ public void write(E key, Void v) {
}

private <E> E copy(E key) {
return dataModel.deepCopy(schema, key);
return DataModelUtil.deepCopy(key, schema);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we want this here, but I wanted a second opinion.

* Added support for setting a reader schema for crunch sources
* Updated the copytask to set the reader schema to the schema of the
  destination dataset.
@joey joey changed the title CDK-973: Copying to a destination that has a schema change and partition... CDK-973: Added support for setting a reader schema in MR Apr 2, 2015
@@ -207,4 +207,38 @@ public static Schema makeNullableSchema(Schema schema) {
return new EntityAccessor<E>(type, schema);
}

public static <E> E deepCopy(E entity, Schema dstSchema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may be good to have later, but if they aren't needed for this commit let's remove them. We can add them as a separate feature and add appropriate tests. Also, if what we want is a schema update then you might want to do that instead of a deep copy. There's no need to deep copy when the schema for a sub-record hasn't changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can remove it.

* Added View#asSchema() method
* Added View#getSchema() method
* Removed DatasetSourceTarget and CrunchDatasets methods that take a
  reader schema.
* Updated DatasetKeyInputFormat to use View#getSchema() for
  GenericRecords
*
* @since 1.1.0
*/
public Schema getSchema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be named getReadSchema instead. The problem is that a dataset is also a view, so we need a way to distinguish between the dataset's schema and the schema you get when using it as a view. Otherwise it appears that I can change the dataset's schema using asSchema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thinking about this more... I don't think my suggestion is valid.

The schema you set on the view is the view's schema and isn't necessarily the a read schema. The dataset's schema must always be able to read that schema, but it can be more strict that the dataset's schema. It can also be more permissive, in the case where you're projecting a subset of the columns you don't care about the ones that aren't being read.

So then we can't validate the schema from asSchema when it is used to create a view, only when we use that view to get a reader or a writer. (Which we should do, by the way.) Because the schema could be either a read or write schema, we can't name this method for either one. But we still need to distinguish it from the dataset's schema. getViewSchema or getEffectiveSchema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option that potentially solves both this and the dataset cast issue below (need two copies of a dataset with different types) is to make the instance class type available only on views. Then the dataset simply has a schema that is the same as the descriptor's schema. You can then get a view with a different type or a view with a different schema, but the underlying dataset is always Dataset<GenericRecord>.

The Datasets class would be simpler, replacing all methods that take a type with something like this:

View<Specific> data = Datasets.load(uri).asType(Specific.class);

rdblue added a commit to rdblue/kite that referenced this pull request May 21, 2015
This updates Joey's addition of View#asSchema for record/column
projection and adds View#asType. The asSchema changes needed the ability
to create a new backing Dataset instance with a different type.

This also fixes the review items I posted on kite-sdk#346.
rdblue added a commit to rdblue/kite that referenced this pull request May 22, 2015
This updates Joey's addition of View#asSchema for record/column
projection and adds View#asType. The asSchema changes needed the ability
to create a new backing Dataset instance with a different type.

This also fixes the review items I posted on kite-sdk#346.
rdblue added a commit to rdblue/kite that referenced this pull request May 29, 2015
This updates Joey's addition of View#asSchema for record/column
projection and adds View#asType. The asSchema changes needed the ability
to create a new backing Dataset instance with a different type.

This also fixes the review items I posted on kite-sdk#346.
@joey
Copy link
Member Author

joey commented Jun 5, 2015

Closing in favor of @rdblue's PR

@joey joey closed this Jun 5, 2015
rdblue added a commit to rdblue/kite that referenced this pull request Jun 6, 2015
This updates Joey's addition of View#asSchema for record/column
projection and adds View#asType. The asSchema changes needed the ability
to create a new backing Dataset instance with a different type.

This also fixes the review items I posted on kite-sdk#346.
rdblue added a commit to rdblue/kite that referenced this pull request Jun 10, 2015
This updates Joey's addition of View#asSchema for record/column
projection and adds View#asType. The asSchema changes needed the ability
to create a new backing Dataset instance with a different type.

This also fixes the review items I posted on kite-sdk#346.
rdblue added a commit to rdblue/kite that referenced this pull request Sep 30, 2015
This updates Joey's addition of View#asSchema for record/column
projection and adds View#asType. The asSchema changes needed the ability
to create a new backing Dataset instance with a different type.

This also fixes the review items I posted on kite-sdk#346.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants