Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KITE-1081 FS writers to use the current view schema #421

Conversation

megnataraj
Copy link

No description provided.

@@ -118,7 +118,7 @@ private FileSystemView(FileSystemView<?> view, Schema schema, Class<E> type) {
writer = PartitionedDatasetWriter.newWriter(this);
} else {
writer = FileSystemWriter.newWriter(
fs, root, -1, -1 /* get from descriptor */, dataset.getDescriptor());
fs, root, -1, -1 /* get from descriptor */, dataset.getDescriptor(), this.getSchema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AbstractRefinableView#getSchema method returns the accessor's read schema. That's useful for callers, but the writers should be initialized using getAccessor().getWriteSchema(). The two schemas can differ slightly for reflected records.

@megnataraj
Copy link
Author

I am not sure if you got a chance to look into my latest commit addressing all your comments. After adding this new commit, build is failing for CDH5 alone and looking into the logs it doesn't seem those errors were introduced by this new commit, I could be wrong.

@chris-hirstein
Copy link

@rdblue Have you had a chance to review the update that @megnataraj provided?

@rdblue
Copy link
Contributor

rdblue commented Nov 3, 2015

@megnataraj and @chris-hirstein, I've looked at the update and it looks mostly good. My comments on TestAvroWriter#newWriter still applies though. That writer should use the schema that was passed in.

@chris-hirstein
Copy link

@rdblue It looks like @megnataraj added an update, can you take another look? Thanks!

@mkwhitacre
Copy link
Contributor

Re-kicked off Travis build b/c failures seemed unrelated to the PR.

+1 for this PR pending successful build for non-flaky reasons.

@rdblue
Copy link
Contributor

rdblue commented Nov 23, 2015

I'm not sure that the tests are exercising the correct code path, so I'd appreciate it if you didn't merge this until I have time to take a closer look. Sorry to be a blocker here.

@chris-hirstein
Copy link

@rdblue Will you have time to review this change soon?


@Test
public void testWriteWithOldSchema() throws IOException {
Schema writerSchema = SchemaBuilder.record("Message").fields()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema isn't compatible with TEST_SCHEMA, which is the dataset's schema. The required string field "message" is missing. This works because you're passing this writerSchema into newReader.

The problem is that I'm not sure this is testing something well-defined. It validates that this writerSchema as passed to FileSystemWriter.newWriter is used for the file (otherwise writes would fail) but I don't see the value of using it as the read schema because we expect that it is a valid read schema for both TEST_SCHEMA and itself.

I think this should pass the writerSchema you have here to newWriter, create a file, and validate that the file's schema is actually this one and not TEST_SCHEMA. At the writer level, that's what we are interested in. The other behavior (visible/hidden files and other checks) is validated in other tests and can probably be removed.

I also think there needs to be an overall test from the view layer. The motivation is to show that a view with an older write schema works. So I would like to see a view test that creates a dataset with a 2-field schema (one optional field) and writes through a view with a write-compatible schema that is missing the optional field. That validates then end-to-end behavior that we want.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue this unit test is now updated to validate the file's schema is same as writerSchema.
Also A new test has been added to TestRefinableView, where reader writer compatibility is verified by using two schemas (Value, TestValue).

Let us know if these are the changes you're expecting, or if there are other modifications that needs to be done

@rdblue
Copy link
Contributor

rdblue commented Dec 11, 2015

@chris-hirstein, @megnataraj, thanks for the reminder. I've posted a comment about making sure the tests cover what we need to. I hope that is clear enough but if it isn't please ask questions about it and I'll make sure they get answered.

@anandsagark
Copy link

I've updated this pull request as per the changes suggested by @rdblue
Please take a look and see if everything looks ok to merge, the failure in the integration checks is not related to this PR

@rdblue
Copy link
Contributor

rdblue commented Jan 29, 2016

@anandsagark, I'll look today.

@rbrush
Copy link
Contributor

rbrush commented Feb 2, 2016

+1 overall. I'm willing to merge unless there are objections from @rdblue or @mkwhitacre.

(I might quibble that the TestValue class could do with a more descriptive name, like "ValueWithoutOptionalField", but I don't think that's enough to block it.)

@rdblue
Copy link
Contributor

rdblue commented Feb 2, 2016

I don't think that the new test actually tests what it should. It looks like the schemas for the two values are the same. I think the dataset should have a 2-field record and the code should write a 1-field record.

But, I've been holding this up for too long and I'm fairly confident that it works, so I'm not going to shoot it down. Thanks for being patient, everyone.

@rbrush
Copy link
Contributor

rbrush commented Feb 2, 2016

@rdblue Ugh, you're right. Got my wires crossed on which schema the view pointed to.

@anandsagark, if I'm reading this right I think this just means making testReaderWriterCompatibleSchema use a dataset that uses the "testDescriptor" rather than the "valueDescriptor" so we write to a different schema than the writer has.

@anandsagark
Copy link

@rdblue @rbrush I've updated PR to use different schemas for read, write. Let me know if this is in tune with what you guys are looking for.

edit : updated PR again, dataset created with 2-field record and writer used 1-field record.

@rdblue
Copy link
Contributor

rdblue commented Feb 3, 2016

+1

Test looks good to me. Thanks @anandsagark

@rbrush
Copy link
Contributor

rbrush commented Feb 3, 2016

I kicked off the build again (looks like there was a spurious failure on the default profile). I'll merge pending a clean build on default and CDH5 (I know the CDH4 profile is failing for unrelated reasons.)

@rbrush
Copy link
Contributor

rbrush commented Feb 4, 2016

The build is clean modulo the known CDH4 issue and we have +1's. Merging.

Thanks everyone!

rbrush added a commit that referenced this pull request Feb 4, 2016
…rrent-view-schema

KITE-1081 FS writers to use the current view schema
@rbrush rbrush merged commit 9e963f2 into kite-sdk:master Feb 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants