-
Notifications
You must be signed in to change notification settings - Fork 251
[query] Add manifest files with parallel export #12875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write more (in some place that will be in the git log) about the expected behavior and purpose here, as well as what should be in one of these manifest files.
Furthermore, it seems like manifests are using absolute paths. This isn't robust to moving the sharded export.
partPaths.forEachDefined(cb) { case (cb, i, file: SStringValue) => | ||
val path = file.loadString(cb) | ||
val i = cb.newLocal[Int]("i", 0) | ||
cb.whileLoop(i < len, { | ||
val path = cb.memoize(partFiles(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is less maintainable than previous, and what would be even more maintainable would be to use partFiles
as an indexable, rather than immediately pulling the array out, like we both did here, that way you can still use forEachDefined
on the partFiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, getting back to this -- is your concern that we're not handling missing?
I've added an assertion that there are no missing values.
bos.write(header.getBytes) | ||
bos.write('\n') | ||
} | ||
files.foreach { f => | ||
bos.write(f.getBytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency in these getBytes calls please
@@ -562,6 +562,22 @@ case class TableTextPartitionWriter(rowType: TStruct, delimiter: String, writeHe | |||
} | |||
|
|||
object TableTextFinalizer { | |||
def writeManifest(fs: FS, outputPath: String, files: Array[String], header: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the header
argument is used, it isn't really a 'header' and is more an extra item. I get why you don't put the item into the files
array in generated code, but it's not a header for the manifest file and at least currently is a header path, so rename it?
good comments, thanks! will address. |
c9cd522
to
95dc846
Compare
addressed! |
95dc846
to
dcd49e5
Compare
bump on this! addressed comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please edit the PR description. What is a manifest file? Why do we write them now? What should be in them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Parallel export modes now write a manifest file, "shard-manifest.txt" to the folder where generated shards live. These manifest files are text files with one filename per line, containing name of each shard written successfully to the directory. These filenames are relative to the export directory.
If a header is written, that is the first filename in the shard manifest.