Skip to content

[query] Add manifest files with parallel export #12875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 10, 2023

Conversation

tpoterba
Copy link
Contributor

@tpoterba tpoterba commented Apr 11, 2023

Parallel export modes now write a manifest file, "shard-manifest.txt" to the folder where generated shards live. These manifest files are text files with one filename per line, containing name of each shard written successfully to the directory. These filenames are relative to the export directory.

If a header is written, that is the first filename in the shard manifest.

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write more (in some place that will be in the git log) about the expected behavior and purpose here, as well as what should be in one of these manifest files.

Furthermore, it seems like manifests are using absolute paths. This isn't robust to moving the sharded export.

Comment on lines -825 to +813
partPaths.forEachDefined(cb) { case (cb, i, file: SStringValue) =>
val path = file.loadString(cb)
val i = cb.newLocal[Int]("i", 0)
cb.whileLoop(i < len, {
val path = cb.memoize(partFiles(i))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is less maintainable than previous, and what would be even more maintainable would be to use partFiles as an indexable, rather than immediately pulling the array out, like we both did here, that way you can still use forEachDefined on the partFiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, getting back to this -- is your concern that we're not handling missing?

I've added an assertion that there are no missing values.

Comment on lines 570 to 574
bos.write(header.getBytes)
bos.write('\n')
}
files.foreach { f =>
bos.write(f.getBytes())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency in these getBytes calls please

@@ -562,6 +562,22 @@ case class TableTextPartitionWriter(rowType: TStruct, delimiter: String, writeHe
}

object TableTextFinalizer {
def writeManifest(fs: FS, outputPath: String, files: Array[String], header: String): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the header argument is used, it isn't really a 'header' and is more an extra item. I get why you don't put the item into the files array in generated code, but it's not a header for the manifest file and at least currently is a header path, so rename it?

@tpoterba
Copy link
Contributor Author

good comments, thanks! will address.

@tpoterba tpoterba force-pushed the manifest-parallel-export branch from c9cd522 to 95dc846 Compare April 28, 2023 17:37
@tpoterba
Copy link
Contributor Author

addressed!

@tpoterba tpoterba force-pushed the manifest-parallel-export branch from 95dc846 to dcd49e5 Compare May 8, 2023 14:19
@tpoterba
Copy link
Contributor Author

tpoterba commented May 9, 2023

bump on this! addressed comments.

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please edit the PR description. What is a manifest file? Why do we write them now? What should be in them?

@tpoterba tpoterba dismissed chrisvittal’s stale review May 9, 2023 18:48

done, thanks for reminder

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@danking danking merged commit a1db5f6 into hail-is:main May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants