-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thank you - SplittableGzip works out of the box with Apache Spark! #2
Comments
This is great to hear. |
I've added your examples to the documentation. |
P.S. If I made a mistake in copying your documentation or if you have improvements then please put up a pull/merge request with the appropriate changes. |
I have some follow-up questions about your project, to make sure I understand it well:
|
Hi @nielsbasjes! Any update on my questions above? I'm putting together a brief talk about SplittableGzip at the Boston Spark Meetup and want to make sure I have the basic facts right. |
Something that may be of use for your presentation: Using this library also reduces the part of the job that needs to be "redone" in case of an external error (i.e. node failure). |
@nchammas was this what you needed for your presentation? |
Yes indeed, thank you for your responses! I’m giving the talk next week. I will share the slides with you after the talk. One interesting thing I found in my research for the talk is the GZinga project from eBay, which also aims to provide a splittable gzip codec for use with Hadoop. If I understood correctly, the key difference in their project is that they write additional metadata to the gzipped files to make them seekable. This is interesting, but it also means that files not created with their library do not support random access and we’re back to square one. With your library we don’t get true random access but we do get compatibility with gzipped files generated by any application or library, which is a better trade off. If you control how the data is written, you’re probably better off writing it with something other than gzip vs. extending gzip with custom metadata. |
This GZinga project seems to be similar to the ideas discussed in https://issues.apache.org/jira/browse/HADOOP-7909 Essentially the tradeoff between
Good luck with your presentation. |
I came across this project via a comment on a Spark Jira ticket where I was thinking about a way to split gzip files that is similar to what this project does. I was delighted to learn that someone had already thought through and implemented such a solution, and from the looks of it done a much better job at it than I could have.
So I just wanted to report here for the record, since gzipped files are a common stumbling block for Apache Spark users, that your solution works with Apache Spark without modification.
Here is an example, which I've tested against Apache Spark 2.4.4 using the Python DataFrame API:
Run this script as follows:
spark-submit --packages "nl.basjes.hadoop:splittablegzip:1.2" splittable-gzip.py
Here's what I see in the Spark UI when I run this script against a 20 GB gzip file on my laptop:
You can see in the task list the behavior described in the README, with each task reading from the start of the file to its target split.
And here is the Executor UI, which shows every available core running concurrently against this single file:
I will experiment some more with this project -- and perhaps ask some questions on here, if you don't mind -- and then promote it on Stack Overflow and in the Boston area.
Thank you for open sourcing this work!
The text was updated successfully, but these errors were encountered: