Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the CombineFileInputFormat to avoid too many mappers #47

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

hansmire
Copy link
Contributor

@hansmire hansmire commented Apr 3, 2014

Change the SequenceFilePailInputFormat to use the CombineFileInputFormat. This should reduce the number of input splits for Pail sources. In my tests, several thousand splits were reduced to one.

There is an issue with this change. It will not work with the hadoop 2.0.5-alpha, which is the version of hadoop that I have deployed. The reason is that the implementation of CombineFileInputFormat in that version does not call listStatus(JobConf conf) from the mapred package to get the list of files. Instead it calls ListStatus(JobContext conf) from the mapreduce package.

I fixed this by pulling in CombineFileInputFormat to avoid version conflicts.

@sorenmacbeth
Copy link
Collaborator

Do you take advantage of the consolidate functions on your pails ever? I personally never ran into an issue with too many small files because I always ensure that my master pails are consolidated before I run my hadoop jobs on them.

@hansmire
Copy link
Contributor Author

hansmire commented Apr 3, 2014

I tried to use it, but I did not have access to the hardcoded /tmp directory. I see there is another PR to fix that problem though. Can you explain a bit more how that works?

Does the data remain partitioned as it is in the master directory? Is the master directory replaced?

@sorenmacbeth
Copy link
Collaborator

The data remains partitioned as designed. files in each sub pail with the
master pail are merged in place. you can configure the size of each
consolidated file as well.

Pail p = new Pail("/some/path");
p.consolidate();

On Thu, Apr 3, 2014 at 1:40 PM, Max Hansmire notifications@github.comwrote:

I tried to use it, but I did not have access to the hardcoded /tmp
directory. I see there is another PR to fix that problem though. Can you
explain a bit more how that works?

Does the data remain partitioned as it is in the master directory? Is the
master directory replaced?

Reply to this email directly or view it on GitHubhttps://github.com//pull/47#issuecomment-39502028
.

http://about.me/soren

* limitations under the License.
*/

// This is straight up copy of the hadoop file, so that we can use extend from it without having to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, copying and pasting big files like this is hugely frowned upon. Not a good way to handle API changes in hadoop. If they fix a bug this code would never see it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see my comment in a previous change about this. I think it would be much better to upgrade the hadoop library, but not sure how to fix otherwise. Also not sure what that upgrade would mean for other users of this library.

#45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants