Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with getSplits? #7

Closed
snabar opened this issue Jun 4, 2015 · 5 comments
Closed

issue with getSplits? #7

snabar opened this issue Jun 4, 2015 · 5 comments
Labels

Comments

@snabar
Copy link

snabar commented Jun 4, 2015

Hi, I am using this in Spark to read hadoop files. Everything works fine if I set LINES_PER_MAP to a really large number such that the file does not get split at all. However, if I do set it to something smaller, it seems to find a larger number of records than there actually are. Have you had any issues with this?

@mvallebr
Copy link
Owner

mvallebr commented Dec 6, 2015

Sorry for the delay to answer @snabar, I believe I had had some problem like this in the beginning, but it should be working in master version... When I get some time I will check it and if it's really not working I will provide a fix

@nitishcse412
Copy link

Hi, Code works perfectly fine but my only trouble is that my data files which are csv can range from 10 gb to 200 gb. Now how to do decide the LINES_PER_MAP so that it does not get overloaded on one mapper also it should do the distribution properly. I was thinking of doing some calculations to get the LINES_PER_MAP(i know input data size and columns data types) but still i am not that much confident if it solves. Any thoughts on how to decided the LINES_PER_MAP.

@mvallebr
Copy link
Owner

LINES_PER_MAP will determine the size of a mapper task. If you have a csv with 10000 lines and LINES_PER_MAP=1000, you will get 10000/1000 = 10 map tasks, to be distributed to your mapper workers along your cluster.
If your file has 10gb+, the number of lines is probably huge, so a value like 10k or 100k seems reasonable to me, but I would decide this after experimenting different values in the real cluster.

Thanks for confirming the code works fine.

@nitishcse412
Copy link

Thank you for the prompt reply. I really appreciate it. I can actually run few tests and can come up with numbers but do you think there could be other smarter ways ??

Two concerns -

Consider a file having 10 records now number of lines can be 10 or it can be 15 as say one record is big and it spans over 6 lines when opened in vim editor. Now in most of the cases people might know the number of records and not the number of lines. Now, if i know the number of input records i might not be able to clearly estimate the number of lines.

Also, other way of figuring out it would be lets take a standard file of 100 GB with 1000 columns and calculate the worst case scenario. Run it through a simple map based job and find how many records are processed in a split(assuming 256 MB block size) lets say that number comes out to be 100k then go with that number for all the files as we know 256 MB is what which should be filled up with the records.

Any thoughts you have ? or any other smarter way ?

@mvallebr
Copy link
Owner

mvallebr commented Feb 28, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants