-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue with getSplits? #7
Comments
Sorry for the delay to answer @snabar, I believe I had had some problem like this in the beginning, but it should be working in master version... When I get some time I will check it and if it's really not working I will provide a fix |
Hi, Code works perfectly fine but my only trouble is that my data files which are csv can range from 10 gb to 200 gb. Now how to do decide the LINES_PER_MAP so that it does not get overloaded on one mapper also it should do the distribution properly. I was thinking of doing some calculations to get the LINES_PER_MAP(i know input data size and columns data types) but still i am not that much confident if it solves. Any thoughts on how to decided the LINES_PER_MAP. |
LINES_PER_MAP will determine the size of a mapper task. If you have a csv with 10000 lines and LINES_PER_MAP=1000, you will get 10000/1000 = 10 map tasks, to be distributed to your mapper workers along your cluster. Thanks for confirming the code works fine. |
Thank you for the prompt reply. I really appreciate it. I can actually run few tests and can come up with numbers but do you think there could be other smarter ways ?? Two concerns - Consider a file having 10 records now number of lines can be 10 or it can be 15 as say one record is big and it spans over 6 lines when opened in vim editor. Now in most of the cases people might know the number of records and not the number of lines. Now, if i know the number of input records i might not be able to clearly estimate the number of lines. Also, other way of figuring out it would be lets take a standard file of 100 GB with 1000 columns and calculate the worst case scenario. Run it through a simple map based job and find how many records are processed in a split(assuming 256 MB block size) lets say that number comes out to be 100k then go with that number for all the files as we know 256 MB is what which should be filled up with the records. Any thoughts you have ? or any other smarter way ? |
So, if I understood your case correctly, every record you have is composed
of N CSV rows, is that right?
Do you need to map records instead of rows?
I don't see a problem with estimation of number of lines per your
description, but if I understood it correctly, you won't be able to use
this input format in your case, you would have to use a RecordInputFormat,
specific for your case.
Please correct me if I misunderstood you.
…On 21 February 2017 at 19:01, Nitish Sharma ***@***.***> wrote:
Thank you for the prompt reply. I really appreciate it. I can actually run
few tests and can come up with numbers but do you think there could be
other smarter ways ??
Two concerns -
Consider a file having 10 records now number of lines can be 10 or it can
be 15 as say one record is big and it spans over 6 lines when opened in vim
editor. Now in most of the cases people might know the number of records
and not the number of lines. Now, if i know the number of input records i
might not be able to clearly estimate the number of lines.
Also, other way of figuring out it would be lets take a standard file of
100 GB with 1000 columns and calculate the worst case scenario. Run it
through a simple map based job and find how many records are processed in a
split(assuming 256 MB block size) lets say that number comes out to be 100k
then go with that number for all the files as we know 256 MB is what which
should be filled up with the records.
Any thoughts you have ? or any other smarter way ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAxKUfr1f0WDcxyQiSYnPzVLCmjsa1Lgks5rezSBgaJpZM4E3qgQ>
.
--
Marcelo Valle
http://mvalle.com - @mvallebr
|
Hi, I am using this in Spark to read hadoop files. Everything works fine if I set LINES_PER_MAP to a really large number such that the file does not get split at all. However, if I do set it to something smaller, it seems to find a larger number of records than there actually are. Have you had any issues with this?
The text was updated successfully, but these errors were encountered: