issue with getSplits? #7

snabar · 2015-06-04T05:55:42Z

Hi, I am using this in Spark to read hadoop files. Everything works fine if I set LINES_PER_MAP to a really large number such that the file does not get split at all. However, if I do set it to something smaller, it seems to find a larger number of records than there actually are. Have you had any issues with this?

mvallebr · 2015-12-06T17:30:46Z

Sorry for the delay to answer @snabar, I believe I had had some problem like this in the beginning, but it should be working in master version... When I get some time I will check it and if it's really not working I will provide a fix

nitishcse412 · 2017-02-21T05:22:59Z

Hi, Code works perfectly fine but my only trouble is that my data files which are csv can range from 10 gb to 200 gb. Now how to do decide the LINES_PER_MAP so that it does not get overloaded on one mapper also it should do the distribution properly. I was thinking of doing some calculations to get the LINES_PER_MAP(i know input data size and columns data types) but still i am not that much confident if it solves. Any thoughts on how to decided the LINES_PER_MAP.

mvallebr · 2017-02-21T11:01:10Z

LINES_PER_MAP will determine the size of a mapper task. If you have a csv with 10000 lines and LINES_PER_MAP=1000, you will get 10000/1000 = 10 map tasks, to be distributed to your mapper workers along your cluster.
If your file has 10gb+, the number of lines is probably huge, so a value like 10k or 100k seems reasonable to me, but I would decide this after experimenting different values in the real cluster.

Thanks for confirming the code works fine.

nitishcse412 · 2017-02-21T19:01:20Z

Thank you for the prompt reply. I really appreciate it. I can actually run few tests and can come up with numbers but do you think there could be other smarter ways ??

Two concerns -

Consider a file having 10 records now number of lines can be 10 or it can be 15 as say one record is big and it spans over 6 lines when opened in vim editor. Now in most of the cases people might know the number of records and not the number of lines. Now, if i know the number of input records i might not be able to clearly estimate the number of lines.

Also, other way of figuring out it would be lets take a standard file of 100 GB with 1000 columns and calculate the worst case scenario. Run it through a simple map based job and find how many records are processed in a split(assuming 256 MB block size) lets say that number comes out to be 100k then go with that number for all the files as we know 256 MB is what which should be filled up with the records.

Any thoughts you have ? or any other smarter way ?

mvallebr · 2017-02-28T13:33:17Z

So, if I understood your case correctly, every record you have is composed of N CSV rows, is that right? Do you need to map records instead of rows? I don't see a problem with estimation of number of lines per your description, but if I understood it correctly, you won't be able to use this input format in your case, you would have to use a RecordInputFormat, specific for your case. Please correct me if I misunderstood you.

…

On 21 February 2017 at 19:01, Nitish Sharma ***@***.***> wrote: Thank you for the prompt reply. I really appreciate it. I can actually run few tests and can come up with numbers but do you think there could be other smarter ways ?? Two concerns - Consider a file having 10 records now number of lines can be 10 or it can be 15 as say one record is big and it spans over 6 lines when opened in vim editor. Now in most of the cases people might know the number of records and not the number of lines. Now, if i know the number of input records i might not be able to clearly estimate the number of lines. Also, other way of figuring out it would be lets take a standard file of 100 GB with 1000 columns and calculate the worst case scenario. Run it through a simple map based job and find how many records are processed in a split(assuming 256 MB block size) lets say that number comes out to be 100k then go with that number for all the files as we know 256 MB is what which should be filled up with the records. Any thoughts you have ? or any other smarter way ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxKUfr1f0WDcxyQiSYnPzVLCmjsa1Lgks5rezSBgaJpZM4E3qgQ> .

-- Marcelo Valle http://mvalle.com - @mvallebr

mvallebr added the question label Feb 21, 2017

mvallebr closed this as completed Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with getSplits? #7

issue with getSplits? #7

snabar commented Jun 4, 2015

mvallebr commented Dec 6, 2015

nitishcse412 commented Feb 21, 2017

mvallebr commented Feb 21, 2017

nitishcse412 commented Feb 21, 2017

mvallebr commented Feb 28, 2017 via email

issue with getSplits? #7

issue with getSplits? #7

Comments

snabar commented Jun 4, 2015

mvallebr commented Dec 6, 2015

nitishcse412 commented Feb 21, 2017

mvallebr commented Feb 21, 2017

nitishcse412 commented Feb 21, 2017

mvallebr commented Feb 28, 2017 via email