my script has 10 lines above the "END" marker, and 750.000 lines below. it runs fine in ruby-1.9.2-p290, but when i start it in jruby-1.7.4, it just hangs and never even gets to read the first line of DATA (at least not in the first 5 minutes).
How big is this file sizewise? We do not have the same impl as MRI (which uses FILE*) and we end up allocating a big bytearrayinputstream out of that section 1k at a time. Assuming memory is not an issue we can probably bump this size up to a larger number like 32k since not many people use END and you are not the first large data set person.
If you could make a script to generate a representative END dataset we can probably poke at this and improve our impl. Ultimately, we want a read/write END data section preferably on top of NIO, but I know we looked at that in the past and there were some issues.
the file is 17MB, with roughly 25 chars per line:
$ ls -l by_started_at
-rw-r--r--@ 1 tim staff 18224247 Jul 10 10:16 by_started_at
$ wc -l by_started_at
So yeah, @enebo was right about the cause. We read the DATA contents all into memory currently, 1k at a time. Those bytes go into a slowly-growing array, so larger files will take DATA.size / 1024 read + resize + copy operations. It just ends up doing too much work.
I'm going to do a short-term fix to increase the buffer size. For a 10MB file, a 64k buffer loads DATA almost immediately.
We are also talking about the longer-term fix to actually pass the real stream/channel for DATA rather than reading into memory.
Bump DATA read buffer up to 64k for now. Fixes #873.
Use actual file stream for DATA when possible. See #873.
cool, that was quick!