-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing xml without loading the whole file into RAM #15
Comments
Memory management of what's being parsed is left up to the range being parsed and is not the concern of dxml. The parser will operate on any forward range of However, if you're going to use |
I see, but do you have any preferred solution? Phobos does not seem to provide any means to represent file as a forward range without loading the whole file. Upd: I've found MmFile can remap portions of file on demand when window argument is given. Still needs a little trickery to make it a forward range, but it might work. |
As dxml uses the standard range mechanism, it bypasses the issue in the sense that it doesn't provide the range that reads in the file efficiently. It assumes that it already exists. Unfortunately, reading from a file efficiently is kind of the achilles heel of ranges in that every time you call Personally, if I had to read a file in as a forward range and couldn't read it all in at once, I'd probably just use std.mmfile rather than trying to deal with buffering everything, since that gets really complicated. There's always Steven's https://github.com/schveiguy/iopipe, but it's still a work in progress, and as I understand it, he's had to work around the range API on some level precisely because it's so poorly suited to reading in a file efficiently, so I don't know exactly how that's going to work with the range API. I'm aware of iopipe but have to spend time studying it. I wrote dxml the way I did so that it could work with a range that read over a file without reading it all in but without trying to actually solve that problem. By just operating on ranges, it pushes that entire problem off to ranges, which doesn't entirely solve the problem, but it does mean that as long as the problem is solved with ranges in general, it's solved for dxml. |
Yeah, and byline/splitlines aren't good for parsers because they loose vital line ending information. Xml cdata section are most likely the problem for Xml. I did a range over mmap files awhile back. |
I've made good experiences with using std.mmfile together with dxml. I’ve written a tool that processes a full Wikipedia export (70 GB) this way, and it worked quite well. |
Is it possible to do with this library?
Maybe you can provide examples that show memory-efficient usage of DOM and StAX parsers.
The text was updated successfully, but these errors were encountered: