Parsing xml without loading the whole file into RAM #15

FreeSlave · 2018-09-25T16:16:55Z

Is it possible to do with this library?
Maybe you can provide examples that show memory-efficient usage of DOM and StAX parsers.

jmdavis · 2018-09-26T02:02:24Z

Memory management of what's being parsed is left up to the range being parsed and is not the concern of dxml. The parser will operate on any forward range of char, wchar, or dchar. I thought that the documentation was clear about that. As such, if you have a forward range of characters over a file which does not read in the entire file at once, you can parse the file without reading into all into memory (though obviously, any parts you keep around will then stay in memory).

However, if you're going to use parseDOM, then any portion of the document that it parses is going to result in memory allocations in order to build the DOM regardless of the underlying range being parsed. That's going to be true of any DOM parser since the whole point of a DOM parser is to build the document tree in memory.

FreeSlave · 2018-09-26T14:55:01Z

I see, but do you have any preferred solution? Phobos does not seem to provide any means to represent file as a forward range without loading the whole file.
DOM of course needs to allocate some structs that represent a tree, but still uses slices of original range to hold the stored data. The point is to minimize allocations, not nullify them.

Upd: I've found MmFile can remap portions of file on demand when window argument is given. Still needs a little trickery to make it a forward range, but it might work.

jmdavis · 2018-09-26T22:55:06Z

As dxml uses the standard range mechanism, it bypasses the issue in the sense that it doesn't provide the range that reads in the file efficiently. It assumes that it already exists. Unfortunately, reading from a file efficiently is kind of the achilles heel of ranges in that every time you call save you then need that range to always be valid for as long as it exists, so in order to buffer file access, you could need to have an arbitrarily large number of buffers, and it gets complicated. It's something that needs to be solved, but Phobos has largely ignored the issue (probably because it's complicated and no one really wants to take the time to write it). It does have stuff like std.stdio's byLine or byChunk, to read lines or chunks efficiently, but that translates to a range of bytes or characters only awkwardly, because what you're really getting then is ranges of lines or chunks (and since those algorithms reuse their buffers, it gets even more complicated). Actually, properly, buffering chunks of a file and referencing-counting that with the range API to properly support save gets complicated fast, and in a lot of cases, simply reading the file in in pieces rather than as a forward range or reading it in all at once avoids the whole issue (though that obviously isn't always an option).

Personally, if I had to read a file in as a forward range and couldn't read it all in at once, I'd probably just use std.mmfile rather than trying to deal with buffering everything, since that gets really complicated. There's always Steven's https://github.com/schveiguy/iopipe, but it's still a work in progress, and as I understand it, he's had to work around the range API on some level precisely because it's so poorly suited to reading in a file efficiently, so I don't know exactly how that's going to work with the range API. I'm aware of iopipe but have to spend time studying it.

I wrote dxml the way I did so that it could work with a range that read over a file without reading it all in but without trying to actually solve that problem. By just operating on ranges, it pushes that entire problem off to ranges, which doesn't entirely solve the problem, but it does mean that as long as the problem is solved with ranges in general, it's solved for dxml.

JesseKPhillips · 2019-03-04T15:20:44Z

Yeah, and byline/splitlines aren't good for parsers because they loose vital line ending information. Xml cdata section are most likely the problem for Xml.

I did a range over mmap files awhile back.
https://github.com/JesseKPhillips/libosm/blob/master/source/util/filerange.d

ghost91- · 2023-09-01T09:19:59Z

I've made good experiences with using std.mmfile together with dxml. I’ve written a tool that processes a full Wikipedia export (70 GB) this way, and it worked quite well.

https://github.com/ghost91-/wikipedia-indexer/tree/master

jmdavis closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing xml without loading the whole file into RAM #15

Parsing xml without loading the whole file into RAM #15

FreeSlave commented Sep 25, 2018

jmdavis commented Sep 26, 2018

FreeSlave commented Sep 26, 2018 •

edited

Loading

jmdavis commented Sep 26, 2018 •

edited

Loading

JesseKPhillips commented Mar 4, 2019

ghost91- commented Sep 1, 2023

Parsing xml without loading the whole file into RAM #15

Parsing xml without loading the whole file into RAM #15

Comments

FreeSlave commented Sep 25, 2018

jmdavis commented Sep 26, 2018

FreeSlave commented Sep 26, 2018 • edited Loading

jmdavis commented Sep 26, 2018 • edited Loading

JesseKPhillips commented Mar 4, 2019

ghost91- commented Sep 1, 2023

FreeSlave commented Sep 26, 2018 •

edited

Loading

jmdavis commented Sep 26, 2018 •

edited

Loading