New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memleak #40
Comments
This is very interesting. However, you are doing it wrongly: you are not providing a streaming interface with proper backpressure and LevelGraph couldn't keep up with the parsing speed. You should try writing a trasform step and use pipe, like so: var db = levelgraph(levelup(dbname));
var through2 = require('through2')
var filename = "pathtodbpedia.nt";
fs.createReadStream(filename).pipe(through2({ objectMode: true }, function (data, enc, callback) {
var that = this
N3Parser().parse(data, function (error, triple) {
that.push(triple)
})
callback() // as parse should be sync
}).pipe(db.putStream()); What package is (I've not tested it, if you can provide a working gist and a link to the dataset, I'll try) BTW, LevelGraph is slower on writes (15k triples/s) than on reads (30-40k triples/s). I'm pulling in @maxogden as it as done huge work on importing bulk data into LevelUp. |
@mcollina It's the node-n3 parser, the non-streaming version. On the other hand, @pietercolpaert, maybe the stream implementation can help out here then! |
(note: @RubenVerborgh is a colleague) I've implemented your method with through2. The memory still increases rapidly though... Let's wait for rdfjs/N3.js#6? |
I think you might want to try level-hyper, it uses a leveldb variant that |
I think this might be what is really happening: Level/levelup#171. (Reopening, as it's easy to search for discussions on this topic). |
The thing is already packaged for usage in LevelGraph-N3: https://github.com/mcollina/levelgraph-n3#importing-n3-files. |
@pietercolpaert @RubenVerborgh any news on this? Have you tried the new stream-based importer? |
hey @pietercolpaert @RubenVerborgh, any news on this? |
The memory leak is still there with level-hyper as well I'm afraid. Test yourselfDownload http://downloads.dbpedia.org/3.9/en/instance_types_en.nt.bz2 Load it in a levelgraph db using something like: var fs = require("fs"),
n3 = require("n3"),
levelup = require('level'),
levelhyper = require('level-hyper'),
levelgraph = require("levelgraph");
var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("/path/to/instance_types_en.nt", { encoding: 'utf8' });
var count = 0;
var transform = new n3.Transform();
transform.pipe(db.putStream());
readstream.pipe(transform).on("data",function (triple) {
count++;
if (count%10000 === 0) {
var str = count + " triples loaded";
console.error( str );
}
}).on("end",function () {
console.error(count + " triples loaded");
}); Or am I doing something wrong here? |
First, using Second, it might really be that Level needs more memory to store all that data, and you might need to increase it... a lot. Look at: https://github.com/joyent/node/wiki/FAQ#what-is-the-memory-limit-on-a-node-process. You are insterting roughly 15.894.068 lines, which we can assume are triples. I'm hoping in some help from @rvagg, @maxogden and others in order to track it down. Third, there might really be a memory leak. Have a look at https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/ to any help tracking it down. I'll try giving a go at your script and see if I can get better. |
On the contrary of what I though, the following is not leaking on my Mac: var fs = require("fs"),
through2 = require("through2"),
n3 = require("n3"),
levelhyper = require('level-hyper'),
levelgraph = require("levelgraph");
var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();
readstream.pipe(transform).pipe(through2({ objectMode: true }, function (chunk, enc, callback) {
count++;
if (count % 10000 === 0) {
var str = count + " triples loaded";
console.error( str );
}
this.push(chunk)
callback()
})).pipe(writestream);
writestream.on("end",function () {
console.error(count + " triples loaded");
}); It is extremely slow, but it is not leaking. |
Ok. memory problem detected at around 700k triples:
Any help here @rvagg? |
Mine gave this when 80MB of the file was stored into levelgraph, while my max memory was 1700MB. This makes me think that backpressure stuff is not properly working |
Have you tried the version I've passed you that don't use the old API? I'm trying with a 8GB machine giving 7GB of memory, and the memory usage explodes. |
I'm confirming that it is a memory leak indeed that happens at around 700k instances and varies with --max_old_space_size. So, it is in node-land and inside the space managed by the GC. Wow -.-. |
Ok, using
|
As I see it, the backpressure is not happening and stuff ends up in the _writableState.buffer. Not sure why it's happening and where the problem is. I used https://github.com/bnoordhuis/node-heapdump to get a heap snapshot, there are tons of unparsed strings. For reference, this is my last run: var fs = require("fs"),
through2 = require("through2"),
n3 = require("n3"),
// levelhyper = require('level-hyper'),
memwatch = require('memwatch'),
levelgraph = require("levelgraph");
var heapdump = require('heapdump');
var throttle = require("throttle");
//var db = levelgraph(levelhyper("data"));
var db = levelgraph("data");
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();
var diff;
// throttle at 10MBs
readstream
//.pipe(throttle(1024 * 1024))
.pipe(transform)
//.pipe(through2({ highWaterMark: 1, objectMode: true }, function (chunk, enc, callback) {
// count++;
// if (count % 10000 === 0) {
// var str = count + " triples loaded";
// console.error( str );
// }
// this.push(chunk)
// callback()
//}))
.pipe(writestream);
writestream.on("end",function () {
console.error(count + " triples loaded");
});
memwatch.on("leak", function(info) {
console.log(info);
if (diff) {
console.log(JSON.stringify(diff.end(), null, ' '))
}
//heapdump.writeSnapshot('./' + Date.now() + '.heapsnapshot');
// diff = new memwatch.HeapDiff();
}); |
Ok, this is bad, and it is very probable a levelup/leveldown bug. It seems that using this helps a lot as it sits directly on leveldown: However I have no more time to spend on this right now :(. |
Let's file the issue in leveldown though? |
It's already filed up in levelup and being discussed. I'm having some luck in using levelup 0.12 and leveldown 0.6. The insert is much slower but it eats really less memory. It is at 1.200.000 triples (which means 7.200.000 level records) and just at 4GB. The 'leak' warning of memleak is not being triggered that often, so things are fine. If you want to have a consistent strategy of importing dbpedia into a levelgraph, the best solution right now is to split the file in 10 chunks of 200MB and insert those, one by one. You can use a node cluster or control that via shell. |
I saturated even the version with levelup 0.12. I was able to insert around 5630000 triples that are 33780000 pairs in leveldb. The final leveldb size is around 1.5GB. Were you able to load that into any triplestore on a single machine? I think the best solution in any case is trying to split the input in around 10 files, and trying to insert them one at a time. |
You should also bring in @brycebaril to this discussion as he seems to have a lot of experience working around this problem. |
Thanks Rod! Using LevelUP 0.12 with LevelWriteStream cause no leak whatsoever (traced with the memleak package), but the memory occupancy still grows a lot: but for storing 33 millions records it's kind of ok, but it will be better if it was not there. This is the LevelDB vs V8 GC. Using LevelUP 0.17 with or without LevelWriteStream causes the memory to explode much much earlier. Should we try Node v0.11? If @brycebaril or @trevnorris wants to help, they are super-welcome to. |
@mcollina I don't have much time, but would be more than happy to help along the way if you have any questions while debugging. Also, @tjfontaine is rad when it comes to finding mem leaks. |
Hi @mcollina a lot of the memory price you're paying with large Level Read/Write streams is objectMode. You may be able to avoid some of that if you used a modified version of n3 that worked with level-bufferstreams, but I don't know how reasonable/possible that is with the work that n3 does. So far as I understand it, the ReadStream is giving you two buffers from LevelDOWN's iterator.next() but when LevelUP throws these into an object {key: ..., value: ...} you're ending up copying those buffers into the V8 heap. This is even worse if you're using JSON key/valueEncoding as then the buffers get inflated into even more heap content. This is also a cost if you're using Array Batch for the WriteStream side, where the array being queued into a JavaScript Array will use V8 heap instead of sending them directly to LevelDOWN via chained Batch. level-bufferstreams skips the V8 heap by using multibuffers to avoid objectMode entirely. The downside being that your Transforms will have to work with pure buffers to keep the memory footprint minimal. That said, there very well could be speed or memory improvements that could be made with LevelDOWN to help here -- particularly the differences between 0.12 and 0.17. |
@brycebaril The current stream implementation in N3 is a short and simple layer above a highly efficient internal parser that only uses callbacks. I took the decision after I noticed that an internal implementation using streams was tremendously slow. Therefore, implementing a level-bufferstreams layer on top of the N3 parser should be trivial: only a new stream layer must be implemented. I guess that 90% of the current N3Transform layer can be reused. |
@RubenVerborgh I think you can try creating that library and insert directly in LevelDown. It should not be hard. Here are the relevant lines for creating the keys and values to store in LevelDown: So you can skip totally levelgraph for insertion and use it just for reading the data. |
@mcollina I'm a bit confused here. @brycebaril suggested to create a level-bufferstreams implementation. |
I'm saying to skip entirely levelgraph and store the things directly inside LevelDown using level-bufferstreams. BTW, I'm trying to understand what is going with the chained-batch vs array-batch, which seems mainly the issue here. |
@mcollina Thanks, got it now 👍 |
I'm trying using the WriteStream implementation that uses chainable batch (Level/level-ws#1) to improving this scenario. I get a segfault in doing so. I could not replicate it easily using LevelWriteStream alone.
Moreover, this happens only on node v0.10.x. Node v0.11.8 runs totally fine with 300MB occupied memory after 12 million level pairs inserted (roughly 2 million triples): I stopped it at this point. Moreover, the memory occupancy is not growing. So, this issue is in node v0.10 part of LevelDown/NaN. As I'm not doing anything strange here, it should be something down in LevelDown or core, or something interfering between the two. Maybe, one of the keys is getting deallocated by V8 while still being worked out by LevelDown? You can find the branch here: https://github.com/mcollina/levelgraph/tree/no-leak-writestream. |
Ok, Level/leveldown#73 solves the issue. I can load the big n3 file with a fixed memory load. Soon on your screens! |
Looks promising, thanks! Keep us updated on the segfault. |
You can use that branch with levelgraph as it is and it works fine with no segfault. |
Affirmative! It works as a charm! |
Reopening as leveldown 0.10 is not released, yet. Moreover we will have a new LevelGraph patch release to bump that dependency. BTW, Have you got some performance reports about queries on that dataset? |
I didn't do any performance tests yet, but for 1M triples, the ingesting is quite fast and queries return within expectations :) Will keep you informed when I have more time to spend on this |
leveldown@0.10.0 and levelup@0.18.0 are out and will hopefully sort this out |
The new leveldown bring also a 25% increase in writing speed, so right now I can write 22.000 triples/second on my 2011 Macbook Air (vs 15.000 of the previous release). Have fun! 👯 |
This is released as version 0.6.11! |
To test this library in order to use it for a bigger project, I've tried to ingest DBPedia. It crashes however at ~1% of the triples saying that the process is out of memory. It only happens when I try to write to a leveldb. If I comment the write, the memory usage stays stable.
some code
Any ideas on how to limit memory usage?
The text was updated successfully, but these errors were encountered: