Memleak #40

Closed
pietercolpaert opened this Issue Oct 11, 2013 · 40 comments

Projects

None yet

6 participants

@pietercolpaert

To test this library in order to use it for a bigger project, I've tried to ingest DBPedia. It crashes however at ~1% of the triples saying that the process is out of memory. It only happens when I try to write to a leveldb. If I comment the write, the memory usage stays stable.

some code

var db = levelgraph(levelup(dbname));
var dbputstream = db.putStream();
var filename = "pathtodbpedia.nt";
fs.createReadStream(filename).on("data", function (data) {
  N3Parser().parse(data, function (error, triple) {
    dbputstream.write(triple);
  }
});

Any ideas on how to limit memory usage?

@mcollina
Owner

This is very interesting.
Loading a huge number of triples (how many it is the 1%?) is something I should dig into.

However, you are doing it wrongly: you are not providing a streaming interface with proper backpressure and LevelGraph couldn't keep up with the parsing speed. You should try writing a trasform step and use pipe, like so:

var db = levelgraph(levelup(dbname));
var through2 = require('through2')
var filename = "pathtodbpedia.nt";
fs.createReadStream(filename).pipe(through2({ objectMode: true }, function (data, enc, callback) {

  var that = this
  N3Parser().parse(data, function (error, triple) {
    that.push(triple)
  })

  callback() // as parse should be sync
}).pipe(db.putStream());

What package is N3Parser from?

(I've not tested it, if you can provide a working gist and a link to the dataset, I'll try)

BTW, LevelGraph is slower on writes (15k triples/s) than on reads (30-40k triples/s).

I'm pulling in @maxogden as it as done huge work on importing bulk data into LevelUp.
Also @RubenVerborgh for the need of a streaming N3-parsing API :).

@RubenVerborgh
Contributor

@mcollina It's the node-n3 parser, the non-streaming version.
It's interesting how you avoid the backpressure, does that work well? parse works sync.
This might make the stream implementation of node-n3 (RubenVerborgh/N3.js#6) unnecessary.

On the other hand, @pietercolpaert, maybe the stream implementation can help out here then!

@RubenVerborgh RubenVerborgh referenced this issue in RubenVerborgh/N3.js Oct 11, 2013
Closed

Node.js stream support #6

@pietercolpaert

(note: @rubenverborgh is a colleague)
How does your code avoid the backpressure?

I've implemented your method with through2. The memory still increases rapidly though...

Let's wait for RubenVerborgh/N3.js#6?

@RubenVerborgh RubenVerborgh referenced this issue in RubenVerborgh/N3.js Oct 13, 2013
Closed

Add Turtle serializer #7

@mcollina
Owner

(note: @rubenverborgh https://github.com/rubenverborgh is a colleague)
How does your code avoid the backpressure?

It is handled by through2, automatically

I've implemented your method with through2. The memory still increases
rapidly though..

Is it still crashing or not?
LevelDB uses HUGE amount of memory in writing.
For storing dbpedia, i think it can eat some GB of RAM and then settle
there, while it does his job.

I think you might want to try level-hyper, it uses a leveldb variant that
might handle your case better.

@mcollina
Owner

I think this might be what is really happening: Level/levelup#171.
So, trying level-hyper might be the only real solution.

(Reopening, as it's easy to search for discussions on this topic).

@mcollina mcollina reopened this Oct 14, 2013
@mcollina
Owner

The thing is already packaged for usage in LevelGraph-N3: https://github.com/mcollina/levelgraph-n3#importing-n3-files.

@mcollina
Owner

@pietercolpaert @RubenVerborgh any news on this? Have you tried the new stream-based importer?

@mcollina
Owner

hey @pietercolpaert @RubenVerborgh, any news on this?

@pietercolpaert

The memory leak is still there with level-hyper as well I'm afraid.

Test yourself

Download http://downloads.dbpedia.org/3.9/en/instance_types_en.nt.bz2

Load it in a levelgraph db using something like:

var fs = require("fs"),
    n3 = require("n3"),
    levelup = require('level'),
    levelhyper = require('level-hyper'),
    levelgraph = require("levelgraph");
var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("/path/to/instance_types_en.nt", { encoding: 'utf8' });
var count = 0;
var transform = new n3.Transform();
transform.pipe(db.putStream());
readstream.pipe(transform).on("data",function (triple) {
  count++;
  if (count%10000 === 0) {
    var str = count + " triples loaded";
    console.error( str );
  }
}).on("end",function () {
  console.error(count + " triples loaded");
});

Or am I doing something wrong here?

@mcollina
Owner

First, using stream.on("data", function() {}) disable all the backpressure goodness. So, don't do that. If you want to track the data being written, we can add something to LevelGraph write stream.

Second, it might really be that Level needs more memory to store all that data, and you might need to increase it... a lot. Look at: https://github.com/joyent/node/wiki/FAQ#what-is-the-memory-limit-on-a-node-process.
Usually Level memory usage explode during writes, and then compact later.
In order to be sure, I'll try with a 3 or 4 GB memory limit.

You are insterting roughly 15.894.068 lines, which we can assume are triples.
That means 16 * 6 triples on levelgraph: roughly 1 billion entries.
Some tweaking might be necessary.

I'm hoping in some help from @rvagg, @maxogden and others in order to track it down.

Third, there might really be a memory leak. Have a look at https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/ to any help tracking it down.

I'll try giving a go at your script and see if I can get better.

@mcollina
Owner

On the contrary of what I though, the following is not leaking on my Mac:

var fs = require("fs"),
  through2 = require("through2"),
  n3 = require("n3"),
  levelhyper = require('level-hyper'),
  levelgraph = require("levelgraph");

var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();

readstream.pipe(transform).pipe(through2({ objectMode: true }, function (chunk, enc, callback) {
  count++;
  if (count % 10000 === 0) {
    var str = count + " triples loaded";
    console.error( str );
  }
  this.push(chunk)
  callback()
})).pipe(writestream);

writestream.on("end",function () {
  console.error(count + " triples loaded");
});

It is extremely slow, but it is not leaking.

@mcollina
Owner

Ok. memory problem detected at around 700k triples:

700000 triples loaded
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

Any help here @rvagg?

@pietercolpaert

Mine gave this when 80MB of the file was stored into levelgraph, while my max memory was 1700MB.

This makes me think that backpressure stuff is not properly working

@mcollina
Owner

Have you tried the version I've passed you that don't use the old API?
Using my version I could go well beyond that.

I'm trying with a 8GB machine giving 7GB of memory, and the memory usage explodes.

@mcollina
Owner

I'm confirming that it is a memory leak indeed that happens at around 700k instances and varies with --max_old_space_size. So, it is in node-land and inside the space managed by the GC. Wow -.-.

@mcollina
Owner

Ok, using memleak I've found that it is leaking Strings:

{
  "before": {
    "nodes": 571650,
    "time": "2013-10-31T15:44:28.000Z",
    "size_bytes": 156916816,
    "size": "149.65 mb"
  },
  "after": {
    "nodes": 2114224,
    "time": "2013-10-31T15:45:25.000Z",
    "size_bytes": 614410984,
    "size": "585.95 mb"
  },
  "change": {
    "size_bytes": 457494168,
    "size": "436.3 mb",
    "freed_nodes": 16793,
    "allocated_nodes": 1559370,
    "details": [
      {
        "what": "Arguments",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Array",
        "size_bytes": 28880,
        "size": "28.2 kb",
        "+": 5342,
        "-": 5632
      },
      {
        "what": "Buffer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Closure",
        "size_bytes": -72,
        "size": "-72 bytes",
        "+": 6,
        "-": 7
      },
      {
        "what": "Code",
        "size_bytes": 104192,
        "size": "101.75 kb",
        "+": 247,
        "-": 10
      },
      {
        "what": "Date",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 2,
        "-": 2
      },
      {
        "what": "Native",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Number",
        "size_bytes": 16,
        "size": "16 bytes",
        "+": 4,
        "-": 3
      },
      {
        "what": "Object",
        "size_bytes": -71896,
        "size": "-70.21 kb",
        "+": 9612,
        "-": 11112
      },
      {
        "what": "SlowBuffer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "String",
        "size_bytes": 457419864,
        "size": "436.23 mb",
        "+": 1543930,
        "-": 13
      },
      {
        "what": "Timeout",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Timer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 2,
        "-": 2
      }
    ]
  }
}
@mcollina
Owner

As I see it, the backpressure is not happening and stuff ends up in the _writableState.buffer. Not sure why it's happening and where the problem is.

I used https://github.com/bnoordhuis/node-heapdump to get a heap snapshot, there are tons of unparsed strings.
Try yourself and see what it is going on.

For reference, this is my last run:

var fs = require("fs"),
  through2 = require("through2"),
  n3 = require("n3"),
//  levelhyper = require('level-hyper'),
  memwatch = require('memwatch'),
  levelgraph = require("levelgraph");
var heapdump = require('heapdump');

var throttle = require("throttle");

//var db = levelgraph(levelhyper("data"));
var db = levelgraph("data");
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();
var diff;

// throttle at 10MBs
readstream
//.pipe(throttle(1024 * 1024))
.pipe(transform)
//.pipe(through2({ highWaterMark: 1, objectMode: true }, function (chunk, enc, callback) {
//  count++;
//  if (count % 10000 === 0) {
//    var str = count + " triples loaded";
//    console.error( str );
//  }
//  this.push(chunk)
//  callback()
//}))
.pipe(writestream);

writestream.on("end",function () {
  console.error(count + " triples loaded");
});

memwatch.on("leak", function(info) {
  console.log(info);
  if (diff) {
    console.log(JSON.stringify(diff.end(), null, '  '))
  }
//heapdump.writeSnapshot('./' + Date.now() + '.heapsnapshot');
//  diff = new memwatch.HeapDiff();
});
@mcollina mcollina referenced this issue in Level/levelup Oct 31, 2013
Closed

Memory leak in db.batch #171

@mcollina
Owner

Ok, this is bad, and it is very probable a levelup/leveldown bug.

It seems that using this helps a lot as it sits directly on leveldown:
https://github.com/brycebaril/level-bufferstreams

However I have no more time to spend on this right now :(.

@pietercolpaert

Let's file the issue in leveldown though?

@mcollina
Owner

It's already filed up in levelup and being discussed.
I hoped it was not that one: Level/levelup#171.

I'm having some luck in using levelup 0.12 and leveldown 0.6.

The insert is much slower but it eats really less memory. It is at 1.200.000 triples (which means 7.200.000 level records) and just at 4GB. The 'leak' warning of memleak is not being triggered that often, so things are fine.

If you want to have a consistent strategy of importing dbpedia into a levelgraph, the best solution right now is to split the file in 10 chunks of 200MB and insert those, one by one. You can use a node cluster or control that via shell.

@mcollina
Owner

I saturated even the version with levelup 0.12. I was able to insert around 5630000 triples that are 33780000 pairs in leveldb. The final leveldb size is around 1.5GB.
The way Level works its fine: it needs lots of memory and it is not leaking.

Were you able to load that into any triplestore on a single machine?

I think the best solution in any case is trying to split the input in around 10 files, and trying to insert them one at a time.

@rvagg
rvagg commented Oct 31, 2013

You should also bring in @brycebaril to this discussion as he seems to have a lot of experience working around this problem.
It's certainly something that's an actual problem but the cause and exact nature of it is so far unknown. My instinct is to blame a combination of standard LevelDB behavior for heavy & rapid inserts and V8 GC behaviour but the fact that we have appreciable differences between different versions of LevelUP suggests that this may be wrong. It could possibly be related to the switch to NAN, perhaps there's are real leak there but perhaps also it's related to the way NAN has to deal with Persistent V8 references. I honestly don't know!
Ultimately I probably just need to find some time to investigate this properly, but time is something I'm quite short on right now! @trevnorris may also come in handy given his debugging and perf experience.

@mcollina
Owner
mcollina commented Nov 1, 2013

Thanks Rod!
I fear we have two (or maybe more) overlapping problems: the LevelDB behavior and a NaN leak.

Using LevelUP 0.12 with LevelWriteStream cause no leak whatsoever (traced with the memleak package), but the memory occupancy still grows a lot: but for storing 33 millions records it's kind of ok, but it will be better if it was not there. This is the LevelDB vs V8 GC.

Using LevelUP 0.17 with or without LevelWriteStream causes the memory to explode much much earlier.

Should we try Node v0.11?

If @brycebaril or @trevnorris wants to help, they are super-welcome to.

@trevnorris

@mcollina I don't have much time, but would be more than happy to help along the way if you have any questions while debugging. Also, @tjfontaine is rad when it comes to finding mem leaks.

@brycebaril

Hi @mcollina a lot of the memory price you're paying with large Level Read/Write streams is objectMode. You may be able to avoid some of that if you used a modified version of n3 that worked with level-bufferstreams, but I don't know how reasonable/possible that is with the work that n3 does.

So far as I understand it, the ReadStream is giving you two buffers from LevelDOWN's iterator.next() but when LevelUP throws these into an object {key: ..., value: ...} you're ending up copying those buffers into the V8 heap. This is even worse if you're using JSON key/valueEncoding as then the buffers get inflated into even more heap content. This is also a cost if you're using Array Batch for the WriteStream side, where the array being queued into a JavaScript Array will use V8 heap instead of sending them directly to LevelDOWN via chained Batch.

level-bufferstreams skips the V8 heap by using multibuffers to avoid objectMode entirely. The downside being that your Transforms will have to work with pure buffers to keep the memory footprint minimal.

That said, there very well could be speed or memory improvements that could be made with LevelDOWN to help here -- particularly the differences between 0.12 and 0.17.

@RubenVerborgh
Contributor

@brycebaril The current stream implementation in N3 is a short and simple layer above a highly efficient internal parser that only uses callbacks. I took the decision after I noticed that an internal implementation using streams was tremendously slow.

Therefore, implementing a level-bufferstreams layer on top of the N3 parser should be trivial: only a new stream layer must be implemented. I guess that 90% of the current N3Transform layer can be reused.

@mcollina
Owner
mcollina commented Nov 2, 2013

@RubenVerborgh I think you can try creating that library and insert directly in LevelDown. It should not be hard.

Here are the relevant lines for creating the keys and values to store in LevelDown:
https://github.com/mcollina/levelgraph/blob/master/lib/writestream.js#L35-L46
https://github.com/mcollina/levelgraph/blob/master/lib/utilities.js#L36-L60

So you can skip totally levelgraph for insertion and use it just for reading the data.

@RubenVerborgh
Contributor

@mcollina I'm a bit confused here. @brycebaril suggested to create a level-bufferstreams implementation.
What library do you suggest that should be created?

@mcollina
Owner
mcollina commented Nov 2, 2013

I'm saying to skip entirely levelgraph and store the things directly inside LevelDown using level-bufferstreams.
I was actually giving a couple of advices on how data are stored by LevelGraph in LevelDown, so you can implement that tiny layer.

BTW, I'm trying to understand what is going with the chained-batch vs array-batch, which seems mainly the issue here.

@RubenVerborgh
Contributor

@mcollina Thanks, got it now 👍

@mcollina
Owner
mcollina commented Nov 3, 2013

I'm trying using the WriteStream implementation that uses chainable batch (Level/level-ws#1) to improving this scenario.
In fact, I could get a much better memory usage (so it seems the culprit is batch array in levelup/leveldown).

I get a segfault in doing so. I could not replicate it easily using LevelWriteStream alone.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
[Switching to process 30524 thread 0x1903]
std::string::size () at basic_string.h:605
605           size() const

Moreover, this happens only on node v0.10.x. Node v0.11.8 runs totally fine with 300MB occupied memory after 12 million level pairs inserted (roughly 2 million triples): I stopped it at this point. Moreover, the memory occupancy is not growing. So, this issue is in node v0.10 part of LevelDown/NaN.

As I'm not doing anything strange here, it should be something down in LevelDown or core, or something interfering between the two. Maybe, one of the keys is getting deallocated by V8 while still being worked out by LevelDown?
What can I do to help tracking this down?

You can find the branch here: https://github.com/mcollina/levelgraph/tree/no-leak-writestream.

@mcollina mcollina referenced this issue in Level/leveldown Nov 6, 2013
Merged

Persistent handles not needed for batch #70

@mcollina
Owner

Ok, Level/leveldown#73 solves the issue. I can load the big n3 file with a fixed memory load. Soon on your screens!

@RubenVerborgh
Contributor

Looks promising, thanks! Keep us updated on the segfault.

@mcollina
Owner

You can use that branch with levelgraph as it is and it works fine with no segfault.
The segfault is with the 'new' writestream that uses chained batch, but we do not need it anymore to do a big import.
(However, we need to solve the segfault anyway).

@pietercolpaert

Affirmative! It works as a charm!

@mcollina
Owner

Reopening as leveldown 0.10 is not released, yet. Moreover we will have a new LevelGraph patch release to bump that dependency.

BTW, Have you got some performance reports about queries on that dataset?

@mcollina mcollina reopened this Nov 14, 2013
@pietercolpaert

I didn't do any performance tests yet, but for 1M triples, the ingesting is quite fast and queries return within expectations :)

Will keep you informed when I have more time to spend on this

@rvagg
rvagg commented Nov 18, 2013

leveldown@0.10.0 and levelup@0.18.0 are out and will hopefully sort this out

@mcollina mcollina added a commit that closed this issue Nov 18, 2013
@mcollina Bump levelup/leveldown dependencies
Closes #40.
8f1eab9
@mcollina mcollina closed this in 8f1eab9 Nov 18, 2013
@mcollina
Owner

The new leveldown bring also a 25% increase in writing speed, so right now I can write 22.000 triples/second on my 2011 Macbook Air (vs 15.000 of the previous release).

Have fun! 👯

@mcollina
Owner

This is released as version 0.6.11!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment