New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memleak #40

Closed
pietercolpaert opened this Issue Oct 11, 2013 · 40 comments

Comments

Projects
None yet
6 participants
@pietercolpaert

To test this library in order to use it for a bigger project, I've tried to ingest DBPedia. It crashes however at ~1% of the triples saying that the process is out of memory. It only happens when I try to write to a leveldb. If I comment the write, the memory usage stays stable.

some code

var db = levelgraph(levelup(dbname));
var dbputstream = db.putStream();
var filename = "pathtodbpedia.nt";
fs.createReadStream(filename).on("data", function (data) {
  N3Parser().parse(data, function (error, triple) {
    dbputstream.write(triple);
  }
});

Any ideas on how to limit memory usage?

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 11, 2013

Collaborator

This is very interesting.
Loading a huge number of triples (how many it is the 1%?) is something I should dig into.

However, you are doing it wrongly: you are not providing a streaming interface with proper backpressure and LevelGraph couldn't keep up with the parsing speed. You should try writing a trasform step and use pipe, like so:

var db = levelgraph(levelup(dbname));
var through2 = require('through2')
var filename = "pathtodbpedia.nt";
fs.createReadStream(filename).pipe(through2({ objectMode: true }, function (data, enc, callback) {

  var that = this
  N3Parser().parse(data, function (error, triple) {
    that.push(triple)
  })

  callback() // as parse should be sync
}).pipe(db.putStream());

What package is N3Parser from?

(I've not tested it, if you can provide a working gist and a link to the dataset, I'll try)

BTW, LevelGraph is slower on writes (15k triples/s) than on reads (30-40k triples/s).

I'm pulling in @maxogden as it as done huge work on importing bulk data into LevelUp.
Also @RubenVerborgh for the need of a streaming N3-parsing API :).

Collaborator

mcollina commented Oct 11, 2013

This is very interesting.
Loading a huge number of triples (how many it is the 1%?) is something I should dig into.

However, you are doing it wrongly: you are not providing a streaming interface with proper backpressure and LevelGraph couldn't keep up with the parsing speed. You should try writing a trasform step and use pipe, like so:

var db = levelgraph(levelup(dbname));
var through2 = require('through2')
var filename = "pathtodbpedia.nt";
fs.createReadStream(filename).pipe(through2({ objectMode: true }, function (data, enc, callback) {

  var that = this
  N3Parser().parse(data, function (error, triple) {
    that.push(triple)
  })

  callback() // as parse should be sync
}).pipe(db.putStream());

What package is N3Parser from?

(I've not tested it, if you can provide a working gist and a link to the dataset, I'll try)

BTW, LevelGraph is slower on writes (15k triples/s) than on reads (30-40k triples/s).

I'm pulling in @maxogden as it as done huge work on importing bulk data into LevelUp.
Also @RubenVerborgh for the need of a streaming N3-parsing API :).

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Oct 11, 2013

Contributor

@mcollina It's the node-n3 parser, the non-streaming version.
It's interesting how you avoid the backpressure, does that work well? parse works sync.
This might make the stream implementation of node-n3 (rdfjs/N3.js#6) unnecessary.

On the other hand, @pietercolpaert, maybe the stream implementation can help out here then!

Contributor

RubenVerborgh commented Oct 11, 2013

@mcollina It's the node-n3 parser, the non-streaming version.
It's interesting how you avoid the backpressure, does that work well? parse works sync.
This might make the stream implementation of node-n3 (rdfjs/N3.js#6) unnecessary.

On the other hand, @pietercolpaert, maybe the stream implementation can help out here then!

@pietercolpaert

This comment has been minimized.

Show comment
Hide comment
@pietercolpaert

pietercolpaert Oct 13, 2013

(note: @RubenVerborgh is a colleague)
How does your code avoid the backpressure?

I've implemented your method with through2. The memory still increases rapidly though...

Let's wait for rdfjs/N3.js#6?

(note: @RubenVerborgh is a colleague)
How does your code avoid the backpressure?

I've implemented your method with through2. The memory still increases rapidly though...

Let's wait for rdfjs/N3.js#6?

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 13, 2013

Collaborator

(note: @RubenVerborgh https://github.com/rubenverborgh is a colleague)
How does your code avoid the backpressure?

It is handled by through2, automatically

I've implemented your method with through2. The memory still increases
rapidly though..

Is it still crashing or not?
LevelDB uses HUGE amount of memory in writing.
For storing dbpedia, i think it can eat some GB of RAM and then settle
there, while it does his job.

I think you might want to try level-hyper, it uses a leveldb variant that
might handle your case better.

Collaborator

mcollina commented Oct 13, 2013

(note: @RubenVerborgh https://github.com/rubenverborgh is a colleague)
How does your code avoid the backpressure?

It is handled by through2, automatically

I've implemented your method with through2. The memory still increases
rapidly though..

Is it still crashing or not?
LevelDB uses HUGE amount of memory in writing.
For storing dbpedia, i think it can eat some GB of RAM and then settle
there, while it does his job.

I think you might want to try level-hyper, it uses a leveldb variant that
might handle your case better.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 14, 2013

Collaborator

I think this might be what is really happening: Level/levelup#171.
So, trying level-hyper might be the only real solution.

(Reopening, as it's easy to search for discussions on this topic).

Collaborator

mcollina commented Oct 14, 2013

I think this might be what is really happening: Level/levelup#171.
So, trying level-hyper might be the only real solution.

(Reopening, as it's easy to search for discussions on this topic).

@mcollina mcollina reopened this Oct 14, 2013

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 15, 2013

Collaborator

The thing is already packaged for usage in LevelGraph-N3: https://github.com/mcollina/levelgraph-n3#importing-n3-files.

Collaborator

mcollina commented Oct 15, 2013

The thing is already packaged for usage in LevelGraph-N3: https://github.com/mcollina/levelgraph-n3#importing-n3-files.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 16, 2013

Collaborator

@pietercolpaert @RubenVerborgh any news on this? Have you tried the new stream-based importer?

Collaborator

mcollina commented Oct 16, 2013

@pietercolpaert @RubenVerborgh any news on this? Have you tried the new stream-based importer?

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 28, 2013

Collaborator

hey @pietercolpaert @RubenVerborgh, any news on this?

Collaborator

mcollina commented Oct 28, 2013

hey @pietercolpaert @RubenVerborgh, any news on this?

@pietercolpaert

This comment has been minimized.

Show comment
Hide comment
@pietercolpaert

pietercolpaert Oct 31, 2013

The memory leak is still there with level-hyper as well I'm afraid.

Test yourself

Download http://downloads.dbpedia.org/3.9/en/instance_types_en.nt.bz2

Load it in a levelgraph db using something like:

var fs = require("fs"),
    n3 = require("n3"),
    levelup = require('level'),
    levelhyper = require('level-hyper'),
    levelgraph = require("levelgraph");
var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("/path/to/instance_types_en.nt", { encoding: 'utf8' });
var count = 0;
var transform = new n3.Transform();
transform.pipe(db.putStream());
readstream.pipe(transform).on("data",function (triple) {
  count++;
  if (count%10000 === 0) {
    var str = count + " triples loaded";
    console.error( str );
  }
}).on("end",function () {
  console.error(count + " triples loaded");
});

Or am I doing something wrong here?

The memory leak is still there with level-hyper as well I'm afraid.

Test yourself

Download http://downloads.dbpedia.org/3.9/en/instance_types_en.nt.bz2

Load it in a levelgraph db using something like:

var fs = require("fs"),
    n3 = require("n3"),
    levelup = require('level'),
    levelhyper = require('level-hyper'),
    levelgraph = require("levelgraph");
var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("/path/to/instance_types_en.nt", { encoding: 'utf8' });
var count = 0;
var transform = new n3.Transform();
transform.pipe(db.putStream());
readstream.pipe(transform).on("data",function (triple) {
  count++;
  if (count%10000 === 0) {
    var str = count + " triples loaded";
    console.error( str );
  }
}).on("end",function () {
  console.error(count + " triples loaded");
});

Or am I doing something wrong here?

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

First, using stream.on("data", function() {}) disable all the backpressure goodness. So, don't do that. If you want to track the data being written, we can add something to LevelGraph write stream.

Second, it might really be that Level needs more memory to store all that data, and you might need to increase it... a lot. Look at: https://github.com/joyent/node/wiki/FAQ#what-is-the-memory-limit-on-a-node-process.
Usually Level memory usage explode during writes, and then compact later.
In order to be sure, I'll try with a 3 or 4 GB memory limit.

You are insterting roughly 15.894.068 lines, which we can assume are triples.
That means 16 * 6 triples on levelgraph: roughly 1 billion entries.
Some tweaking might be necessary.

I'm hoping in some help from @rvagg, @maxogden and others in order to track it down.

Third, there might really be a memory leak. Have a look at https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/ to any help tracking it down.

I'll try giving a go at your script and see if I can get better.

Collaborator

mcollina commented Oct 31, 2013

First, using stream.on("data", function() {}) disable all the backpressure goodness. So, don't do that. If you want to track the data being written, we can add something to LevelGraph write stream.

Second, it might really be that Level needs more memory to store all that data, and you might need to increase it... a lot. Look at: https://github.com/joyent/node/wiki/FAQ#what-is-the-memory-limit-on-a-node-process.
Usually Level memory usage explode during writes, and then compact later.
In order to be sure, I'll try with a 3 or 4 GB memory limit.

You are insterting roughly 15.894.068 lines, which we can assume are triples.
That means 16 * 6 triples on levelgraph: roughly 1 billion entries.
Some tweaking might be necessary.

I'm hoping in some help from @rvagg, @maxogden and others in order to track it down.

Third, there might really be a memory leak. Have a look at https://hacks.mozilla.org/2012/11/tracking-down-memory-leaks-in-node-js-a-node-js-holiday-season/ to any help tracking it down.

I'll try giving a go at your script and see if I can get better.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

On the contrary of what I though, the following is not leaking on my Mac:

var fs = require("fs"),
  through2 = require("through2"),
  n3 = require("n3"),
  levelhyper = require('level-hyper'),
  levelgraph = require("levelgraph");

var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();

readstream.pipe(transform).pipe(through2({ objectMode: true }, function (chunk, enc, callback) {
  count++;
  if (count % 10000 === 0) {
    var str = count + " triples loaded";
    console.error( str );
  }
  this.push(chunk)
  callback()
})).pipe(writestream);

writestream.on("end",function () {
  console.error(count + " triples loaded");
});

It is extremely slow, but it is not leaking.

Collaborator

mcollina commented Oct 31, 2013

On the contrary of what I though, the following is not leaking on my Mac:

var fs = require("fs"),
  through2 = require("through2"),
  n3 = require("n3"),
  levelhyper = require('level-hyper'),
  levelgraph = require("levelgraph");

var db = levelgraph(levelhyper("data"));
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();

readstream.pipe(transform).pipe(through2({ objectMode: true }, function (chunk, enc, callback) {
  count++;
  if (count % 10000 === 0) {
    var str = count + " triples loaded";
    console.error( str );
  }
  this.push(chunk)
  callback()
})).pipe(writestream);

writestream.on("end",function () {
  console.error(count + " triples loaded");
});

It is extremely slow, but it is not leaking.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

Ok. memory problem detected at around 700k triples:

700000 triples loaded
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

Any help here @rvagg?

Collaborator

mcollina commented Oct 31, 2013

Ok. memory problem detected at around 700k triples:

700000 triples loaded
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

Any help here @rvagg?

@pietercolpaert

This comment has been minimized.

Show comment
Hide comment
@pietercolpaert

pietercolpaert Oct 31, 2013

Mine gave this when 80MB of the file was stored into levelgraph, while my max memory was 1700MB.

This makes me think that backpressure stuff is not properly working

Mine gave this when 80MB of the file was stored into levelgraph, while my max memory was 1700MB.

This makes me think that backpressure stuff is not properly working

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

Have you tried the version I've passed you that don't use the old API?
Using my version I could go well beyond that.

I'm trying with a 8GB machine giving 7GB of memory, and the memory usage explodes.

Collaborator

mcollina commented Oct 31, 2013

Have you tried the version I've passed you that don't use the old API?
Using my version I could go well beyond that.

I'm trying with a 8GB machine giving 7GB of memory, and the memory usage explodes.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

I'm confirming that it is a memory leak indeed that happens at around 700k instances and varies with --max_old_space_size. So, it is in node-land and inside the space managed by the GC. Wow -.-.

Collaborator

mcollina commented Oct 31, 2013

I'm confirming that it is a memory leak indeed that happens at around 700k instances and varies with --max_old_space_size. So, it is in node-land and inside the space managed by the GC. Wow -.-.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

Ok, using memleak I've found that it is leaking Strings:

{
  "before": {
    "nodes": 571650,
    "time": "2013-10-31T15:44:28.000Z",
    "size_bytes": 156916816,
    "size": "149.65 mb"
  },
  "after": {
    "nodes": 2114224,
    "time": "2013-10-31T15:45:25.000Z",
    "size_bytes": 614410984,
    "size": "585.95 mb"
  },
  "change": {
    "size_bytes": 457494168,
    "size": "436.3 mb",
    "freed_nodes": 16793,
    "allocated_nodes": 1559370,
    "details": [
      {
        "what": "Arguments",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Array",
        "size_bytes": 28880,
        "size": "28.2 kb",
        "+": 5342,
        "-": 5632
      },
      {
        "what": "Buffer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Closure",
        "size_bytes": -72,
        "size": "-72 bytes",
        "+": 6,
        "-": 7
      },
      {
        "what": "Code",
        "size_bytes": 104192,
        "size": "101.75 kb",
        "+": 247,
        "-": 10
      },
      {
        "what": "Date",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 2,
        "-": 2
      },
      {
        "what": "Native",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Number",
        "size_bytes": 16,
        "size": "16 bytes",
        "+": 4,
        "-": 3
      },
      {
        "what": "Object",
        "size_bytes": -71896,
        "size": "-70.21 kb",
        "+": 9612,
        "-": 11112
      },
      {
        "what": "SlowBuffer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "String",
        "size_bytes": 457419864,
        "size": "436.23 mb",
        "+": 1543930,
        "-": 13
      },
      {
        "what": "Timeout",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Timer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 2,
        "-": 2
      }
    ]
  }
}
Collaborator

mcollina commented Oct 31, 2013

Ok, using memleak I've found that it is leaking Strings:

{
  "before": {
    "nodes": 571650,
    "time": "2013-10-31T15:44:28.000Z",
    "size_bytes": 156916816,
    "size": "149.65 mb"
  },
  "after": {
    "nodes": 2114224,
    "time": "2013-10-31T15:45:25.000Z",
    "size_bytes": 614410984,
    "size": "585.95 mb"
  },
  "change": {
    "size_bytes": 457494168,
    "size": "436.3 mb",
    "freed_nodes": 16793,
    "allocated_nodes": 1559370,
    "details": [
      {
        "what": "Arguments",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Array",
        "size_bytes": 28880,
        "size": "28.2 kb",
        "+": 5342,
        "-": 5632
      },
      {
        "what": "Buffer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Closure",
        "size_bytes": -72,
        "size": "-72 bytes",
        "+": 6,
        "-": 7
      },
      {
        "what": "Code",
        "size_bytes": 104192,
        "size": "101.75 kb",
        "+": 247,
        "-": 10
      },
      {
        "what": "Date",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 2,
        "-": 2
      },
      {
        "what": "Native",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Number",
        "size_bytes": 16,
        "size": "16 bytes",
        "+": 4,
        "-": 3
      },
      {
        "what": "Object",
        "size_bytes": -71896,
        "size": "-70.21 kb",
        "+": 9612,
        "-": 11112
      },
      {
        "what": "SlowBuffer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "String",
        "size_bytes": 457419864,
        "size": "436.23 mb",
        "+": 1543930,
        "-": 13
      },
      {
        "what": "Timeout",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 1,
        "-": 1
      },
      {
        "what": "Timer",
        "size_bytes": 0,
        "size": "0 bytes",
        "+": 2,
        "-": 2
      }
    ]
  }
}
@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

As I see it, the backpressure is not happening and stuff ends up in the _writableState.buffer. Not sure why it's happening and where the problem is.

I used https://github.com/bnoordhuis/node-heapdump to get a heap snapshot, there are tons of unparsed strings.
Try yourself and see what it is going on.

For reference, this is my last run:

var fs = require("fs"),
  through2 = require("through2"),
  n3 = require("n3"),
//  levelhyper = require('level-hyper'),
  memwatch = require('memwatch'),
  levelgraph = require("levelgraph");
var heapdump = require('heapdump');

var throttle = require("throttle");

//var db = levelgraph(levelhyper("data"));
var db = levelgraph("data");
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();
var diff;

// throttle at 10MBs
readstream
//.pipe(throttle(1024 * 1024))
.pipe(transform)
//.pipe(through2({ highWaterMark: 1, objectMode: true }, function (chunk, enc, callback) {
//  count++;
//  if (count % 10000 === 0) {
//    var str = count + " triples loaded";
//    console.error( str );
//  }
//  this.push(chunk)
//  callback()
//}))
.pipe(writestream);

writestream.on("end",function () {
  console.error(count + " triples loaded");
});

memwatch.on("leak", function(info) {
  console.log(info);
  if (diff) {
    console.log(JSON.stringify(diff.end(), null, '  '))
  }
//heapdump.writeSnapshot('./' + Date.now() + '.heapsnapshot');
//  diff = new memwatch.HeapDiff();
});
Collaborator

mcollina commented Oct 31, 2013

As I see it, the backpressure is not happening and stuff ends up in the _writableState.buffer. Not sure why it's happening and where the problem is.

I used https://github.com/bnoordhuis/node-heapdump to get a heap snapshot, there are tons of unparsed strings.
Try yourself and see what it is going on.

For reference, this is my last run:

var fs = require("fs"),
  through2 = require("through2"),
  n3 = require("n3"),
//  levelhyper = require('level-hyper'),
  memwatch = require('memwatch'),
  levelgraph = require("levelgraph");
var heapdump = require('heapdump');

var throttle = require("throttle");

//var db = levelgraph(levelhyper("data"));
var db = levelgraph("data");
var readstream = fs.createReadStream("./instance_types_en.nt", { encoding: 'utf8' });
var writestream = db.putStream();
var count = 0;
var transform = new n3.Transform();
var diff;

// throttle at 10MBs
readstream
//.pipe(throttle(1024 * 1024))
.pipe(transform)
//.pipe(through2({ highWaterMark: 1, objectMode: true }, function (chunk, enc, callback) {
//  count++;
//  if (count % 10000 === 0) {
//    var str = count + " triples loaded";
//    console.error( str );
//  }
//  this.push(chunk)
//  callback()
//}))
.pipe(writestream);

writestream.on("end",function () {
  console.error(count + " triples loaded");
});

memwatch.on("leak", function(info) {
  console.log(info);
  if (diff) {
    console.log(JSON.stringify(diff.end(), null, '  '))
  }
//heapdump.writeSnapshot('./' + Date.now() + '.heapsnapshot');
//  diff = new memwatch.HeapDiff();
});
@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

Ok, this is bad, and it is very probable a levelup/leveldown bug.

It seems that using this helps a lot as it sits directly on leveldown:
https://github.com/brycebaril/level-bufferstreams

However I have no more time to spend on this right now :(.

Collaborator

mcollina commented Oct 31, 2013

Ok, this is bad, and it is very probable a levelup/leveldown bug.

It seems that using this helps a lot as it sits directly on leveldown:
https://github.com/brycebaril/level-bufferstreams

However I have no more time to spend on this right now :(.

@pietercolpaert

This comment has been minimized.

Show comment
Hide comment
@pietercolpaert

pietercolpaert Oct 31, 2013

Let's file the issue in leveldown though?

Let's file the issue in leveldown though?

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

It's already filed up in levelup and being discussed.
I hoped it was not that one: Level/levelup#171.

I'm having some luck in using levelup 0.12 and leveldown 0.6.

The insert is much slower but it eats really less memory. It is at 1.200.000 triples (which means 7.200.000 level records) and just at 4GB. The 'leak' warning of memleak is not being triggered that often, so things are fine.

If you want to have a consistent strategy of importing dbpedia into a levelgraph, the best solution right now is to split the file in 10 chunks of 200MB and insert those, one by one. You can use a node cluster or control that via shell.

Collaborator

mcollina commented Oct 31, 2013

It's already filed up in levelup and being discussed.
I hoped it was not that one: Level/levelup#171.

I'm having some luck in using levelup 0.12 and leveldown 0.6.

The insert is much slower but it eats really less memory. It is at 1.200.000 triples (which means 7.200.000 level records) and just at 4GB. The 'leak' warning of memleak is not being triggered that often, so things are fine.

If you want to have a consistent strategy of importing dbpedia into a levelgraph, the best solution right now is to split the file in 10 chunks of 200MB and insert those, one by one. You can use a node cluster or control that via shell.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Oct 31, 2013

Collaborator

I saturated even the version with levelup 0.12. I was able to insert around 5630000 triples that are 33780000 pairs in leveldb. The final leveldb size is around 1.5GB.
The way Level works its fine: it needs lots of memory and it is not leaking.

Were you able to load that into any triplestore on a single machine?

I think the best solution in any case is trying to split the input in around 10 files, and trying to insert them one at a time.

Collaborator

mcollina commented Oct 31, 2013

I saturated even the version with levelup 0.12. I was able to insert around 5630000 triples that are 33780000 pairs in leveldb. The final leveldb size is around 1.5GB.
The way Level works its fine: it needs lots of memory and it is not leaking.

Were you able to load that into any triplestore on a single machine?

I think the best solution in any case is trying to split the input in around 10 files, and trying to insert them one at a time.

@rvagg

This comment has been minimized.

Show comment
Hide comment
@rvagg

rvagg Oct 31, 2013

You should also bring in @brycebaril to this discussion as he seems to have a lot of experience working around this problem.
It's certainly something that's an actual problem but the cause and exact nature of it is so far unknown. My instinct is to blame a combination of standard LevelDB behavior for heavy & rapid inserts and V8 GC behaviour but the fact that we have appreciable differences between different versions of LevelUP suggests that this may be wrong. It could possibly be related to the switch to NAN, perhaps there's are real leak there but perhaps also it's related to the way NAN has to deal with Persistent V8 references. I honestly don't know!
Ultimately I probably just need to find some time to investigate this properly, but time is something I'm quite short on right now! @trevnorris may also come in handy given his debugging and perf experience.

rvagg commented Oct 31, 2013

You should also bring in @brycebaril to this discussion as he seems to have a lot of experience working around this problem.
It's certainly something that's an actual problem but the cause and exact nature of it is so far unknown. My instinct is to blame a combination of standard LevelDB behavior for heavy & rapid inserts and V8 GC behaviour but the fact that we have appreciable differences between different versions of LevelUP suggests that this may be wrong. It could possibly be related to the switch to NAN, perhaps there's are real leak there but perhaps also it's related to the way NAN has to deal with Persistent V8 references. I honestly don't know!
Ultimately I probably just need to find some time to investigate this properly, but time is something I'm quite short on right now! @trevnorris may also come in handy given his debugging and perf experience.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 1, 2013

Collaborator

Thanks Rod!
I fear we have two (or maybe more) overlapping problems: the LevelDB behavior and a NaN leak.

Using LevelUP 0.12 with LevelWriteStream cause no leak whatsoever (traced with the memleak package), but the memory occupancy still grows a lot: but for storing 33 millions records it's kind of ok, but it will be better if it was not there. This is the LevelDB vs V8 GC.

Using LevelUP 0.17 with or without LevelWriteStream causes the memory to explode much much earlier.

Should we try Node v0.11?

If @brycebaril or @trevnorris wants to help, they are super-welcome to.

Collaborator

mcollina commented Nov 1, 2013

Thanks Rod!
I fear we have two (or maybe more) overlapping problems: the LevelDB behavior and a NaN leak.

Using LevelUP 0.12 with LevelWriteStream cause no leak whatsoever (traced with the memleak package), but the memory occupancy still grows a lot: but for storing 33 millions records it's kind of ok, but it will be better if it was not there. This is the LevelDB vs V8 GC.

Using LevelUP 0.17 with or without LevelWriteStream causes the memory to explode much much earlier.

Should we try Node v0.11?

If @brycebaril or @trevnorris wants to help, they are super-welcome to.

@trevnorris

This comment has been minimized.

Show comment
Hide comment
@trevnorris

trevnorris Nov 1, 2013

@mcollina I don't have much time, but would be more than happy to help along the way if you have any questions while debugging. Also, @tjfontaine is rad when it comes to finding mem leaks.

@mcollina I don't have much time, but would be more than happy to help along the way if you have any questions while debugging. Also, @tjfontaine is rad when it comes to finding mem leaks.

@brycebaril

This comment has been minimized.

Show comment
Hide comment
@brycebaril

brycebaril Nov 1, 2013

Hi @mcollina a lot of the memory price you're paying with large Level Read/Write streams is objectMode. You may be able to avoid some of that if you used a modified version of n3 that worked with level-bufferstreams, but I don't know how reasonable/possible that is with the work that n3 does.

So far as I understand it, the ReadStream is giving you two buffers from LevelDOWN's iterator.next() but when LevelUP throws these into an object {key: ..., value: ...} you're ending up copying those buffers into the V8 heap. This is even worse if you're using JSON key/valueEncoding as then the buffers get inflated into even more heap content. This is also a cost if you're using Array Batch for the WriteStream side, where the array being queued into a JavaScript Array will use V8 heap instead of sending them directly to LevelDOWN via chained Batch.

level-bufferstreams skips the V8 heap by using multibuffers to avoid objectMode entirely. The downside being that your Transforms will have to work with pure buffers to keep the memory footprint minimal.

That said, there very well could be speed or memory improvements that could be made with LevelDOWN to help here -- particularly the differences between 0.12 and 0.17.

Hi @mcollina a lot of the memory price you're paying with large Level Read/Write streams is objectMode. You may be able to avoid some of that if you used a modified version of n3 that worked with level-bufferstreams, but I don't know how reasonable/possible that is with the work that n3 does.

So far as I understand it, the ReadStream is giving you two buffers from LevelDOWN's iterator.next() but when LevelUP throws these into an object {key: ..., value: ...} you're ending up copying those buffers into the V8 heap. This is even worse if you're using JSON key/valueEncoding as then the buffers get inflated into even more heap content. This is also a cost if you're using Array Batch for the WriteStream side, where the array being queued into a JavaScript Array will use V8 heap instead of sending them directly to LevelDOWN via chained Batch.

level-bufferstreams skips the V8 heap by using multibuffers to avoid objectMode entirely. The downside being that your Transforms will have to work with pure buffers to keep the memory footprint minimal.

That said, there very well could be speed or memory improvements that could be made with LevelDOWN to help here -- particularly the differences between 0.12 and 0.17.

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Nov 2, 2013

Contributor

@brycebaril The current stream implementation in N3 is a short and simple layer above a highly efficient internal parser that only uses callbacks. I took the decision after I noticed that an internal implementation using streams was tremendously slow.

Therefore, implementing a level-bufferstreams layer on top of the N3 parser should be trivial: only a new stream layer must be implemented. I guess that 90% of the current N3Transform layer can be reused.

Contributor

RubenVerborgh commented Nov 2, 2013

@brycebaril The current stream implementation in N3 is a short and simple layer above a highly efficient internal parser that only uses callbacks. I took the decision after I noticed that an internal implementation using streams was tremendously slow.

Therefore, implementing a level-bufferstreams layer on top of the N3 parser should be trivial: only a new stream layer must be implemented. I guess that 90% of the current N3Transform layer can be reused.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 2, 2013

Collaborator

@RubenVerborgh I think you can try creating that library and insert directly in LevelDown. It should not be hard.

Here are the relevant lines for creating the keys and values to store in LevelDown:
https://github.com/mcollina/levelgraph/blob/master/lib/writestream.js#L35-L46
https://github.com/mcollina/levelgraph/blob/master/lib/utilities.js#L36-L60

So you can skip totally levelgraph for insertion and use it just for reading the data.

Collaborator

mcollina commented Nov 2, 2013

@RubenVerborgh I think you can try creating that library and insert directly in LevelDown. It should not be hard.

Here are the relevant lines for creating the keys and values to store in LevelDown:
https://github.com/mcollina/levelgraph/blob/master/lib/writestream.js#L35-L46
https://github.com/mcollina/levelgraph/blob/master/lib/utilities.js#L36-L60

So you can skip totally levelgraph for insertion and use it just for reading the data.

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Nov 2, 2013

Contributor

@mcollina I'm a bit confused here. @brycebaril suggested to create a level-bufferstreams implementation.
What library do you suggest that should be created?

Contributor

RubenVerborgh commented Nov 2, 2013

@mcollina I'm a bit confused here. @brycebaril suggested to create a level-bufferstreams implementation.
What library do you suggest that should be created?

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 2, 2013

Collaborator

I'm saying to skip entirely levelgraph and store the things directly inside LevelDown using level-bufferstreams.
I was actually giving a couple of advices on how data are stored by LevelGraph in LevelDown, so you can implement that tiny layer.

BTW, I'm trying to understand what is going with the chained-batch vs array-batch, which seems mainly the issue here.

Collaborator

mcollina commented Nov 2, 2013

I'm saying to skip entirely levelgraph and store the things directly inside LevelDown using level-bufferstreams.
I was actually giving a couple of advices on how data are stored by LevelGraph in LevelDown, so you can implement that tiny layer.

BTW, I'm trying to understand what is going with the chained-batch vs array-batch, which seems mainly the issue here.

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Nov 2, 2013

Contributor

@mcollina Thanks, got it now 👍

Contributor

RubenVerborgh commented Nov 2, 2013

@mcollina Thanks, got it now 👍

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 3, 2013

Collaborator

I'm trying using the WriteStream implementation that uses chainable batch (Level/level-ws#1) to improving this scenario.
In fact, I could get a much better memory usage (so it seems the culprit is batch array in levelup/leveldown).

I get a segfault in doing so. I could not replicate it easily using LevelWriteStream alone.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
[Switching to process 30524 thread 0x1903]
std::string::size () at basic_string.h:605
605           size() const

Moreover, this happens only on node v0.10.x. Node v0.11.8 runs totally fine with 300MB occupied memory after 12 million level pairs inserted (roughly 2 million triples): I stopped it at this point. Moreover, the memory occupancy is not growing. So, this issue is in node v0.10 part of LevelDown/NaN.

As I'm not doing anything strange here, it should be something down in LevelDown or core, or something interfering between the two. Maybe, one of the keys is getting deallocated by V8 while still being worked out by LevelDown?
What can I do to help tracking this down?

You can find the branch here: https://github.com/mcollina/levelgraph/tree/no-leak-writestream.

Collaborator

mcollina commented Nov 3, 2013

I'm trying using the WriteStream implementation that uses chainable batch (Level/level-ws#1) to improving this scenario.
In fact, I could get a much better memory usage (so it seems the culprit is batch array in levelup/leveldown).

I get a segfault in doing so. I could not replicate it easily using LevelWriteStream alone.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
[Switching to process 30524 thread 0x1903]
std::string::size () at basic_string.h:605
605           size() const

Moreover, this happens only on node v0.10.x. Node v0.11.8 runs totally fine with 300MB occupied memory after 12 million level pairs inserted (roughly 2 million triples): I stopped it at this point. Moreover, the memory occupancy is not growing. So, this issue is in node v0.10 part of LevelDown/NaN.

As I'm not doing anything strange here, it should be something down in LevelDown or core, or something interfering between the two. Maybe, one of the keys is getting deallocated by V8 while still being worked out by LevelDown?
What can I do to help tracking this down?

You can find the branch here: https://github.com/mcollina/levelgraph/tree/no-leak-writestream.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 12, 2013

Collaborator

Ok, Level/leveldown#73 solves the issue. I can load the big n3 file with a fixed memory load. Soon on your screens!

Collaborator

mcollina commented Nov 12, 2013

Ok, Level/leveldown#73 solves the issue. I can load the big n3 file with a fixed memory load. Soon on your screens!

@RubenVerborgh

This comment has been minimized.

Show comment
Hide comment
@RubenVerborgh

RubenVerborgh Nov 12, 2013

Contributor

Looks promising, thanks! Keep us updated on the segfault.

Contributor

RubenVerborgh commented Nov 12, 2013

Looks promising, thanks! Keep us updated on the segfault.

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 12, 2013

Collaborator

You can use that branch with levelgraph as it is and it works fine with no segfault.
The segfault is with the 'new' writestream that uses chained batch, but we do not need it anymore to do a big import.
(However, we need to solve the segfault anyway).

Collaborator

mcollina commented Nov 12, 2013

You can use that branch with levelgraph as it is and it works fine with no segfault.
The segfault is with the 'new' writestream that uses chained batch, but we do not need it anymore to do a big import.
(However, we need to solve the segfault anyway).

@pietercolpaert

This comment has been minimized.

Show comment
Hide comment
@pietercolpaert

pietercolpaert Nov 14, 2013

Affirmative! It works as a charm!

Affirmative! It works as a charm!

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 14, 2013

Collaborator

Reopening as leveldown 0.10 is not released, yet. Moreover we will have a new LevelGraph patch release to bump that dependency.

BTW, Have you got some performance reports about queries on that dataset?

Collaborator

mcollina commented Nov 14, 2013

Reopening as leveldown 0.10 is not released, yet. Moreover we will have a new LevelGraph patch release to bump that dependency.

BTW, Have you got some performance reports about queries on that dataset?

@mcollina mcollina reopened this Nov 14, 2013

@pietercolpaert

This comment has been minimized.

Show comment
Hide comment
@pietercolpaert

pietercolpaert Nov 14, 2013

I didn't do any performance tests yet, but for 1M triples, the ingesting is quite fast and queries return within expectations :)

Will keep you informed when I have more time to spend on this

I didn't do any performance tests yet, but for 1M triples, the ingesting is quite fast and queries return within expectations :)

Will keep you informed when I have more time to spend on this

@rvagg

This comment has been minimized.

Show comment
Hide comment
@rvagg

rvagg Nov 18, 2013

leveldown@0.10.0 and levelup@0.18.0 are out and will hopefully sort this out

rvagg commented Nov 18, 2013

leveldown@0.10.0 and levelup@0.18.0 are out and will hopefully sort this out

@mcollina mcollina closed this in 8f1eab9 Nov 18, 2013

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 18, 2013

Collaborator

The new leveldown bring also a 25% increase in writing speed, so right now I can write 22.000 triples/second on my 2011 Macbook Air (vs 15.000 of the previous release).

Have fun! 👯

Collaborator

mcollina commented Nov 18, 2013

The new leveldown bring also a 25% increase in writing speed, so right now I can write 22.000 triples/second on my 2011 Macbook Air (vs 15.000 of the previous release).

Have fun! 👯

@mcollina

This comment has been minimized.

Show comment
Hide comment
@mcollina

mcollina Nov 18, 2013

Collaborator

This is released as version 0.6.11!

Collaborator

mcollina commented Nov 18, 2013

This is released as version 0.6.11!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment