Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks with oboe.drop #68

Closed
goloroden opened this issue Apr 22, 2015 · 50 comments · Fixed by #165
Closed

Memory leaks with oboe.drop #68

goloroden opened this issue Apr 22, 2015 · 50 comments · Fixed by #165
Assignees
Labels

Comments

@goloroden
Copy link

I have seen that there already was an issue on memory leaks (it's #45), and that things should be resolved by returning oboe.drop as documented here.

Unfortunately, 2.1.1 seems to still have memory issues (or I am getting it absolutely wrong how to use oboe.drop correctly).

My setup is as follows: I have a server based on Express that delivers an endless JSON array, and I have a client that uses Oboe to stream these data. I measure the memory consumption of this client, and within a few hours it uses hundreds of MBytes, and GC obviously does not clean up as expected.

The server looks like this:

var http = require('http');

var express = require('express');

var app = express();

app.get('/', function (req, res) {
  var i = 0;

  var sendEventToClient = function (event) {
    res.write(JSON.stringify(event) + ',');
  };

  req.setTimeout(0);
  res.setTimeout(0);

  res.writeHead(200, {
    'content-type': 'application/json'
  });
  res.write('[');

  setInterval(function () {
    sendEventToClient({
      foo: 'bar',
      bar: 'baz',
      nufta: 23,
      counter: i++
    });
  }, 10);
});

http.createServer(app).listen(3000);

The client looks like this:

var url = require('url');

var oboe = require('oboe'),
    Stethoskop = require('stethoskop');

var stethoskop = new Stethoskop({
  from: {
    application: 'client.js'
  },
  to: {
    host: 'localhost',
    port: 8125
  },
  enabled: true
});

oboe({
  url: url.format({
    protocol: 'http',
    hostname: 'localhost',
    port: 3000,
    pathname: '/'
  }),
  cached: false
}).on('node:!.*', function (event) {
  console.log(event);
  return oboe.drop;
}).on('fail', function (err) {
  console.log('Error!', err.thrown);
});

So, the server sends a new object every 10 ms, and the client should do nothing with it but drop it. I am using stethoskop to measure the client's fitness: It sends the CPU and memory data to a StatsD server.

I have run these two processes for 4.5 hours, and the memory consumption looks like this:

graphite

I also tried to run it returning null instead of oboe.drop, same result (the left part is the same test as above, the right part is with null, so basically both options show the very same behavior. The drop to 0 in the middle is not because of GC, it's because I stopped and restarted the processes):

test 2

So, to cut a long story short, basically I have two questions:

  • Am I doing something wrong in the client? If so, what do I need to correct? Any hints?
  • If not, this seems to be a bug in Oboe. Can I do something to help fix it? Pointing out and giving some hints where to look might already be quite helpful :-)

Moreover, the documentation states:

Dropping from an array will result in an array with holes in it. This is intentional so that the array indices from the original JSON are preserved:

My guess is that dropping nodes only works (wrt memory consumption) for objects, but not for arrays. Since I am using an array as outer container here, and since sparse arrays seem to capture more memory than dense ones, this might be the cause of the problem. Please note that I'm not too sure about memory behavior of sparse arrays, so any answer in this direction will be appreciated.

@badisa
Copy link

badisa commented Jul 14, 2015

+1

When I do a regular Http request and take a Heap Snapshot on one of my pages in Chrome I get 20mb, but when I run oboe the snapshot jumps up to 100mb. And that is on a very small JSON object, on a large one I get up to 1000mb.

@Amberlamps
Copy link

+1

Also, awesome issue reporting.

@goloroden
Copy link
Author

@Amberlamps Thanks :-)

@goloroden
Copy link
Author

BTW… for everyone who has the same problems that we had: We used Oboe.js for transferring data using the JSON Lines protocol.

For this very special use case we have written a replacement that does not suffer from the memory issue. If you have this very special use case, too, you might be interested in our modules json-lines and json-lines-client.

@badisa
Copy link

badisa commented Aug 5, 2015

So @ar7em figured this out for me. But when I pushed the node data to an array the node itself wasn't being garbage collected, even with return oboe.drop.

However by doing the following the memory leak disappeared.

.node('{scores info}', function(node){
      node = JSON.stringify(node)
      node = JSON.parse(node)
      resultsData.push(node);
      return oboe.drop;
})

Not entirely sure why this changes anything, but it reduces memory use by up to 300mb.

@jimhigson Is there any plan to fix this?

@JuanCaicedo
Copy link
Collaborator

@badisa Here's my thought: I think it might be because your array stores a reference to the node, which keeps the node from being garbage collected. Doing this node = JSON.parse(node) means that node is now a reference to a new object created from the original node, which would allow the original one to be garbage collected. Not entirely sure why that would be less memory intensive though.

@egervari
Copy link

I am experiencing the same issue. I have a 400MB JSON file. It contains an array of arrays. Each sub-array contains anywhere from 1 to 500 objects. There are probably 12000-13000 objects in total if you hypothetically flattened the arrays.

Depending on how I have my server chunk it, I can read in around 4300 of those 13,000 objects before I get an "Aw, Snap" message within Chrome. And it does this because of the same problem. I am using the drop return result, just as above, but the memory is not getting garbage collected.

This is a very serious bug. Is it fixed?

@JuanCaicedo
Copy link
Collaborator

@egervari Have you tried the JSON stringify/parse trick that @badisa did? It would be interesting to see if that works

@egervari
Copy link

I will try it now and report back. However - like you - I don't see how this would fix the problem actually. It also feels hacky. I'm kind of concerned about it and tempted to create a very sub-standard, low-level parser that simply nulls and deletes the values manually, just to see what it actually does and to see if it's different than what oboe is doing. If that also doesn't work - or if that's what oboe is doing - then maybe there's a very bad bug in v8 itself.

@egervari
Copy link

Okay, I tried the above solution and it still didn't solve it :( So, so sad.

@egervari
Copy link

Does the format of the JSON file have anything to do with things being released? For example, I am sending json such as:

[
   [{..},{..},{..}],
   ...
   [{..},{..},{..}]
]

Should I instead send it as:

{
   "partition1:": [{..},{..},{..}],
   "partition2:": [{..},{..},{..}],
   ....
   "partitionN:": [{..},{..},{..}]
}

?

@badisa
Copy link

badisa commented Aug 28, 2015

@egervari What exactly are you doing with the nodes? I had no memory leak until I was pushing the nodes into a scoped Angular array.

@egervari
Copy link

Each node (in my case) is an array since I am trying to process an array of arrays. For the sake of clarity, let's call these partitions. What I'd like to do is put all of the objects in this partition into PouchDB. However, even if I ignore pouch altogether, and simply do a console.log(partition[0].whatever), it'll crash after 6000+ objects are processed. That's just a little more than 1/3 of the objects to process.

When it's processing the nodes, at first, it goes really fast. Then it just slows down and keeps going slower until the "Ah Snap" message shows up. Essentially, my oboe code is doing nothing:

            oboe(url, {
                cached: false // tried it with and without this option
            }).node('![*]', function(documents) {
                console.log(documents[0].documentType);

                return oboe.drop;
            }).done(function(leftOver) {
                console.log(leftOver);
            });

@badisa
Copy link

badisa commented Sep 1, 2015

@egervari So reading more closely it sounds like you are managing to read in about 250mb into your browser before it quits. That is quite a bit of memory to use, so there is a possibility that it is dying due to that and not because of a memory leak. Have you done what the first poster did? With the heap snapshot?

Also did you try:

            .node('![*]', function(documents) {
                documents = JSON.stringify(documents);
                documents = JSON.parse(documents)
                console.log(documents[0].documentType);

                return oboe.drop;
            })

@egervari
Copy link

egervari commented Sep 1, 2015

Yes, I tried exactly that :) It did not work.

I would say though that I am not getting 250MB on each pass... each 'documents' variable probably has 13MB worth of data on average, although I've tried smaller chunks too.

But here's the thing - I have tried streaming and parsing 1 document at a time too, and it still bombs - it can just process more documents before it bombs (perhaps 3000 more, but there is still so many left that it didn't get to).

I didn't get a graph of the heap, although I saw that Buffer % in chrome slowly go up to 100% and then crashed.

@egervari
Copy link

egervari commented Sep 1, 2015

Okay, I saw the heap graph and it was the exact same - the graph you can see in chrome.

@magic890
Copy link

Any news? @egervari have you solved your issue? If yes, how?

@JuanCaicedo
Copy link
Collaborator

In my mind, this is a pretty important issue because Oboe makes the claim on the website that it makes it able to handle JSON that is bigger than the available memory. This is an awesome claim, and I think it'll be totally true after this bug is handled.

I'll take a look and see what I can figure out!

@egervari
Copy link

@magic890 No, I never solved it, and I gave up on it. I implemented my own from scratch - was just easier for me - and I got it to work that way.

@JuanCaicedo
Copy link
Collaborator

@egervari Do you have it up on github, or would you be willing to? I'd love to compare what you have an what Oboe does to try to figure out where this memory leak is.

@egervari
Copy link

@JuanCaicedo Mine is not a framework or anything like that - it is just a something small I put directly into my project. It is not an all-encompassing solution or anything like that. It's not a personal project regardless, so I'm reluctant to share it. Honestly, I just did the simplest possible thing - I had the server send the json in chunks, converted it to a real json object when it got to the client, sent those objects to pouchdb, and then removed them from memory with null. It works for data up to 1.8 gig on chrome, firefox and safari.

A good tip is not to deal with 1000+ non-trivial objects at the same time. That will kill it on small devices. Chunk up the data and stream it and you will be fine. You don't need a framework/library.

@JuanCaicedo
Copy link
Collaborator

@egervari I'm trying to get some data to try do recreate your issue and some of the other ones one here. Do you have any tips on acquiring something that size? And then how to reformat it if I have it? Or were you producing your own data?

@egervari
Copy link

I am exporting large json documents intended to be put into PouchDB from a
large ms sql database. Sometimes the objects are quite large, and having
any more than 1000 of them in memory causes heap errors/crashes. I would
say that some of the largest documents have a dozen properties with 5 or 6
levels of collections. I can't give this json data out though - the data
itself is valuable and needs to be protected from non-clients obviously. In
the largest cases, I am sending around 1.8gb of json.

On Sun, Jan 24, 2016 at 6:00 PM, Juan Caicedo notifications@github.com
wrote:

@egervari https://github.com/egervari I'm trying to get some data to
try do recreate your issue and some of the other ones one here. Do you have
any tips on acquiring something that size? And then how to reformat it if I
have it? Or were you producing your own data?


Reply to this email directly or view it on GitHub
#68 (comment).

@JuanCaicedo
Copy link
Collaborator

@egervari totally understand that. I'm going to try with https://github.com/zeMirco/sf-city-lots-json. From what I can tell, it's a JSON document with a property features which is a really big array. Think that might be similar enough to your situation?

@egervari
Copy link

My situation is about 12x worse than that, haha. In my cases, a lot of the
text properties contain html content that contains a lot of text. Each
object/document might have 6 or 8 of those at least. But there's a lot of
arrays containing objects that contain more arrays that contain more
objects, etc. It is a very large object graph.

On Sun, Jan 24, 2016 at 6:10 PM, Juan Caicedo notifications@github.com
wrote:

@egervari https://github.com/egervari totally understand that. I'm
going to try with https://github.com/zeMirco/sf-city-lots-json. From what
I can tell, it's a JSON document with a property features which is a
really big array. Think that might be similar enough to your situation?


Reply to this email directly or view it on GitHub
#68 (comment).

@JuanCaicedo
Copy link
Collaborator

I'm working on a repo to reproduce these errors. Right now, as a sanity check, I've been able to establish that oboe.drop does work in at least some scenarios.

I'll have to play around with either the data or the front end code to try to reproduce. Github doesn't let you upload files larger than 100 mb, so I'm going to have to find an alternate way of hosting them if it comes down to needing bigger data. Any ideas welcome!

@egervari
Copy link

In my case, I have a Java application using Spring running on Tomcat that
is exposign the json as a rest-based api.

On Sun, Jan 24, 2016 at 9:56 PM, Juan Caicedo notifications@github.com
wrote:

I'm working on a repo https://github.com/JuanCaicedo/oboe-memory-bug.git
to reproduce these errors. Right now, as a sanity check, I've been able to
establish that oboe.drop does work in at least some scenarios.

I'll have to play around with either the data or the front end code to try
to reproduce. Github doesn't let you upload files larger than 100 mb, so
I'm going to have to find an alternate way of hosting them if it comes down
to needing bigger data. Any ideas welcome!


Reply to this email directly or view it on GitHub
#68 (comment).

@JuanCaicedo
Copy link
Collaborator

Hmm, setting up Spring and Tomcat is a whole extra level of complexity (I can probably get it though, I come from a Java background), so I think I'll try to repro @goloroden's case first.

@lukeasrodgers
Copy link

For what it's worth, I have some code using oboe.js to parse about 750 megs of JSON with nodejs. Without return oboe.drop; memory usage steadily climbs until it crashes with

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Abort trap: 6

When I add return oboe.drop; everything works fine, and the process runs to completion.

@JuanCaicedo
Copy link
Collaborator

@lukeasrodgers Awesome, I got the same results. I suspect there might be something else going on that's causing the problem.

@Amberlamps @badisa did the two of you also have a problem with this? If so, could you share any other information that could help pin down what's happening? (i.e. what type of server you're running, how you're sending the data, and if possible what your oboe client-side looks like). Thanks!

@JuanCaicedo
Copy link
Collaborator

@egervari Do you still have a version of your oboe code you could try something on? If so, on this line of code

}).node('![*]', function(documents) {

Could you change ![*] to !.[*] and see if that fixes your problem?

@JuanCaicedo
Copy link
Collaborator

@goloroden I have a suspicion your error might be because of the event notation you're using on the client side.

If you check out that test repo I made, there's a branch named goloroden where I change

.node('!.features[*]', function(feature) {

to

.on('node:!.features[*]',function(feature) {

Doing that causes Chrome to run out of memory and display an error message.

By the way, once you start the server off that repo, be sure to go to http://localhost:3000/home?drop=true, which causes the client side to use oboe.drop

@JuanCaicedo
Copy link
Collaborator

I can't find a way to recreate this, so I'm going to close the issue. I'll wait until Feb 28 in case anyone in the thread can help me reproduce the bug 😃

@goloroden @badisa @Amberlamps @egervari @magic890 @lukeasrodgers

@pavelerofeev
Copy link

@JuanCaicedo For me the issue reproduces on slow connections, you can use recent Chrome to set network throttling at 4 Mb/s. Currently I work around it with stringily-parse mentioned above.

@goloroden
Copy link
Author

Sorry for the late answer, I'm currently investigating a few ideas and will report back… thanks so far for your help :-)

@goloroden
Copy link
Author

Yay, we have a result :-)))

When you run the old demo code as shown in the original post, the memory leak is still there:

1

When you change the line

}).on('node:!.*', function (event) {

to

}).node('!.*', function (event) {

the memory leak is gone:

2

This is really good news, as this not only means that there is a workaround, but it also means that oboe.drop actually works - but only if you directly use the node function.

So the essential question is: What is (and why is there) the difference between on and node?

@JuanCaicedo
Copy link
Collaborator

Oh wow, that's really interesting and definitely a bug!

Would it be possible for you to share the code you used to profile this? Ideally if I could clone a repo and be able to reproduce the same results as you, I could look into this 😃

@goloroden
Copy link
Author

Oh, it's just the code of the original post to this issue.

What I did is the following:

  • Run docker run -d --name wolkenkit-profiling -p 80:80 -p 2003:2003 -p 8125:8125/udp hopsoft/graphite-statsd to get a statsd/Graphite container up and running.
  • Save the server file as server.js and the client file as client.js.
  • Run node server.js.
  • Run node client.js.
  • Open a browser and point it to http://192.168.9.130 (or whatever the IP of your Docker host is) to get to Graphite.
  • Switch to the dashboard view and add the graphs from stats.gauges.client.js.Schneehase.local.memory.* to it (and merge them to a single view if you want).
  • Wait for a few hours… ;-)
  • After that, in the client, change the questionable line of code, and run it again.

@badisa
Copy link

badisa commented Mar 11, 2016

I am experiencing the memory leak with

.node('{scores info}', function(node){
                resultsData.addData(node);
                $scope.$evalAsync();
                return oboe.drop;
            })

Unless I do the stringify/parse so it seems like it is not limited to just .on(), but my case might be due to the {scores info} portion of my code.

@goloroden
Copy link
Author

@badisa I guess it's because of your line

resultsData.addData(node);

where you explicitly keep a reference to the node you just received. Dropping it then of course does not have an effect.

@goloroden
Copy link
Author

@JuanCaicedo Any insights on this?

@JuanCaicedo
Copy link
Collaborator

Haven't been able to look at it, I'm hoping for some time on Saturday 😃

@goloroden
Copy link
Author

Don't want to be pushy, but I am curious: Any news on this?

@JuanCaicedo
Copy link
Collaborator

Not at all, thanks for the reminder. I've been prioritizing making a gh-pages version of the website, but that should be up soon and then I'll look into this. I'm going to assign this so I'll remember it

@JuanCaicedo JuanCaicedo self-assigned this Apr 25, 2016
@will-l-h
Copy link

I'm also curious if there has been any progress on this.

@cnyzgkn
Copy link

cnyzgkn commented Jan 3, 2017

any progress on it? I still can't load big data even using }).node('!.*', function (event) {

@mweimer
Copy link

mweimer commented Jan 12, 2017

I've also been experiencing this memory leak, however, I have found that making a copy of the node seems to provide a workaround:

.node('!.*', node => {
   const copy = JSON.parse(JSON.stringify(node));
   events.push(copy);
   return oboe.drop;
})

@JuanCaicedo
Copy link
Collaborator

I thought I'd share an update. I'm currently the only one actively working on the project, and I've been dedicating most of my open source time towards a workshop I'm giving. I expect I should have more time to dedicate to oboe by the end of next week.

My first priority after that is to improve how the tests and build processes works. Right now these things make it fairly challenging for me to work on the source code and I think they'll make the issue easier to diagnose.

If anyone is interested in helping me do that, especially to get to know the codebase to narrow down where this might be, I would love the help 😄

@Parboard
Copy link

@JuanCaicedo Any updates here?

@JuanCaicedo
Copy link
Collaborator

Please refer to #137 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.