New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second child process doesn't receive data when using the stdout of first child process as stdin of second #9413

Open
aalexgabi opened this Issue Nov 2, 2016 · 20 comments

Comments

Projects
None yet
7 participants
@aalexgabi

aalexgabi commented Nov 2, 2016

  • Version: v0.12.9 - v7.0.0
  • Platform: Linux ${edited hostname} 4.4.0-45-generic #66~14.04.1-Ubuntu SMP Wed Oct 19 15:05:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Also posted here: http://stackoverflow.com/questions/40306385/missing-lines-when-using-the-stdout-of-a-child-process-as-stdin-of-another

When using the stdout of one child process as stdin for another, it seems that sometimes data is not passed to the next child:

var spawn = require('child_process').spawn;

var pipeId = 0;
var pipeSlots = 6;

var launchProcess = function(cmd, args, stdin, stdout){
  return spawn(cmd, args, {
    stdio: [stdin, stdout, 'ignore']
  });
};

var launch = function(){
  var task0 = launchProcess('echo', ['how\nare\nyou\ndear\nstranger'], 'ignore', 'pipe');
  var task1 = launchProcess('tee', ['/tmp/body-pipeline-' + pipeId], task0.stdout, 'ignore');

  task0.on('error', function(err){
    console.log('Error while processing task0:' + err.stack);
  });
  task1.on('error', function(err){
    console.log('Error while processing task1:' + err.stack);
  });

  pipeId++;
};

// Simulating message queue
setInterval(function(){
  // Simulating how many messages we get from the messaging queue
  var mqMessageCount = Math.floor(Math.random() * (pipeSlots + 1));

  for(var i = 0; i < mqMessageCount; i++){
    launch();
  }
}, 250); // For this test we assume that pipes finish under 250ms

Some files are empty:

ls -lhS /tmp/body-pipeline-*

FYI: task0.stdout.pipe(task1.stdin) solves the issue but the script uses 50% CPU (compared to 0% when passing stdout of task0 as stdin of task1) for the equivalent of yes | tee /tmp//tmp/body-pipeline-x

@mscdex mscdex added the child_process label Nov 2, 2016

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Nov 2, 2016

You are handing off task0.stdout to the second child process but because task0.stdout starts in flowing mode and since it also exists in the parent process, data can end up in either process.

It might be possible to work around when you create the first child's stdout pipe manually in paused mode but that's probably difficult to do without groping around in node internals.

Perhaps spawn() needs to be taught a pauseOnCreate option, similar to new net.Socket({ handle: handle, pauseOnCreate: true }). Note that it may not be the best possible fix, just the first one that comes to mind.

@addaleax

This comment has been minimized.

Member

addaleax commented Nov 2, 2016

/cc @targos fyi this seems to be the right answer to nodejs/help#324, too

@mscdex

This comment has been minimized.

Contributor

mscdex commented Nov 2, 2016

The problem has to do with task0's stdout stream buffering all of the output before task1 has a chance to start and hook up to task0's stdout. It's racy due to timing/scheduling of the child processes, that is why it does not happen all of the time.

EDIT: Oops, didn't see the comments before this. I was just about to say what @bnoordhuis said. It works if pauseOnCreate: true is passed (internally) and you manually set task0.stdout.readable = false in the userland code in order for flushStdio() to not call stream.resume() which starts buffering data on the next tick.

@aalexgabi

This comment has been minimized.

aalexgabi commented Nov 3, 2016

Is there a workaround that doesn't involve task0.stdout.pipe(task1.stdin)? I tried task0.stdout.readable = false but I don't know how to pass pauseOnCreate: true.

@bnoordhuis How to manually create a pipe in non flowing mode?

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Nov 3, 2016

Is there a workaround that doesn't involve task0.stdout.pipe(task1.stdin)?

Not at the moment, no. Not without monkeying around with node's internals.

@aalexgabi

This comment has been minimized.

aalexgabi commented Nov 3, 2016

Is there a way to use task0.stdout.pipe(task1.stdin) without passing all the data through NodeJS VM? I'm thinking of something like:

  var task0 = launchProcess('echo', ['how\nare\nyou\ndear\nstranger'], 'ignore', 'pipe');
  var fakePipe = new net.Socket({fd: task0.stdout._handle.fd})
  var task1 = launchProcess('tee', ['/tmp/body-pipeline-' + pipeId], fakePipe, 'ignore');

How would you go about monkey patching node's internals? Temporarily replacing net.createSocket with a version forcing readable to false?

I'm using NodeJS for an export runner that launches pipelines of scripts to process lots of data and upload it at the end so I would really like to find a way of doing this without the runner having very high CPU usage.

@targos

This comment has been minimized.

Member

targos commented Nov 5, 2016

Thanks @addaleax, it seems to be the same issue indeed!

What I don't understand here is that it happens even when the two calls to spawn are made in the same tick. Is task0.stdout somehow receiving data synchronously?

@bnoordhuis

This comment has been minimized.

Member

bnoordhuis commented Nov 5, 2016

The file descriptor of task0.stdout exists simultaneously in the parent and the child. Calling task0.stdout.destroy() or task0.stdout.pause() in the parent right after spawning task1 might work around the issue for now.

@addaleax

This comment has been minimized.

Member

addaleax commented Nov 5, 2016

Calling task0.stdout.destroy() or task0.stdout.pause() in the parent right after spawning task1 might work around the issue for now.

I haven’t looked at it in detail, so ignore me if this doesn’t make sense, but is there any reason why the second spawn call shouldn’t do that itself when receiving task0.stdout?

@targos

This comment has been minimized.

Member

targos commented Nov 6, 2016

Calling task0.stdout.destroy() or task0.stdout.pause() in the parent right after spawning task1 might work around the issue for now.

Just tried both and it works with destroy but not with pause.

@aalexgabi

This comment has been minimized.

aalexgabi commented Nov 30, 2016

A colleague of mine had a great idea that fixed this:

  var task1 = launchProcess('tee', ['/tmp/body-pipeline-' + pipeId], 'pipe', 'ignore');
  var task0 = launchProcess('echo', ['how\nare\nyou\ndear\nstranger'], 'ignore', task1.stdin);

The idea is to launch the receiving task before launching the sending task !

@masaeedu

This comment has been minimized.

masaeedu commented Jul 13, 2017

@bnoordhuis Is there a way to create a process with a predefined WritableStream for stdout? I mean the API itself seemed racy when I started using it, because I don't want to start a process and then hook up something to receive stdout. Can't see how to provide a stdout receiver before starting the process with the existing streams API though.

@gireeshpunathil

This comment has been minimized.

Member

gireeshpunathil commented Apr 17, 2018

@aalexgabi - is this still an outstanding issue for you?

@aalexgabi

This comment has been minimized.

aalexgabi commented Apr 24, 2018

@gireeshpunathil Yes. A lot of CPU time is wasted because I had to pass all data through the NodeJS process instead of letting the two sub-processes pipe it directly. Changing the algorithm so that the receiving task is launched before the receiving task as pointed in my last comment was too complicated for a workaround in the end.

@gireeshpunathil

This comment has been minimized.

Member

gireeshpunathil commented May 20, 2018

cross link #18016 - both cases are root caused by same issue.

@gireeshpunathil

This comment has been minimized.

Member

gireeshpunathil commented May 20, 2018

looks like it was suggested earlier, no evidence whether it was attempted: can you please try with:

task0.stdout.pipe(task1.stdin);

It is reliable, and simple.

@aalexgabi

This comment has been minimized.

aalexgabi commented May 20, 2018

@gireeshpunathil The problem does not arise when using task0.stdout.pipe(task1.stdin) but it has the major downside of passing all data through NodeJS VM which causes very high CPU usage.

@gireeshpunathil

This comment has been minimized.

Member

gireeshpunathil commented May 20, 2018

@aalexgabi - if your main motivation is to bypass node from being a middle man between task0 and task1, why don't you combine them together and run under a single child process?

I was under the impression that you want node to gain fine grained control over both the processes, and want them not to be dependent on each other - in which case anyways the data comes through Node, and the additional burden of piping does not cause too much load.

Thinking further on it (between yesterday and today) I believe the issue is neither in your code nor in Node's code. There is a unspecified design at work: when we pass stdin for task1 we specify stdout of task0 the expectation that all of stdin data will be routed to task1, but in reality we are making cloning stdin passing to task1. What is unspecified is what to do with the original stdin - should it still function, or should it be closed?

If we force close original stdin any other parts of the program (you don't have one) may malfunction.

If we leave it as it is, the behavior you found will occur - data will branch out into to destinations

Under default conditions, I guess the data should be made available in both - following this example:

#echo hello |tee 1.txt
hello
#cat 1.txt 
hello
#
@aalexgabi

This comment has been minimized.

aalexgabi commented Jun 2, 2018

@gireeshpunathil My application is a task runner which it's agnostic to what tasks it executes. I have a RabbitMQ queue called task_run_requests. Each task_run_request has a type and some parameters. The type of the tasks has a corresponding entry in a configuration file which looks like this:

{
...
  "someFilteringTaskThatUploadsResult": {
    steps: [
      {
        command: "cat",
        args: [ "${request.filename}" ]
      },
      {
        command: "grep",
        args: [ "-v", "${request.filter}" ]
      },
      {
        command: "tee ",
        args: [ "${request.backupFile}" ]
      },
      {
        command: "ftpupload",
        args: [ "-p", "${request.password}", "-u", "${request.username}" ]
      },
    ]
  }
...
}

All the commands the runner executes are just Linux programs. Most of the commands the runner executes are atomic functions (generate file, transform file, merge files, split files etc.). Only the task type assembles different tasks into one pipeline. I cannot write a script for each combination because that would defy the purpose of having a generic runner in the first place. The idea is to have a system where you can mix and match system and custom commands by writing configuration and not code, except when it's necessary. Some of these tasks process gigabytes of data and can run for up to 12 hours. Passing all the data through the Node VM serves no purpose because the runner does not touch the data but only orchestrates the execution of tasks.

There is a need for granular control of tasks in case of a failure. In the event of one task failing, we need to kill all other tasks, store the exit code of each task in the database, cancel a pipeline on user request etc. Some tasks are demand lots of resources so for example we may want to only respawn one child not not the entire pipeline in case of a failure.

I'm sorry but I don't think I understand the second part of your comment. You are talking about cloning stdin and I don't understand why. I will try to explain how I understand the process spawning and the stdin and stdout handling by analogy with bash.

When you want to execute a pipeline like cat myFile | tee backupFile | grep -v myFilter bash will create three fds for stdin stdout and stderr for the first process and spawn it. Because there is no consumer on the stdout of cat myFile, the write call inside of this process will block once the fd buffer is full until something reads from it's stdout. Bash will only create two file descriptors for the tee backupFile because it will pass it exactly the same file descriptor of the stdout of the first command. When the second command will read it's stdin, it will unblock the write from the first command. The same will happen for the last command. With this design no data can be lost because as long as nobody reads the stdout file descriptor, the write calls inside the process that outputs data will block once the write buffer is full.

My understanding may not be accurate but this is how I imagine it and how I expect NodeJS to handle it. I may not be correct but it seems to me that for some reason the stdout of the first process is either a blackhole like /dev/null until the second process is spawned or it's being read from by NodeJS or something.

Note that I don't need to be able to handle any data events on the file descriptors between spawned processes so I don't expect them being NodeJS streams but simple integers that identify file descriptors.

@gireeshpunathil

This comment has been minimized.

Member

gireeshpunathil commented Jun 8, 2018

@aalexgabi - thanks for the detailed explanation.

  • I understand your use case now, thanks.
  • I agree I fully messed up my second para. I am re-writing it like this with a better mind set:

There is an unspecified design at work: when we pass stdout of task0 as the stdin for task1, the expectation is that all of task0::stdout is passed to task1::stdin. But the relaity is that we are making a new pipe thatcontainstask0::stdout in one end and task1::stdin at the other. The original copy of task0::stdout is still open and available with parent. What is unspecified is what to do with the original task0::stdout - should it still function, or should it be closed?

Of course, this explanation does not provide any relief to your scenario, it just explains the current design.

While I am not sure about your reasoning around how bash manages this, the outcome of it is what matters - once the piping is completed, bash does not involve in the data flow.

However, bash has information upfront about what are the processes coming for execution and how they want to be connected. So it can precisely do the connectivity in the optimal manner.

our Node program does not have that info. So every child process spawn is independant, and spawns new pipes with one end at the child and the other with the parent, i.e, node.

It becomes a manual work to close out / re-pipe the open ends of the pipe according to the current design.

I will look into the code / engage others to see is there any improvements possible here, without causing side effects to the otherwise stable code.

gireeshpunathil added a commit to gireeshpunathil/node that referenced this issue Jun 8, 2018

child_process: close pipe ends that are re-piped
when t0 and t1 are spawned with t0's outputstream [1, 2] is piped into
t1's input, a new pipe is created which uses a copy of the t0's fd.
This leaves the original copy in Node parent, unattended. Net result is
that when t0 produces data, it gets bifurcated into both the copies

Detect the passed handle to be of 'wrap' type and close after the
native spawn invocation by which time piping would have been over.

Fixes: nodejs#9413
Fixes: nodejs#18016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment