Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster.worker.on('message', (msg) => ...) fails to register a callback if ESM file extensions is used #48578

Open
jerome-benoit opened this issue Jun 27, 2023 · 20 comments
Labels
cluster Issues and PRs related to the cluster subsystem. confirmed-bug Issues with confirmed bugs.

Comments

@jerome-benoit
Copy link
Contributor

jerome-benoit commented Jun 27, 2023

Version

v20.3.1

Platform

All supported platform

Subsystem

No response

What steps will reproduce the bug?

After a migration of code to ESM by using .mjs file extensions, the poolifier project has encountered issues at running internal benchmarks code.

Test case:

  • main.mjs:
import cluster from 'cluster'

cluster.setupPrimary({ exec: './worker.mjs' })

const worker = cluster.fork()

worker.on('message', message => {
  console.info('message received from worker:', message)
})

worker.on('online', () => {
  console.info('worker is online')
})

worker.on('error', (error) => {
  console.info('worker error', error)
})

worker.on('disconnect', () => {
  console.info('worker disconnected')
})

worker.on('exit', () => {
  console.info('worker exited')
})

worker.send('hello')
  • worker.mjs:
import cluster from 'cluster'

cluster.worker.on('message', message => {
  console.info('echo message received from main:', message)
  cluster.worker.send(message)
})

node main.mjs is frozen.

How often does it reproduce? Is there a required condition?

100% reproducible

What is the expected behavior? Why is that the expected behavior?

No response

What do you see instead?

Callback is never called if a message is sent from the primary.

Additional information

No response

@jerome-benoit
Copy link
Contributor Author

jerome-benoit commented Jun 27, 2023

Additional information: if example code for IPC with cluster is put the same ESM file, the callback for 'message' event is properly registered and called. Poolifier code is using the exec option to specify the worker file path. And in that case, cluster.worker.on('',() => {}) seems to have issues with ESM.

@aduh95
Copy link
Contributor

aduh95 commented Jun 27, 2023

Can you share a repro?

@jerome-benoit
Copy link
Contributor Author

jerome-benoit commented Jun 27, 2023

Please read the detailed bug report on poolifier repo, it contains all the bits to reproduce it reliably with poolifier. I do not have the time currently to extract the code from it to reproduce with two files using .mjs extension.

@aduh95
Copy link
Contributor

aduh95 commented Jun 27, 2023

I don't have the time either, I don't plan on working on it without a repro that does not involve external code.

@bnoordhuis
Copy link
Member

I'll close this until there is a standalone test case. Let me know when I should reopen.

@bnoordhuis bnoordhuis closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2023
@jerome-benoit
Copy link
Contributor Author

jerome-benoit commented Jun 28, 2023

The way that bug 100% reproducible has been handled by the node.js project is below the common sense standards expected if the project goal is stability.
When I receive a confirmed bug report 100% reproducible on one of the FOSS project I maintain, I would not even dare to put the burden on the bug reporter to solve it: it's unrespectful of the work done to identify the issue and pinpoint the root cause.

@bnoordhuis
Copy link
Member

The reason we don't usually accept bug reports that manifest in third-party code is that 9 out of 10 times the bug is in said third-party code, not Node.js.

Eventually comes the point where you decide you've wasted enough unpaid hours of your life on other people's crappy code.

Long story short: happy to take a look if you have a minimal test case. But if you are not willing to put in the time, then neither are we.

@jerome-benoit
Copy link
Contributor Author

jerome-benoit commented Jun 28, 2023

Please read carefully the bug report I've made before applying blindly rules that do not apply to it:

  • CommonJS benchmarking code: everything works, callback registered with cluster.worker.on(), called in poolifier
  • benchmarking code migrated to ESM: identical code, except the import part, using the very same (not even the ESM bundle, but the ESM bundle has the same issue) poolifier code previously working: callback not registered with cluster.worker.on() or called

=> Bug confirmed, 100% reproducible. Extracting a standalone test case is then becoming a corollary, not a prerequisite to its resolution. And then the burden is expected to be shared between project maintainers and the bug reporter.

@bnoordhuis
Copy link
Member

Unless they're on your payroll, then no, you don't get to decide how and where other people invest their time.

Either put together a test case or don't, but stop arguing.

@jerome-benoit
Copy link
Contributor Author

jerome-benoit commented Jun 28, 2023

So the node.js project is not interested in releasing a stable ESM support in the cluster module by trying to welcome a proven 100% reproducible confirmed bug? And prefers to apply blindly rules instead of collaborating with the bug reporter to help planning fixing it and makes node.js better?

I will do the standalone test case when time permits. But given the way my first confirmed 100% reproducible bug report on the node.js project has been welcomed, I do not think I will redo it unless it impacts deeply a FOSS project I comaintain.

@aduh95
Copy link
Contributor

aduh95 commented Jun 28, 2023

What about you? Are you not interested in releasing a stable ESM support in the cluster module? I think there's a misunderstanding here, the Node.js project is run by volunteers, you can't expect that someone else would be interested in investing their time in fixing your bug, especially when you yourself are not. Or you need to pay them (send me an email, I can make myself available).
The Node.js project would gladly accept a PR to fix it, but the Node.js project cannot open PRs by itself, only contributors can; and you are unlikely to find someone to spend time on it if they are not affected by the bug themselves. If you provide a repro – which is as you probably know is often the hardest part of fixing a bug – you're way more likely to find a volunteer.

But given the way my first confirmed 100% reproducible bug report on the node.js project has been welcomed, I do not think I will redo it unless it impacts deeply a FOSS project I comaintain.

You'll be deeply missed. If I may suggest to tune down the entitlement and show a little more respect of other people's personal time, you could be amazed what difference it makes.

@jerome-benoit
Copy link
Contributor Author

jerome-benoit commented Jun 28, 2023

I do not expect any fix to be done, I do not expect any answer, I expect nothing from a bug report on a FOSS project, except one thing: that if taken into account, it's done fairly in depth by a volunteer that matters at fixing it and has the time to handle it. A sign of respect of the time spent by another volunteer to pointpoint the bug. If that's not the view of the node.js project volunteers, that's fine by me. But easily understandable that the volunteer in question will be reluctant at doing another detailed bug reports on the same project.
If the node.js project thinks that the present bug report has been handled in a respectful way and fairly, that discussion is going nowhere and there's no point at continuing it.

Back to the subject: I'll do the test case reproducing the issue when time permits and attach the files to the bug report. And when times permit probably do a PR to try to fix it, as I usually do as a long time FOSS hacker on the open bug reports I do.

@jerome-benoit
Copy link
Contributor Author

jerome-benoit commented Jul 29, 2023

The test case:

  • main.mjs:
import cluster from 'cluster'

cluster.setupPrimary({ exec: './worker.mjs' })

const worker = cluster.fork()

worker.on('message', message => {
  console.info('message received from worker:', message)
})

worker.on('online', () => {
  console.info('worker is online')
})

worker.on('error', (error) => {
  console.info('worker error', error)
})

worker.on('disconnect', () => {
  console.info('worker disconnected')
})

worker.on('exit', () => {
  console.info('worker exited')
})

worker.send('hello')
  • worker.mjs:
import cluster from 'cluster'

cluster.worker.on('message', message => {
  console.info('echo message received from main:', message)
  cluster.worker.send(message)
})

node main.mjs is frozen.

Please reopen the issue.

@jerome-benoit
Copy link
Contributor Author

  • main.js:
const cluster = require('cluster')

cluster.setupPrimary({ exec: './worker.js' })

const worker = cluster.fork()

worker.on('message', message => {
  console.info('message received from worker:', message)
})

worker.on('online', () => {
  console.info('worker is online')
})

worker.on('error', (error) => {
  console.info('worker error', error)
})

worker.on('disconnect', () => {
  console.info('worker disconnected')
})

worker.on('exit', () => {
  console.info('worker exited')
})

worker.send('hello')
  • worker.js:
const cluster = require('cluster')

cluster.worker.on('message', message => {
  console.info('echo message received from main:', message)
  cluster.worker.send(message)
})

node main.js just works.

@bnoordhuis
Copy link
Member

Thanks, I'll reopen. Can you update your original report with the test cases?

I can make an educated guess as to why it doesn't work:

  1. cluster.fork() is implemented on top of child_process.fork()
  2. child_process.fork() spawns the child process with NODE_CHANNEL_FD=<num> set in the environment
  3. NODE_CHANNEL_FD makes node's bootstrap code call child_process._forkChild() to set up IPC
  4. the asynchronous loading of worker.mjs makes that malfunction somehow

Interestingly, the worker sends the 'online' internal message. That's done by cluster._setupWorker() and it's called from lib/internal/process/pre_execution.js so at least some of the steps are working, just not all.

Anyway, you know where to start looking now. Pull request welcome.

@bnoordhuis bnoordhuis reopened this Jul 30, 2023
@bnoordhuis bnoordhuis added confirmed-bug Issues with confirmed bugs. cluster Issues and PRs related to the cluster subsystem. labels Jul 30, 2023
@jerome-benoit
Copy link
Contributor Author

That will take quite a while before I will be able to work on it, so if someone starts to work on that issue, please say so and share your progress.

@jerome-benoit
Copy link
Contributor Author

@fabiancook: thanks for your interest. You do need to use poolifier to fix that issue.

aleksejaku added a commit to aleksejaku/node that referenced this issue Oct 6, 2023
@ronag
Copy link
Member

ronag commented Oct 7, 2023

@mcollina i think this is quite a significant issue in terms of esm stability.

See @bnoordhuis last comment.

@jerome-benoit
Copy link
Contributor Author

@mcollina i think this is quite a significant issue in terms of esm stability.

The issue is only met if the worker file is an ESM one. The main file type has no impact.

@st3ffgv4
Copy link

I'm facing the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. confirmed-bug Issues with confirmed bugs.
Projects
None yet
5 participants