Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongod sometimes fails to start due to ETXTBSY #872

Closed
Simon-Hayden-iteratec opened this issue May 16, 2024 · 10 comments
Closed

Mongod sometimes fails to start due to ETXTBSY #872

Simon-Hayden-iteratec opened this issue May 16, 2024 · 10 comments
Labels
bug released on @beta released Pull Request released | Issue is fixed

Comments

@Simon-Hayden-iteratec
Copy link

Versions

  • NodeJS: ^20 (Github Action uses actions/setup-node@v3 with node-version: 20)
  • mongodb-memory-server: 9.1.6
  • mongodb(the binary version): 0
  • mongodb(the js package): 6.3.0 (from mongoose
  • mongoose: 8.1.2
  • system: Linux, most likely Ubuntu, but we are running on a selfhosted runner provided by my client

package: mongo-memory-server

What is the Problem?

We have a project with two integration tests using the mongodb-memory-server with a replica set because we are using Change Streams. When we run the tests on the CI, we sometimes get the this error (full logs below):

2024-05-15T07:08:36.6267264Z 2024-05-15T07:08:27.279Z MongoMS:MongoMemoryServer Mongo[unknown]: _startUpInstance threw a Error:  Error: spawn ETXTBSY
2024-05-15T07:08:36.6269314Z     at ChildProcess.spawn (node:internal/child_process:421:11)
2024-05-15T07:08:36.6270360Z     at spawn (node:child_process:761:9)
2024-05-15T07:08:36.6273299Z     at MongoInstance._launchMongod (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/util/MongoInstance.ts:504:31)
2024-05-15T07:08:36.6277477Z     at MongoInstance.start (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/util/MongoInstance.ts:392:31)
2024-05-15T07:08:36.6281617Z     at Function.create (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/util/MongoInstance.ts:292:5)
2024-05-15T07:08:36.6286043Z     at MongoMemoryServer._startUpInstance (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryServer.ts:490:22)
2024-05-15T07:08:36.6290386Z     at MongoMemoryServer.start (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryServer.ts:297:5)
2024-05-15T07:08:36.6293230Z     at async Promise.all (index 0)
2024-05-15T07:08:36.6296114Z     at MongoMemoryReplSet.initAllServers (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryReplSet.ts:468:5)
2024-05-15T07:08:36.6300501Z     at MongoMemoryReplSet.start (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryReplSet.ts:380:5)
2024-05-15T07:08:36.6304725Z     at Function.create (/home/runner/work/<redacted>/<redacted>/node_modules/.pnpm/mongodb-memory-server-core@9.1.6/node_modules/mongodb-memory-server-core/src/MongoMemoryReplSet.ts:193:5)
2024-05-15T07:08:36.6308684Z     at Object.<anonymous> (/home/runner/work/<redacted>/<redacted>/libs/backend/db/src/lib/mongodb/services/user-follow-entity/user-follow-entity.service.spec.ts:337:21) {
2024-05-15T07:08:36.6310935Z   errno: -26,
2024-05-15T07:08:36.6311790Z   code: 'ETXTBSY',
2024-05-15T07:08:36.6312428Z   syscall: 'spawn'
2024-05-15T07:08:36.6313120Z }

After some googling, the errno ETXTBSY is supposed to mean "Text file busy". So I assume there is some race condition, which disallows mongod to be called/spawned twice in quick succession. However, I did not find any information to mongod + ETXTBSY.

Because I assume this to be a race condition issue, it is also hard to reproduce. Locally, this error has not yet happened on any dev machine. On the CI, however, I'd say this error happens around 1/3 to 2/3 of runs. My assumption is that the CI cores are just slower and thus give a larger time window for the race condition to happen? Or maybe there is some weird OS issue with the runner, but I couldn't find anything providing an OS fix for this error either.

As a workaround, we are running tests with the --runInBand flag to instruct Jest to run one test after another. This is okay-ish for us, because we are using Nx to manage the monorepo, so we still parallelize based on Nx projects. However, if we have a second project which also needs to run MongoDB integration tests, I'd assume that the error would return.

Code Example

// Wish I could pin it down to a specific code example
// My best bet: Have multiple tests starting an in-memory MongoDB and run them in parallel.

Debug Output

The logs are quite verbose:

https://gist.github.com/Simon-Hayden-iteratec/93a2a822a8a081d987b4b427061adac4

Something I noticed is that it looks like there are multiple downloads of the mongod binary. Maybe the issue is also caused by process A still downloading the binary while process B tries to start mongod?

Do you know why it happenes?

Sadly no

@hasezoey
Copy link
Collaborator

Because I assume this to be a race condition issue, it is also hard to reproduce

it is, and my best guess is that is that the problem is the following happens (according to the logs, and my best guesses):

  1. worker 1 tries to start, does not find binary, acquires download lock and downloads
  2. worker 2 tries to start, does not find binary, tries to acquire download lock, but exists and so waits until done
  3. worker 1 starts to extract the binary
  4. worker 3 tries to start, does find the binary, and tries to use that binary (but its incomplete, so ETXTBUSY)
  5. worker 1 finished extracting the binary

this is at least my best guess from the logs, currently the checks are:

  1. if file exists, try to use it
  2. if file does not exist, look for the download lock

i guess i need to add a check to see if the download lock still exists before trying to use the binary

@Simon-Hayden-iteratec
Copy link
Author

Would be amazing if this is the issue and you could fix it 😄

Though I wonder, why we are the only once reporting on this, though. As this race condition you're describing could have happened for anybody, there should have been at least some report already?

Anyway, thanks a lot for your fast reply :)

@hasezoey
Copy link
Collaborator

Though I wonder, why we are the only once reporting on this, though. As this race condition you're describing could have happened for anybody, there should have been at least some report already?

i guess the extraction part is quite quick and so is basically "atomic" on any other system (the archive is about ~70mb and the extracted binary around ~130mb to ~180mb).
and mostly the instances are run on the same process instead of separate processes (like globalSetup instead of in each setupFile / beforeAll; thanks to how jest works), because if they are in the same process they would likely wait (dont know the specifics for that at the moment)

hasezoey added a commit that referenced this issue May 16, 2024
and treat it like the binary does not exists

fixes #872
@hasezoey
Copy link
Collaborator

hasezoey commented May 16, 2024

put in a fix in c600609, which will be available as 9.2.1-beta.1, please test it for a while once available and report back (there will be a bot message)

Copy link

🎉 This issue has been resolved in version 9.2.1-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@Simon-Hayden-iteratec
Copy link
Author

I'll give it a spin and see if we get any errors

@Simon-Hayden-iteratec
Copy link
Author

I've ran the pipeline now five times with the beta version and it didn't fail. But I also can't find the log message "binary found, but also a download-lock, trying to resolve lock" anywhere, which most likely means that I didn't trigger the race condition this time.

I'm not sure if something else is currently preventing the issue (e.g. lower load on the runner?) but I will keep an eye on this. If there are more failed runs I will let you know. But maybe in a new ticket, so this one can be closed.

Thanks for your effort and quick fix implementation 🥇

@hasezoey
Copy link
Collaborator

i will give it a few more days open as there has not been the message you were searching for, at the latest it will be closed once the stale message comes up (when i forgot to close beforehand)

@Simon-Hayden-iteratec
Copy link
Author

So our test pipeline hasn't been failing the last few days. After looking at the logs I also found at least one, which hat the log line binary found, but also a download-lock, trying to resolve lock in it. To me that sounds like the race condition is fixed 🎉

I'll close the bug report now. Thanks again for your super fast help :)

Copy link

@github-actions github-actions bot added the released Pull Request released | Issue is fixed label May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug released on @beta released Pull Request released | Issue is fixed
Projects
None yet
Development

No branches or pull requests

2 participants