Skip to content

Comments

[node-core-library] Gracefully handle irregular LockFile.tryAcquire fail on macOS/Linux#4497

Merged
iclanton merged 7 commits intomicrosoft:mainfrom
andrew--r:feature/lockfile-handle-failure-gracefully
Feb 7, 2024
Merged

[node-core-library] Gracefully handle irregular LockFile.tryAcquire fail on macOS/Linux#4497
iclanton merged 7 commits intomicrosoft:mainfrom
andrew--r:feature/lockfile-handle-failure-gracefully

Conversation

@andrew--r
Copy link
Contributor

@andrew--r andrew--r commented Jan 29, 2024

Fixes #4491

Summary and details

See #4497 (comment)

There are two fixes in this PR:

  1. Fixed the root cause of the unexpected ENOENT error
  2. Added getStatistics method to FileWriter as requested

How it was tested

I’ve added a corresponding test case to LockFile tests suite

@andrew--r
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Joom Unipessoal LDA"

@andrew--r andrew--r marked this pull request as ready for review January 29, 2024 16:39
@andrew--r
Copy link
Contributor Author

andrew--r commented Jan 31, 2024

I’ve dug a bit deeper into the problem, and I think I’ve managed to find the root cause.

The problem occurs when at least two processes constantly try to acquire the same lock, e.g. when processing a task queue inside each process.

Reproduction steps:

  1. The first process acquires lock and starts executing something (e.g. current task).
  2. The second process simultaneously tries to acquire the same lock, writes it’s lockfile, …, and scans directory in search for other lockfiles.
  3. Immediately after it, the first process lock is released, which leads to the first lockfile being deleted
  4. The second lock processes each file found on the step 2. The first lockfile doesn’t exist anymore, so it fails to retrieve info about the first process lock, and the error is silently swallowed
  5. The first process takes the next task from the queue and starts acquiring new lock, writing a new lockfile
  6. The second process uses obsolete data about the first process lockfile and deletes it with an assumption that the first process is no longer executing, but the lockfile still exists for some reason
  7. The first process tries to retrieve stats for it’s new lockfile, which was deleted on the previous step by the second process, which finally leads to unexpected ENOENT error

To be sure, in the next few days I’ll write some test cases covering this behaviour, and if it confirms, I’ll update the PR with a fix for the root cause.

@andrew--r andrew--r marked this pull request as draft January 31, 2024 01:19
@andrew--r andrew--r marked this pull request as ready for review February 4, 2024 20:10
@andrew--r andrew--r requested a review from iclanton February 4, 2024 20:15
@andrew--r
Copy link
Contributor Author

@iclanton could you please take a look at the updates when you’ll have time?

@iclanton iclanton enabled auto-merge February 7, 2024 18:32
@iclanton iclanton merged commit 0ce9231 into microsoft:main Feb 7, 2024
@andrew--r andrew--r deleted the feature/lockfile-handle-failure-gracefully branch February 7, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[node-core-library] LockFile.acquire fails irregularly when multiple processes try to acquire lock

3 participants