Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Server 2019: LockFile lock is not released by OS when process exits #37

Closed
thecloudtaylor opened this issue Jun 24, 2020 · 38 comments
Assignees
Labels
bug Something isn't working Storage Data storage management and issues

Comments

@thecloudtaylor
Copy link
Member

thecloudtaylor commented Jun 24, 2020

Combining: moby/moby#39088 and docker-library/mongo#385 and StefanScherer/dockerfiles-windows#349

Description from @drnybble on (moby/moby#39088) - thank you!

Description

This problem is exhibited by running Mongo 4.0.8 with a named volume. If you restart Mongo it will not start because the WiredTiger.lock file remains locked. This is a general problem demonstrated by a sample Windows CLI program shown below.

Steps to reproduce the issue:

Using Visual Studio 2017, compile the Windows CLI application shown below. I used the C++ code generation option "Multi-threaded" so the VS redistributables did not need to be included in the Dockerfile.

#include "pch.h"
#include "Windows.h"
#include <iostream>
int main() {
	const wchar_t* FILE_NAME = L"c:\\data\\Test.lock";
	HANDLE h = ::CreateFile(FILE_NAME, GENERIC_WRITE, 0, NULL, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, NULL);
	if (h == INVALID_HANDLE_VALUE) {
		std::cout << "CreateFile failed";
		exit(1);
	}
	std::cout << "Created file " << FILE_NAME << std::endl;

	const char* buf = "Test";
	DWORD written;
	if (!::WriteFile(h, (LPCVOID)buf, sizeof(buf), &written, NULL)) {
		std::cout << "WriteFile failed" << std::endl;
		exit(1);
	}
	std::cout << "Wrote content to file " << FILE_NAME << std::endl;

	if (!::LockFile(h, 0, 0, 1, 0)) {
		std::cout << "LockFile failed" << std::endl;
		exit(1);
	}
	std::cout << "Locked first byte of file " << FILE_NAME << std::endl;

	std::cout << "Exiting without closing handle, OS must unlock" << std::endl;
}

Use the following Dockerfile:

FROM mcr.microsoft.com/windows/servercore:ltsc2019
COPY LockFileTest.exe /
CMD ["C:\\LockFileTest.exe"]

Build it:

docker build -t locktest . 

Create a named volume:

docker volume create locktest 

Run it:

docker run -v locktest:c:\data locktest 

Describe the results you received:

A file called Test.lock is located in the named volume C:\ProgramData\docker\volumes\locktest_data. Try to open it with Visual Studio Code, it fails with the error EBUSY. The file remains locked even though the owning process has exited.

Describe the results you expected:

As described by the documentation for LockFile (https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-lockfile) the OS should release the lock when the process exits. If you run the LockFileTest.exe on the desktop you will find that the file is not locked after the process exits.

Additional information you deem important (e.g. issue happens only occasionally):

Workaround is that you can delete the file even though it is locked. So for Mongo you can delete the WiredTiger.lock file on startup if it exists.

@immuzz immuzz added the triage New and needs attention label Jun 30, 2020
@immuzz immuzz self-assigned this Jul 1, 2020
@immuzz immuzz added Storage Data storage management and issues and removed triage New and needs attention labels Jul 1, 2020
@immuzz
Copy link

immuzz commented Jul 1, 2020

We are looking into this.

@Kellendros007
Copy link

Kellendros007 commented Jul 18, 2020

I found another interesting fact about this error.

If you try consistently CreateFile, LockFile, CloseFile, OpenFile with file in mounted dir, on OpenFile you get error, or application will infinitly waiting, but if you add Sleep after close file, it will be fine (for my machine delay ~34sec). This problem occurs only if file placed in mounted dir.

Steps to reproduce the issue:

  1. VS2019 with last updates, with C++ code generation option "Multi-threaded"
  2. c++ test app
    `
    #include "Windows.h"
    #include <iostream>

int main() {
int Time = 0;
LPCWSTR Path = TEXT("c:\data\Test.lock");
HANDLE h = ::CreateFile(Path, GENERIC_WRITE, 0, NULL, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, NULL);
if (h == INVALID_HANDLE_VALUE) {
std::cout << "CreateFile failed";
exit(1);
}
std::cout << "Created file " << Path << std::endl;

OVERLAPPED overlapvar = { 0 };
if (!::LockFileEx(h, LOCKFILE_EXCLUSIVE_LOCK, 0, 1, 0, &overlapvar)) {
	std::cout << "LockFile failed" << std::endl;
	exit(1);
}
std::cout << "Lock file" << std::endl;

if (!::CloseHandle(h)) {
	std::cout << "Close failed" << std::endl;
	exit(1);
}
std::cout << "Close file" << std::endl;

std::cout << "Sleep " << Time << " sec start" << std::endl;
Sleep(Time * 1000);
std::cout << "Sleep " << Time << " sec stop" << std::endl;
    //If sleep<34sec in my case on next command program will be infinitely waiting, but if sleep>=34sec program will normally continue.
h = ::CreateFile(Path, GENERIC_WRITE, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (h == INVALID_HANDLE_VALUE) {
	std::cout << "CreateFile failed";
	exit(1);
}
std::cout << "Open file " << Path << std::endl;

if (!::LockFileEx(h, LOCKFILE_EXCLUSIVE_LOCK, 0, 1, 0, &overlapvar)) {
	std::cout << "LockFile failed" << std::endl;
	exit(1);
}
std::cout << "Lock file" << std::endl;

::CloseHandle(h);
std::cout << "Close file" << std::endl;

}
`

  1. Docker file:
    FROM mcr.microsoft.com/windows/servercore:ltsc2019 COPY ["App/*", "c:/App/"] VOLUME ["c:/data"] CMD [ "c:/App/TestApp.exe"]

  2. Build:
    docker build -t locktest .

  3. Run:
    docker run --rm -it --mount type=bind,src=c:/locktest/data,dst=c:/data locktest

More info:

I try a lot of base images and i found this bug in:

  1. mcr.microsoft.com/windows/servercore: 1803,1809,1903,1909
  2. mcr.microsoft.com/windows/nanoserver: 1809
  3. mcr.microsoft.com/windows:1809

I try Windows 10 Pro (1909, 2004) and Windows Server 2019 Core as host machine.

But, on mcr.microsoft.com/windows/servercore: 1607 base image all working fine.

P.s. For my scenario i can't make Windows MinioServer in Container.

UPD: try mcr.microsoft.com/windows/servercore: 2004 and problem are here too

@Kellendros007
Copy link

Any news, after 2 month?

@rossmobi
Copy link

@immuzz Shouldn't this be labelled as a bug? This is surely not expected behaviour.

@awakecoding
Copy link

Now that docker-library/mongo#385 was closed in favor of this ticket, will this get worked on? It's a shame that MongoDB Windows containers literally have no chance of working correctly on their own at this point.

@immuzz immuzz added the bug Something isn't working label Oct 21, 2020
@immuzz
Copy link

immuzz commented Oct 22, 2020

@Kellendros007 Have you reproduced this on Windows OS 2004?

@Kellendros007
Copy link

Kellendros007 commented Oct 22, 2020

@immuzz Now i try to reproduced on Windows 10 Pro 2004 (19041.572) Host,
Base Images:

mcr.microsoft.com/windows/servercore:2004
mcr.microsoft.com/windows:2004

but error is not resolved

@immuzz
Copy link

immuzz commented Oct 22, 2020

@Kellendros007 Our developers are looking into it. Thank you for confirming its reproducible on 2004

@ghost
Copy link

ghost commented Nov 22, 2020

This issue has been open for 30 days with no updates.
@immuzz, please provide an update or close this issue.

@awakecoding
Copy link

Please don't let the bot consider this issue stale :( it really has to get fixed once and for all

@rossmobi
Copy link

@awakecoding It has moved along their Roadmap from "Backlog" to "Planned" at least, so it seems a fix is on the (distant) horizon.

@awakecoding
Copy link

@thecloudtaylor any update on this? We've started hitting more issues with MongoDB in Windows containers, and our ugly workaround of manually deleting the WiredTiger.lock file with PowerShell before launching the container will not be able to save us this time: docker-library/mongo#435

We've been investigating issues with customers that have a working deployment of our application using a MongoDB container on Windows Server 2019, and they've been unable to get a fresh installation up and running on brand new Windows Server 2019 machines. MongoDB just fails after 20-30 minutes to restart and hit the WiredTiger.lock issue because it wasn't launched through our PowerShell wrapper, making it much harder to diagnose.

It's really hard to justify telling our customers to run everything except MongoDB inside containers, especially since they never had to go through the trouble of manually setting up the database. They like it, and they don't want to switch to Linux, for most of these customers this is their first experience using Windows containers. They feel at home on Windows and we're happy to give them the true Windows experience.

I have always been a strong advocate for Windows containers on Windows, but I really need a hand here, please.

@rossmobi
Copy link

rossmobi commented Dec 9, 2020

@awakecoding You are mentioning the issue reporter who has not touched this issue since he opened it almost six months ago; I doubt they have any updates for us. If anyone can give us an update, perhaps the assignee @immuzz can, but in any case you can keep track of the issue status on the roadmap here: https://github.com/microsoft/Windows-Containers/projects/1#card-43557545

Sounds like you or your customers are using Windows in production. If you really need to get traction on this, I would suggest you follow the path of your/their Windows licensing vendor, be it Microsoft Azure or a Microsoft Partner with an SPLA, and try to get to get them to apply some internal / horizontal pressure. Complaints on GitHub Issues are at the bottom of the food-chain unfortunately, certainly if there are only 3 participants outside of Microsoft.

Good luck!

@awakecoding
Copy link

@rossdotpink I guess I'll be at the bottom of the food chain then, trying to apply horizontal pressure would likely be very costly when this is really critical stuff that should be addressed no matter what.

I see another critical issue that could very well be related to this one, basically admitting to the fact that Windows containers currently have no graceful shut down at all: #16

The lack of a graceful shutdown would definitely explain all the issues related to containerized MongoDB we've seen: docker-library/mongo#435

@immuzz any update on this? sounds like both issues could very well be related

@immuzz
Copy link

immuzz commented Dec 9, 2020

@awakecoding Let me check with the team owning this and get back to you.

@awakecoding
Copy link

@immuzz thanks a lot, I appreciate it

@Justin-DynamicD
Copy link

Adding voice to this issue. I have AzureDevOps build servers on Server 2004 experiencing the same issue. The hope was to use build containers instead of managing local installs but file lock breaks a number of build workflows.

As these are build boxes, is there a modern core install that is confirmed as working? Container has to match the host, so I can't rotate the image version, I have to rebuild the host server, and would like to avoid multiple-rebuilds trying to find a working version

@Justin-DynamicD
Copy link

So is servercore 1607 the most recent working version? Isnt that distro EOL?

Does anyone know of a more recent working version?

@TBBle
Copy link

TBBle commented Dec 15, 2020

servercore:1607 (aka Windows Server 2016), is in Mainstream support until early 2022, and Extended support until early 2027.

It's interesting that it varies by base image, I'd have expected it to vary by host version. Running most of those tests would have been using Hyper-V isolation, but the servercore:1809 test on Windows Server 2019 should have been process isolation, so I guess that doesn't make a difference either.

So I guess a possible workaround is to run, e.g., MongoDB images based on servercore:1607 on newer Windows hosts in Hyper-V isolation.

@awakecoding
Copy link

@Justin-DynamicD @TBBle wait... are you saying that this issue is not observed when using the older servercore:1607 base image with Hyper-V isolation in Windows Server 2019? In other words... this maddening issue is a regression?

@TBBle
Copy link

TBBle commented Dec 16, 2020

That's what #37 (comment) says at the end, as far as I understand it.

@awakecoding
Copy link

@immuzz any update on this issue or the one about graceful shutdowns not being supported (#16)? Are there plans for either supporting containerd + HCSv2 on Windows Server 2019, or fixing HCSv1? I don't mind what the plan is, as long as there is a serious plan to get this fixed.

@Justin-DynamicD
Copy link

pinging a request for update as well.

@ghost
Copy link

ghost commented Feb 4, 2021

This issue has been open for 30 days with no updates.
@immuzz, please provide an update or close this issue.

2 similar comments
@ghost
Copy link

ghost commented Mar 6, 2021

This issue has been open for 30 days with no updates.
@immuzz, please provide an update or close this issue.

@ghost
Copy link

ghost commented Apr 6, 2021

This issue has been open for 30 days with no updates.
@immuzz, please provide an update or close this issue.

@Justin-DynamicD
Copy link

Justin-DynamicD commented Apr 12, 2021

So it's April with no movement. This is a pretty big deal that simply makes Windows containers unreliable in their current state. Are Windows containers simply DOA? It is simpler to just sub a dockerfile with packer and forget they exist until the app can be ported to linux at this point.

@immuzz
Copy link

immuzz commented May 3, 2021

Sorry about the delay folks. Our devs were fixing the bug and happy to announce that its been fixed in bindflt. Its part of patch 4c. Please try it and let us know if its working for you. Otherwise I will close this issue and mark it as fixed in a couple of days. Thanks for being patient.

@awakecoding
Copy link

@immuzz that's good news! Will this become available through Windows Update on the base Windows Server 2019 OS, or through an update to DockerMsftProvider on PSGallery? I just want to know how to get the fix as soon as it becomes available.

@thecloudtaylor
Copy link
Member Author

It was released as part of KB5001391 which you can download/install from the catalog now. Next Tue (the 11th) the fix will roll up into the normal patch Tue content and go out through Windows update as well as subsequently updated Azure gallery images when those are available (typically a few days later).

@Kellendros007
Copy link

Kellendros007 commented May 5, 2021

@thecloudtaylor, @immuzz, unfortunately, I have some bad news. Either I'm doing something wrong or the problem hasn't gone away. I am using windows 10 pro 21h1 with KB5001391 installed and base image:

  1. mcr.microsoft.com/windows/servercore:2004
  2. mcr.microsoft.com/windows/servercore:20H2

@immuzz
Copy link

immuzz commented May 5, 2021

In order to isolate the issue, have you tried it on Windows Server 2019?

@immuzz
Copy link

immuzz commented May 5, 2021

@thecloudtaylor just pointed out that the base layer wont get updated until Tuesday (May 11). Could you try after its updated and let us know

@Kellendros007
Copy link

@immuzz, ok i will try after patch tue on win 10 pro and win server 2019

@tianon
Copy link

tianon commented May 11, 2021

I just tested with today's updated base on an 1809-based build of https://github.com/docker-library/mongo and it worked! (following the steps in my reproducer in docker-library/mongo#435 (comment))

🎉 🥳

@thecloudtaylor
Copy link
Member Author

Thank you for the confirmation! I'm goint to close this one resloved!

@Justin-DynamicD
Copy link

This is fantastic! Thank you for the effort put in this.

@mloskot
Copy link

mloskot commented May 28, 2021

@Kellendros007

I am using windows 10 pro 21h1 with KB5001391 installed and base image

Are we positive it is wise to use 21H1 hosts yet? See #117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage Data storage management and issues
Projects
None yet
Development

No branches or pull requests

9 participants