Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cbfs driver causes 'hcsshim::PrepareLayer - failed failed in Win32: Incorrect function. (0x1)' #42807

Closed
nickwesselman opened this issue Aug 31, 2021 · 18 comments

Comments

@nickwesselman
Copy link

Description

Note: There is a long two-year old discussion, with lots of "me too" comments on this issue at docker/for-win#3884. Re-opening here based on feedback from @stephen-turner.

Starting with Windows 10 1903, having certain 'drive' software installed with certain versions of CBFS drivers causes Docker builds to fail with Windows containers, even with the most trivial Dockerfile.

hcsshim::PrepareLayer - failed failed in Win32: Incorrect function. (0x1)

Steps to reproduce the issue:

  1. Install Box Drive (no need to log in, just install)
  2. Create a simple Dockerfile
FROM mcr.microsoft.com/windows/nanoserver:1809
SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]
  1. docker build .

Describe the results you received:

➜  docker build .
Sending build context to Docker daemon  2.048kB
Step 1/2 : FROM mcr.microsoft.com/windows/nanoserver:1809
 ---> f524b7260f3c
Step 2/2 : SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]
 ---> Running in 389b868934e6
hcsshim::PrepareLayer - failed failed in Win32: Incorrect function. (0x1)

Uninstalling Box Drive makes the issue go away.

Describe the results you expected:
A successful build.

Additional information you deem important (e.g. issue happens only occasionally):
The company that makes this CBFS Connect driver used by Box and others has apparently released a version with a workaround for this bug. Installing software that contains this updated driver (such as SFTP Drive v2) actually resolves the issue. However Box has not released a fix, and insists that the issue is a regression in Docker and/or Windows itself.

Output of docker version:

Client:
 Cloud integration: 1.0.17
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:58:50 2021
 OS/Arch:           windows/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.24)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:54:29 2021
  OS/Arch:          windows/amd64
  Experimental:     false

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Build with BuildKit (Docker Inc., v0.6.1-docker)
  compose: Docker Compose (Docker Inc., v2.0.0-rc.1)
  scan: Docker Scan (Docker Inc., v0.8.0)

Server:
 Containers: 11
  Running: 0
  Paused: 0
  Stopped: 11
 Images: 801
 Server Version: 20.10.8
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: hyperv
 Kernel Version: 10.0 19043 (19041.1.amd64fre.vb_release.191206-1406)
 Operating System: Windows 10 Enterprise Version 2009 (OS Build 19043.985)
 OSType: windows
 Architecture: x86_64
 CPUs: 12
 Total Memory: 31.81GiB
 Name: LT-NWE1-T-US
 ID: KJ3K:R3GY:FLRD:BAJT:4KZW:LFB3:KPOO:JMA7:Y5Y7:UYWE:MPYU:2M7U
 Docker Root Dir: C:\ProgramData\Docker
 Debug Mode: true
  File Descriptors: -1
  Goroutines: 26
  System Time: 2021-08-31T14:21:10.5302762-04:00
  EventsListeners: 1
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Additional environment details (AWS, VirtualBox, physical, etc.):
physical

@emayevski
Copy link

Docker makes a certain request and expects a specific error code to be returned. If there's any other error code reported by the driver, Docker refuses to continue. So, it's not anyone's bug but something that Docker should definitely improve for better compatibility.

@thaJeztah
Copy link
Member

The company that makes this CBFS Connect driver used by Box and others has apparently released a version with a workaround for this bug. Installing software that contains this updated driver (such as SFTP Drive v2) actually resolves the issue. However Box has not released a fix, and insists that the issue is a regression in Docker and/or Windows itself.

I'm curious what their workaround is. Unfortunately their code looks to be proprietary (?), so not sure if it'll be easy to find out exactly what it does. I'm wondering if the software injects itself into some of these code paths, and that's causing the issue.

FWIW, I think the error is coming from this code;

// Get returns the rootfs path for the id. This will mount the dir at its given path.
func (d *Driver) Get(id, mountLabel string) (containerfs.ContainerFS, error) {
logrus.Debugf("WindowsGraphDriver Get() id %s mountLabel %s", id, mountLabel)
var dir string
rID, err := d.resolveID(id)
if err != nil {
return nil, err
}
if count := d.ctr.Increment(rID); count > 1 {
return containerfs.NewLocalContainerFS(d.cache[rID]), nil
}
// Getting the layer paths must be done outside of the lock.
layerChain, err := d.getLayerChain(rID)
if err != nil {
d.ctr.Decrement(rID)
return nil, err
}
if err := hcsshim.ActivateLayer(d.info, rID); err != nil {
d.ctr.Decrement(rID)
return nil, err
}
if err := hcsshim.PrepareLayer(d.info, rID, layerChain); err != nil {

The hcsshim::PrepareLayer - failed failed in Win32: Incorrect function. (0x1) error is coming directly from the Windows Kernel / syscall that's made.

Docker makes a certain request and expects a specific error code to be returned

I had a quick look, but (also see the code above), and at a glance, I don't see specific handling for "specific" errors. In this case, the error indicates that the syscall failed in the kernel, so something went wrong when mounting the layer (at which point, the only thing docker can do is "rollback" (DeactivateLayer) the mount, and print the error that came from the kernel.

/cc @katiewasnothere @kevpar @TBBle are you able to help out on this?

@thaJeztah
Copy link
Member

Perhaps someone could get a trace for this, as was suggested in #42804 (comment)

You could try to get some HCS traces by following https://docs.microsoft.com/en-us/virtualization/windowscontainers/troubleshooting#capturing-hcs-verbose-tracing.

@TBBle
Copy link
Contributor

TBBle commented Sep 2, 2021

This should be simple to reproduce with the wclayer utility from the hcsshim repo, which will help for reproduction and testing. wclayer mount is the Activate/Prepare call, and you could use it pretty easily on the graphdriver directory for a Nanoserver layer for example. (That would also justify punting this issue into the hcsshim GitHub tracker... ^_^)

Has anyone checked this on a later Windows kernel (2004, 20H2, or Server LTSC 2022)? The 1903/1909 kernel has had other issues in the WCOW Layer handling support, i.e. #39524 and microsoft/hcsshim#708. In fact, that bug also manifested as a failure in the PrepareLayer call.

Edit: Ah, I see the actual bug report here is on 2004 (19041 kernel), so it wasn't fixed by that series. Maybe worth checking on LTSC 2022, particularly since the SAC releases are aging out, and if it is an OS issue, are unlikely to see a fix, but Windows Server 2022 would be an important target for such a fix.

It's also possible that the HCSv2 API (i.e. computestorage.dll) doesn't have this issue, but I don't have anything that uses that yet (and we found an API issue in hcsshim, so we can't use it there yet), so it's harder to rule this in or out.

I assume the issue is that the PrepareLayer call is calling an API on the registered Installable Filesystem drivers, and CBFS is being hit before WCIFS (Windows Container Installable Filesystem), which is what PrepareLayer is supposed to be setting up, and returns something unexpected, which is reflected out to the caller.

It's also possible that it's a bug in CBFS that it is somehow interfering with the attempt to activate WCIFS on the layer. It does kind-of depend on the workaround used, and since we can't see the source of either CBFS or vmcompute.dll, that's all speculation.

@Kollgergo
Copy link

Kollgergo commented Sep 3, 2021

Also experienced this issue using pCloud and docker on the same machine.
The issue seems to happen when I try to build a Dockerfile with SHELL instruction as @nickwesselman reported it. In my experience SHELL instruction always fails but other builds not containing SHELL can succeed.

After uninstalling pCloud (and removing the CBFS driver which was left behind...) the issue resolves and the Dockerfile is built.

@TBBle
Copy link
Contributor

TBBle commented Sep 4, 2021

I can confirm that it replicates just with wclayer on my Windows 10 20H2 (19042.1165) desktop box.

Given 32ebba58e25d6508c171d737289fc376095d82d0b026132e903dd32673a88a65 is a Windows container base layer (I happened to have hello-world:latest pulled here, so I think this is a nanoserver base, and then working in some temp directory:

## Box is not installed
> wclayer create tm -l C:\ProgramData\Docker\windowsfilter\32ebba58e25d6508c171d737289fc376095d82d0b026132e903dd32673a88a65
> wclayer mount tm -l C:\ProgramData\Docker\windowsfilter\32ebba58e25d6508c171d737289fc376095d82d0b026132e903dd32673a88a65
\\?\Volume{74a47361-9cf4-4774-8655-a25fb69af17a}
> wclayer unmount tm
## Installed https://e3.boxcdn.net/box-installers/desktop/releases/win/Box-x64.msi here, closed the login window
> wclayer mount tm -l C:\ProgramData\Docker\windowsfilter\32ebba58e25d6508c171d737289fc376095d82d0b026132e903dd32673a88a65
hcsshim::PrepareLayer - failed failed in Win32: Incorrect function. (0x1)

The loaded filesystem filters are unchanged by installing or uninstalling Box, so however CBFS works, it's not a minifilter as I had assumed, or somehow it's causing problems without being loaded.

> fltmc

Filter Name                     Num Instances    Altitude    Frame
------------------------------  -------------  ------------  -----
bindflt                                 1       409800         0
FsDepends                               8       407000         0
WdFilter                                8       328010         0
storqosflt                              1       244000         0
wcifs                                   1       189900         0
gameflt                                 5       189850         0
CldFlt                                  1       180451         0
FileCrypt                               0       141100         0
luafv                                   1       135000         0
npsvctrig                               1        46000         0
Wof                                     6        40700         0
FileInfo                                8        40500         0

I do see it's a kernel mode driver.

A service was installed in the system.

Service Name:  cbfsconnect2017
Service File Name:  C:\WINDOWS\system32\drivers\cbfsconnect2017.sys
Service Type:  kernel mode driver
Service Start Type:  system start
Service Account:  
> sc query cbfsconnect2017

SERVICE_NAME: cbfsconnect2017
        TYPE               : 1  KERNEL_DRIVER
        STATE              : 4  RUNNING
                                (NOT_STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0

So I'm assuming it's a legacy filesytem filter driver. However, the docs for that say it should appear in the above fltmc output. I suspect that's because the driver installation is being done with the Service Control Manager (pre-XP) approach.

I uninstalled Box, and installed SFTP Drive V2 Personal Edition, and confirmed that it looks the same (i.e. not in fltmc output) and the repro case passes. So the difference must be in the driver implementation itself.

So I can confirm that the 2019 release of cbfsconnect bundled with Box shows this problem, and the 2020 release of cbfsconnect bundled with SFTP Drive V2 does not.

And it's nothing to do with Docker, it's a conflict between the device driver and the Windows container filesystem support.

Windows 10 1903 introduced some driver model changes, so it's not unbelievable to me that this is what introduced the incompatibility, and since the fix is apparently obvious to the CBFS driver developers in their closed-source product, I can't see any useful way forward except either:

  • CBFS driver developers acknowledge it was a bug or legacy issue in their code and the Box developers upgrade their integration
  • CBFS driver developers usefully document the Windows bug they are working around to Microsoft (perhaps raise it on the hcsshim bug tracker, since there's a simple repro case there), and MS fixes it as appropriate.

@thaJeztah
Copy link
Member

Wow, thanks for the deep-dive into this, @TBBle. Looks like there's not a lot that can be done in this repository.

Does someone on this thread have a subscription for Box Drive or is in contact with the CBFS Connect driver people to get the missing information?

@nickwesselman
Copy link
Author

I don't have a direct relationship but I've reached out to Callback Technologies via some public mailboxes, and asked them to respond here.

@thaJeztah
Copy link
Member

Thanks! Hopefully someone from them is able to respond so that we are able to find a solution to this problem.

@nickwesselman
Copy link
Author

nickwesselman commented Sep 8, 2021

Hey all. Here is what I got back from Callback Technologies. Since I am not familiar with the technical details here and can't discern what's relevant/important, I am re-posting as is.

@TBBle:
So I'm assuming it's a legacy filesytem filter driver. However, the docs for that say it should appear in the above fltmc output…
I uninstalled Box, and installed SFTP Drive V2 Personal Edition, and confirmed that it looks the same (i.e. not in fltmc output) and the repro case passes. So the difference must be in the driver implementation itself.

CBFS Connect is a full filesystem driver, not a filter driver or minifilter. To help with this issue I will first provide some background then talk more about our investigation into the issue.

It is incredibly important when talking about issues involving driver technologies to start from a low-level concrete test case. In this case since we are testing the issue using Docker’s interface to build an image, this is too high-level and does not tell us much about the operations happening underneath the covers. Since our filesystem driver will receive IRPs from the system, from the perspective of the driver it is not always clear the specific sequence of events which cause an IRP to be sent to us from the I/O manager. In order to comprehend the sequence of events which leads to an IRP being generated by I/O manager, it is necessary to know all the user-mode operations involved.

Setting that aside for the time-being, in this case, we were able to analyze the test case and saw the same behavior as others: when building an image, the parameter is incorrect error is reported when older version of cbfsconnect2017 driver is installed and does not get reported when a newer version of cbfsconnect2017 is installed. At this point what we can say is some specific sequence of events triggers the Windows I/O Manager to send us an IRP_MN_MOUNT_VOLUME request. In the older version of the driver we complete the IRP by failing it with NTSTAUS STATUS_FWP_INVALID_PARAMETER. Our understanding is this is a perfectly valid error code to return when the operation does not make sense.

One thing is for certain, that the MS documentation for this IRP is not entirely clear. It states “if a file system driver receives this IRP for a volume whose format does not match that of the file system the file system driver should return STATUS_UNRECOGNIZED_VOLUME” (https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/irp-mj-file-system-control ). It stands to reason that other errors are also possible (including invalid parameter). Therefore, it is our understanding nothing strange happening and it seems like either Docker or Windows does not like it when a different error is returned. We haven’t talked about the newer build of the 2017 driver yet.

After the point we return this status, either Windows or Docker must be checking the code and reporting the error to the higher layers. The reason the issue goes away when installing a newer build of cbfsconnect2017 is not due to a bug or problem on our side, but we indeed made a change to workaround this side-effect reported a year ago. External logic which checks the error codes being returned is not something we control. We will now return the STATUS_UNRECOGNIZED_VOLUME value all the time. This change is present in the driver version used by SFTP Drive V2 going forward (even though we feel the “parameter is incorrect” status more accurately fits the bill). It is worth noting that in 15 years, the return codes have not caused problems for others and Docker seems to be the only one failing on it.

In short, there are two necessary steps to move forward from here. First, Box developers really should contact us and request an update to alleviate the problem for their users. This will come with drawbacks for them, as Microsoft changed signing requirements which will impact the application’s compatibility with older versions of Windows and non-updated versions of Windows. Second, Microsoft should clarify whether completing the IRP_MN_MOUNT_VOLUME request with a failure via invalid parameter status is okay. If the error is fine, which we suspect it is, the onus is on Docker or Windows to provide the logic that handles this case. To reiterate, our code just returns the error code coming from below, which is a perfectly valid strategy. We indeed made a change between the two versions being tested here; however, the observations with respect to the error going away do not demonstrate a problem exists in our code, as it is really the upper caller’s job to determine what to do with the error code that gets returned.

I hope this helps! Feel free to repost what you feel necessary to Docker’s issue tracking system.

@TBBle
Copy link
Contributor

TBBle commented Sep 9, 2021

Interesting. My reading of that doc:

Indicates a volume mount request. If a file system driver receives this IRP for a volume whose format does not match that of the file system, the file system driver should return STATUS_UNRECOGNIZED_VOLUME.

is different to the CBFS developers, my take would be that the only valid response to IRP_MN_MOUNT_VOLUME in the case of "That's not my volume" is STATUS_UNRECOGNIZED_VOLUME, and that any other error code indicates "That is my volume, and something else is wrong".

So the apparent behaviour in HCS or below (not Docker this is happening in a Windows-internal component as I demonstrated with wclayer earlier) seems reasonable to me based on the docs:

for each filesystem driver:
 result = ioctl(driver, IRP_MN_MOUNT_VOLUME, ...)
 if result == `STATUS_UNRECOGNIZED_VOLUME`: continue # Wrong driver, keep trying
 return result

That seems to be the behaviour described in the How the Volume is Recognized IFS docs, which suggests this routine is actually happening even lower-level than HCS, as a core Windows response to "a request to open a file on a logical volume (that is, a partition or dynamic volume)".

In that case, I'm not sure why the old CBFS driver didn't cause similar problems elsewhere. Maybe something HCS is doing was causing ordering differences in the filesystem driver search, or normally an earlier driver in the list (NTFS) recognises the volume and so CBFS wasn't being called in the general user case. Perhaps we'd see the same problem if I installed Box and then tried to mount an extfs2 volume for which there is no valid filesystem driver installed, getting "invalid parameter" instead of "no filesystem found".

It's also possible that this volume mount is not supposed to find a driver, because HCS is going to apply the WCIFS minifilter before the volume can be mounted by the NTFS driver, but the error from CBFS is aborting that process with a response of "Actually, it's a CBFS device but it's faulty" and HCS aborts so it doesn't corrupt someone else's volume data.

The "How the Volume is Recognized" docs don't mention precisely how the driver is to signal "format does not match", but STATUS_UNRECOGNIZED_VOLUME is what the IRP_MJ_FILE_SYSTEM_CONTROL docs say for "a volume whose format does not match that of the file system", which I read as being the same thing.

So yeah, this looks like possibly a documentation ambiguity, and I'd suggest that when implementing an API, it be implemented by the strictest-possible reading where multiple readings exist, as the updated version of CBFS now does.

That said, since this problem appeared in Windows 10 1903, perhaps MS changed this routine to be narrower about the error codes that mean continue at that point, not realising that developers had read it differently than I did. Before that point, perhaps any error from IRP_MN_MOUNT_VOLUME meant continue, and it would only be the last filesystem's failure that bubbled up.

That said, someone else observed the same error-code behaviour in 2018 on Windows 10 17134 (1803 SAC):

I/O manager sends IRP_MN_MOUNT_VOLUME to FastFat file system, FastFat send IOCTL_DISK_CHECK_VERIFY to WinFsp virtual disk, WinFsp virtual disk returns STATUS_INVALID_DEVICE_REQUEST, FastFat returns STATUS_INVALID_DEVICE_REQUEST too, I/O manager doesn't attempt to ask other file system any more.
If the file system doesn't recognize disk device, the file system need returns STATUS_UNRECOGNIZED_VOLUME, I/O manager will attempt to ask other file system, like WinFsp.

referencing the same IRP_MN_MOUNT_VOLUME documentation, so perhaps it is HCS that got narrower in what it expects to receive from the mount call, and what it considers a failure, or perhaps a larger logic change in how HCS mounts volumes.

@emayevski
Copy link

is different to the CBFS developers, my take would be that the only valid response to IRP_MN_MOUNT_VOLUME in the case of "That's not my volume" is STATUS_UNRECOGNIZED_VOLUME, and that any other error code indicates "That is my volume, and something else is wrong".

That's the problem - someone in MS was trying to fit multiple different pieces of information into one variable. It would be more logical to have a separate parameter for an indicator of the volume recognition and reserve the result code for possible errors.

But that's a different story.

@thaJeztah
Copy link
Member

In that case, I'm not sure why the old CBFS driver didn't cause similar problems elsewhere.

Yes, curious about that as well; from the description it seems like there's a chance it may fail / cause errors in different scenarios as well.

referencing the same IRP_MN_MOUNT_VOLUME documentation, so perhaps it is HCS that got narrower in what it expects to receive from the mount call, and what it considers a failure, or perhaps a larger logic change in how HCS mounts volumes.

perhaps @katiewasnothere or @kevpar can help shine a light on this (or know who would be able to); and/or help improving the Windows documentation if it's ambiguous.

That's the problem - someone in MS was trying to fit multiple different pieces of information into one variable. It would be more logical to have a separate parameter for an indicator of the volume recognition and reserve the result code for possible errors.

I'm not familiar with the overall Windows APIs in this respect, but it's a quite common pattern to use sentinel errors to control flows. In this case, (deducting from the above), I think the intent is;

  • "no error": the driver successfully handled the request, so no need to continue with other drivers
  • STATUS_UNRECOGNIZED_VOLUME: (sentinel error): the driver did not handle the request, continue with next driver
  • any other (non-sentinel) error: the driver was able to handle the request, but something failed. the error returned may be specific to this driver.

@thaJeztah
Copy link
Member

From docker/for-win#3884 (comment)

The issue should be resolved now. The CBFS Connect developers mention a fix in v20.0.7921, released on September 8, 2021.

Software will need to update their CBFS Connect driver, though.

@nickwesselman
Copy link
Author

Last email from Callback on this:

Thanks for the update. It sounds like the direction is good and we don’t have anything further to add. The return status in newer builds does align with TBBle’s more strict interpretation. Let us know however we can assist.

It seems there is nothing to fix in Windows/moby/Docker but rather this is now on Box to incorporate the updated driver. Thanks all for the thorough investigation, I'll close and continue my year+ effort to get Box to incorporate the fix. 😬

@thaJeztah
Copy link
Member

Thank you!

@Hish15
Copy link

Hish15 commented Oct 21, 2021

Hi @nickwesselman, How is your year+ fight going on?
I think I do have the problem described here because of Box :/

Regards

@nickwesselman
Copy link
Author

@Hish15 Still waiting on Box. See above though for options on other software that you can install which may get you the updated driver that you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants