HPE team - found a stack override problem on libusb-win32 #19

JamesHuangHPE · 2021-08-11T09:24:42Z

Hi Peter,

This is James from HPE, we recently found a stack override problem on the libusb-win32 driver.
https://sourceforge.net/projects/libusb-win32/

Please review below for details and let us know your opinion and suggestions.
It would be great if you can help us to fix this issue.
I have sent you an email around 8/5.

The problem is the irp->userIosb ptr member which points to local io_status which is allocated locally from the stack.

The kernel updates the 2 qword io_status which is originally on the stack of call_usbd_ex but sometimes the kernel stomps the stack of another routine if call_usbd_ex has already returned when the update happens.

Here’s a lot more detail from libusb_driver.c:

NTSTATUS call_usbd_ex(libusb_device_t *dev, void *urb, ULONG control_code,      <-- see note1
                                    int timeout, int max_timeout)
{
       KEVENT event;
       NTSTATUS status;
       IRP *irp;
       IO_STACK_LOCATION *next_irp_stack;
       LARGE_INTEGER _timeout;
       IO_STATUS_BLOCK io_status;                                                 <-- see note 2
 
       if (max_timeout > 0 && timeout > max_timeout)
       {
              timeout = max_timeout;
       }
       if (timeout <= 0)
              timeout = LIBUSB_MAX_CONTROL_TRANSFER_TIMEOUT;
 
       KeInitializeEvent(&event, NotificationEvent, FALSE);
 
       irp = IoBuildDeviceIoControlRequest(control_code, dev->target_device,     \
              NULL, 0, NULL, 0, TRUE,                                             > see note 3
              NULL, &io_status);                                                 /
 
       if (!irp)
       {
              return STATUS_NO_MEMORY;
       }
 
       next_irp_stack = IoGetNextIrpStackLocation(irp);
       next_irp_stack->Parameters.Others.Argument1 = urb;
       next_irp_stack->Parameters.Others.Argument2 = NULL;
 
       IoSetCompletionRoutine(irp, on_usbd_complete, &event, TRUE, TRUE, TRUE);
 
       status = IoCallDriver(dev->target_device, irp);                          \_ returns STATUS_PENDING so need to wait for irp completion
       if(status == STATUS_PENDING)                                             /
       {
              _timeout.QuadPart = -(timeout * 10000);
 
              if(KeWaitForSingleObject(&event, Executive, KernelMode,            \_ wait times-out after specified 500ms  --  see note 4
                     FALSE, &_timeout) == STATUS_TIMEOUT)                        /
              {
                     USBERR0("request timed out\n");
                     IoCancelIrp(irp);                                           <-- attempt to cancel the irp because of the time-out
              }
       }
 
       /* wait until completion routine is called */
       KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);       <-- wait (with infinite time-out) for completion
 
       status = irp->IoStatus.Status;                                           <-- see note 5
 
       /* complete the request */
       IoCompleteRequest(irp, IO_NO_INCREMENT);                                 <-- see note 6
       
       USBDBG("status = %08Xh\n",status);
       return status;                                                           <-- see note 7
}

Note 1:
The call_usbd_ex() routine executes all the time but precisely every 60sec there is a unique usb request that always times-out after 500ms and fails:
2: kd> dt _URB ffff868d`0be18570 UrbControlVendorClassRequest.Value
USBPORT!_URB
+0x000 UrbControlVendorClassRequest :
+0x082 Value : 0x3fd <-- 0x3fd = 1021 is a unique usb vendor request code that always times-out
Eaton should be able to tell us what this 0x3fd (1021) request is, why it gets sent every 60sec, and what might cause it to always time-out.
We should check a “working” system to see if this request is being sent and if it always time-out and fails.
If this request didn’t get sent or if it didn’t time-out and fail then we would not have a problem.

Note 2:
The 2 qword IO_STATUS_BLOCK io_status is allocated as a local on the stack of the current routine:
2: kd> dt IO_STATUS_BLOCK Status Information
libusb0!IO_STATUS_BLOCK
+0x000 Status : Int4B <-- low dword of first qword will be updated with ntstatus 0xc0000120 (The I/O request was canceled)
+0x008 Information : Uint8B <-- second qword will be updated with the byte count of the transfer which will be 0x00000000’00000000
This is what gets updated or stomped on the stack and the results are:

If the writes happen before this routine returns then this is just a normal status block update
If the writes happen after this routine returns then this stomps the stack possibly of some other routine and timing / stack usage determines if it crashes

Note 3:
The irp is created here. Notice that the &io_status ptr is passed as an arg and it becomes the irp member UserIosb ptr:
2: kd> dt nt!_IRP IoStatus UserIosb
+0x030 IoStatus : _IO_STATUS_BLOCK
+0x048 UserIosb : Ptr64 _IO_STATUS_BLOCK <-- ptr to local io_status allocated on this routine’s stack
The irp that’s created has the following flags:
2: kd> dt nt!_IRP 0xffffe486`c1d56990 Flags
+0x010 Flags : 0x40060000 <-- 0x40000000 => FILE_FLAG_OVERLAPPED => it can be asynchronous (or synchronous)
When the kernel is later called to complete the request, it copies IoStatus into the ptr contents of *UserIosb (ie. io_status) which is the final status info for the irp:

This is what can stomp the stack of another routine depending upon when it happens
ie. it might be a synchronous or asynchronous completion
if it is async then the stack will get stomped later but there will only be a crash if another routine is actively using those stack addrs at that time
Either of the following will work-around the problem:
If io_status is defined as a global (instead of as a local on the stack) then the stack will not get stomped
If io_status is replaced with a NULL then the kernel will not update user status (*UserIosb) and the stack will not get stomped
o Note: I don’t think it is documented that this ptr can be a NULL ptr but I verified that this works (using kd to hack the assembly code)

Note 4:
This times-out after a specified 500ms wait/time-out (ie. the request/irp event does not get signaled within 500ms).
If I hack the assembly with kd to make it an infinite wait/time-out then it works:

This means that this usb request (ie. the specific one that happens every 60sec) is taking far too long
Eaton should be able to tell us what this vendor _URB request (value 0x3fd / 1021) is and why it takes so long
We should also check a “working” system to see if the same request is being sent every 60sec and if it also times-out
ie. if this request is not being sent and/or not timing-out (or we can stop it) then the system will not fail

Note 5:
The “status = irp->IoStatus.Status” should really be “status = irp->UserIosb->Status” (where the final status is copied into) but this doesn’t cause a problem.
It does mean that irp->UserIosb is not being used here.

Note 6:
This call (IoCompleteRequest) tells the kernel to finish and do clean-up of the irp. This includes the kernel executing nt!IopCompleteRequest to copy IoStatus into *UserIosb:

this can happen synchronously by direct calls (w/o an APC int) as shown in one of the callstacks earlier in this email
this can also happen asynchronously sometime later via an APC int as shown in another of the callstacks earlier in this email

Note 7:
The call_usbd_ex routine returns here and the locals on its stack should not be accessed from this point onwards:

if the irp was handled synchronously then io_status on the current stack was previously written and will not be accessed again => no problem
if the irp is being handled asynchronously then io_status on the stack might be written sometime after the return (on some other routine’s stack):
o if the APC int and the write of io_status happens before the penter() check then the stack is stomped w/o problems because the addr is not yet in use
o if the APC int and the write of io_status happens after the penter() check then the stack is stomped w/o problems if the addr is not currently used
o if the APC int and the write of io_status happens near the penter() check then the args being passed up the stack during that time window get stomped
 this last case is where we see 1 or 2 stomped args on the stack causing a penter trap or crash
 ie. stack addr of old qword io_status.Information stomped w/ 0 and old dword io_status.Status stomped w/ ntstatus 0xc0000120

Best Regards,
James

The text was updated successfully, but these errors were encountered:

mcuee · 2021-08-13T04:59:35Z

Sorry but I do not see any emails in libusb-win32 mailing list. You need to subscribe to the mailing list in order to post.

mcuee · 2021-08-13T05:00:15Z

@dontech Just wondering if you have some time to take a look whether this is a real issue or not.

mcuee · 2021-08-13T05:05:22Z

@JamesHuangHPE
Have you tried out the latest snapshot version here? Take note it is unsigned version so you may have to use a test machine and enable test signing.
https://sourceforge.net/projects/libusb-win32/files/libusb-win32-snapshots/20190918/
https://docs.microsoft.com/en-us/windows-hardware/drivers/install/the-testsigning-boot-configuration-option

Just wondering if it is possible for you to change to libusbk driver. It is difficult for use to support libusb-win32 driver now.

mcuee · 2021-08-13T05:11:39Z

Even for libusbk driver, we are shipping the old version because of the new Windows 10 kernel driver requirements. We have switched the focus on the libusbk.dll and not updating the libusbk.sys. WinUSB driver should be the way to go.

JamesHuangHPE · 2021-08-13T05:53:20Z

Hi mucee,

I email to this address before ==> pdt@dontech.dk

James

JamesHuangHPE · 2021-08-13T06:01:32Z

@JamesHuangHPE
Have you tried out the latest snapshot version here? Take note it is unsigned version so you may have to use a test machine and enable test signing.
https://sourceforge.net/projects/libusb-win32/files/libusb-win32-snapshots/20190918/
https://docs.microsoft.com/en-us/windows-hardware/drivers/install/the-testsigning-boot-configuration-option

Just wondering if it is possible for you to change to libusbk driver. It is difficult for use to support libusb-win32 driver now.

The issue also existing in latest version.

mcuee · 2021-08-13T07:02:54Z

I see. In that case, let see whether @dontech can come and comment.

mcuee · 2021-08-13T07:05:40Z

@JamesHuangHPE I will strongly encourage you to try out libusbk driver to see if that helps. libusbk.sys is almost a drop in replacement for libusb0.sys for most of the devices. You do not need to change your application as libusb0.dll is still working with libusbk.sys.

You can use the libusbk-inf-wizard to replace the driver.

mcuee · 2021-08-13T07:06:55Z

Ref: libusbk.sys vs libusb0.sys vs WinUSB
http://libusbk.sourceforge.net/UsbK3/usbk_comparisons.html

dontech · 2021-08-15T13:29:17Z

Hello there, First of all let me state that i have not worked very much on the driver part of the code, but have mearly fixed a few problems here and there. That said, this problem sounds interresting, and should probably be fixed.

On Wed, 2021-08-11 at 02:24 -0700, JamesHuangHPE wrote: The problem is the irp->userIosb ptr member which points to local io_status which is allocated locally from the stack. The kernel updates the 2 qword io_status which is originally on the stack of call_usbd_ex but sometimes the kernel stomps the stack of another routine if call_usbd_ex has already returned when the update happens.

Yeah, it would seem like there is a race condition here. It would seem that the code for some reason has not completed the request after IoCompleteRequest() was called. This is for a number of reaons not great. The corruption you are seeing is only one of the problems. That new requests cannot be issued is likely also a problem. The request must have completed 100% before the function returns.

Note 1: The call_usbd_ex() routine executes all the time but precisely every 60sec there is a unique usb request that always times-out after 500ms and fails:

Yeah you probably have a bug on your USB besides this problem. All USB device sshould respond with a STALL handshake, NAK or ACK. Not responding is considered a broken device. I have seen this before however with some devices. Thats probably why the driver has a timeout, which should not be needed if the device was not doing this.

We should check a “working” system to see if this request is being sent and if it always time-out and fails. If this request didn’t get sent or if it didn’t time-out and fail then we would not have a problem.

The driver should work no matter what. The problem you are seeing must never happen no matter how bad the device behaves.

The 2 qword IO_STATUS_BLOCK io_status is allocated as a local on the stack of the current routine

Yeah, so the request must complete before the function returns.

Note 4: This times-out after a specified 500ms wait/time-out (ie. the request/irp event does not get signaled within 500ms). If I hack the assembly with kd to make it an infinite wait/time-out then it works:

Probably not a good idea with endless wait, as some devices (broken) might hang the system forever.

* This means that this usb request (ie. the specific one that happens every 60sec) is taking far too long

Yeah, no device should really do this (but some do anyway).

The “status = irp->IoStatus.Status” should really be “status = irp- >UserIosb->Status” (where the final status is copied into) but this doesn’t cause a problem.

Yeah, this is probably just a little wrong but not fatal. We can fix this along with the rest if we find the problem.

This call (IoCompleteRequest) tells the kernel to finish and do clean- up of the irp. This includes the kernel executing nt!IopCompleteRequest to copy IoStatus into *UserIosb: * this can happen synchronously by direct calls (w/o an APC int) as shown in one of the callstacks earlier in this email * this can also happen asynchronously sometime later via an APC int as shown in another of the callstacks earlier in this email

I think we need to wait for the async APC somehow, as we must not return before we are sure the request has completed. Making the variable global or NULL is a hack, and would not fix the other side issues. I found this while looking for examples: https://www-user.tu-chemnitz.de/~heha/oney_wdm/ch05g.htm What is interresting here is that they wait for an additional event with KeWaitForSingleObject() after calling IoCompleteRequest() (and after IoCancelIrp) and they clear the event manually after each use. So maybe we are simply not waiting for the correct number of signals before asuming the request is complete ? It is a bit tricky to decode from MS documentation what the exact right thing to do is. Could you try using the _exact_ same construction as they use in that example? So: Cancel -> Wait -> Reset -> Complete -> Wait. I would be great to find this one as i am sure some of the strange blue-screen deaths that some of my clients have reported could be the same thing. Thanks, /pedro

mcuee · 2021-08-16T05:37:14Z

@TravisRo If you have some time, please take a look as well.

dontech · 2021-09-20T08:25:46Z

Any update on this?

James, did you try the code i posted?

JamesHuangHPE · 2021-09-21T12:10:12Z

Hi Dontech,

Where did you post the code?
Is there any link?
Thank you.

James

dontech · 2021-09-21T20:51:37Z

Hi James,

When I meant "the code" I meant the suggestion above which I wrote on Aug 15.

I will need to modify the driver to try it, but since you already have done that before, it should not be a problem.

Thanks,

/pedro

JamesHuangHPE · 2021-09-22T01:54:32Z

hi Pedro,

Could you please modify the driver and provide it to us for testing?
I did not do the modification before, it was done by other engineer before.
Thank you.

James

dontech · 2021-09-24T09:53:11Z

Hi James,

I am not sure when or if i have the time for this. Furthermore, i am not sure i can reproduce the problem here.

I will see what i can do, but i cannot guarantee when i will have time for this.

If you need it done within a reasonable time-frame, i would suggest you allocate some resources for it.

Thanks,

/pedro

dontech · 2021-09-26T18:40:44Z

Hello again James.

OK i found a little time for this, and i think the fix works:

f41b963

dontech · 2021-09-26T18:45:12Z

Hi James,

Please review the changes i have made thoroughly.
Please try these binaries. You will need to disable driver signing requirements or re-sign them for testing (using your own inf).

libusb_fix_stack_smash.zip

Thanks,

/pedro

dontech · 2021-09-28T22:30:16Z

James, please update this.

JamesHuangHPE · 2021-09-29T01:50:16Z

Hi Peter,

The HPE SW expert is reviewing the modification.
should have some update later.
Thank you.

James

dontech · 2021-10-16T07:57:28Z

Hi James,

Any update on this? Does the fix work?

JamesHuangHPE · 2021-10-18T01:39:46Z

Hi Peter,

Sorry for the late reply, we are seeking some other way to fix this issue.
I will provide more response once the solution is confirmed.
Thank you for your support.

Best Regards,
James

dontech · 2021-10-18T10:11:52Z

OK, but any update on why this does not work would be appreciated.

dontech · 2021-10-25T01:08:24Z

Should be fixed in https://sourceforge.net/projects/libusb-win32/files/libusb-win32-snapshots/20211025/

mcuee · 2021-10-25T08:19:13Z

@JamesHuangHPE
Please try the latest snapshot and check whether this issue has been fixed. Thanks.

dontech · 2021-11-13T12:18:41Z

James, could you/HPE report back on this?

Lets get this one closed, so we are sure we have it right for the next release.

JamesHuangHPE · 2021-11-14T23:35:09Z

Hi Peter,

Sorry for the late response.
We found other solution on the application software, fix the URB request timeout problem to cause the USB driver enter error path to cause this issue.
Thank you for you update, and please close this case.

Best Regards,
James

dontech · 2021-11-15T00:02:49Z

OK The fix is still valid however, as the driver is now more safe from BSODs.

Thanks,

/pedro

JamesHuangHPE · 2021-11-15T00:06:21Z

Hi Peter,

Thank you for all your response and driver updated.
Really appreciate your support.

Best Regards,
James

mcuee added the bug label Aug 16, 2021

This was referenced Oct 19, 2021

Cannot send libusb_control_transfer with zero wLength for libusb0.sys libusb/libusb#1006

Closed

libusb-win32 new snapshot release #23

Closed

mcuee mentioned this issue Oct 30, 2021

libusb0.sys snapshot release and digital signature #24

Closed

JamesHuangHPE closed this as completed Nov 14, 2021

mcuee mentioned this issue Nov 16, 2021

Release plan -- next release 1.4.0.0 #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPE team - found a stack override problem on libusb-win32 #19

HPE team - found a stack override problem on libusb-win32 #19

JamesHuangHPE commented Aug 11, 2021 •

edited by mcuee

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021 •

edited

mcuee commented Aug 13, 2021

JamesHuangHPE commented Aug 13, 2021

JamesHuangHPE commented Aug 13, 2021

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021

dontech commented Aug 15, 2021 via email

mcuee commented Aug 16, 2021

dontech commented Sep 20, 2021 •

edited

JamesHuangHPE commented Sep 21, 2021

dontech commented Sep 21, 2021 •

edited

JamesHuangHPE commented Sep 22, 2021

dontech commented Sep 24, 2021

dontech commented Sep 26, 2021

dontech commented Sep 26, 2021

dontech commented Sep 28, 2021

JamesHuangHPE commented Sep 29, 2021

dontech commented Oct 16, 2021

JamesHuangHPE commented Oct 18, 2021

dontech commented Oct 18, 2021

dontech commented Oct 25, 2021

mcuee commented Oct 25, 2021

dontech commented Nov 13, 2021 •

edited

JamesHuangHPE commented Nov 14, 2021

dontech commented Nov 15, 2021

JamesHuangHPE commented Nov 15, 2021

HPE team - found a stack override problem on libusb-win32 #19

HPE team - found a stack override problem on libusb-win32 #19

Comments

JamesHuangHPE commented Aug 11, 2021 • edited by mcuee

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021 • edited

mcuee commented Aug 13, 2021

JamesHuangHPE commented Aug 13, 2021

JamesHuangHPE commented Aug 13, 2021

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021

mcuee commented Aug 13, 2021

dontech commented Aug 15, 2021 via email

mcuee commented Aug 16, 2021

dontech commented Sep 20, 2021 • edited

JamesHuangHPE commented Sep 21, 2021

dontech commented Sep 21, 2021 • edited

JamesHuangHPE commented Sep 22, 2021

dontech commented Sep 24, 2021

dontech commented Sep 26, 2021

dontech commented Sep 26, 2021

dontech commented Sep 28, 2021

JamesHuangHPE commented Sep 29, 2021

dontech commented Oct 16, 2021

JamesHuangHPE commented Oct 18, 2021

dontech commented Oct 18, 2021

dontech commented Oct 25, 2021

mcuee commented Oct 25, 2021

dontech commented Nov 13, 2021 • edited

JamesHuangHPE commented Nov 14, 2021

dontech commented Nov 15, 2021

JamesHuangHPE commented Nov 15, 2021

JamesHuangHPE commented Aug 11, 2021 •

edited by mcuee

mcuee commented Aug 13, 2021 •

edited

dontech commented Sep 20, 2021 •

edited

dontech commented Sep 21, 2021 •

edited

dontech commented Nov 13, 2021 •

edited