New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPE team - found a stack override problem on libusb-win32 #19
Comments
Sorry but I do not see any emails in libusb-win32 mailing list. You need to subscribe to the mailing list in order to post. |
@dontech Just wondering if you have some time to take a look whether this is a real issue or not. |
@JamesHuangHPE Just wondering if it is possible for you to change to libusbk driver. It is difficult for use to support libusb-win32 driver now. |
Even for libusbk driver, we are shipping the old version because of the new Windows 10 kernel driver requirements. We have switched the focus on the libusbk.dll and not updating the libusbk.sys. WinUSB driver should be the way to go. |
Hi mucee, I email to this address before ==> pdt@dontech.dk James |
The issue also existing in latest version. |
I see. In that case, let see whether @dontech can come and comment. |
@JamesHuangHPE I will strongly encourage you to try out libusbk driver to see if that helps. libusbk.sys is almost a drop in replacement for libusb0.sys for most of the devices. You do not need to change your application as libusb0.dll is still working with libusbk.sys. You can use the libusbk-inf-wizard to replace the driver. |
Ref: libusbk.sys vs libusb0.sys vs WinUSB |
Hello there,
First of all let me state that i have not worked very much on the
driver part of the code, but have mearly fixed a few problems here and
there. That said, this problem sounds interresting, and should probably
be fixed.
On Wed, 2021-08-11 at 02:24 -0700, JamesHuangHPE wrote:
The problem is the irp->userIosb ptr member which points to local
io_status which is allocated locally from the stack.
The kernel updates the 2 qword io_status which is originally on the
stack of call_usbd_ex but sometimes the kernel stomps the stack of
another routine if call_usbd_ex has already returned when the update
happens.
Yeah, it would seem like there is a race condition here. It would seem
that the code for some reason has not completed the request after
IoCompleteRequest() was called. This is for a number of reaons not
great. The corruption you are seeing is only one of the problems. That
new requests cannot be issued is likely also a problem. The request
must have completed 100% before the function returns.
Note 1:
The call_usbd_ex() routine executes all the time but precisely every
60sec there is a unique usb request that always times-out after 500ms
and fails:
Yeah you probably have a bug on your USB besides this problem. All USB
device sshould respond with a STALL handshake, NAK or ACK. Not
responding is considered a broken device. I have seen this before
however with some devices. Thats probably why the driver has a timeout,
which should not be needed if the device was not doing this.
We should check a “working” system to see if this request is being sent
and if it always time-out and fails.
If this request didn’t get sent or if it didn’t time-out and fail then
we would not have a problem.
The driver should work no matter what. The problem you are seeing must
never happen no matter how bad the device behaves.
The 2 qword IO_STATUS_BLOCK io_status is allocated as a local on the
stack of the current routine
Yeah, so the request must complete before the function returns.
Note 4:
This times-out after a specified 500ms wait/time-out (ie. the
request/irp event does not get signaled within 500ms).
If I hack the assembly with kd to make it an infinite wait/time-out
then it works:
Probably not a good idea with endless wait, as some devices (broken)
might hang the system forever.
* This means that this usb request (ie. the specific one that happens
every 60sec) is taking far too long
Yeah, no device should really do this (but some do anyway).
The “status = irp->IoStatus.Status” should really be “status = irp-
>UserIosb->Status” (where the final status is copied into) but this
doesn’t cause a problem.
Yeah, this is probably just a little wrong but not fatal. We can fix
this along with the rest if we find the problem.
This call (IoCompleteRequest) tells the kernel to finish and do clean-
up of the irp. This includes the kernel executing nt!IopCompleteRequest
to copy IoStatus into *UserIosb:
* this can happen synchronously by direct calls (w/o an APC int) as
shown in one of the callstacks earlier in this email
* this can also happen asynchronously sometime later via an APC int as
shown in another of the callstacks earlier in this email
I think we need to wait for the async APC somehow, as we must not
return before we are sure the request has completed. Making the
variable global or NULL is a hack, and would not fix the other side
issues.
I found this while looking for examples:
https://www-user.tu-chemnitz.de/~heha/oney_wdm/ch05g.htm
What is interresting here is that they wait for an additional event
with KeWaitForSingleObject() after calling IoCompleteRequest() (and
after IoCancelIrp) and they clear the event manually after each use.
So maybe we are simply not waiting for the correct number of signals
before asuming the request is complete ? It is a bit tricky to decode
from MS documentation what the exact right thing to do is.
Could you try using the _exact_ same construction as they use in that
example?
So:
Cancel -> Wait -> Reset -> Complete -> Wait.
I would be great to find this one as i am sure some of the strange
blue-screen deaths that some of my clients have reported could be the
same thing.
Thanks,
/pedro
|
@TravisRo If you have some time, please take a look as well. |
Any update on this? James, did you try the code i posted? |
Hi Dontech, Where did you post the code? James |
Hi James, When I meant "the code" I meant the suggestion above which I wrote on Aug 15. I will need to modify the driver to try it, but since you already have done that before, it should not be a problem. Thanks, /pedro |
hi Pedro, Could you please modify the driver and provide it to us for testing? James |
Hi James, I am not sure when or if i have the time for this. Furthermore, i am not sure i can reproduce the problem here. I will see what i can do, but i cannot guarantee when i will have time for this. If you need it done within a reasonable time-frame, i would suggest you allocate some resources for it. Thanks, /pedro |
Hello again James. OK i found a little time for this, and i think the fix works: |
Hi James,
Thanks, /pedro |
James, please update this. |
Hi Peter, The HPE SW expert is reviewing the modification. James |
Hi James, Any update on this? Does the fix work? |
Hi Peter, Sorry for the late reply, we are seeking some other way to fix this issue. Best Regards, |
OK, but any update on why this does not work would be appreciated. |
@JamesHuangHPE |
James, could you/HPE report back on this? Lets get this one closed, so we are sure we have it right for the next release. |
Hi Peter, Sorry for the late response. Best Regards, |
OK The fix is still valid however, as the driver is now more safe from BSODs. Thanks, /pedro |
Hi Peter, Thank you for all your response and driver updated. Best Regards, |
Hi Peter,
This is James from HPE, we recently found a stack override problem on the libusb-win32 driver.
https://sourceforge.net/projects/libusb-win32/
Please review below for details and let us know your opinion and suggestions.
It would be great if you can help us to fix this issue.
I have sent you an email around 8/5.
The problem is the irp->userIosb ptr member which points to local io_status which is allocated locally from the stack.
The kernel updates the 2 qword io_status which is originally on the stack of call_usbd_ex but sometimes the kernel stomps the stack of another routine if call_usbd_ex has already returned when the update happens.
Here’s a lot more detail from libusb_driver.c:
Note 1:
The call_usbd_ex() routine executes all the time but precisely every 60sec there is a unique usb request that always times-out after 500ms and fails:
2: kd> dt _URB ffff868d`0be18570 UrbControlVendorClassRequest.Value
USBPORT!_URB
+0x000 UrbControlVendorClassRequest :
+0x082 Value : 0x3fd <-- 0x3fd = 1021 is a unique usb vendor request code that always times-out
Eaton should be able to tell us what this 0x3fd (1021) request is, why it gets sent every 60sec, and what might cause it to always time-out.
We should check a “working” system to see if this request is being sent and if it always time-out and fails.
If this request didn’t get sent or if it didn’t time-out and fail then we would not have a problem.
Note 2:
The 2 qword IO_STATUS_BLOCK io_status is allocated as a local on the stack of the current routine:
2: kd> dt IO_STATUS_BLOCK Status Information
libusb0!IO_STATUS_BLOCK
+0x000 Status : Int4B <-- low dword of first qword will be updated with ntstatus 0xc0000120 (The I/O request was canceled)
+0x008 Information : Uint8B <-- second qword will be updated with the byte count of the transfer which will be 0x00000000’00000000
This is what gets updated or stomped on the stack and the results are:
Note 3:
The irp is created here. Notice that the &io_status ptr is passed as an arg and it becomes the irp member UserIosb ptr:
2: kd> dt nt!_IRP IoStatus UserIosb
+0x030 IoStatus : _IO_STATUS_BLOCK
+0x048 UserIosb : Ptr64 _IO_STATUS_BLOCK <-- ptr to local io_status allocated on this routine’s stack
The irp that’s created has the following flags:
2: kd> dt nt!_IRP 0xffffe486`c1d56990 Flags
+0x010 Flags : 0x40060000 <-- 0x40000000 => FILE_FLAG_OVERLAPPED => it can be asynchronous (or synchronous)
When the kernel is later called to complete the request, it copies IoStatus into the ptr contents of *UserIosb (ie. io_status) which is the final status info for the irp:
Either of the following will work-around the problem:
o Note: I don’t think it is documented that this ptr can be a NULL ptr but I verified that this works (using kd to hack the assembly code)
Note 4:
This times-out after a specified 500ms wait/time-out (ie. the request/irp event does not get signaled within 500ms).
If I hack the assembly with kd to make it an infinite wait/time-out then it works:
Note 5:
The “status = irp->IoStatus.Status” should really be “status = irp->UserIosb->Status” (where the final status is copied into) but this doesn’t cause a problem.
It does mean that irp->UserIosb is not being used here.
Note 6:
This call (IoCompleteRequest) tells the kernel to finish and do clean-up of the irp. This includes the kernel executing nt!IopCompleteRequest to copy IoStatus into *UserIosb:
Note 7:
The call_usbd_ex routine returns here and the locals on its stack should not be accessed from this point onwards:
o if the APC int and the write of io_status happens before the penter() check then the stack is stomped w/o problems because the addr is not yet in use
o if the APC int and the write of io_status happens after the penter() check then the stack is stomped w/o problems if the addr is not currently used
o if the APC int and the write of io_status happens near the penter() check then the args being passed up the stack during that time window get stomped
this last case is where we see 1 or 2 stomped args on the stack causing a penter trap or crash
ie. stack addr of old qword io_status.Information stomped w/ 0 and old dword io_status.Status stomped w/ ntstatus 0xc0000120
Best Regards,
James
The text was updated successfully, but these errors were encountered: