Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: unable to interrupt blocking hid_read call #133

Open
FrogTheFrog opened this issue Dec 17, 2019 · 28 comments
Open

Windows: unable to interrupt blocking hid_read call #133

FrogTheFrog opened this issue Dec 17, 2019 · 28 comments
Labels
documentation Improvements or additions to documentation Windows Related to Windows backend

Comments

@FrogTheFrog
Copy link

Currently hidapi is not able to close open device handle in a thread-safe manner. This is due to CancelIo function being used instead of CancelIoEx for closing a win32 device handle.

Upgrading to CancelIoEx would cause a WinXP support to be dropped (signal11/hidapi#48), however a fallback could be used to try and load CancelIoEx dynamically to keep a somewhat buggy WinXP experience (signal11/hidapi#208).

Something needs to be done about this issue as hidapi is being used in multi-threaded environment more and more. I would personally propose to drop WinXP support altogether as there is no point in beating a dead horse, however I would like to hear what you guys think.

@Youw
Copy link
Member

Youw commented Dec 17, 2019

I don't think it is safe to assume hid_close as a thread-safe function at all.

Any client code has to stop any read/write attempts before calling hid_close.

What do I miss here?

@FrogTheFrog
Copy link
Author

FrogTheFrog commented Dec 18, 2019

Let's say you have 2 threads (both are using some kind of wrapper to make sure that access to hidapi is synchronized via mutex or something else). First one is doing some thread blocking read operation which puts whole thread on sleep, and the second one is trying to abort this read. The only way to abort is via hid_close in second thread.

Currently hid_close uses CancelIo to abort this read operation in the first thread, however CancelIo can only be used in the same thread that started the read operation. Thus, we are stuck. Most users don't care and call hid_close on second thread anyway, which sometimes crashes. Whereas, CancelIoEx supports the scenario I've just described.

P.S. Pardon the grammar, just woke up.

@Youw
Copy link
Member

Youw commented Dec 18, 2019

here is a pseudo-code, that works without issues and flexible for various use-cases:
thread1:

while(!interrupted) {
  result = hid_read_timeout(dev, buffer, buffer_size, 1000);
  if (result == 0) // no data available
    continue;
  if (result > 0) {
    // handle data here
  }
  else {
    // handle/report error, retry, etc.
  }
}

thread2:

interrupted = true;
thread1.join();
// now we're sure no other threads are trying to access the device,
// assuming there is no threads, except thread1
hid_close(dev);

such usage conforms to my previous statement, that client code should first stop any read/write attempts before trying to close the device as hid_close cannot not be thread-safe by design, even it it is sometime, for some implementations

@Youw
Copy link
Member

Youw commented Dec 18, 2019

@todbot I see that there is a good bit of discussion under node-hid. I'm not sure what is the node-hid implementation/use-case and don't want to misunderstand the discussion.

I'd very like to hear your input on this.

My point stands that tuning hid_close conceptually isn't best idea because of more general statement:
If there is a manageable resource, that needs to be freed at some point, using such resource after it being freed (including the start of "free" procedure) cannot be a well-defined behavior in general.

@todbot
Copy link
Contributor

todbot commented Dec 18, 2019

Hi @Youw, I believe I'm in agreement with you now, after researching things a bit more and doing some tests. Thank you @FrogTheFrog for helping me understand node-hid and hidapi code better because of your issue & PR submissions.

Node.js is usually single-threaded so the "read on a closed handled" issue doesn't come up much. However, node-hid does spawn a libuv worker thread when using the device.on('data', ...) event handler. (In my application code I don't use this technique and the worker thread code I inherited I've not inspected much.) Suffice to say, node-hid was doing a blocking hid_read() in a separate thread and thus could lead to a race condition. I've now changed it to use hid_read_timeout() like you describe above.

On my main use cases MacOS and Linux hidraw, this issue seemingly never came up. I guess both of their hid_read() implementations exit quickly/gracefully when the underlying handle goes away. Windows not so much it seems. If CancelIOEx also deals with this more gracefully, it's a reason I could argue we should switch from CancelIo to CancelIoEx. I know nothing about Windows coding though.

@Youw
Copy link
Member

Youw commented Dec 18, 2019

deals with this more gracefully

I believe it cannot deal gracefully by design.

I've checked both hidraw and macOS implementations: there is always a race condition when hid_close is called during blocking hid_read call.

The fact that you don't hit any issues is just you being lucky. It doesn't mean it is a well defined behavior.


Instead, I'd rather add an explicit check into each hid_close implementation: if there is any active call on this device - fail fast/crash the process/etc., as it is a design bug and undefined behavior.
Yes it sound like "too much", and I don't really going to do it :)


What we do need here, as a solution:

  • explicitly mention in the documentation, that hid_close is not, and cannot be a thread-safe function;
  • fix an actual problem here: add the ability to interrupt blocking hid_read, like a new function int hid_interrupt_read(hid_device *dev), so users aren't forced to use a "timeout" version, and can explicitly shutdown all calls, before calling hid_close;

P.S.: yes, the implementation of int hid_interrupt_read(hid_device *dev) on Windows likely would use CancelIOEx as an underlying mechanism, but that I think is a separate discussion.

@todbot
Copy link
Contributor

todbot commented Dec 18, 2019

Agreed, I think I was getting lucky (or at least the underlying system calls on MacOS and hidraw are more tolerant to bad usage). And I might put in that hard-crash check in a local copy of hidapi to help diagnose any bad usage I may be doing. :)

The addition of a hid_interrupt_read() sounds great. For node-hid I think hid_read_timeout() will work fine. And if we need testers for hid_interrupt_read() I'll be glad to try it out.

@mcuee mcuee added the Windows Related to Windows backend label Dec 19, 2019
@FrogTheFrog
Copy link
Author

@Youw, @todbot Sorry for the lat reply, but I agree with you both. I have nothing to add to this discussion 😄

@Youw Youw changed the title Windows does not support multi-threading Windows: unable to interrupt blocking hid_read call Dec 26, 2019
@Youw Youw added this to In progress in v0.11 Release Feb 9, 2020
@Youw Youw moved this from In progress to To do in v0.11 Release Feb 9, 2020
@mcuee
Copy link
Member

mcuee commented Jan 26, 2021

Similar discussion on libusb.
libusb/libusb#703

From Chris in libusb/libusb#297

Closing a device while there are outstanding transfers is 100% an application error. Handling this gracefully has been discussed in various github libusb discussions.

@dontech
Copy link

dontech commented Nov 29, 2023

I found the same while looking at this: #650

Suggestion: Why not state that hid_close is thread safe to call? There are plenty of synchronization APIs on all platforms supported to ensure this. What supported system does not have this?

E.g. on Windows we can just add a simple critical section which is then entered during hid_read_timeout() (and write), and released when calling GetOverlappedResult(). hid_close() then enters the same critical section. Then there is no potential race.

@Youw's current comment that the current solution is not safe is true (and relies on luck :-)), but forcing people to use use short timeouts to workaround an unsafe API is not cool either. On a tight system, I expect to be able to call read() with a timeout of forever with any problems, as this is the most efficient implementation.

When using hidapi, as a user i expect hid_close() to be thread safe, just like when calling close() on a file. Anything else, even if documented, would be illogical to a typical application programmer.

@dontech
Copy link

dontech commented Nov 29, 2023

I made a quick patch for win32 to show this concept. This also has the side-benefit of actually making the entire library actually thread safe (for win32 for now).

169c524

Probably needs a bit more work, but i am fairly sure this will work.

@dontech
Copy link

dontech commented Nov 29, 2023

Instead, I'd rather add an explicit check into each hid_close implementation: if there is any active call on this device - fail >>fast/crash the process/etc., as it is a design bug and undefined behavior.

That suggestion has the same problem as the initial problem. How would you know you had something pending, if you did not utilize any O/S synchronization? You cannot, for the same reason you pointed out to begin with. Trying to "guess" if something was pending would rely again on being "lucky", and would be just as broken as the mechanism you are trying to fix.

There is only one way to make it air-tight, and that is by implementing some kind of synchronization mechanism. If you then use that mechanism to accept or reject is then up to you. I would argue that if synchronization is implemented, why not just upgrade the library to be thread safe?

@Youw
Copy link
Member

Youw commented Nov 29, 2023

just like when calling close() on a file

close on a file is not thread-safe and never has been. Only because of how the inode/descriptors work on Linux, it is sometime looks like it is safe, but you end up in some weird cases, when you're successfully finished ther read after the close is completed which is technically possible on Linux but logically an insanity from a good good-engineering/sanity perspective.

So is the case with hid_close - it shouldn't be.

It is technically not possible to have a thread-safe *close functions w/o external syncronisation.
If you think otherwise - you're not the first one, and it has been proven wrong long before.

169c524 - right, so you're forcing a non-free overhead for all users that don't need it at all.

but forcing people to use use short timeouts to workaround an unsafe API is not cool either

This is the solution: #146
Contributions are welcome.

solution is not safe is true (and relies on luck :-))

There is no "not safe solution". There is either a solution or no solution at all/undefined behavior.

The current behavior, when the things just hang - that is a good thing. That gives you the ability to catch it early and implement a stable workaround instead of having a nightmare catching rarely-occuring undefined-behavior/crashes later when you product hits thouthands of users in the world.

@dontech
Copy link

dontech commented Nov 29, 2023

right, so you're forcing a non-free overhead for all users that don't need it at all.

I am not sure I understand the logic here. So using a few cycles using efficient locking is a problem? It will not be visible in any real-life application as locking is some of the most efficient O/S operations.

But constantly wasting millions of instructions by having short timeout to get a round a design issue is considered a good thing? This is on the other hand very visible in real-life applications.....

@dontech
Copy link

dontech commented Nov 29, 2023

This is the solution: #146
Contributions are welcome.

Nope. Same problem again. How will you define when it is safe to call hid_interrupt_read()?
You claim that the library is not meant to be thread safe, but at the same time define hid_interrupt_read() which only has a purpose when called asynchronously, implying that it would need to be thread safe. Contradiction.

@Youw
Copy link
Member

Youw commented Nov 29, 2023

It will not be visible in any real-life application as locking is some of the most efficient O/S operations.

Please-please-please read my comment and understand that there is still a race-condition!

constantly wasting millions of instructions

I don't know how you measured the millions of instructions because of the timeout.

to get a round a design issue

Would you rather have a race condition/random crashes on you application?

This is on the other hand very visible in real-life applications

I have multiple applications in production doing this. The only visible down side of this - it taks up-to-1-sec longer when closing the device/application.

@Youw
Copy link
Member

Youw commented Nov 29, 2023

You claim that the library is not meant to be thread safe

hid_interrupt_read would be the first thread-safe funciton in this matter, designed to be called concurently with hid_read.
And of course it will be documented.

And external syncronisation is still will be needed.

@dontech
Copy link

dontech commented Nov 29, 2023

but you end up in some weird cases, when you're successfully finished ther read after the close is completed which is >>technically possible on Linux but logically an insanity from a good good-engineering/sanity perspective.

This is a "weird case" for any application that chooses to call close() in the middle of any file operation. But this is the applications responsibility to handle this and not the underlying library or O/S. With this logic you could strip all thread-safe operations from all system, as you can always argue that the user can mess it up.

@Youw
Copy link
Member

Youw commented Nov 29, 2023

application that chooses to call close() in the middle of any file operation. But this is the applications responsibility to handle this and not the underlying library or O/S.

This basically a definition of "not thread-safe"... And I feel a bit inconsistency between the statement above and a statement below:

as a user i expect hid_close() to be thread safe, just like when calling close() on a file

@dontech
Copy link

dontech commented Nov 29, 2023

Please-please-please read my comment and understand that there is still a race-condition!

I have read you comments. Your argument is based on the fact that you cannot call other functions (read, write, whatever) after calling close. Yes, this is true for hidapi, as it is true for basically any other file-like system. I fully accept this and is it valid, just like any other system. But using this as an argument to avoid thread safety does not seem reasonable.

Having worked with operating system for too many years, I agree that synchronization is not as easy as people make it out to be. However, your statement that it is inherently impossible to thread safe a simple library like hidapi is not something I can agree with. Following that logic all O/S on the planet should be scrapped.

hid_interrupt_read would be the first thread-safe function in this matter, designed to be called concurently with hid_read.
And of course it will be documented.

You cannot in any way guarantee that this will work across multiple O/S without using synchronization. It just might only work on Windows. And then you are back to basically allowing async close() anyway. But why symptom fix this one thing, instead of fixing the overall problem?

Would you rather have a race condition/random crashes on you application?

No. But that is what you have now.... with synchronization the result would be the same every time, except for the case of calling read/write after close(), just like you would expect. Again, calling file operations on a device you just closed is not something I consider a problem, but just a broken app.

I have multiple applications in production doing this. The only visible down side of this - it taks up-to-1-sec longer when closing the device/application.

I have applications that need to shutdown within milliseconds. Hard-coding hidapi to suit your very narrow use-case does not seem reasonable to a wider audience. The only way to solve that would be to use smaller timeouts, and waste millions if instructions. The reason blocking operations are good is because of efficiency.

The current behavior, when the things just hang - that is a good thing. That gives you the ability to catch it early and >>implement a stable workaround instead of having a nightmare catching rarely-occuring undefined-behavior/crashes later >>when you product hits thouthands of users in the world.

That is the whole point of this thread. That is not the current behavior. The current behavior is completely random, due to the obvious races in the code...

@dontech
Copy link

dontech commented Nov 29, 2023

I would still like to know if the current patch (which I just updated) has other races than the one where you erroneously call operations after close. If we accept that (just like any other file system) agree that you cannot call operations after close, I think the propose patch make the entire library thread-safe.

Any input would be great.

@Youw
Copy link
Member

Youw commented Nov 29, 2023

You cannot in any way guarantee that this will work across multiple O/S without using synchronization.

That is the implementation details. HIDAPI already uses syncronisation for some cases where it is unavoidable.

It just might only work on Windows.

Interesting assumption. I wouldn't make such w/o a propper investigation first.

And then you are back to basically allowing async close() anyway

Of course not. The intended scenario is:

  1. Notify the read thread that it needs to stop any read attempts
  2. Call hid_interrupt_read (even if hid_read is called after/confurently - it should immediatelly exit - that's up to HIDAPI implementation)
  3. Wait until read thread is no longer tries to call hid_read (e.g. simply by joining the read therd - has to be done in some cases anyway; alternative syncronisations are possible as well).
  4. Call hid_close - no concurency at this point

No. But that is what you have now....

Of course not. Right now the only doable scenario is as above but w/o step 2. But race condition on this matter.

I have applications that need to shutdown within milliseconds.

Lets make it right. #146

smaller timeouts, and waste millions if instructions.

Yeah, I agree here. If you use timeout of several (one?) ms - that is a significant overhead.

That is not the current behavior. The current behavior is completely random, due to the obvious races in the code...

I stand corrected. What I mean is - this is much easier to reproduce than a possible crash if one only uses CancelIoEx - I've tried that.

I think the propose patch make the entire library thread-safe.

Only on Windows. And I'm not ready to commit to have the entire library thread-safe - at most hid_read(_timeout) + 1 more function.

you cannot call operations after close

We need a more strict definition here: you cannot call operations once close is being called (even before it has finished).


I would still like to know if the current patch (which I just updated) has other races than the one where you erroneously call operations after close.

Lets assume the behavior is as if your patch is accepted. In this case yes, it is safe to call hid_close if we're already inside of the hid_read in a different thread. And I believe that is the only case.

Which means you also need to ensure that once you decide to call the hid_close (including the case when you're already inside of if, but the hid_close is not finished yet) - the other thread cannot attempt to call hid_read. How do you propose to implement that?
I cannot come up with a scenario where that would be possible (w/o delays, etc.).
Please share a pseudo-code that describes the two threads.

@YgorSouza
Copy link

I'd just like to add my 2c and point out a problem I had with the Rust wrapper, which I explained in ruabmbua/hidapi-rs#151. The Rust implementation makes it impossible to call any methods on the device from different threads at the same time, but you can call them sequentially using a mutex. And you don't have access to hid_close at all, it only runs in the destructor.

It turns out that this still causes a segfault on Windows, presumably because the pending overlapped/async operation that was started on the other thread doesn't get canceled (even though hid_read has already returned on timeout), and it tries to access memory you just deallocated. Calling CancelIo from a different thread is the same as not calling it at all. So I think hidapi should switch to CancelIoEx, which did not have the same problem in my tests, and seems like the right function to call in this case, looking at the documentation. If you're going to free the device, you should ensure that nothing else is using it on any thread.

Apparently a side effect of that is that you might be able to stop a blocking hid_read on one thread by calling hid_close on another, which is what this issue here is asking for, but trying to do that is undefined behavior with or without this change, so we can at least try to ensure the RAII wrappers in Rust, C++ etc (which don't let you abuse hid_close like that) are safe to use.

@dontech
Copy link

dontech commented Feb 12, 2024

@YgorSouza I totally agree.

We probably need to work a bit more on the synchronization mechanism to make sure all of @Youw's concerns are met.

I am not sure we can handle the close() scenario properly, but I have a few ideas that could improve it a bit.

@Youw
Copy link
Member

Youw commented Feb 15, 2024

It turns out that this still causes a segfault on Windows, presumably because the pending overlapped/async operation that was started on the other thread doesn't get canceled (even though hid_read has already returned on timeout), and it tries to access memory you just deallocated.

That sounds like a totally different issue. It sounds like it still going to be there even if one switches to CancelIoEx, e.g. if someone uses hid_read, it returns on timeout, and the buffer gets immediatelly deallocated.

I think we could consider fixing it separately.
@YgorSouza can you open a separate ticket, so we can continue the discussion there?
Maybe share a (pseudo-)code to reproduce it with detailed description.

@YgorSouza
Copy link

The only code I have is the one I posted in the Rust issue linked above. My application is similar to the one mentioned in #576 (comment). I have a single dedicated thread that locks the list of open devices and tries to read them with a timeout of 0, then releases the lock and sleeps until the next cycle. And the UI thread can lock the list and add or remove some devices from it.

I did notice that in my Rust reproduce, if I change the timeout to 1ms it doesn't crash, but it also doesn't crash if I leave it at 0 and sleep for 1ms before or after, while holding the lock, so I'm guessing it's just a timing coincidence. It is undefined behavior after all. So maybe just using CancelIoEx is not a perfect solution, and it's just a coincidence that it didn't crash in my tests. The way Microsoft recommends in this example is to call CancelIoEx and, if it returns no error, call GetOverlappedResult to block until the cancellation actually goes through. I tried that and it worked as well.

I know this isn't enough for this issue here, since you'd still need more synchronization to prevent hid_read from using fields that hid_close is freeing, but since this issue mentions CancelIoEx and properly closing the device on Windows, I figured I'd add my notes here.

@mcuee
Copy link
Member

mcuee commented Mar 8, 2024

I tend to think this discussion is quite similar to various discussions in libusb. The users like to use the simple synchronous API but there are limitations.

Reference libusb documentation.
https://libusb.sourceforge.io/api-1.0/libusb_io.html
https://libusb.sourceforge.io/api-1.0/libusb_mtasync.html

Reference libusb Wiki.

What are the differences between Synchronous and asynchronous device I/O?
Please read libusb-1.0 API documentation here. https://libusb.sourceforge.io/api-1.0/libusb_io.html

Please take note that ALL transfers MUST be freed before libusb_exit() is called.

Please take note that ALL libusb objects MUST be released before libusb_exit() is called.

Please also take note that there is no way to cancel Synchronous transfer after the request has been submitted.

@mcuee
Copy link
Member

mcuee commented Mar 8, 2024

Right now we have synchronous API in hidapi: hid_read_timeout, hid_read, hid_write.
https://github.com/libusb/hidapi/blob/master/windows/hid.c

Maybe we can add asyshronous API like hid_interrupt_read and hid_interrupt_write.

And then the synchronous API can be re-written based on the above two asynchronous API, just like libusb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Windows Related to Windows backend
Projects
No open projects
Development

No branches or pull requests

6 participants