Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow up of #14312: ensure IO completion routines and waitable timers are safe #14627

Merged
merged 28 commits into from Mar 30, 2023

Conversation

LeonarddeR
Copy link
Collaborator

@LeonarddeR LeonarddeR commented Feb 11, 2023

Link to issue number:

Follow up of #14312

Summary of the issue:

In #14312, we decoupled the background I/O thread from the braille module. However, the thread was still bound to IoBase in it's _initialRead method unless you'd override it.
While I hoped the changes in #14312 would decrease crashing of braille, this was indeed the cause. However, crashes are still occurring here and there. I'm pretty sure I know the cause lies in the IO done completion routines that are still executed on the IO thread without ensuring that the instances of the routines still exist. I've seen this causing several access violation errors.
Furthermore, #14312 saved bound methods in a dictionary on the iOThread. While this shouldn't be a big problem, this could cause potential issues with garbage collection (i.e. instances being kept alife forever because the APC of an instance was never called and it would be stuck in the cached apc dictionary).

Description of user facing changes

Hopefully, more stability.

Description of development approach

  1. Added the ioThread as a constructor method to IoBase and derivatives as an optional argument. If not provided (default), the default IoThread is used. It will be added as a weak reference on the IoBase instance. Thereby it is no longer necessary to subclass IoBase or a derivative to override _initialRead to use another thread.
  2. Added IoThread.setWaitableTimer and IoThread.getCompletionRoutine, also ensuring that the function pointers are available when the background thread tries to call them. The function wrapper keeps a weak reference to the method it wraps. Thereby it ensures that a method of a died instance is no longer executed.
  3. IoThread.queueAsApc still stores a strong reference to the wrapped method on its APC, to keep backwards compatibility. This will be changed to weak references starting in NVDA 2024.1. API deprecation logic is in place.
  4. Removed the pre_IoThreadStop extension point. It was added in 2023.1 but never implemented and announced, so I think it is safe to do so.

Testing strategy:

  • Test that hardware I/O still works with an APH mantis Q40. Only the most basic form of testing already proves that IoThread.queueAsApc and IoThread.getCompletionRoutine work, as the queued routines are executed correctly.
  • set receivesAckPackets on the Brailliant driver, observe that debug warnings will be logged indicating that the ACK timer is reset correctly when no ack packets can be handled in time. This proves that IoThread.setWaitableTimer works as expected.
  • Test points above bot with NVDAState._allowDeprecatedAPI True and False

Known issues with pull request:

This creates a new CFunc instance for every called completion routine and therefore for every async read. I think this is the safest method to both avoid crashes and ensure that routines won't be left behind.

Change log entries:

Bug fixes:

  • Several stability fixes to input/output for braille displays, resulting in less frequent access violations or even crashes of NVDA. (Follow up of #14312: ensure IO completion routines and waitable timers are safe #14627)
    For Developers:
  • hwIo.base.IoBase and its derivatives now have a new constructor parameter to take a hwIo.IoThread. If not provided, the default thread is used.
  • hwIo.ioThread.IoThread now has a setWaitableTimer method to set a waitable timer using a python function. Similarly, the new getCompletionRoutine method allows you to convert a python method into a completion routine safely.
    API deprecations:
  • Passing lambda functions to hwIo.ioThread.IoThread.queueAsApc is deprecated. Instead, functions should be weakly referenceable.

Code Review Checklist:

  • Pull Request description:
    • description is up to date
    • change log entries
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • API is compatible with existing add-ons.
  • Documentation:
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • Security precautions taken.

@LeonarddeR LeonarddeR requested a review from a team as a code owner February 11, 2023 08:19
@AppVeyorBot
Copy link

See test results for failed build of commit 37c02035ea

@LeonarddeR
Copy link
Collaborator Author

Never mind, I think it's too much for 2023.1. I really want to kill this issue with APC's once and for good, and it's probably best to bundle that.

@LeonarddeR LeonarddeR marked this pull request as draft February 11, 2023 18:56
@LeonarddeR LeonarddeR changed the title Follow up of #14312: Better way to allow another ioThread for IoBase Follow up of #14312: Only save weak references of APC methods and ensure IO completion routines are safe Feb 13, 2023
@AppVeyorBot
Copy link

See test results for failed build of commit f66ba17393

@AppVeyorBot
Copy link

See test results for failed build of commit 61f2cba87a

@AppVeyorBot
Copy link

See test results for failed build of commit 366b48eca7

@LeonarddeR LeonarddeR marked this pull request as ready for review February 14, 2023 13:34
@LeonarddeR
Copy link
Collaborator Author

@michaelDCurran and @seanbudd I leave it up to you whether this can go in 2023.1. I think there are some aspects counting against delaying this to 2023.2.

  • I've heard reports from people like @dkager that current alpha versions still crash here and there, especially around going in and out of stand by with a bluetooth display connected
  • In comparison with Decouple BgThread from braille module #14312, this now stores weak references instead of strong references to functions. Strictly spoken this adds an additional API requirement, since references to functions need to be kept alive by the caller

@AppVeyorBot
Copy link

See test results for failed build of commit 400e746420

@burmancomp
Copy link
Contributor

I'm somewhat unsure if I'm writing this to the right place but I thought to ask if you have encountered following errors. I'm using Albatross. Build is based on current main branch code. I think this is rarely occuring issue. I may have encountered this some time ago but this time I investigated log file. When NVDA has run several days (which is common in my case) investigating log file, especially when debug level is in use, may be quite hard due to size of file.

I can attach more detailed log entries from this session if needed.

"..." means skipped lines):

...
ERROR - stderr (13:09:11.080) - hwIo.ioThread.IoThread (10916):
Exception in thread hwIo.ioThread.IoThread:
Traceback (most recent call last):
File "threading.pyc", line 926, in bootstrap_inner
File "hwIo\ioThread.pyc", line 91, in run
OSError: exception: access violation reading 0x053B0058
...
ERROR - eventHandler.executeEvent (13:09:13.063) - MainThread (11860):
error executing event: UIA_elementSelected on <NVDAObjects.Dynamic_UIItemListItemUIA object at 0x013ABF30> with extra args of {}
Traceback (most recent call last):
File "eventHandler.pyc", line 300, in executeEvent
File "eventHandler.pyc", line 101, in init
File "eventHandler.pyc", line 110, in next
File "NVDAObjects\UIA_init
.pyc", line 2089, in event_UIA_elementSelected
File "NVDAObjects_init_.pyc", line 1248, in event_selection
File "NVDAObjects\UIA_init_.pyc", line 2289, in event_stateChange
File "NVDAObjects_init_.pyc", line 1264, in event_stateChange
File "braille.pyc", line 2487, in handleUpdate
File "braille.pyc", line 2274, in update
File "braille.pyc", line 2195, in _updateDisplay
File "braille.pyc", line 2259, in _displayWithCursor
File "braille.pyc", line 2243, in _writeCells
File "braille.pyc", line 2248, in _writeCellsInBackground
File "hwIo\ioThread.pyc", line 58, in queueAsApc
RuntimeError: Thread is not running
...

@LeonarddeR
Copy link
Collaborator Author

These access violation errors are exactly the type of errors this pr wants to fix indeed.

@LeonarddeR LeonarddeR marked this pull request as draft March 25, 2023 08:10
@LeonarddeR LeonarddeR changed the title Follow up of #14312: Only save weak references of APC methods and ensure IO completion routines are safe Follow up of #14312: ensure IO completion routines and waitable timers are safe Mar 25, 2023
@LeonarddeR LeonarddeR marked this pull request as ready for review March 27, 2023 06:28
@LeonarddeR
Copy link
Collaborator Author

@michaelDCurran and @seanbudd I went again over this, ensuring that this is backwards compatible now. Would love it if this could go into 2023.2.

@seanbudd seanbudd added the merge-early Merge Early in a developer cycle label Mar 28, 2023
source/extensionPoints/util.py Show resolved Hide resolved
source/hwIo/base.py Show resolved Hide resolved
source/hwIo/base.py Outdated Show resolved Hide resolved
source/hwIo/base.py Outdated Show resolved Hide resolved
tests/unit/test_hwIo.py Outdated Show resolved Hide resolved
@AppVeyorBot
Copy link

See test results for failed build of commit 9e9742ec38

tests/unit/test_hwIo.py Outdated Show resolved Hide resolved
@AppVeyorBot
Copy link

See test results for failed build of commit 157546c997

@seanbudd seanbudd merged commit e0ba12c into nvaccess:master Mar 30, 2023
1 check was pending
@nvaccessAuto nvaccessAuto added this to the 2023.2 milestone Mar 30, 2023
seanbudd pushed a commit that referenced this pull request Jun 2, 2023
Related to #14899, #14312, #14627
Fixes #14895

Summary of the issue:
Despite several attempts to fix this, NVDA's IoThread can crash without a clear cause.

Description of user facing changes
Less crashes, most likely, as tests indicate that this is the case.

Description of development approach
As proposed by @jcsteh , rather than creating a new function pointer for every APC or completion routine call, use a single internal APC and completion routine and use an internal cache to store the python functions, not the actual APC functions.
seanbudd pushed a commit that referenced this pull request Jul 30, 2023
Fixup of #14924

Summary of the issue:
In #14627, we introduced weak references for APCs called as part of a waitable timer. In #14924, this was made more robust by using a single internal APC func. However in the porting process, a part of the logic was reversed, therefore in the internal APC store, we still stored strong rather than weak references.

Description of user facing changes
None.

Description of development approach
Store references instead of functions in the apc store.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-early Merge Early in a developer cycle
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants