[Spec] VT Sequence for Screen Reader Control #14342

carlos-zamora · 2022-11-04T22:43:00Z

This spec outlines...

Windows Terminal and ConHost's implementation for the VT sequence introduced in Control Screen Reader from Applications (#18) · Issues · terminal-wg / specifications · GitLab.
potential command-line applications/partners that would benefit from using this VT sequence and how they could leverage it
necessary work on screen readers to take advantage of this feature (if any)

References

Control Screen Reader from Applications (#18) · Issues · terminal-wg / specifications · GitLab
Unable to use applications that hook the arrow keys using Windows Console Host. #13666

carlos-zamora · 2022-11-04T22:47:49Z

@Tyriar since you wrote the spec in the Terminal WG repo, I'd love to get some feedback and thoughts here. 😊

@j4james you're generally pretty on top of what's going on in the VT sequence world. Curious if you have any thoughts here too.

@codeofdusk I'd also love your feedback since you're a screen reader user and you've contributed to NVDA.

Thanks all.

j4james · 2022-11-05T01:32:36Z

doc/specs/#13666 - VT Sequence for Screen Reader Control.md

+> `OSC Ps ; Pt ST`
+> - `Ps = 2 0 0` -> Stop announcing incoming data to screen reader, `Pt` is an optional string that will be announced immediately. The screen reader will resume announcing incoming data if any key is pressed.
+> - `Ps = 2 0 1` -> Resume announcing incoming data to screen reader, `Pt` is an optional string that will be announced immediately.
+> - `Ps = 2 0 2` -> Announce `Pt` immediately to the screen reader.
+> Note that the reason any key press will force the screen reader to announce again is to prevent situations where applications are terminated while the screen reader is not announcing or where applications are misbehaving.


This is a bit of a nit pick, but it's worth noting that OSC numbers are a finite resource, so a single OSC number for this would be preferable to three. For example, something like OSC 200 ; Ps ; Pt, where Ps differentiates between stop/resume/announce. Also makes it a little easier to extend.

My main concern, though, is how this is going to propagate over conpty. Is the idea to just pass it through and hope for the best? What happens if conpty later refreshes part of the display that was originally output with "stop announcing"? Would we then need to rewrap that content with these sequences?

Because if that is something conpty needs to account for, we may be better off with a simple attribute-like sequence similar to DECSCA, which could be recorded in the buffer, and then forwarded over conpty as part the regular repaint. The "immediate announce" strings would still then need a separate sequence (and that probably could be passed through directly).

I don't know. Just thinking out loud. This may be something you need to prototype and try out before you lock down the exact protocol you're going to use.

This is a bit of a nit pick, but it's worth noting that OSC numbers are a finite resource, so a single OSC number for this would be preferable to three. For example, something like OSC 200 ; Ps ; Pt, where Ps differentiates between stop/resume/announce. Also makes it a little easier to extend.

Oh I like that!

My main concern, though, is how this is going to propagate over conpty. Is the idea to just pass it through and hope for the best? What happens if conpty later refreshes part of the display that was originally output with "stop announcing"? Would we then need to rewrap that content with these sequences?

I always felt like ConPTY recording and then "re-rendering" VT output is quite a bit of a "hack". And because of that, if we ever design a VT sequence, I don't think we should limit ourselves by how hard it'd be to implement inside conhost.
Basically, I'd personally be in favor of whatever optimal / "clean" design we can come up with, unhindered by any design complexities that only we would have to suffer, unlike other, existing UNIX terminals. Long-term VtEngine might just be replaced entirely with something leaner anyways.

I always felt like ConPTY recording and then "re-rendering" VT output is quite a bit of a "hack". And because of that, if we ever design a VT sequence, I don't think we should limit ourselves by how hard it'd be to implement inside conhost.

I very much agree with this sentiment, but I am concerned we might end up with something we can't actually use. And while I'm still confident we can improve on the current VtEngine, that's a bigger task than I had originally thought, and I don't see that happening anytime soon.

That said, this may end up not being that big a deal in practice. I just wanted everyone to be aware of the potential issues here.

be aware of the potential issues here

I'll be sure to mention this in the potential issues part of the spec 😉. I think my stance so far though, is basically the same as my comments on scan mode:

terminal/doc/specs/#13666 - VT Sequence for Screen Reader Control.md

Lines 58 to 69 in 60d3685

### Scan Mode Experience

Three scenarios this VT sequence would make more accessible include:

1. text is being redrawn on top of existing text (i.e. progress bars)

2. prompts where the user must select an option using the arrow keys (i.e. `gh pr create`)

3. supplementary content is displayed with different visual characteristics (i.e. PowerShell suggestions)

Scan mode is a mode where the user can use the screen reader to navigate the text manually. In the scenarios listed above, the user should expect the following experiences when in scan mode:

1. The progress bar should be read in the way it is drawn. The VT sequence data should not be embedded into the terminal because it would be more confusing to read out "10%" and "20%" depending on where the user is scanning the progress bar. Instead, the progress bar should be displayed as "historical" content that had already occurred.

2. The output text should be read in the way it is drawn. Alt text doesn't make sense for this scenario.

3. The output text should be read in the way it is drawn.

In the future, if alt text is functionality the community is interested in, a separate VT sequence should probably be introduced to provide that functionality.

This sequence would be most useful for text being output rather than leaving a "landmark" stored in the buffer. The "alt text" sequence is probably something we would want to tackle separately, if desired and found to be useful.

That said, I'd be concerned about the scenario where the user switches from the alt buffer back to the main buffer. That probably wouldn't work right. Urgh. I'll have to be sure to test that out. (another one for "potential issues" haha) :/

My core contention with not attaching these OSCs to buffer positions is that a screen reader user reviewing past contents will have a different experience than when the text was originally emitted.

That seems terrible.

codeofdusk · 2022-11-07T06:29:40Z

As I've said before one-on-one, this will be quite tricky to get right. This is a very large change/project.

A few notes:

I don't like the thought of disabling textChange events in response to new text from this feature. If users are using a UIA client that hasn't adopted this protocol, disabling events might have unexpected results.
- In particular, if a screen reader (or by extension the user) hasn't opted into receiving events in this way, nothing should change.
This will require significant buy-in from not just terminal, but also various command-line app and screen reader manufacturers.
- Check whether ncurses would accept a patch to implement this protocol.
- JAWS, Narrator, and NVDA cover the Windows screen reader market, but there are various other solutions on other systems (including console-only screen readers such as TDSR and Speakup in the Linux kernel, plus "audio-first" interfaces such as Emacspeak that provide shell access).

CC @josephsl, @LeonarddeR, @michaelDCurran, @tspivey, @tvraman.

Tyriar

I put together the proposal in a couple of days as part of the MS hackathon to trigger a conversation on the topic but didn't get the engagement I was hoping for. Since then, I've been casually thinking about the problem every now and then and I'm actually not sure this is the direction we should go because:

People don't care about a11y enough to add this to programs, shell scripts, etc. which is sad but true. The fact that the issue only got a single comment several months later kind of proved this imo.
There is extra overhead and un-intuitiveness to adopt in applications run in the terminal. Take HTML for example where people still either ignore alt text or get it wrong all together.

Microsoft/PowerShell/etc. certainly cares about this issue deeply, but if possible I'm more interested in solving the problem for everything, rather than just first party and the few CLIs that adopt it. In its current state I don't plan on implementing this in xterm.js/vscode and encourage you not to either.

I think these are the more promising directions to solve the problem that are mostly things we can improve for all CLIs, rather than just programs that add explicit a11y support:

@textshell's alternative proposals in https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/18#note_329814
- Implementing synchronized updates and never update screen reader content during a sync update
- A decorative SGR attribute that ignores the character is a pretty interesting idea as it's so simple and solves a lot of the problem.
Make the terminal smarter at detecting the worst offenders that cause the most issues like progress bars, frequently updated lines, box drawing chars(?), etc. and don't announce them.

In the future, if alt text is functionality the community is interested in, a separate VT sequence should probably be introduced to provide that functionality.

Not sure what you mean by this, this proposal is essentially alt text, no?

zadjii-msft · 2022-11-07T20:14:57Z

People don't care about a11y enough to add this to programs, shell scripts, etc. which is sad but true. The fact that the issue only got a single comment several months later kind of proved this imo.

That might be more of an indictment of terminal-wg moreso than demonstrative that people don't care about this issue. There's fundamentally no good way of doing anything like this currently, so I doubt anyone's bothered even thinking about it all that much.

The synchronized updates idea is a great alternative plan here. Perhaps we should work with the winget (@denelon), pwsh, gh folks to explore how using those sequences would modify the screen reader output. Where would they start/end the frame, and what would the resulting screen reader output be like?

lhecker · 2022-11-07T20:27:32Z

@textshell's alternative proposals in https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/18#note_329814

Implementing synchronized updates and never update screen reader content during a sync update

Let's say I implement autocompletion for my shell and when you press tab on robocopy.exe it replaces it with Robocopy.exe (since that's the proper capitalization of the executable). If we diff the viewport in a trivial way, this would just read out "R". Or if I tab on roboc and get Robocopy.exe it then would read out "R" and "copy.exe", since those are the only parts that got changed.
So, clearly, we need to announce entire words at least, but technically my shell didn't write "Robocopy" either - it wrote "Robocopy.exe", since that's the full application name. But how do we parse such "words"? Should we modify UAX#29 to recognize "." as being part of a word? How would such a diffing algorithm look like? Is it even feasible in the first place?

Tyriar · 2022-11-07T21:03:21Z

That might be more of an indictment of terminal-wg moreso than demonstrative that people don't care about this issue

I think you're being a little too optimistic here, during my years of work on terminals the only interest or mention of the topic has come from Microsoft people. But sure, some people will want to support it for sure, whether they actually do when prioritized against their backlog is another thing.

If we diff the viewport in a trivial way, this would just read out "R". Or if I tab on roboc and get Robocopy.exe it then would read out "R" and "copy.exe", since those are the only parts that got changed.

@lhecker a diff like that wouldn't make sense, the changed range is R to .exe, so you would read out Robocopy.exe.

lhecker · 2022-11-07T21:51:16Z

@lhecker a diff like that wouldn't make sense, the changed range is R to .exe, so you would read out Robocopy.exe.

Right, that addresses the latter example. What about the first one? And I really only intended them as examples. I'm sure there's way more examples one could contrive. For instance another one I had in mind was:
What about applications that want to move some text left/right/up/down by a bunch of columns/lines? Or "scroll" some content themselves? Would the terminal have to cleverly identify identical pieces of "words" while also detecting their (positional) relationship with each other, so that false positives are suppressed?

Tyriar · 2022-11-07T21:56:10Z

Right, that addresses the latter example. What about the first one?

The first may be the same because the whole prompt likely got re-printed. I would want to tie the a11y improvements like this into our shell integration support so we would also know when we're in a prompt and where the start/end/rprompt/continuations are.

What about applications that want to move some text left/right/up/down by a bunch of columns/lines?

That's in the bucket of things that feel out of scope to me. We won't be able to make everything in the terminal accessible, I was aiming for better prompt interactions and better natural ltr/top to bottom text flow.

carlos-zamora · 2022-11-08T00:29:06Z

I don't like the thought of disabling textChange events in response to new text from this feature. If users are using a UIA client that hasn't adopted this protocol, disabling events might have unexpected results.

This will require significant buy-in from not just terminal, but also various command-line app and screen reader manufacturers.

@codeofdusk I'm a bit confused from the stuff above. The idea with this proposal is that no changes would be required on the screen reader side because they're handling textChanged events and notifications appropriately (NVDA has a special case here, but we'll get back to that in a minute). So theoretically, if this spec was implemented and an app (i.e. winget) started sending out these VT sequences, the terminal would just send less textChanged events or notifications with different payloads. So, the attached screen reader wouldn't know the difference and read out the notification as usual.

NVDA, of course, is special in that it's currently ignoring notifications unless a setting is enabled. If we gave the UIA notifications a different ID however, could NVDA whitelist that class of notifications?

carlos-zamora · 2022-11-08T00:49:43Z

Not sure what you mean by this, this proposal is essentially alt text, no?

Sort of? I see the similarity to alt text, but I think we should not embed the sequence into the buffer. This sequence should be limited to new output and the resulting notifications entirely. So on a resize or when in scan mode, whatever special text was notified out shouldn't be found.

#13666 is a standard example I can think of (gh pr create). It's not taking up the entire screen, it's just rewriting the prompt entirely because that's (presumably) the easiest way to rewrite the prompt. There's other lightweight examples that could benefit from this (i.e. PowerShell suggestions). Sure we're not solving the problem for heavy-weight apps like midnight commander, but (1) now it's on the app to decide if it's worth making it accessible and (2) I don't know if it's worth lumping those scenarios into this scope since other accessible options exist (like, idk if the screen reader community would even want to use it if there's non-cli apps that already meets their needs.)

carlos-zamora · 2022-11-10T18:03:55Z

Chatted with @Tyriar today. Here's some takeaways from that meeting:

Main Benefits of this Approach

We need a way for apps to be more in control of making their content more accessible.

Main Concerns

Adoption of this seems difficult. CLI apps don't really want to think about how to be more accessible. It'd be best to give them a more fine-tuned solution.
Yes, the above solution is versatile, but that also means it's a bit hacky because the developer has to think about how to make their workflow more accessible, vs simply saying "I'm in a 'select an option' workflow, I'll use the guidelines designed for that".

Other Proposals

We discussed a few ideas that could be more fine-tuned. They're not mutually exclusive. Also note, these are very lightweight specs. They definitely need some fleshing out, but I'll do that when I add them to the actual spec (at some point).

Idea 1: Decorative Tags

A VT sequence marking a range of text as purely decorative.
Example:
- App displays progress bar as such: [===___] 50%
Solution:
- The region of text that looks like this [===___] would be marked as decorative and ignored by the screen reader. This would be accomplished by having the terminal not append that to the output text in a UIA notification event.
- The 50% displayed at the end would still be read out as normal text, no additional work would be required.

Idea 2: Semantic Embedding

Inspired by HTML, Accessible Rich Internet Applications (ARIA) provide hints to the screen reader. We could use a combination of HTML-like tags and ARIA labels to make workflows more clear.
Example:
- gh pr create outputs a prompt with multiple choices (see this comment for what it looks like)
Solution:
- add an option tag for each option presented
- the selected option has an additional tag (maybe through an ARIA-like label?)
- Terminal knows to read "X selected"

Idea 3: Flag to know if a screen reader is active

An environment variable or flag that lets CLI apps know if a screen reader is active.
NOTE: PowerShell does this to disable PSReadLine. We should definitely check out how they do that.
Example:
- gh pr create could benefit from adding numbers before each option to make it more clear that they are distinct options.
Solution:
- gh.exe checks this flag. If a screen reader is active, change the layout of gh pr create to add numbers to the front.
This could also be used for more complex scenarios. Consider PowerShell's MenuComplete option. Each suggestion is displayed in a grid, which means that multiple options are on a single line. This makes it very difficult for screen reader users to identify distinct options. If PowerShell could know when a screen reader is attached, they could force a different behavior.

j4james · 2022-11-10T18:33:50Z

An environment variable or flag that lets CLI apps know if a screen reader is active.

Ideally you want this to work over remote connections too, and the terminal assumedly wouldn't be able to set an environment variable in that case. So my recommendation would be using one of the standard VT reporting sequences for this.

For scenario 1, if we were using something like DECSCA, an app could tell if the new "decorative" attribute was supported by sending a DECRQSS query. And the spec could require that a terminal not respond successfully to that query unless there is actually an active screen reader.

If the solution is mode based, the app can query whether the mode is supported with a DECRQM query. And again the spec could require the presence of a screen reader.

For a more general way to query whether a screen reader is present, regardless of support for any particular functionality, we could define a new DSR query. For a similar use case, look at the way the printer status is reported for DSR-PP: https://www.vt100.net/docs/vt510-rm/DSR-PP

And as a last restore, if we just wanted a way for terminals to indicate that they support this screen reader spec in general, we could define a new feature number that is reported in the DA response. It would be preferable if we could avoid this though.

zadjii-msft

this one is 🌶️

doc/specs/#13666 - VT Sequence for Screen Reader Control.md

KalleOlaviNiemitalo · 2022-12-22T09:04:24Z

doc/specs/#13666 - VT Sequence for Screen Reader Control.md

+As mentioned earlier, `DSR` is already a standard method for command-line applications to query the capabilities of the attached terminal emulator. By claiming a value, the terminal can easily respond to let the command-line application know if a screen reader is attached or not. In the event the terminal emulator does not support this feature, no response is given, which is common practice.
+> `DSR` - Screen Reader
+> - command-line application query: `CSI ? 2577 n`
+> - terminal emulator response: `CSI ? 2577; Ps`


This again should end with n.

It's also worth mentioning that the typical pattern for DSR sequences like this, is that the query number is a multiple of 5, and then you have separate numbers for each response, starting at a multiple of 10 (below the query number).

For example, the DSR 5 query (operating status), responds with DSR 0, DSR 1, DSR 2, etc. The DSR ? 15 query (printer port), responds with DSR ? 10, DSR ? 11, etc. There are exceptions to that rule (e.g. the CPR query, or the keyboard dialect query), but that's just because those don't really fit the pattern of a "status" response.

So unless we're expecting to extend this with lots of different response types, it would be more customary to use something like DSR ? 2575 for the query, and then DSR ? 2570 and DSR ? 2571 for the two responses (attached and not attached).

I know I previously argued that it's better to use a single number for finite resources, but that's just because the OSC numbers are a bit of nightmare in terms of conflicts, and there's no standard usage pattern. DSR is less of a risk, because I don't think there are any modern terminals using it (not counting XTerm's non-standard abuse).

@carlos-zamora I think I may have confused things with all my DSR references above. So just to be clear, DSR is the shorthand name for the Device Status Report operation. CSI n is the actual escape sequence for that operation (where CSI is equal to ESC [ in 7-bit mode).

So you can say DSR ? 2575, or CSI ? 2575 n, or possibly even ESC [ ? 2575 n, but you wouldn't say DSR ? 2575 n. And in this particular area of the documentation, CSI ? 2575 n is probably the most appropriate.

Ah ok. My bad. Yeah I saw you using DSR above and thought applying the same notation in the spec would make it more clear. Thanks for the explanation!

j4james reviewed Nov 5, 2022

View reviewed changes

Tyriar reviewed Nov 7, 2022

View reviewed changes

This comment has been minimized.

Sign in to view

zadjii-msft requested changes Dec 12, 2022

View reviewed changes

ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Dec 12, 2022

[Spec] VT Sequence for Screen Reader Control

0f4f17d

carlos-zamora force-pushed the dev/cazamor/spec/a11y-vt-seq branch from 079e516 to 0f4f17d Compare December 15, 2022 21:18

ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Dec 15, 2022

This comment has been minimized.

Sign in to view

apply feedback from spec review meeting

5214b13

This comment has been minimized.

Sign in to view

spell

d22ab7a

carlos-zamora assigned zadjii-msft Dec 21, 2022

KalleOlaviNiemitalo reviewed Dec 22, 2022

View reviewed changes

doc/specs/#13666 - VT Sequence for Screen Reader Control.md Outdated Show resolved Hide resolved

KalleOlaviNiemitalo reviewed Dec 22, 2022

View reviewed changes

doc/specs/#13666 - VT Sequence for Screen Reader Control.md Outdated Show resolved Hide resolved

KalleOlaviNiemitalo reviewed Dec 22, 2022

View reviewed changes

carlos-zamora added 2 commits January 9, 2023 11:22

address KalleOlaviNiemitalo and j4james feedback

1fd7c3f

CSI ends with n, DSR does not

03e0bfb

zadjii-msft added this to the Terminal v1.18 milestone Jan 10, 2023

j4james mentioned this pull request Mar 2, 2023

Provide a more detailed Device Attributes report #14906

Merged

lhecker mentioned this pull request Jul 5, 2023

Screen readers don't speak available options in interactive prompts #15653

Closed

zadjii-msft modified the milestones: Terminal v1.18, Terminal v1.19 Jul 5, 2023

lhecker mentioned this pull request Feb 19, 2024

JAWS reads entire window over SSH when typing into nano #16734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec] VT Sequence for Screen Reader Control #14342

[Spec] VT Sequence for Screen Reader Control #14342

carlos-zamora commented Nov 4, 2022

carlos-zamora commented Nov 4, 2022

j4james Nov 5, 2022

lhecker Nov 5, 2022

j4james Nov 5, 2022

carlos-zamora Nov 8, 2022

DHowett Nov 8, 2022

codeofdusk commented Nov 7, 2022

Tyriar left a comment

zadjii-msft commented Nov 7, 2022 •

edited

lhecker commented Nov 7, 2022

Tyriar commented Nov 7, 2022

lhecker commented Nov 7, 2022 •

edited

Tyriar commented Nov 7, 2022

carlos-zamora commented Nov 8, 2022

carlos-zamora commented Nov 8, 2022

carlos-zamora commented Nov 10, 2022

j4james commented Nov 10, 2022

This comment has been minimized.

zadjii-msft left a comment

This comment has been minimized.

This comment has been minimized.

KalleOlaviNiemitalo Dec 22, 2022

j4james Dec 22, 2022

j4james Jan 10, 2023

carlos-zamora Jan 10, 2023

	### Scan Mode Experience
	Three scenarios this VT sequence would make more accessible include:
	1. text is being redrawn on top of existing text (i.e. progress bars)
	2. prompts where the user must select an option using the arrow keys (i.e. `gh pr create`)
	3. supplementary content is displayed with different visual characteristics (i.e. PowerShell suggestions)

	Scan mode is a mode where the user can use the screen reader to navigate the text manually. In the scenarios listed above, the user should expect the following experiences when in scan mode:
	1. The progress bar should be read in the way it is drawn. The VT sequence data should not be embedded into the terminal because it would be more confusing to read out "10%" and "20%" depending on where the user is scanning the progress bar. Instead, the progress bar should be displayed as "historical" content that had already occurred.
	2. The output text should be read in the way it is drawn. Alt text doesn't make sense for this scenario.
	3. The output text should be read in the way it is drawn.

	In the future, if alt text is functionality the community is interested in, a separate VT sequence should probably be introduced to provide that functionality.

[Spec] VT Sequence for Screen Reader Control #14342

Are you sure you want to change the base?

[Spec] VT Sequence for Screen Reader Control #14342

Conversation

carlos-zamora commented Nov 4, 2022

References

carlos-zamora commented Nov 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codeofdusk commented Nov 7, 2022

Tyriar left a comment

Choose a reason for hiding this comment

zadjii-msft commented Nov 7, 2022 • edited

lhecker commented Nov 7, 2022

Tyriar commented Nov 7, 2022

lhecker commented Nov 7, 2022 • edited

Tyriar commented Nov 7, 2022

carlos-zamora commented Nov 8, 2022

carlos-zamora commented Nov 8, 2022

carlos-zamora commented Nov 10, 2022

Main Benefits of this Approach

Main Concerns

Other Proposals

Idea 1: Decorative Tags

Idea 2: Semantic Embedding

Idea 3: Flag to know if a screen reader is active

j4james commented Nov 10, 2022

This comment has been minimized.

zadjii-msft left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zadjii-msft commented Nov 7, 2022 •

edited

lhecker commented Nov 7, 2022 •

edited