Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spec] VT Sequence for Screen Reader Control #14342

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

carlos-zamora
Copy link
Member

This spec outlines...

References

@carlos-zamora
Copy link
Member Author

@Tyriar since you wrote the spec in the Terminal WG repo, I'd love to get some feedback and thoughts here. 😊

@j4james you're generally pretty on top of what's going on in the VT sequence world. Curious if you have any thoughts here too.

@codeofdusk I'd also love your feedback since you're a screen reader user and you've contributed to NVDA.

Thanks all.

Comment on lines +19 to +23
> `OSC Ps ; Pt ST`
> - `Ps = 2 0 0` -> Stop announcing incoming data to screen reader, `Pt` is an optional string that will be announced immediately. The screen reader will resume announcing incoming data if any key is pressed.
> - `Ps = 2 0 1` -> Resume announcing incoming data to screen reader, `Pt` is an optional string that will be announced immediately.
> - `Ps = 2 0 2` -> Announce `Pt` immediately to the screen reader.
> Note that the reason any key press will force the screen reader to announce again is to prevent situations where applications are terminated while the screen reader is not announcing or where applications are misbehaving.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a nit pick, but it's worth noting that OSC numbers are a finite resource, so a single OSC number for this would be preferable to three. For example, something like OSC 200 ; Ps ; Pt, where Ps differentiates between stop/resume/announce. Also makes it a little easier to extend.

My main concern, though, is how this is going to propagate over conpty. Is the idea to just pass it through and hope for the best? What happens if conpty later refreshes part of the display that was originally output with "stop announcing"? Would we then need to rewrap that content with these sequences?

Because if that is something conpty needs to account for, we may be better off with a simple attribute-like sequence similar to DECSCA, which could be recorded in the buffer, and then forwarded over conpty as part the regular repaint. The "immediate announce" strings would still then need a separate sequence (and that probably could be passed through directly).

I don't know. Just thinking out loud. This may be something you need to prototype and try out before you lock down the exact protocol you're going to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a nit pick, but it's worth noting that OSC numbers are a finite resource, so a single OSC number for this would be preferable to three. For example, something like OSC 200 ; Ps ; Pt, where Ps differentiates between stop/resume/announce. Also makes it a little easier to extend.

Oh I like that!

My main concern, though, is how this is going to propagate over conpty. Is the idea to just pass it through and hope for the best? What happens if conpty later refreshes part of the display that was originally output with "stop announcing"? Would we then need to rewrap that content with these sequences?

I always felt like ConPTY recording and then "re-rendering" VT output is quite a bit of a "hack". And because of that, if we ever design a VT sequence, I don't think we should limit ourselves by how hard it'd be to implement inside conhost.
Basically, I'd personally be in favor of whatever optimal / "clean" design we can come up with, unhindered by any design complexities that only we would have to suffer, unlike other, existing UNIX terminals. Long-term VtEngine might just be replaced entirely with something leaner anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always felt like ConPTY recording and then "re-rendering" VT output is quite a bit of a "hack". And because of that, if we ever design a VT sequence, I don't think we should limit ourselves by how hard it'd be to implement inside conhost.

I very much agree with this sentiment, but I am concerned we might end up with something we can't actually use. And while I'm still confident we can improve on the current VtEngine, that's a bigger task than I had originally thought, and I don't see that happening anytime soon.

That said, this may end up not being that big a deal in practice. I just wanted everyone to be aware of the potential issues here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be aware of the potential issues here

I'll be sure to mention this in the potential issues part of the spec 😉. I think my stance so far though, is basically the same as my comments on scan mode:

### Scan Mode Experience
Three scenarios this VT sequence would make more accessible include:
1. text is being redrawn on top of existing text (i.e. progress bars)
2. prompts where the user must select an option using the arrow keys (i.e. `gh pr create`)
3. supplementary content is displayed with different visual characteristics (i.e. PowerShell suggestions)
Scan mode is a mode where the user can use the screen reader to navigate the text manually. In the scenarios listed above, the user should expect the following experiences when in scan mode:
1. The progress bar should be read in the way it is drawn. The VT sequence data should not be embedded into the terminal because it would be more confusing to read out "10%" and "20%" depending on where the user is scanning the progress bar. Instead, the progress bar should be displayed as "historical" content that had already occurred.
2. The output text should be read in the way it is drawn. Alt text doesn't make sense for this scenario.
3. The output text should be read in the way it is drawn.
In the future, if alt text is functionality the community is interested in, a separate VT sequence should probably be introduced to provide that functionality.

This sequence would be most useful for text being output rather than leaving a "landmark" stored in the buffer. The "alt text" sequence is probably something we would want to tackle separately, if desired and found to be useful.

That said, I'd be concerned about the scenario where the user switches from the alt buffer back to the main buffer. That probably wouldn't work right. Urgh. I'll have to be sure to test that out. (another one for "potential issues" haha) :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My core contention with not attaching these OSCs to buffer positions is that a screen reader user reviewing past contents will have a different experience than when the text was originally emitted.

That seems terrible.

@codeofdusk
Copy link
Contributor

As I've said before one-on-one, this will be quite tricky to get right. This is a very large change/project.

A few notes:

  • I don't like the thought of disabling textChange events in response to new text from this feature. If users are using a UIA client that hasn't adopted this protocol, disabling events might have unexpected results.
    • In particular, if a screen reader (or by extension the user) hasn't opted into receiving events in this way, nothing should change.
  • This will require significant buy-in from not just terminal, but also various command-line app and screen reader manufacturers.
    • Check whether ncurses would accept a patch to implement this protocol.
    • JAWS, Narrator, and NVDA cover the Windows screen reader market, but there are various other solutions on other systems (including console-only screen readers such as TDSR and Speakup in the Linux kernel, plus "audio-first" interfaces such as Emacspeak that provide shell access).

CC @josephsl, @LeonarddeR, @michaelDCurran, @tspivey, @tvraman.

Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put together the proposal in a couple of days as part of the MS hackathon to trigger a conversation on the topic but didn't get the engagement I was hoping for. Since then, I've been casually thinking about the problem every now and then and I'm actually not sure this is the direction we should go because:

  • People don't care about a11y enough to add this to programs, shell scripts, etc. which is sad but true. The fact that the issue only got a single comment several months later kind of proved this imo.
  • There is extra overhead and un-intuitiveness to adopt in applications run in the terminal. Take HTML for example where people still either ignore alt text or get it wrong all together.

Microsoft/PowerShell/etc. certainly cares about this issue deeply, but if possible I'm more interested in solving the problem for everything, rather than just first party and the few CLIs that adopt it. In its current state I don't plan on implementing this in xterm.js/vscode and encourage you not to either.

I think these are the more promising directions to solve the problem that are mostly things we can improve for all CLIs, rather than just programs that add explicit a11y support:

In the future, if alt text is functionality the community is interested in, a separate VT sequence should probably be introduced to provide that functionality.

Not sure what you mean by this, this proposal is essentially alt text, no?

@zadjii-msft
Copy link
Member

zadjii-msft commented Nov 7, 2022

People don't care about a11y enough to add this to programs, shell scripts, etc. which is sad but true. The fact that the issue only got a single comment several months later kind of proved this imo.

That might be more of an indictment of terminal-wg moreso than demonstrative that people don't care about this issue. There's fundamentally no good way of doing anything like this currently, so I doubt anyone's bothered even thinking about it all that much.

The synchronized updates idea is a great alternative plan here. Perhaps we should work with the winget (@denelon), pwsh, gh folks to explore how using those sequences would modify the screen reader output. Where would they start/end the frame, and what would the resulting screen reader output be like?

@lhecker
Copy link
Member

lhecker commented Nov 7, 2022

Let's say I implement autocompletion for my shell and when you press tab on robocopy.exe it replaces it with Robocopy.exe (since that's the proper capitalization of the executable). If we diff the viewport in a trivial way, this would just read out "R". Or if I tab on roboc and get Robocopy.exe it then would read out "R" and "copy.exe", since those are the only parts that got changed.
So, clearly, we need to announce entire words at least, but technically my shell didn't write "Robocopy" either - it wrote "Robocopy.exe", since that's the full application name. But how do we parse such "words"? Should we modify UAX#29 to recognize "." as being part of a word? How would such a diffing algorithm look like? Is it even feasible in the first place?

@Tyriar
Copy link
Member

Tyriar commented Nov 7, 2022

That might be more of an indictment of terminal-wg moreso than demonstrative that people don't care about this issue

I think you're being a little too optimistic here, during my years of work on terminals the only interest or mention of the topic has come from Microsoft people. But sure, some people will want to support it for sure, whether they actually do when prioritized against their backlog is another thing.

If we diff the viewport in a trivial way, this would just read out "R". Or if I tab on roboc and get Robocopy.exe it then would read out "R" and "copy.exe", since those are the only parts that got changed.

@lhecker a diff like that wouldn't make sense, the changed range is R to .exe, so you would read out Robocopy.exe.

@lhecker
Copy link
Member

lhecker commented Nov 7, 2022

@lhecker a diff like that wouldn't make sense, the changed range is R to .exe, so you would read out Robocopy.exe.

Right, that addresses the latter example. What about the first one? And I really only intended them as examples. I'm sure there's way more examples one could contrive. For instance another one I had in mind was:
What about applications that want to move some text left/right/up/down by a bunch of columns/lines? Or "scroll" some content themselves? Would the terminal have to cleverly identify identical pieces of "words" while also detecting their (positional) relationship with each other, so that false positives are suppressed?

@Tyriar
Copy link
Member

Tyriar commented Nov 7, 2022

Right, that addresses the latter example. What about the first one?

The first may be the same because the whole prompt likely got re-printed. I would want to tie the a11y improvements like this into our shell integration support so we would also know when we're in a prompt and where the start/end/rprompt/continuations are.

What about applications that want to move some text left/right/up/down by a bunch of columns/lines?

That's in the bucket of things that feel out of scope to me. We won't be able to make everything in the terminal accessible, I was aiming for better prompt interactions and better natural ltr/top to bottom text flow.

@carlos-zamora
Copy link
Member Author

  • I don't like the thought of disabling textChange events in response to new text from this feature. If users are using a UIA client that hasn't adopted this protocol, disabling events might have unexpected results.
  • This will require significant buy-in from not just terminal, but also various command-line app and screen reader manufacturers.

@codeofdusk I'm a bit confused from the stuff above. The idea with this proposal is that no changes would be required on the screen reader side because they're handling textChanged events and notifications appropriately (NVDA has a special case here, but we'll get back to that in a minute). So theoretically, if this spec was implemented and an app (i.e. winget) started sending out these VT sequences, the terminal would just send less textChanged events or notifications with different payloads. So, the attached screen reader wouldn't know the difference and read out the notification as usual.

NVDA, of course, is special in that it's currently ignoring notifications unless a setting is enabled. If we gave the UIA notifications a different ID however, could NVDA whitelist that class of notifications?

@carlos-zamora
Copy link
Member Author

Not sure what you mean by this, this proposal is essentially alt text, no?

Sort of? I see the similarity to alt text, but I think we should not embed the sequence into the buffer. This sequence should be limited to new output and the resulting notifications entirely. So on a resize or when in scan mode, whatever special text was notified out shouldn't be found.

#13666 is a standard example I can think of (gh pr create). It's not taking up the entire screen, it's just rewriting the prompt entirely because that's (presumably) the easiest way to rewrite the prompt. There's other lightweight examples that could benefit from this (i.e. PowerShell suggestions). Sure we're not solving the problem for heavy-weight apps like midnight commander, but (1) now it's on the app to decide if it's worth making it accessible and (2) I don't know if it's worth lumping those scenarios into this scope since other accessible options exist (like, idk if the screen reader community would even want to use it if there's non-cli apps that already meets their needs.)

@carlos-zamora
Copy link
Member Author

Chatted with @Tyriar today. Here's some takeaways from that meeting:

Main Benefits of this Approach

  • We need a way for apps to be more in control of making their content more accessible.

Main Concerns

  • Adoption of this seems difficult. CLI apps don't really want to think about how to be more accessible. It'd be best to give them a more fine-tuned solution.
  • Yes, the above solution is versatile, but that also means it's a bit hacky because the developer has to think about how to make their workflow more accessible, vs simply saying "I'm in a 'select an option' workflow, I'll use the guidelines designed for that".

Other Proposals

We discussed a few ideas that could be more fine-tuned. They're not mutually exclusive. Also note, these are very lightweight specs. They definitely need some fleshing out, but I'll do that when I add them to the actual spec (at some point).

Idea 1: Decorative Tags

  • A VT sequence marking a range of text as purely decorative.
  • Example:
    • App displays progress bar as such: [===___] 50%
  • Solution:
    • The region of text that looks like this [===___] would be marked as decorative and ignored by the screen reader. This would be accomplished by having the terminal not append that to the output text in a UIA notification event.
    • The 50% displayed at the end would still be read out as normal text, no additional work would be required.

Idea 2: Semantic Embedding

  • Inspired by HTML, Accessible Rich Internet Applications (ARIA) provide hints to the screen reader. We could use a combination of HTML-like tags and ARIA labels to make workflows more clear.
  • Example:
    • gh pr create outputs a prompt with multiple choices (see this comment for what it looks like)
  • Solution:
    • add an option tag for each option presented
    • the selected option has an additional tag (maybe through an ARIA-like label?)
    • Terminal knows to read "X selected"

Idea 3: Flag to know if a screen reader is active

  • An environment variable or flag that lets CLI apps know if a screen reader is active.
  • NOTE: PowerShell does this to disable PSReadLine. We should definitely check out how they do that.
  • Example:
    • gh pr create could benefit from adding numbers before each option to make it more clear that they are distinct options.
  • Solution:
    • gh.exe checks this flag. If a screen reader is active, change the layout of gh pr create to add numbers to the front.
  • This could also be used for more complex scenarios. Consider PowerShell's MenuComplete option. Each suggestion is displayed in a grid, which means that multiple options are on a single line. This makes it very difficult for screen reader users to identify distinct options. If PowerShell could know when a screen reader is attached, they could force a different behavior.

@j4james
Copy link
Collaborator

j4james commented Nov 10, 2022

An environment variable or flag that lets CLI apps know if a screen reader is active.

Ideally you want this to work over remote connections too, and the terminal assumedly wouldn't be able to set an environment variable in that case. So my recommendation would be using one of the standard VT reporting sequences for this.

For scenario 1, if we were using something like DECSCA, an app could tell if the new "decorative" attribute was supported by sending a DECRQSS query. And the spec could require that a terminal not respond successfully to that query unless there is actually an active screen reader.

If the solution is mode based, the app can query whether the mode is supported with a DECRQM query. And again the spec could require the presence of a screen reader.

For a more general way to query whether a screen reader is present, regardless of support for any particular functionality, we could define a new DSR query. For a similar use case, look at the way the printer status is reported for DSR-PP: https://www.vt100.net/docs/vt510-rm/DSR-PP

And as a last restore, if we just wanted a way for terminals to indicate that they support this screen reader spec in general, we could define a new feature number that is reported in the DA response. It would be preferable if we could avoid this though.

@github-actions

This comment has been minimized.

Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is 🌶️

@ghost ghost added the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Dec 12, 2022
@ghost ghost removed the Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something label Dec 15, 2022
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

As mentioned earlier, `DSR` is already a standard method for command-line applications to query the capabilities of the attached terminal emulator. By claiming a value, the terminal can easily respond to let the command-line application know if a screen reader is attached or not. In the event the terminal emulator does not support this feature, no response is given, which is common practice.
> `DSR` - Screen Reader
> - command-line application query: `CSI ? 2577 n`
> - terminal emulator response: `CSI ? 2577; Ps`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again should end with n.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth mentioning that the typical pattern for DSR sequences like this, is that the query number is a multiple of 5, and then you have separate numbers for each response, starting at a multiple of 10 (below the query number).

For example, the DSR 5 query (operating status), responds with DSR 0, DSR 1, DSR 2, etc. The DSR ? 15 query (printer port), responds with DSR ? 10, DSR ? 11, etc. There are exceptions to that rule (e.g. the CPR query, or the keyboard dialect query), but that's just because those don't really fit the pattern of a "status" response.

So unless we're expecting to extend this with lots of different response types, it would be more customary to use something like DSR ? 2575 for the query, and then DSR ? 2570 and DSR ? 2571 for the two responses (attached and not attached).

I know I previously argued that it's better to use a single number for finite resources, but that's just because the OSC numbers are a bit of nightmare in terms of conflicts, and there's no standard usage pattern. DSR is less of a risk, because I don't think there are any modern terminals using it (not counting XTerm's non-standard abuse).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlos-zamora I think I may have confused things with all my DSR references above. So just to be clear, DSR is the shorthand name for the Device Status Report operation. CSI n is the actual escape sequence for that operation (where CSI is equal to ESC [ in 7-bit mode).

So you can say DSR ? 2575, or CSI ? 2575 n, or possibly even ESC [ ? 2575 n, but you wouldn't say DSR ? 2575 n. And in this particular area of the documentation, CSI ? 2575 n is probably the most appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. My bad. Yeah I saw you using DSR above and thought applying the same notation in the spec would make it more clear. Thanks for the explanation!

@zadjii-msft zadjii-msft added this to the Terminal v1.18 milestone Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants