New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spec] VT Sequence for Screen Reader Control #14342
base: main
Are you sure you want to change the base?
Conversation
@Tyriar since you wrote the spec in the Terminal WG repo, I'd love to get some feedback and thoughts here. 😊 @j4james you're generally pretty on top of what's going on in the VT sequence world. Curious if you have any thoughts here too. @codeofdusk I'd also love your feedback since you're a screen reader user and you've contributed to NVDA. Thanks all. |
> `OSC Ps ; Pt ST` | ||
> - `Ps = 2 0 0` -> Stop announcing incoming data to screen reader, `Pt` is an optional string that will be announced immediately. The screen reader will resume announcing incoming data if any key is pressed. | ||
> - `Ps = 2 0 1` -> Resume announcing incoming data to screen reader, `Pt` is an optional string that will be announced immediately. | ||
> - `Ps = 2 0 2` -> Announce `Pt` immediately to the screen reader. | ||
> Note that the reason any key press will force the screen reader to announce again is to prevent situations where applications are terminated while the screen reader is not announcing or where applications are misbehaving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a nit pick, but it's worth noting that OSC numbers are a finite resource, so a single OSC number for this would be preferable to three. For example, something like OSC 200 ; Ps ; Pt
, where Ps
differentiates between stop/resume/announce. Also makes it a little easier to extend.
My main concern, though, is how this is going to propagate over conpty. Is the idea to just pass it through and hope for the best? What happens if conpty later refreshes part of the display that was originally output with "stop announcing"? Would we then need to rewrap that content with these sequences?
Because if that is something conpty needs to account for, we may be better off with a simple attribute-like sequence similar to DECSCA
, which could be recorded in the buffer, and then forwarded over conpty as part the regular repaint. The "immediate announce" strings would still then need a separate sequence (and that probably could be passed through directly).
I don't know. Just thinking out loud. This may be something you need to prototype and try out before you lock down the exact protocol you're going to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a nit pick, but it's worth noting that OSC numbers are a finite resource, so a single OSC number for this would be preferable to three. For example, something like
OSC 200 ; Ps ; Pt
, wherePs
differentiates between stop/resume/announce. Also makes it a little easier to extend.
Oh I like that!
My main concern, though, is how this is going to propagate over conpty. Is the idea to just pass it through and hope for the best? What happens if conpty later refreshes part of the display that was originally output with "stop announcing"? Would we then need to rewrap that content with these sequences?
I always felt like ConPTY recording and then "re-rendering" VT output is quite a bit of a "hack". And because of that, if we ever design a VT sequence, I don't think we should limit ourselves by how hard it'd be to implement inside conhost.
Basically, I'd personally be in favor of whatever optimal / "clean" design we can come up with, unhindered by any design complexities that only we would have to suffer, unlike other, existing UNIX terminals. Long-term VtEngine
might just be replaced entirely with something leaner anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always felt like ConPTY recording and then "re-rendering" VT output is quite a bit of a "hack". And because of that, if we ever design a VT sequence, I don't think we should limit ourselves by how hard it'd be to implement inside conhost.
I very much agree with this sentiment, but I am concerned we might end up with something we can't actually use. And while I'm still confident we can improve on the current VtEngine
, that's a bigger task than I had originally thought, and I don't see that happening anytime soon.
That said, this may end up not being that big a deal in practice. I just wanted everyone to be aware of the potential issues here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be aware of the potential issues here
I'll be sure to mention this in the potential issues part of the spec 😉. I think my stance so far though, is basically the same as my comments on scan mode:
terminal/doc/specs/#13666 - VT Sequence for Screen Reader Control.md
Lines 58 to 69 in 60d3685
### Scan Mode Experience | |
Three scenarios this VT sequence would make more accessible include: | |
1. text is being redrawn on top of existing text (i.e. progress bars) | |
2. prompts where the user must select an option using the arrow keys (i.e. `gh pr create`) | |
3. supplementary content is displayed with different visual characteristics (i.e. PowerShell suggestions) | |
Scan mode is a mode where the user can use the screen reader to navigate the text manually. In the scenarios listed above, the user should expect the following experiences when in scan mode: | |
1. The progress bar should be read in the way it is drawn. The VT sequence data should not be embedded into the terminal because it would be more confusing to read out "10%" and "20%" depending on where the user is scanning the progress bar. Instead, the progress bar should be displayed as "historical" content that had already occurred. | |
2. The output text should be read in the way it is drawn. Alt text doesn't make sense for this scenario. | |
3. The output text should be read in the way it is drawn. | |
In the future, if alt text is functionality the community is interested in, a separate VT sequence should probably be introduced to provide that functionality. |
This sequence would be most useful for text being output rather than leaving a "landmark" stored in the buffer. The "alt text" sequence is probably something we would want to tackle separately, if desired and found to be useful.
That said, I'd be concerned about the scenario where the user switches from the alt buffer back to the main buffer. That probably wouldn't work right. Urgh. I'll have to be sure to test that out. (another one for "potential issues" haha) :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My core contention with not attaching these OSCs to buffer positions is that a screen reader user reviewing past contents will have a different experience than when the text was originally emitted.
That seems terrible.
As I've said before one-on-one, this will be quite tricky to get right. This is a very large change/project. A few notes:
CC @josephsl, @LeonarddeR, @michaelDCurran, @tspivey, @tvraman. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put together the proposal in a couple of days as part of the MS hackathon to trigger a conversation on the topic but didn't get the engagement I was hoping for. Since then, I've been casually thinking about the problem every now and then and I'm actually not sure this is the direction we should go because:
- People don't care about a11y enough to add this to programs, shell scripts, etc. which is sad but true. The fact that the issue only got a single comment several months later kind of proved this imo.
- There is extra overhead and un-intuitiveness to adopt in applications run in the terminal. Take HTML for example where people still either ignore alt text or get it wrong all together.
Microsoft/PowerShell/etc. certainly cares about this issue deeply, but if possible I'm more interested in solving the problem for everything, rather than just first party and the few CLIs that adopt it. In its current state I don't plan on implementing this in xterm.js/vscode and encourage you not to either.
I think these are the more promising directions to solve the problem that are mostly things we can improve for all CLIs, rather than just programs that add explicit a11y support:
- @textshell's alternative proposals in https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/18#note_329814
- Implementing synchronized updates and never update screen reader content during a sync update
- A decorative SGR attribute that ignores the character is a pretty interesting idea as it's so simple and solves a lot of the problem.
- Make the terminal smarter at detecting the worst offenders that cause the most issues like progress bars, frequently updated lines, box drawing chars(?), etc. and don't announce them.
In the future, if alt text is functionality the community is interested in, a separate VT sequence should probably be introduced to provide that functionality.
Not sure what you mean by this, this proposal is essentially alt text, no?
That might be more of an indictment of terminal-wg moreso than demonstrative that people don't care about this issue. There's fundamentally no good way of doing anything like this currently, so I doubt anyone's bothered even thinking about it all that much. The synchronized updates idea is a great alternative plan here. Perhaps we should work with the |
Let's say I implement autocompletion for my shell and when you press tab on |
I think you're being a little too optimistic here, during my years of work on terminals the only interest or mention of the topic has come from Microsoft people. But sure, some people will want to support it for sure, whether they actually do when prioritized against their backlog is another thing.
@lhecker a diff like that wouldn't make sense, the changed range is |
Right, that addresses the latter example. What about the first one? And I really only intended them as examples. I'm sure there's way more examples one could contrive. For instance another one I had in mind was: |
The first may be the same because the whole prompt likely got re-printed. I would want to tie the a11y improvements like this into our shell integration support so we would also know when we're in a prompt and where the start/end/rprompt/continuations are.
That's in the bucket of things that feel out of scope to me. We won't be able to make everything in the terminal accessible, I was aiming for better prompt interactions and better natural ltr/top to bottom text flow. |
@codeofdusk I'm a bit confused from the stuff above. The idea with this proposal is that no changes would be required on the screen reader side because they're handling NVDA, of course, is special in that it's currently ignoring notifications unless a setting is enabled. If we gave the UIA notifications a different ID however, could NVDA whitelist that class of notifications? |
Sort of? I see the similarity to alt text, but I think we should not embed the sequence into the buffer. This sequence should be limited to new output and the resulting notifications entirely. So on a resize or when in scan mode, whatever special text was notified out shouldn't be found. #13666 is a standard example I can think of ( |
Chatted with @Tyriar today. Here's some takeaways from that meeting: Main Benefits of this Approach
Main Concerns
Other ProposalsWe discussed a few ideas that could be more fine-tuned. They're not mutually exclusive. Also note, these are very lightweight specs. They definitely need some fleshing out, but I'll do that when I add them to the actual spec (at some point). Idea 1: Decorative Tags
Idea 2: Semantic Embedding
Idea 3: Flag to know if a screen reader is active
|
Ideally you want this to work over remote connections too, and the terminal assumedly wouldn't be able to set an environment variable in that case. So my recommendation would be using one of the standard VT reporting sequences for this. For scenario 1, if we were using something like If the solution is mode based, the app can query whether the mode is supported with a For a more general way to query whether a screen reader is present, regardless of support for any particular functionality, we could define a new And as a last restore, if we just wanted a way for terminals to indicate that they support this screen reader spec in general, we could define a new feature number that is reported in the |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is 🌶️
079e516
to
0f4f17d
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
As mentioned earlier, `DSR` is already a standard method for command-line applications to query the capabilities of the attached terminal emulator. By claiming a value, the terminal can easily respond to let the command-line application know if a screen reader is attached or not. In the event the terminal emulator does not support this feature, no response is given, which is common practice. | ||
> `DSR` - Screen Reader | ||
> - command-line application query: `CSI ? 2577 n` | ||
> - terminal emulator response: `CSI ? 2577; Ps` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This again should end with n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also worth mentioning that the typical pattern for DSR
sequences like this, is that the query number is a multiple of 5, and then you have separate numbers for each response, starting at a multiple of 10 (below the query number).
For example, the DSR 5
query (operating status), responds with DSR 0
, DSR 1
, DSR 2
, etc. The DSR ? 15
query (printer port), responds with DSR ? 10
, DSR ? 11
, etc. There are exceptions to that rule (e.g. the CPR
query, or the keyboard dialect query), but that's just because those don't really fit the pattern of a "status" response.
So unless we're expecting to extend this with lots of different response types, it would be more customary to use something like DSR ? 2575
for the query, and then DSR ? 2570
and DSR ? 2571
for the two responses (attached and not attached).
I know I previously argued that it's better to use a single number for finite resources, but that's just because the OSC
numbers are a bit of nightmare in terms of conflicts, and there's no standard usage pattern. DSR
is less of a risk, because I don't think there are any modern terminals using it (not counting XTerm's non-standard abuse).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlos-zamora I think I may have confused things with all my DSR
references above. So just to be clear, DSR
is the shorthand name for the Device Status Report operation. CSI n
is the actual escape sequence for that operation (where CSI
is equal to ESC [
in 7-bit mode).
So you can say DSR ? 2575
, or CSI ? 2575 n
, or possibly even ESC [ ? 2575 n
, but you wouldn't say DSR ? 2575 n
. And in this particular area of the documentation, CSI ? 2575 n
is probably the most appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. My bad. Yeah I saw you using DSR
above and thought applying the same notation in the spec would make it more clear. Thanks for the explanation!
This spec outlines...
References