l3text: \c_group_begin_token and \c_group_end_token are not handled correctly #874

Skillmon · 2021-04-24T12:50:20Z

Inside of \text_expand:n the tokens \c_group_begin_token and \c_group_end_token aren't handled correctly. Instead of shipping tokens to the output of \text_expand:n they will leave { \if_false: } \fi: and \if_false: { \fi: }, respectively, in the input stream in front of the \exp:w-expansion, and place the next loop iteration afterwards.

This results in a low level Missing number, treated as zero. error being thrown.

MWE:

\documentclass[]{minimal}

\begin{document}
\ExplSyntaxOn
\text_expand:n { \c_group_begin_token \c_group_end_token }
\ExplSyntaxOff
\end{document}

The text was updated successfully, but these errors were encountered:

PhelypeOleinik · 2021-04-24T16:05:30Z

One option to handle this would be for \__text_expand_store:n to put things in \unexpanded in the output token list, and have a \__text_expand_store_e:n that doesn't, then at the end do one run of \expanded in the whole thing:

\documentclass{minimal}

\ExplSyntaxOn
\cs_gset:Npn \__text_expand_store:n #1
  { \__text_expand_store:nw { \exp_not:n {#1} } }
\cs_gset:Npn \__text_expand_store_e:n #1
  { \__text_expand_store:nw {#1} }
\cs_gset:Npn \__text_expand_end:w #1 \__text_expand_result:n #2
  {
    \group_align_safe_end:
    \exp_after:wN \exp_end:
    \tex_expanded:D {#2}
  }
\cs_gset:Npn \__text_expand_N_type_auxii:N #1
  {
    \token_if_eq_meaning:NNTF #1 \c_group_begin_token
      {
        \__text_expand_store_e:n { { \if_false: } \fi: }
        \__text_expand_loop:w
      }
      {
        \token_if_eq_meaning:NNTF #1 \c_group_end_token
          {
            \__text_expand_store_e:n { \if_false: { \fi: } }
            \__text_expand_loop:w
          }
          { \__text_expand_N_type_auxiii:N #1 }
      }
  }

\edef\x{\text_expand:n { \bgroup é \egroup }}\show\x
\stop

But it would fail miserably if \bgroup/\egroup are unbalanced (for example if it sees something like \@ifnextchar\bgroup{}{}).

Skillmon · 2021-04-24T17:01:17Z

Why is \bgroup/\egroup processed at all?

PhelypeOleinik · 2021-04-24T17:38:17Z

I guess it depends a bit on the scope of “what should \text_expand:n do”. I see it as expanding the argument to text, so maybe \c_left_brace_str and \c_right_brace_str would be appropriate

josephwright · 2021-04-24T17:47:45Z

Why is \bgroup/\egroup processed at all?

Equivalent to {/} and therefore forms a group - we are expandable, so we can't tell the difference.

PhelypeOleinik · 2021-04-24T17:54:12Z

we are expandable, so we can't tell the difference.

We can actually. \tl_if_head_is_group:nTF is used earlier. At this point we know it is an N-type token, so definitely \bgroup/\egroup

Skillmon · 2021-04-24T17:54:19Z

@josephwright not true, we already know the difference, since by the time we handle \bgroup/\egroup we already know the token is a valid N-type argument, so must be an implicit and not an explicit brace.

Skillmon · 2021-04-24T17:57:31Z

Imho, \bgroup/\egroup (or equivalent) shouldn't be handled at all and passed through as they are, just like \group_begin: and \group_end: would.

Skillmon · 2021-04-24T17:59:44Z

@PhelypeOleinik but the argument is (per documentation) formatted text, so could contain markup (and it might be possible that some markup is mistakenly not defined as \protected and uses some boxing with \bgroup/\egroup). So \c_left_brace_str/\c_right_brace_str are definitely wrong here.

blefloch · 2021-04-24T22:12:43Z

Related issue. The blanket statement "Implicit tokens are converted to their explicit equivalent." in the documentation of \text_purify:n seems to motivate the definition of \@@_token_to_explicit:N, which does two things I disagree with.

It turns for instance \chardef and \mathchardef tokens into \char_generate:nn {#1} { \char_value_catcode:n {#1} } (is that reasonable for \mathchardef??). This is fine for catcode 10, 11, 12 (space/letter/other) but not any other. For instance after \chardef\myhat=`^ the command \myhat typesets a hat, but it would get turned to a superscript token by this code. Same problem for catcode 1,2 etc.
It turns implicit character tokens such as \c_group_begin_token into explicit ones, which is fine for most catcodes (at least 3,4,7,8,10,11,12), but is strange for catcodes 1,2.

I think this code might historically be why \text_expand:n has the tests in \@@_expand_N_type_auxi:N, but now I don't see why the \c_@@_chardef_group_begin_token etc are filtered out. What happens if we do the following?

\cs_set_eq:NN \__text_expand_N_type_auxi:N \__text_expand_N_type_auxiii:N

car222222 · 2021-04-25T03:23:30Z

Note that the rationale for, use of and inputs to these 'text altering commands' and related concepts needs revisiting. So there is not much point in discussing right now most of the details of 'what should happen to xyz'.

But meanwhile, fixes should be added for gross infelicities such as the origin of this issue, and maybe a small number of the other contributions.

josephwright · 2023-05-18T06:45:06Z

@blefloch wrote

It turns for instance \chardef and \mathchardef tokens into \char_generate:nn {#1} { \char_value_catcode:n {#1} } (is that reasonable for \mathchardef??). This is fine for catcode 10, 11, 12 (space/letter/other) but not any other. For instance after \chardef\myhat=`^ the command \myhat typesets a hat, but it would get turned to a superscript token by this code. Same problem for catcode 1,2 etc.

Indeed, this doesn't make sense. The 11/2 distinction doesn't make sense here, so I am adjusting to 'if it's a space, make a space, otherwise make an other char'. I think that should be clear enough.

josephwright · 2023-05-18T07:07:28Z

@blefloch asked

(is that reasonable for \mathchardef??)

Good question! I suspect possibly not but we will need quite a bit of data, etc., to map characters otherwise. I think one to address as and when we find issues: at least at present we get 'something' out.

josephwright · 2023-05-18T07:27:24Z

OK, reading back the code comments I remember now why this code is there. If you look at the current comments, the aim is to deal with not only \bgroup/\egroup but also implicit spaces. The reason was that when we get 'further along' to case changing, we need to know about spaces and brace groups to decide where word boundaries are: things like 'is this a final sigma' need to know if the next token is a letter, a space or something else. Because we are dealign with TeX, we can't just assume everything is 'text', so there are some compromises.

Looking back, it's probably better to leave all implicit chars alone in \text_expand:n, and if they crop in some 'text processing' context, say 'tough'. (We do say that the argument here should be 'broadly text'.) We still need to deal with them in \text_purify:n, as that has to remove all non-Unicode text and so drop stuff if there is no alternative. But that's a different question.

At least a couple of commits coming up.

davidcarlisle · 2023-05-18T13:32:54Z

Looking back, it's probably better to leave all implicit chars alone in \text_expand:n, and if they crop in some 'text processing' context, say 'tough'. (We do say that the argument here should be 'broadly text'.) We still need to deal with them in \text_purify:n

leaving as-is in \text_expand:n and making catcode 12 or 10 in \text_purify:n sounds good to me

blefloch added bug Something isn't working expl3 labels Apr 29, 2021

blefloch assigned blefloch and unassigned blefloch May 12, 2021

josephwright removed the expl3 label May 17, 2023

josephwright self-assigned this May 17, 2023

josephwright closed this as completed in 7998284 May 18, 2023

josephwright mentioned this issue May 18, 2023

Handling of implict chars by \text_purify:n is not as documented #1208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

l3text: \c_group_begin_token and \c_group_end_token are not handled correctly #874

l3text: \c_group_begin_token and \c_group_end_token are not handled correctly #874

Skillmon commented Apr 24, 2021

PhelypeOleinik commented Apr 24, 2021

Skillmon commented Apr 24, 2021

PhelypeOleinik commented Apr 24, 2021 •

edited

josephwright commented Apr 24, 2021

PhelypeOleinik commented Apr 24, 2021

Skillmon commented Apr 24, 2021

Skillmon commented Apr 24, 2021

Skillmon commented Apr 24, 2021

blefloch commented Apr 24, 2021

car222222 commented Apr 25, 2021

josephwright commented May 18, 2023

josephwright commented May 18, 2023

josephwright commented May 18, 2023

davidcarlisle commented May 18, 2023

l3text: \c_group_begin_token and \c_group_end_token are not handled correctly #874

l3text: \c_group_begin_token and \c_group_end_token are not handled correctly #874

Comments

Skillmon commented Apr 24, 2021

PhelypeOleinik commented Apr 24, 2021

Skillmon commented Apr 24, 2021

PhelypeOleinik commented Apr 24, 2021 • edited

josephwright commented Apr 24, 2021

PhelypeOleinik commented Apr 24, 2021

Skillmon commented Apr 24, 2021

Skillmon commented Apr 24, 2021

Skillmon commented Apr 24, 2021

blefloch commented Apr 24, 2021

car222222 commented Apr 25, 2021

josephwright commented May 18, 2023

josephwright commented May 18, 2023

josephwright commented May 18, 2023

davidcarlisle commented May 18, 2023

PhelypeOleinik commented Apr 24, 2021 •

edited