Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add \tl_lowercase:nn and deprecate (non-expandable) \tl_to_lowercase:n #141

Closed
blefloch opened this issue Feb 3, 2013 · 11 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@blefloch
Copy link
Member

blefloch commented Feb 3, 2013

Contrarily to all \<type>_to_<thing> functions, \tl_to_lowercase:n and \tl_to_uppercase:n, wrappers around the corresponding TeX primitives, are not expandable. It would be better to provide \tl_lowercase:nn, analogous to \tl_rescan:nn, with a first argument to hold the setup.

\cs_new_protected:Npn \tl_lowercase:nn #1#2
  { \group_begin: #1 \tex_lowercase:D { \group_end: #2 } }

Later, we can deprecate \tl_to_lowercase:n, and even later, rename \tl_expandable_lowercase:n to \tl_to_lowercase:n. Similarly for upper case.

@josephwright
Copy link
Member

I'm reasonably happy with this, provided we feel that requiring understanding of TeX's case-changing stuff is OK. I suspect that this is the realistic position: Will's earlier attempt as \tl_transform:nn (or something similar) did not really work. I guess the only question is about naming (as I think @FrankMittelbach is not so keen on \tl_rescan:nn anyway!).

I'm also keen we do provide some form of expandable case-change, even if we know it's very slow, as this is something that people want to be able to do and we do have the code.

@blefloch
Copy link
Member Author

blefloch commented Feb 5, 2013

I don't think that we can abstract away TeX's case changing, unless we set all lccodes or all uccodes to 0, except those requested by users. What are lc and uc codes used for in TeX, apart from \lowercase and \uppercase?

On naming, would you be happier with \tl_use_rescanned:nn (and \tl_set_rescanned:Nnn) and \tl_use_lowercased:nn, or similar \tl_use_...?

@wspr
Copy link
Contributor

wspr commented Apr 2, 2013

I'd sort of forgotten the exact syntax for \tl_transform for case changing; I guess it was probably "lost" in the big bang; I can't remember if it turned into anything more than what's attached below. The gist was indeed to set all lccodes to 0 (whether locally or globally I guess) and then provide a wrapper macro to transform the contents as appropriate.

I don't know about "did not really work" — it was an abstraction but the question would be whether it was useful. Can probably argue this style of programming doesn't much belong in expl3, but to me it falls into the same grey zone as tl_rescan, which is useful for some but not usually for most.

\documentclass{article}
\usepackage{expl3}
\begin{document}
\ExplSyntaxOn

\cs_new:Npn \tl_transform:nn #1 {
  \group_begin:
    \tl_map_function:nN {
      \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z
    } \char_protect_uppercase:N
    \cs_set_eq:NN \char_transform:NN \char_transform_hidden:NN
    #1
    \tl_transform_aux:n
}

\cs_new:Npn \tl_transform_aux:n #1 {
    \tl_to_lowercase:n {
  \group_end:
  #1
  }
}

\cs_set:Npn \char_protect_uppercase:N #1 {
  \char_set_lccode:nn {`#1} {0}
}

\cs_new:Npn \char_transform_hidden:NN #1#2 {
  \char_set_lccode:nn {`#1} {`#2}
}

\tl_transform:nn{
  \char_transform:NN \a \b 
  \char_transform:NN \A \B 

  \char_set_catcode_active:N \~
  \char_transform:NN \~ \! 
}{
  a b c A B C
  \cs_set:Npn ~ {BANG}
}

\par
\char_set_catcode_active:N \!
!

\end{document}

@josephwright
Copy link
Member

To be clear, by 'did not really work' I meant in terms of a fit with expl3 (indeed, we might decide the same about \tl_transform). At a technical level it did of course work!

I'd like to move on @blefloch's suggestion if we are all agreed: \tl_to_lowercase:n as it stands is a poor fit for expl3 and is one of the few rough edges in the code we have in l3kernel.

@josephwright
Copy link
Member

I've added some code: see the comments there. May be worth raising on LaTeX-L.

wspr pushed a commit that referenced this issue Apr 4, 2013
I wonder if we might actually be better if the first argument
is a mapping

    \tl_lowercase:nn
      {
         { `\A } { `\A }
         { `\B } { `\c }
      }
      { <stuff> }

but then this has the issue that there are different ways of 
giving a charcode and if you go with the above you can only accept
one (presumably as I've done above by number).

Longer-term, the above might suggest we don't need `\char_set_lccode:n`,
etc., at all, but I'm wary of that as there is an interaction between
lccode/uccode and for example end-of-sentence spacing.

Thoughts on all of this most welcome!


git-svn-id: http://www.latex-project.org/svnroot/experimental/trunk@4478 de43f980-851b-0410-b2f7-c40aca1f87e0
@blefloch
Copy link
Member Author

blefloch commented Apr 6, 2013

(Oops, wall of text ahead.) I think we need to think about why we need analogs of \uppercase or \lowercase. Right now, I can think of three distinct tasks for which one could want case-changing:

  • Building weird character tokens. This can be done in two ways, either with \lowercase/\uppercase in the usual way, or by defining a temporary helper, for instance, a function to strip the trailing catcode-12 "pt" from a token list such as "1pt" could be defined in at least two ways:

    \group_begin:
    \char_set_uccode:nn { `\+ } { `\p }
    \char_set_uccode:nn { `\- } { `\t }
    \tex_uppercase:D { \group_end:
      \cs_new:Npn \@@_strip_pt:w #1 + - {#1} }
    
    \cs_set_protected:Npn \@@_tmp:w #1
      { \cs_new:pn \@@_strip_pt:w ##1 #1 {##1} }
    \exp_args:No \@@_tmp:w { \tl_to_str:n { pt } }
    

    The first way is more general, and allows to build almost any weird catcode-charcode combination, with the following exceptions: one cannot get a character code 0 from \tex_uppercase:D, and one cannot get catcode 10 characters after control sequences (except single-character csnames), or several such characters in a row, or two catcode 10 characters with two distinct catcodes, because of how TeX normalizes all catcode-10 characters to character code 32 upon input. In particular, the auxiliaries we use to strip spaces from token lists or comma-lists are defined in the second way. This second way is less general but suffices for the great majority of cases (e.g., in all those weird conditionals in l3token).

  • Uppercasing a title or other piece of text. Doing this properly requires to understand better the structure of the text that is being uppercased, so as to avoid uppercasing environments' names, mathematics, etc. Also, a general approach should allow for title-casing, which opens a whole new can of worms. All this, I believe, should not be done when operating on token lists, but rather in some slightly later stage of processing.

  • Lowercasing some piece of text, for instance, to canonicalize names or words for sorting, or to work with case insensitive file systems. Well, simply forgetting case is not enough to sort properly anyway, and for indexes we need to think quite a lot about giving users the option to coalesce names, and we need to take regional differences in the alphabetical order. When working with the file system, I would say that we want a fixed dictionary between upper and lower case, which should not be affected by the lccode and uccode. Also, this should happen when working on strings of characters, since the os does not know what a token is.

I might be reducing the questions of uppercasing and lowercasing to very small special cases, and if so, correct me. My current impression, though, is that we need two functions: one to lowercase in a controlled way a string of characters, and one to produce weird tokens. It does not make sense to me to define such a function to define weird tokens with lowercase or uppercase explicitly in its name. Thus I am quite fond of Will's \tl_transform:nn, at least as a rough starting point.

On the question of how to implement it, I've asked on TeX.sx to know when TeX uses each code. It may be possible to set uccodes to 0 during the whole TeX run, so that the function only has to apply the setup requested by the user.

From Joseph's commit,

\tl_lowercase:nn
  {
     { `\A } { `\A }
     { `\B } { `\c }
  }
  { <stuff> }

but then this has the issue that there are different ways of
giving a charcode and if you go with the above you can only accept
one (presumably as I've done above by number).

I am not sure whether to go with Will's version of \tl_transform:nn,

\tl_transform:nn
  {
    \char_transform:nn { `A } { `A }
    \char_transform:nn { `B } { `c }
    \char_set_catcode_active:N \@
    \char_transform:nn { `\@ } { `\% }
  }
  { \cs_set_protected:Npn @ { BANG } }

or with

\group_begin:
\char_set_catcode_active:N \@
\tl_transform:nn
  { { `A } { `A }  { `B } { `c }  { `\@ } { `\% } }
  { \group_end: \cs_set_protected:Npn @ { BANG } }

or with one argument for character code changes and one for other setups,

\tl_transform:nnn
  { \char_set_catcode_active:N \@ }
  { { `A } { `A }  { `B } { `c }  { `\@ } { `\% } }
  { \cs_set_protected:Npn @ { BANG } }

or perhaps in such cases one should mix \tl_rescan:nn (which does something a little bit different):

\tl_rescan:nn
  { \char_set_catcode_active:N \@ }
  {
    \tl_transform:nn
      { { `A } { `A }  { `B } { `c }  { `\@ } { `\% } }
      { \cs_set_protected:Npn @ { BANG } }
  }

\tl_transform:nn
  { { `A } { `A }  { `B } { `c }  { `\@ } { `\% } }
  {
    \tl_rescan:nn
      { \char_set_catcode_active:N \% }
      { \cs_set_protected:Npn @ { BANG } }
  }

Longer-term, the above might suggest we don't need \char_set_lccode:n,
etc., at all, but I'm wary of that as there is an interaction between
lccode/uccode and for example end-of-sentence spacing.

None there, but with hyphenation, at least. See the TeX.sx question linked above for details if anyone answers.

@wspr
Copy link
Contributor

wspr commented Apr 7, 2013

I might be reducing the questions of uppercasing and lowercasing to very small special cases, and if so, correct me. My current impression, though, is that we need two functions: one to lowercase in a controlled way a string of characters, and one to produce weird tokens. It does not make sense to me to define such a function to define weird tokens with lowercase or uppercase explicitly in its name. Thus I am quite fond of Will's \tl_transform:nn, at least as a rough starting point.

I like the thought process behind the different syntaxes here, and I agree with you that we're basically talking about two different things and we should design the syntax for these to be either appropriate to both or have two separate commands for the two separate ideas.

I still lean towards a generic "setup" argument, since you never know what else you'll want to do in there, such as redefine macros or provide "local-only" definitions.

If you wanted a shorthand for input char mapping, instead of

\char_transform:nn {`\A}{`\B}
\char_transform:nn {`\C}{`\D}

etc, a wrapper would be fairly tidy I guess:

\tl_transform:nn
 { \char_transform:n { {`\A}{`\B} {`\C}{`\D} } ... }
 { ... }

@josephwright
Copy link
Member

My concern with a generic 'set up' is that you are then mixing stuff up. I can see an argument for this, as some effects are otherwise tricky to achieve, but do want to be sure that's what we are after.

I'd agree with Bruno's analysis that there are distinct cases. Suggests to me that we shouldn't add anything new with 'lower/uppercase' in the name at the moment, so I'm going to back-out the additions.

wspr pushed a commit that referenced this issue Apr 9, 2013
As discussed in issue #141, there are separate use cases for
the primitives here, and it's likely we want to cover the
'odd category code' case with a name which reflects this.


git-svn-id: http://www.latex-project.org/svnroot/experimental/trunk@4481 de43f980-851b-0410-b2f7-c40aca1f87e0
@josephwright josephwright self-assigned this Jul 15, 2015
@josephwright
Copy link
Member

As agreed at TUG2015, do the deprecation and ask for real use cases looking forward. Talk to Ulrike Fischer.

@josephwright
Copy link
Member

Progress update: most of the use cases will be removed from expl3 this week although a second round will be needed later to get rid of some \tex_lowercase:D that might go in favour of \char_generate:nn but at present are awkward (mainly due to XeTeX).

@josephwright
Copy link
Member

This was done a while ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants