Refine titlecasing #1247

josephwright · 2023-07-08T15:10:37Z

Follows from discussion in #1240. Here, a clearer split occurs between titlecasing one an multiple words without any additional naming needed. It also avoids any change to formal behaviour of the existing functions: note though that the new functions do not lowercase the remainder of their input.

josephwright · 2023-07-08T15:11:32Z

@moewew Sorry I've changed the plan a bit. This one I think would not require any immediate biblatex action, though you might want to shift over time.

josephwright · 2023-07-08T15:11:57Z

@gmilde Hoepfully more to your taste: no immediate change to existing functions.

josephwright · 2023-07-08T15:12:44Z

Note that I've not at present added a 'word exception' mechanism, though that is trivial if we stick with the 'word = between two spaces' approximation. Thoughts on this aspect welcome.

moewew · 2023-07-09T06:01:15Z

So just to be sure: This implements \text_titlecase_once, which capitalises the first word and leaves the rest alone, and \text_titlecase_all, which capitalises all words?

Does "capitalise" here means roughly (barring special cases like Dutch "IJ") uppercase first letter and leave the rest alone or does it mean uppercase first letter and lowercase the rest?

biblatex needs a function that capitalises the first word and lowercases all others, as that's our "sentence case". That would no longer be available with this change, right? (Currently that's what \text_titlecase does for us. We use \text_titlecase_first to capitalise the first word in a string and leave the rest alone.)

car222222 · 2023-07-09T11:53:45Z

Is it necessary to introduce "titlecase_once".?
"once" seems too imprecise.
Is there a problem with using "titlecase_first"?
Even more precise would be "titlecase_firstword".

josephwright · 2023-07-09T18:18:42Z

@car222222 I was thinking of the existing tl functions, where we have replace_once and replace_all. I'm happy with first remaining if it is clear enough, and if that then makes sense with all. (I thought of including word in the naming, but it got a bit long.)

FrankMittelbach · 2023-07-09T21:07:42Z

@car222222 I was thinking of the existing tl functions, where we have replace_once and replace_all. I'm happy with first remaining if it is clear enough, and if that then makes sense with all. (I thought of including word in the naming, but it got a bit long.)

I think the argument of using consistent names is compelling even if "firstword" is slightly clearer if just seen without context or without knowledge of expl3. We use "once" always in the sense of "first occurance" and thus using the standard anme is on the whole better in my opinion.

car222222 · 2023-07-10T07:48:10Z

@FrankMittelbach I am happy that "once" always means the first occurrence, but here there is nothing to indicate clearly what is occurring! Using "wordonce" (better "oneword") and "allwords" would be clearer.

I fact replace_first would have been much better:-).

car222222 · 2023-07-10T08:11:16Z

In the case of replace, it is very likely that the one to be replaced (once) is the last occurrence rather than the first.

josephwright · 2023-07-10T08:41:41Z

I'm happy if we want to go with first, but I would then look to adjust other occurrences.

car222222 · 2023-07-10T08:52:50Z

@josephwright Do you mean other occurrences of "once",
or of "first"?

Note also that in this case we still need to indicate the object affected, i.e., "word", since the term "titlecase" does not obviously have much relationship with words.

josephwright · 2023-07-10T10:04:56Z

So just to be sure: This implements \text_titlecase_once, which capitalises the first word and leaves the rest alone, and \text_titlecase_all, which capitalises all words?

Correct.

Does "capitalise" here means roughly (barring special cases like Dutch "IJ") uppercase first letter and leave the rest alone or does it mean uppercase first letter and lowercase the rest?

Broadly uppercase first, leave rest alone. @gmilde suggested that on balance when you look across different languages, leaving non-initial characters alone is likely best. (Also, Unicode only describe titlecase in terms of the action at the start of a word, and do not mention the effect on the remain of the input.)

biblatex needs a function that capitalises the first word and lowercases all others, as that's our "sentence case". That would no longer be available with this change, right? (Currently that's what \text_titlecase does for us. We use \text_titlecase_first to capitalise the first word in a string and leave the rest alone.)

Current suggestion would retain \text_titlecase:n with unchanged behaviour by applying \text_lowercase:n inside \text_titlecase_once:n. However, that's relatively slow as it's two full passes. If it is overall decided that we need a function (or set of functions) which lowercase everything that's not titlecased, then I think a clearer name would be good. 'capitalise' has alredy been suggested, doesn't get into 'sentence' questions, etc. So perhaps \text_captialise_once:n (or first) and \text_capitalise_all:n? (Likely implementation would be a faster version of the 'just apply lowercase first' plan.)

car222222 · 2023-07-10T10:21:28Z

But "capitalise" still does not convey that it is applied to words!
Can we have something that involves "word", please?

josephwright · 2023-07-10T10:24:29Z

@car222222 On `titlecasing', I think the Unicode FAQ are clear it's a word-based property:

Titlecase takes its name from the case format used when forming a title, in which the initial letter in a word is capitalized and the rest are not. Titlecase is also used in forming a sentence by capitalizing the first word, and for forming proper names. The titlecase mapping in the Unicode Standard is the mapping applied to the initial character in a word.

I'm not sure how one could describe 'capitalisation' in a way that didn't link to words.

car222222 · 2023-07-10T10:42:44Z

To me, capitalisation means much the same as uppercasing.
I shall look this up!

car222222 · 2023-07-10T10:50:02Z

capitalize: write or print (a word or letter) in capital letters

But also: to write a letter of the alphabet as a capital, or to write the first letter of a word as a capital

And: To capitalize a word is to make its first letter a capital letter

So again, it seems that getting "word" in there will remove all ambiguities.

car222222 · 2023-07-10T12:48:01Z

Something completely different, try:

_initialcap_allwords

_initialcap_firstword

josephwright · 2023-07-11T12:49:24Z

@moewew I wonder if it';s best for me to set up to internally-optimise the \text_...case:n { \text_...case:n { ...} } structure so we don't need another name: I think \text_titlecase_first:n { \text_lowercase:n { <input> } } looks quite clear and usable.

moewew · 2023-07-11T19:17:45Z

Hmmm, with my biblatex hat on I mainly care about having a tool to do the job, I don't really care whether I can use a predefined function or have to cook up my own combination of \text_titlecase_first:n and \text_lowercase:n. Of course it would be good if the implementation could be efficient (or at least not actively wasteful).

My personal opinion is that it would be cool to have the option to stack \text_... functions without significant performance hit, but my fear would be that that might be tricky to pull off if one wants to keep the code clean. I don't know what your approach is in LaTeX3 when it comes to balancing these kind of things.

josephwright · 2023-07-11T19:31:10Z

@moewew I'm thinking of using a marker so text expansion only needs to happen once. If you are happy with a 'two part' approach, I think this looks cleaner: we simply the naming, etc. Now i need to see if the PR is signed off: as we are at TUG from Thursday p.m., I'll ask the team in person.

car222222 · 2023-07-14T06:08:53Z

How can it get Signed Off now?
Are you not still working on the code?

When the code and names are decided, then it will need something to fill this void:

% \begin{documentation}
%
% \end{documentation}

josephwright · 2023-07-14T06:49:56Z

How can it get Signed Off now? Are you not still working on the code?

Code is finished at this point: I am not planning to include the 'skip words' idea at this stage.

When the code and names are decided, then it will need something to fill this void:
% \begin{documentation}
%
% \end{documentation}

The documentation for all of the l3text-... files is in l3text.dtx.

car222222 · 2023-07-14T07:29:17Z

So should you remove this empty environment from this file?

Or put therein the reference to l3text.dtx (where I shall now check).

car222222 · 2023-07-16T06:10:54Z

Here is some further information on use of the term 'titlecase' in and around Unicode (and a little more generally).

@josephwright wrote (elsewhere):

the Unicode FAQ does use 'titlecase' for the idea of 'capitalising': https://unicode.org/faq/casemap_charprop.html#4

However, according to this FAQ (and my research) the Unicode
documentation in reality uses the term only within these two
specific terms:

the name of a character mapping, “titlecase mapping”, which
is the character mapping that is used by “Capitalisation”;
the name of a Unicode “general category” of characters:
Lt (letter, titlecase).

The above referenced FAQ does indeed also allude (without a
reference) to the possible use elsewhere of the term “titlecase”,
but it seems to clearly eschew (==not support) such usage within
Unicode documentation beyond the cases of the two terms listed above.

Elsewhere, the term is used, occasionally and inconsistently, for the somewhat ill-defined idea of transforming “a title” by uppercasing (or, in sophisticated cases, titlecasing) the start of some more or less well-defined selection of the words.
Also occasionally, for a transformation that changes only the initial character(s) of a word or string (and is never applied to normal multi-word text).

I could not find any examples of its use in relation to the capitalisation of all words in multi-word text.

josephwright · 2023-10-13T12:07:27Z

Right, rebasing, etc. for a merge (@moewew)

moewew · 2023-10-13T19:23:30Z

There are a couple of references to titlecase_once (which is not defined in the current version of expl3 on my machine), but then it is not documented (or implemented?) anywhere. Is that intended?

I don't think I'll manage to get the current GitHub dev version of expl3 running in time, so I'll have to do this blind.

Can you please tell me what

\documentclass{article}

\ExplSyntaxOn
\def\test#1{\text_titlecase_first:n{\text_lowercase:n{#1}}}
\ExplSyntaxOff

\begin{document}
\test{Lorem ipsum Dolor sit Amet A}

\test{lorem ipsum Dolor sit Amet A}
\end{document}

produces with this PR merged?

josephwright · 2023-10-14T10:43:11Z

There are a couple of references to titlecase_once (which is not defined in the current version of expl3 on my machine), but then it is not documented (or implemented?) anywhere. Is that intended?

No - I'll tidy those up - we had a bit of back-and-forth about first vs once.

I don't think I'll manage to get the current GitHub dev version of expl3 running in time, so I'll have to do this blind.

No pressure - we just did a release, I can sit on this for a while.

\test{Lorem ipsum Dolor sit Amet A}
\test{lorem ipsum Dolor sit Amet A}

Both will give Lorem ipsum dolor sit amet a. The overall feeling seemed to be we don't need to lowercase then titlecase in most applications, and that just titlecasing is enough. If that turns out to be problematic, I'll sort something a bit more integrated but where it's clear that first/all is about the 'words' and that the lowercasing is different.

moewew · 2023-10-15T07:10:06Z

Can you tell me what

\documentclass{article}
\usepackage{csquotes}

\ExplSyntaxOn
\bool_set_false:N  \l_text_titlecase_check_letter_bool
\def\test#1{%
  \text_titlecase_first:n{\text_lowercase:n{#1}}\par
  \text_titlecase:n{#1}}
\ExplSyntaxOff

\begin{document}
\test{\enquote{lorem ipsum}}

\test{\enquote*{lorem ipsum}}
\end{document}

gives with the PR merged?

I get

“Lorem ipsum”
“Lorem ipsum”
‘l’orem ipsum
‘lorem ipsum’

where the "‘l’orem ipsum" is not what I want. That means that at least in the current version of the kernel the nested \text_ calls are not a proper replacement for what we need. (So I cannot just switch biblatex to using that for good. We actually need some case distinction on how old the kernel is... presuming it works on dev.)

josephwright · 2023-10-15T19:13:33Z

@moewew 'lorem ipsum' for both. As the deprecation code already covers \text_titlecase:n, I'm not sure why you'd want \text_titlecase_first:n { \text_lowercase:n { .. } } at all.

moewew · 2023-10-16T04:47:49Z

Hmmm.

To be honest, I'm a bit lost here. I just thought \text_titlecase:n was removed and we had to look for an alternative. Can we just keep on using it even though it is marked as deprecated?

josephwright · 2023-10-16T05:46:02Z

@moewew We never remove any functions, although here there are some edge-case changes in behaviour as deprecation goes with shifting to emulation. If you need a function that lowercases first and want that to work before and after the update, and will never give a deprecation warning, something like

\cs_if_exist:NTF \text_titlecase_all:n
  {
    \cs_new:Npn \__mypkg_tilelcase:n #1
      { \text_titlecase_first:n { \text_lowercase:n {#1} } }
  }
  { \cs_new_eq:NN \__mypkg_tilelcase:n \text_titlecase_first:n }

would work.

However, one of the questions I was trying to sort early on is to what extent that is actually required. The suggestion was that lowercasing 'the remainder' is likely not that useful, as the typical pattern is only to worry about the first character (\text_fiflecase_first:n) or the start of each word (\text_titlecase_all:n), leaving the other characters unchanged.

moewew · 2023-10-16T15:58:54Z

However, one of the questions I was trying to sort early on is to what extent that is actually required.

We need it to turn titles into sentence case. Entries in the .bib file are supposed to be given in title case and for styles that need it biblatex provides a function to convert the title into sentence case (as does BibTeX). I can't say anything about how useful such a function is in general, but biblatex definitely needs it.

FrankMittelbach · 2023-10-16T16:35:53Z

@moewew what is the biblatex approach currently to handle uppercase letters that supposed to stay unchanged, say, something like IBM in the title?

josephwright · 2023-10-16T17:31:19Z

@moewew Is the main concern the performance of \text_titlecase_first:n { \text_lowercase:n { ... } }? If so, I can address that I think - all tht's needed is to shortcut a few internals if that situation is detected.

moewew · 2023-10-16T19:23:14Z

@FrankMittelbach The default is to use curly braces thanks to some very clever code by @josephwright and @blefloch that essentially turns {...} into \NoCaseChange{...} (see the long discussion in plk/biblatex#960 and in particular moewew/biblatex@c485f1b) - same as in classical BibTeX (it even gets the BibTeX exceptions broadly right). But users can also set the option bibtexcaseprotection=false, to choose to not have curly braces imply case protection and use \NoCaseChange instead (I prefer this, because the curly braces thing gets messy what with them also being used for arguments, but I doubt that lots of people do this - backwards compatibility with BibTeX is important for many, old habits die hard and the new options are probably not very widely known anyway).

@josephwright Performance is a secondary concern, though I will admit that I have wondered how much the new two-pass scheme will affect performance. My primary concern is backwards compatibility. We need to retain the documented behaviour of \MakeSentencecase and friends and should ideally also preserve its current behaviour as closely as possible (though I could probably live with small deviations in edge cases - this whole thing is just too complicated).

I don't really care how this is implemented either on the expl3 side or the biblatex side as long as the biblatex code follows best practices so that we can be as sure as one can be that what we do is supported for the foreseeable future, benefits from improvements in expl3, and won't break because of what we did. I thought one was not supposed to use deprecated functions, so I looked for a replacement and thought it would be \text_titlecase_first:n { \text_lowercase:n { ... } }. I noticed that those commands were available even in old kernel versions, which would make things easier for us, because we wouldn't have to branch on expl3 versions. So I tried just replacing \text_titlecase:n with that. Then I found the unfortunate example with csquotes' \enquote* shown above that does not produce a good result with my expl3 kernel, which suggested we can't just switch to \text_titlecase_first:n { \text_lowercase:n { ... } } for good even in older kernels, because that would break backwards compatibility, as \text_titlecase:n does the right thing there. (All tested with L3 programming layer <2023-08-29> without this PR and not with the dev version.)

If using the deprecated function will not change behaviour significantly, will work in the foreseeable future and will not generate warnings that users could complain about, I'm perfectly fine doing that. If using deprecated functions is generally frowned upon or may cause warnings (e.g. with stricter expl3 code checking settings) or could cause issues in the future, I'd rather not go down that route, because sooner or later someone will complain.

josephwright · 2023-10-17T21:04:38Z

@moewew I think I can solve the \enquote* issue. There's an oversight in the current code if we 'break' with exactly one brace group left. Luckily, that can be patched such that it will be repaired for both the old and new approaches, and in a way you could simply add to biblatex as a 'hotfix' for a transition period:

\cs_gset:Npn \__text_change_case_break:w #1 \q__text_recursion_stop
  {
    \__text_change_case_break_aux:w ? #1
  }
\cs_gset:Npn \__text_change_case_break_aux:w #1 \q__text_recursion_tail
  {
    \__text_change_case_store:o { \use_none:n #1 }
    \__text_change_case_end:w
  }

With that, you should be able to use \text_titlecase_first:n { \text_lowercase:n { ... } } in an update. Would that be acceptable?

The issue I've been having is that 'titlecase' really means just changing the first char, and depending on the exact usage it may or may not be expected to lowercase the remainder of the input. I really would rather split the two concepts.

If there is a performance issue, I can see a way to do a lookahead and shortcut some of the code that would be repetitive.

Skillmon · 2023-10-17T21:09:52Z

@josephwright random optimisation note: Use \prg_do_nothing: instead of ? to protect against accidental brace-stripping, and just \__text_change_case_store:o {#1} without the \use_none:n.

moewew · 2023-10-18T19:11:08Z

With that, you should be able to use \text_titlecase_first:n { \text_lowercase:n { ... } } in an update. Would that be acceptable?

Sure. As I said, I don't really care what exact code we use, all I care about is that it does what we need to do and we use supported code/follow best practices. From what I gather so far this should give us what we need. (I don't have time to test this at the moment. Hopefully over the weekend...)

The issue I've been having is that 'titlecase' really means just changing the first char, and depending on the exact usage it may or may not be expected to lowercase the remainder of the input. I really would rather split the two concepts.

Fair enough. It felt a bit odd to me to use a macro with titlecase in its name to effectively implement sentence casing (in a way the exact opposite of title case), so I can see where this whole discussion is coming from (as I said in #1232).

If there is a performance issue, I can see a way to do a lookahead and shortcut some of the code that would be repetitive.

Performance improvements would be cool, but I have no idea if this would really impact users significantly. (After all, biblatex already slows down things for other reasons.)

josephwright · 2023-10-18T19:24:04Z

@moewew OK, I will hold off from a release until at least the start of next week - probably will do one soon-ish as this is a non-trivial change.

moewew · 2023-10-18T19:27:11Z

If a new biblatex release is required to avoid things breaking, we will probably need more time, because we also need to release Biber (and that means waiting for the binaries to be built by other contributors), but maybe we don't actually need one, because \text_titlecase:n is not removed... (I guess I'll see if that's the case on Sunday.)

josephwright · 2023-10-18T19:28:35Z

@moewew Ah, right, yes: I'll still wait to make sure there's nothing unexpected

we no longer use \text_titlecase:n for sentence casing. See <latex3/latex3#1247>.

moewew · 2023-10-23T05:17:26Z

Prepared plk/biblatex#1310 for biblatex. As far as I understood, we do not need to merge and release this immediately after the expl3 update because you have backwards compatibility code in place to keep \text_titlecase:n around. Is that correct? Then I would wait for the expl3 update to arrive on my machine, run a few more tests and see if anything changed.

josephwright · 2023-10-23T06:04:33Z

@moewew Sounds good: I will do a release today

josephwright linked an issue Jul 8, 2023 that may be closed by this pull request

Applied to a title, \MakeTitlecase does not what its name suggests. #1232

Closed

josephwright requested a review from zauguin July 8, 2023 15:10

josephwright requested a review from car222222 July 8, 2023 15:13

moewew mentioned this pull request Jul 9, 2023

Prepare for expl3 case changing change plk/biblatex#1293

Closed

zauguin approved these changes Jul 9, 2023

View reviewed changes

josephwright mentioned this pull request Jul 15, 2023

Refine titlecasing and introduce formal sentence casing #1240

Closed

josephwright added 6 commits October 13, 2023 13:05

s/once/first/g

d051e18

Detail changes in ChangeLog

948fac9

Missed one test file

86a0c3b

Update docs

533d846

Add a comma

5bc4986

Revise docs

5095b47

josephwright force-pushed the gh1232-titlecase-2 branch from 3fd6554 to 5095b47 Compare October 13, 2023 12:31

josephwright merged commit 632bbc4 into main Oct 13, 2023
6 checks passed

josephwright deleted the gh1232-titlecase-2 branch October 13, 2023 12:57

moewew added a commit to moewew/biblatex that referenced this pull request Oct 23, 2023

Adapt to expl3 case changing change

efa65b4

we no longer use \text_titlecase:n for sentence casing. See <latex3/latex3#1247>.

moewew mentioned this pull request Oct 23, 2023

Adapt to expl3 case changing change plk/biblatex#1310

Merged

Refine titlecasing #1247

Refine titlecasing #1247

Conversation

josephwright commented Jul 8, 2023

josephwright commented Jul 8, 2023

josephwright commented Jul 8, 2023

josephwright commented Jul 8, 2023

moewew commented Jul 9, 2023

car222222 commented Jul 9, 2023

josephwright commented Jul 9, 2023

FrankMittelbach commented Jul 9, 2023

car222222 commented Jul 10, 2023

car222222 commented Jul 10, 2023 • edited Loading

josephwright commented Jul 10, 2023

car222222 commented Jul 10, 2023

josephwright commented Jul 10, 2023 • edited Loading

car222222 commented Jul 10, 2023

josephwright commented Jul 10, 2023

car222222 commented Jul 10, 2023

car222222 commented Jul 10, 2023

car222222 commented Jul 10, 2023

josephwright commented Jul 11, 2023 • edited Loading

moewew commented Jul 11, 2023

josephwright commented Jul 11, 2023

car222222 commented Jul 14, 2023 • edited Loading

josephwright commented Jul 14, 2023

car222222 commented Jul 14, 2023 • edited Loading

car222222 commented Jul 16, 2023

josephwright commented Oct 13, 2023

moewew commented Oct 13, 2023

josephwright commented Oct 14, 2023

moewew commented Oct 15, 2023

josephwright commented Oct 15, 2023

moewew commented Oct 16, 2023

josephwright commented Oct 16, 2023 • edited Loading

moewew commented Oct 16, 2023

FrankMittelbach commented Oct 16, 2023

josephwright commented Oct 16, 2023

moewew commented Oct 16, 2023 • edited Loading

josephwright commented Oct 17, 2023 • edited Loading

Skillmon commented Oct 17, 2023

moewew commented Oct 18, 2023

josephwright commented Oct 18, 2023

moewew commented Oct 18, 2023

josephwright commented Oct 18, 2023

moewew commented Oct 23, 2023

josephwright commented Oct 23, 2023

car222222 commented Jul 10, 2023 •

edited

Loading

josephwright commented Jul 10, 2023 •

edited

Loading

josephwright commented Jul 11, 2023 •

edited

Loading

car222222 commented Jul 14, 2023 •

edited

Loading

car222222 commented Jul 14, 2023 •

edited

Loading

josephwright commented Oct 16, 2023 •

edited

Loading

moewew commented Oct 16, 2023 •

edited

Loading

josephwright commented Oct 17, 2023 •

edited

Loading