Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine titlecasing #1247

Merged
merged 12 commits into from
Oct 13, 2023
Merged

Refine titlecasing #1247

merged 12 commits into from
Oct 13, 2023

Conversation

josephwright
Copy link
Member

Follows from discussion in #1240. Here, a clearer split occurs between titlecasing one an multiple words without any additional naming needed. It also avoids any change to formal behaviour of the existing functions: note though that the new functions do not lowercase the remainder of their input.

@josephwright
Copy link
Member Author

@moewew Sorry I've changed the plan a bit. This one I think would not require any immediate biblatex action, though you might want to shift over time.

@josephwright
Copy link
Member Author

@gmilde Hoepfully more to your taste: no immediate change to existing functions.

@josephwright
Copy link
Member Author

Note that I've not at present added a 'word exception' mechanism, though that is trivial if we stick with the 'word = between two spaces' approximation. Thoughts on this aspect welcome.

@moewew
Copy link
Contributor

moewew commented Jul 9, 2023

So just to be sure: This implements \text_titlecase_once, which capitalises the first word and leaves the rest alone, and \text_titlecase_all, which capitalises all words?

Does "capitalise" here means roughly (barring special cases like Dutch "IJ") uppercase first letter and leave the rest alone or does it mean uppercase first letter and lowercase the rest?

biblatex needs a function that capitalises the first word and lowercases all others, as that's our "sentence case". That would no longer be available with this change, right? (Currently that's what \text_titlecase does for us. We use \text_titlecase_first to capitalise the first word in a string and leave the rest alone.)

@car222222
Copy link
Contributor

Is it necessary to introduce "titlecase_once".?
"once" seems too imprecise.
Is there a problem with using "titlecase_first"?
Even more precise would be "titlecase_firstword".

@josephwright
Copy link
Member Author

@car222222 I was thinking of the existing tl functions, where we have replace_once and replace_all. I'm happy with first remaining if it is clear enough, and if that then makes sense with all. (I thought of including word in the naming, but it got a bit long.)

@FrankMittelbach
Copy link
Member

@car222222 I was thinking of the existing tl functions, where we have replace_once and replace_all. I'm happy with first remaining if it is clear enough, and if that then makes sense with all. (I thought of including word in the naming, but it got a bit long.)

I think the argument of using consistent names is compelling even if "firstword" is slightly clearer if just seen without context or without knowledge of expl3. We use "once" always in the sense of "first occurance" and thus using the standard anme is on the whole better in my opinion.

@car222222
Copy link
Contributor

@FrankMittelbach I am happy that "once" always means the first occurrence, but here there is nothing to indicate clearly what is occurring! Using "wordonce" (better "oneword") and "allwords" would be clearer.

I fact replace_first would have been much better:-).

@car222222
Copy link
Contributor

car222222 commented Jul 10, 2023

In the case of replace, it is very likely that the one to be replaced (once) is the last occurrence rather than the first.

@josephwright
Copy link
Member Author

I'm happy if we want to go with first, but I would then look to adjust other occurrences.

@car222222
Copy link
Contributor

@josephwright Do you mean other occurrences of "once",
or of "first"?

Note also that in this case we still need to indicate the object affected, i.e., "word", since the term "titlecase" does not obviously have much relationship with words.

@josephwright
Copy link
Member Author

josephwright commented Jul 10, 2023

So just to be sure: This implements \text_titlecase_once, which capitalises the first word and leaves the rest alone, and \text_titlecase_all, which capitalises all words?

Correct.

Does "capitalise" here means roughly (barring special cases like Dutch "IJ") uppercase first letter and leave the rest alone or does it mean uppercase first letter and lowercase the rest?

Broadly uppercase first, leave rest alone. @gmilde suggested that on balance when you look across different languages, leaving non-initial characters alone is likely best. (Also, Unicode only describe titlecase in terms of the action at the start of a word, and do not mention the effect on the remain of the input.)

biblatex needs a function that capitalises the first word and lowercases all others, as that's our "sentence case". That would no longer be available with this change, right? (Currently that's what \text_titlecase does for us. We use \text_titlecase_first to capitalise the first word in a string and leave the rest alone.)

Current suggestion would retain \text_titlecase:n with unchanged behaviour by applying \text_lowercase:n inside \text_titlecase_once:n. However, that's relatively slow as it's two full passes. If it is overall decided that we need a function (or set of functions) which lowercase everything that's not titlecased, then I think a clearer name would be good. 'capitalise' has alredy been suggested, doesn't get into 'sentence' questions, etc. So perhaps \text_captialise_once:n (or first) and \text_capitalise_all:n? (Likely implementation would be a faster version of the 'just apply lowercase first' plan.)

@car222222
Copy link
Contributor

But "capitalise" still does not convey that it is applied to words!
Can we have something that involves "word", please?

@josephwright
Copy link
Member Author

@car222222 On `titlecasing', I think the Unicode FAQ are clear it's a word-based property:

Titlecase takes its name from the case format used when forming a title, in which the initial letter in a word is capitalized and the rest are not. Titlecase is also used in forming a sentence by capitalizing the first word, and for forming proper names. The titlecase mapping in the Unicode Standard is the mapping applied to the initial character in a word.

I'm not sure how one could describe 'capitalisation' in a way that didn't link to words.

@car222222
Copy link
Contributor

To me, capitalisation means much the same as uppercasing.
I shall look this up!

@car222222
Copy link
Contributor

capitalize: write or print (a word or letter) in capital letters

But also: to write a letter of the alphabet as a capital, or to write the first letter of a word as a capital

And: To capitalize a word is to make its first letter a capital letter

So again, it seems that getting "word" in there will remove all ambiguities.

@car222222
Copy link
Contributor

Something completely different, try:

_initialcap_allwords

_initialcap_firstword

@josephwright
Copy link
Member Author

josephwright commented Jul 11, 2023

@moewew I wonder if it';s best for me to set up to internally-optimise the \text_...case:n { \text_...case:n { ...} } structure so we don't need another name: I think \text_titlecase_first:n { \text_lowercase:n { <input> } } looks quite clear and usable.

@moewew
Copy link
Contributor

moewew commented Jul 11, 2023

Hmmm, with my biblatex hat on I mainly care about having a tool to do the job, I don't really care whether I can use a predefined function or have to cook up my own combination of \text_titlecase_first:n and \text_lowercase:n. Of course it would be good if the implementation could be efficient (or at least not actively wasteful).

My personal opinion is that it would be cool to have the option to stack \text_... functions without significant performance hit, but my fear would be that that might be tricky to pull off if one wants to keep the code clean. I don't know what your approach is in LaTeX3 when it comes to balancing these kind of things.

@josephwright
Copy link
Member Author

@moewew I'm thinking of using a marker so text expansion only needs to happen once. If you are happy with a 'two part' approach, I think this looks cleaner: we simply the naming, etc. Now i need to see if the PR is signed off: as we are at TUG from Thursday p.m., I'll ask the team in person.

@car222222
Copy link
Contributor

car222222 commented Jul 14, 2023

How can it get Signed Off now?
Are you not still working on the code?

When the code and names are decided, then it will need something to fill this void:

% \begin{documentation}
%
% \end{documentation}

@josephwright
Copy link
Member Author

How can it get Signed Off now? Are you not still working on the code?

Code is finished at this point: I am not planning to include the 'skip words' idea at this stage.

When the code and names are decided, then it will need something to fill this void:

% \begin{documentation}
%
% \end{documentation}

The documentation for all of the l3text-... files is in l3text.dtx.

@car222222
Copy link
Contributor

car222222 commented Jul 14, 2023

So should you remove this empty environment from this file?

Or put therein the reference to l3text.dtx (where I shall now check).

@car222222
Copy link
Contributor

Here is some further information on use of the term 'titlecase' in and around Unicode (and a little more generally).

@josephwright wrote (elsewhere):

the Unicode FAQ does use 'titlecase' for the idea of 'capitalising': https://unicode.org/faq/casemap_charprop.html#4

However, according to this FAQ (and my research) the Unicode
documentation in reality uses the term only within these two
specific terms:

  1. the name of a character mapping, “titlecase mapping”, which
    is the character mapping that is used by “Capitalisation”;

  2. the name of a Unicode “general category” of characters:
    Lt (letter, titlecase).

The above referenced FAQ does indeed also allude (without a
reference) to the possible use elsewhere of the term “titlecase”,
but it seems to clearly eschew (==not support) such usage within
Unicode documentation beyond the cases of the two terms listed above.

Elsewhere, the term is used, occasionally and inconsistently, for the somewhat ill-defined idea of transforming “a title” by uppercasing (or, in sophisticated cases, titlecasing) the start of some more or less well-defined selection of the words.
Also occasionally, for a transformation that changes only the initial character(s) of a word or string (and is never applied to normal multi-word text).

I could not find any examples of its use in relation to the capitalisation of all words in multi-word text.

@josephwright
Copy link
Member Author

Right, rebasing, etc. for a merge (@moewew)

@josephwright josephwright merged commit 632bbc4 into main Oct 13, 2023
6 checks passed
@josephwright josephwright deleted the gh1232-titlecase-2 branch October 13, 2023 12:57
@moewew
Copy link
Contributor

moewew commented Oct 13, 2023

There are a couple of references to titlecase_once (which is not defined in the current version of expl3 on my machine), but then it is not documented (or implemented?) anywhere. Is that intended?

I don't think I'll manage to get the current GitHub dev version of expl3 running in time, so I'll have to do this blind.

Can you please tell me what

\documentclass{article}

\ExplSyntaxOn
\def\test#1{\text_titlecase_first:n{\text_lowercase:n{#1}}}
\ExplSyntaxOff

\begin{document}
\test{Lorem ipsum Dolor sit Amet A}

\test{lorem ipsum Dolor sit Amet A}
\end{document}

produces with this PR merged?

@josephwright
Copy link
Member Author

There are a couple of references to titlecase_once (which is not defined in the current version of expl3 on my machine), but then it is not documented (or implemented?) anywhere. Is that intended?

No - I'll tidy those up - we had a bit of back-and-forth about first vs once.

I don't think I'll manage to get the current GitHub dev version of expl3 running in time, so I'll have to do this blind.

No pressure - we just did a release, I can sit on this for a while.

\test{Lorem ipsum Dolor sit Amet A}
\test{lorem ipsum Dolor sit Amet A}

Both will give Lorem ipsum dolor sit amet a. The overall feeling seemed to be we don't need to lowercase then titlecase in most applications, and that just titlecasing is enough. If that turns out to be problematic, I'll sort something a bit more integrated but where it's clear that first/all is about the 'words' and that the lowercasing is different.

@moewew
Copy link
Contributor

moewew commented Oct 15, 2023

Can you tell me what

\documentclass{article}
\usepackage{csquotes}

\ExplSyntaxOn
\bool_set_false:N  \l_text_titlecase_check_letter_bool
\def\test#1{%
  \text_titlecase_first:n{\text_lowercase:n{#1}}\par
  \text_titlecase:n{#1}}
\ExplSyntaxOff

\begin{document}
\test{\enquote{lorem ipsum}}

\test{\enquote*{lorem ipsum}}
\end{document}

gives with the PR merged?

I get

“Lorem ipsum”
“Lorem ipsum”
‘l’orem ipsum
‘lorem ipsum’

picture of output reproduced above

where the "‘l’orem ipsum" is not what I want. That means that at least in the current version of the kernel the nested \text_ calls are not a proper replacement for what we need. (So I cannot just switch biblatex to using that for good. We actually need some case distinction on how old the kernel is... presuming it works on dev.)

@josephwright
Copy link
Member Author

@moewew 'lorem ipsum' for both. As the deprecation code already covers \text_titlecase:n, I'm not sure why you'd want \text_titlecase_first:n { \text_lowercase:n { .. } } at all.

@moewew
Copy link
Contributor

moewew commented Oct 16, 2023

Hmmm.

To be honest, I'm a bit lost here. I just thought \text_titlecase:n was removed and we had to look for an alternative. Can we just keep on using it even though it is marked as deprecated?

@josephwright
Copy link
Member Author

josephwright commented Oct 16, 2023

@moewew We never remove any functions, although here there are some edge-case changes in behaviour as deprecation goes with shifting to emulation. If you need a function that lowercases first and want that to work before and after the update, and will never give a deprecation warning, something like

\cs_if_exist:NTF \text_titlecase_all:n
  {
    \cs_new:Npn \__mypkg_tilelcase:n #1
      { \text_titlecase_first:n { \text_lowercase:n {#1} } }
  }
  { \cs_new_eq:NN \__mypkg_tilelcase:n \text_titlecase_first:n }

would work.

However, one of the questions I was trying to sort early on is to what extent that is actually required. The suggestion was that lowercasing 'the remainder' is likely not that useful, as the typical pattern is only to worry about the first character (\text_fiflecase_first:n) or the start of each word (\text_titlecase_all:n), leaving the other characters unchanged.

@moewew
Copy link
Contributor

moewew commented Oct 16, 2023

However, one of the questions I was trying to sort early on is to what extent that is actually required.

We need it to turn titles into sentence case. Entries in the .bib file are supposed to be given in title case and for styles that need it biblatex provides a function to convert the title into sentence case (as does BibTeX). I can't say anything about how useful such a function is in general, but biblatex definitely needs it.

@FrankMittelbach
Copy link
Member

@moewew what is the biblatex approach currently to handle uppercase letters that supposed to stay unchanged, say, something like IBM in the title?

@josephwright
Copy link
Member Author

@moewew Is the main concern the performance of \text_titlecase_first:n { \text_lowercase:n { ... } }? If so, I can address that I think - all tht's needed is to shortcut a few internals if that situation is detected.

@moewew
Copy link
Contributor

moewew commented Oct 16, 2023

@FrankMittelbach The default is to use curly braces thanks to some very clever code by @josephwright and @blefloch that essentially turns {...} into \NoCaseChange{...} (see the long discussion in plk/biblatex#960 and in particular moewew/biblatex@c485f1b) - same as in classical BibTeX (it even gets the BibTeX exceptions broadly right). But users can also set the option bibtexcaseprotection=false, to choose to not have curly braces imply case protection and use \NoCaseChange instead (I prefer this, because the curly braces thing gets messy what with them also being used for arguments, but I doubt that lots of people do this - backwards compatibility with BibTeX is important for many, old habits die hard and the new options are probably not very widely known anyway).

@josephwright Performance is a secondary concern, though I will admit that I have wondered how much the new two-pass scheme will affect performance. My primary concern is backwards compatibility. We need to retain the documented behaviour of \MakeSentencecase and friends and should ideally also preserve its current behaviour as closely as possible (though I could probably live with small deviations in edge cases - this whole thing is just too complicated).

I don't really care how this is implemented either on the expl3 side or the biblatex side as long as the biblatex code follows best practices so that we can be as sure as one can be that what we do is supported for the foreseeable future, benefits from improvements in expl3, and won't break because of what we did. I thought one was not supposed to use deprecated functions, so I looked for a replacement and thought it would be \text_titlecase_first:n { \text_lowercase:n { ... } }. I noticed that those commands were available even in old kernel versions, which would make things easier for us, because we wouldn't have to branch on expl3 versions. So I tried just replacing \text_titlecase:n with that. Then I found the unfortunate example with csquotes' \enquote* shown above that does not produce a good result with my expl3 kernel, which suggested we can't just switch to \text_titlecase_first:n { \text_lowercase:n { ... } } for good even in older kernels, because that would break backwards compatibility, as \text_titlecase:n does the right thing there. (All tested with L3 programming layer <2023-08-29> without this PR and not with the dev version.)

If using the deprecated function will not change behaviour significantly, will work in the foreseeable future and will not generate warnings that users could complain about, I'm perfectly fine doing that. If using deprecated functions is generally frowned upon or may cause warnings (e.g. with stricter expl3 code checking settings) or could cause issues in the future, I'd rather not go down that route, because sooner or later someone will complain.

@josephwright
Copy link
Member Author

josephwright commented Oct 17, 2023

@moewew I think I can solve the \enquote* issue. There's an oversight in the current code if we 'break' with exactly one brace group left. Luckily, that can be patched such that it will be repaired for both the old and new approaches, and in a way you could simply add to biblatex as a 'hotfix' for a transition period:

\cs_gset:Npn \__text_change_case_break:w #1 \q__text_recursion_stop
  {
    \__text_change_case_break_aux:w ? #1
  }
\cs_gset:Npn \__text_change_case_break_aux:w #1 \q__text_recursion_tail
  {
    \__text_change_case_store:o { \use_none:n #1 }
    \__text_change_case_end:w
  }

With that, you should be able to use \text_titlecase_first:n { \text_lowercase:n { ... } } in an update. Would that be acceptable?

The issue I've been having is that 'titlecase' really means just changing the first char, and depending on the exact usage it may or may not be expected to lowercase the remainder of the input. I really would rather split the two concepts.

If there is a performance issue, I can see a way to do a lookahead and shortcut some of the code that would be repetitive.

@Skillmon
Copy link
Contributor

@josephwright random optimisation note: Use \prg_do_nothing: instead of ? to protect against accidental brace-stripping, and just \__text_change_case_store:o {#1} without the \use_none:n.

@moewew
Copy link
Contributor

moewew commented Oct 18, 2023

With that, you should be able to use \text_titlecase_first:n { \text_lowercase:n { ... } } in an update. Would that be acceptable?

Sure. As I said, I don't really care what exact code we use, all I care about is that it does what we need to do and we use supported code/follow best practices. From what I gather so far this should give us what we need. (I don't have time to test this at the moment. Hopefully over the weekend...)

The issue I've been having is that 'titlecase' really means just changing the first char, and depending on the exact usage it may or may not be expected to lowercase the remainder of the input. I really would rather split the two concepts.

Fair enough. It felt a bit odd to me to use a macro with titlecase in its name to effectively implement sentence casing (in a way the exact opposite of title case), so I can see where this whole discussion is coming from (as I said in #1232).

If there is a performance issue, I can see a way to do a lookahead and shortcut some of the code that would be repetitive.

Performance improvements would be cool, but I have no idea if this would really impact users significantly. (After all, biblatex already slows down things for other reasons.)

@josephwright
Copy link
Member Author

@moewew OK, I will hold off from a release until at least the start of next week - probably will do one soon-ish as this is a non-trivial change.

@moewew
Copy link
Contributor

moewew commented Oct 18, 2023

If a new biblatex release is required to avoid things breaking, we will probably need more time, because we also need to release Biber (and that means waiting for the binaries to be built by other contributors), but maybe we don't actually need one, because \text_titlecase:n is not removed... (I guess I'll see if that's the case on Sunday.)

@josephwright
Copy link
Member Author

@moewew Ah, right, yes: I'll still wait to make sure there's nothing unexpected

moewew added a commit to moewew/biblatex that referenced this pull request Oct 23, 2023
we no longer use \text_titlecase:n for sentence casing.

See <latex3/latex3#1247>.
@moewew
Copy link
Contributor

moewew commented Oct 23, 2023

Prepared plk/biblatex#1310 for biblatex. As far as I understood, we do not need to merge and release this immediately after the expl3 update because you have backwards compatibility code in place to keep \text_titlecase:n around. Is that correct? Then I would wait for the expl3 update to arrive on my machine, run a few more tests and see if anything changed.

@josephwright
Copy link
Member Author

@moewew Sounds good: I will do a release today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Applied to a title, \MakeTitlecase does not what its name suggests.
7 participants