Option to transliterate paths #9134

jmooring · 2021-11-05T18:29:18Z

Background

You can configure Hugo to remove non-spacing marks from composite characters in content paths by enabling removePathAccents in the site configuration.

content/áéíñóú.md --> https://example.org/aeinou/
content/ÄÖÜäöüß.md --> https://example.org/aouaou%C3%9F/
content/çđħłƚŧ.md --> https://example.org/c%C4%91%C4%A7%C5%82%C6%9A%C5%A7/

Removing the non-spacing marks has the desired effect in the first example, but it:

Has no effect on non-composite characters (e.g., ß, ł, Ł)
Is not language aware (e.g., for German, ä should become ae)

This issue has been raised a few times on the forum, and stale bot has closed three related issues that continue to receive comments:

Also:

Proposal

Provide an option to convert path characters from Unicode to ASCII, commonly called "Transliteration."

For a site with English (en) as the default content language:

content/áéíñóú.md --> https://example.org/aeinou/
content/ÄÖÜäöüß.md --> https://example.org/AOUaouss/
content/çđħłƚŧ.md --> https://example.org/cdhllt/

For a site with German (de) as the default content language:

content/ÄÖÜäöüß.md --> https://example.org/AeOeUeaeoeuess/

Include a related template function so that you can access term pages:

{{ with site.GetPage (path.Join "tags" ("çđħłƚŧ äöü" | transliterate | anchorize)) }}
  <a href="{{ .RelPermalink }}">{{ .LinkTitle }}</a>
{{ end }}

The text was updated successfully, but these errors were encountered:

Closes gohugoio#9134

github-actions · 2022-11-13T02:15:28Z

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

istr · 2022-11-21T21:00:35Z

Just a ping to keep this open. I still find this very useful and the implementation in #9135 looks sensible. There @bep stated:

I agree that we probably need something like this, but it needs to wait.

Closes gohugoio#9134

bep · 2024-02-05T17:47:16Z

I have read this issue.

Arrest me if I'm wrong, but with how Hugo treats content paths, this issue is a cosmetic issue about end URLs. The example above:

{{ with site.GetPage (path.Join "tags" ("çđħłƚŧ äöü" | transliterate | anchorize)) }}

Should now work fine as

{{ with site.GetPage "/tags/çđħłƚŧ äöü" }}

I don't think the first example would work at all.

I understand that people would want pretty URLs (that's what I use slug for), so that leaves:

Has no effect on non-composite characters (e.g., ß, ł, Ł)
Is not language aware (e.g., for German, ä should become ae)

I have not seen a (relatively) complete language aware transliteration library (the one used in the referenced PR failed to transliterate my name in my language (which is also the language of this year's Nobel price winner in litterature).

I think this would be easier to fix if we drop the second point above. Then we could also possibly get away using the existing setting.

jmooring · 2024-02-05T18:40:10Z

That's great! With v0.123.0-DEV we can now make the round trip when removePathAccents is true, so a template function to go the other direction is not required.

If we take language-specific behavior out of the equation, all we need is something that does the equivalent of:

iconv -f utf-8 -t ascii//TRANSLIT <<< "áéíñóú çđħłƚŧ ÄÖÜäöüß"   # aeinou cdhllt AOUaouss

And that could certainly live under the existing setting.

istr · 2024-02-05T19:04:12Z

@bep

I have not seen a (relatively) complete language aware transliteration library (the one used in the referenced PR failed to transliterate my name in my language (which is also the language of this year's Nobel price winner in litterature).

I think this would be easier to fix if we drop the second point above. Then we could also possibly get away using the existing setting.

Yes, this is why I suggested a different approach some time ago (let the user choose the mapping and make it a config entry).
#3476 (comment)

The proposal was based on this blog post with an idea of how it could be implemented.
https://www.von-laufenberg.de/blog/it/hugo-umlaute/

The requirements for transliteration can vary widely from use case to use case, so I think it would still be best not to rely on a (hardcoded) library, but to provide a versatile configurable mapping and maybe provide some sensible internal defaults or maybe even only in the documentation of the feature.

istr · 2024-02-05T19:39:37Z

I understand that people would want pretty URLs (that's what I use slug for), so that leaves:

This is a bit more than what slug can currently do. People want to generate search engine friendly URLs with a common transliteration for all generated URLs, including taxonomy and tags. I still don't see how this can be done with slug or removePathAccents alone.

I think this issue, and all the other related issues in the description, are specifically about transliterating/mangling the final URL with a mapping function, either using a hard-coded common transliteration or using a generic mapping. So the focus of all these questions is on the second point above (having a working mapping). Once you have that, you can easily work around the first problem (a very specific mapping function does not cover all expected cases).

A generic filtering/mapping function that could be hooked into the final stage of URL generation would be sufficient to handle all these use cases.

Although it might be technically easier to fix only half of them, it would not solve the problem or address the intended use cases.

So, in my opinion the most important part of the equation would be

If we take language-specific behavior out of the equation, all we need is something that does the equivalent of:

iconv -f utf-8 -t ascii//TRANSLIT <<< "áéíñóú çđħłƚŧ ÄÖÜäöüß" # aeinou cdhllt AOUaouss
And that could certainly live under the existing setting.

which renders correctly with iconv, given you use the locale that contains the target language:

env | grep LC ; iconv -f utf-8 -t ascii//TRANSLIT <<< "áéíñóú çđħłƚŧ ÄÖÜäöüß"
LC_CTYPE=de_DE.UTF-8
aeinou cdhllt AEOEUEaeoeuess

(note that iconv creates Ä -> AE, Ö -> OE ... in that case).

LC_CTYPE=nn_NO.UTF-8 iconv -f utf-8 -t ascii//TRANSLIT <<< "Bjørn Erik Pedersen"
Bjoern Erik Pedersen

LC_CTYPE=nb_NO.UTF-8 iconv -f utf-8 -t ascii//TRANSLIT <<< "Bjørn Erik Pedersen"
Bjoern Erik Pedersen

(note that it has the expected output as per #11246 (comment), both for Nynorsk and Bokmål, but I did not expect any differences between the two, to be honest)

bep · 2024-02-05T19:50:13Z

People want to generate search engine friendly URLs with a common transliteration for all generated URLs, including taxonomy and tags.

Are you sure the search engines cares about perfect transliteration? I suspect Google happily reads this:

https://example.org/c%C4%91%C4%A7%C5%82%C6%9A%C5%A7/

Which I guess is what we have today.

istr · 2024-02-05T20:05:23Z

Perhaps (hopefully) the use case for transliterating path segments or URLs is obsolete, or will be soon.

However, the number of comments on all these issues and the activity on the forum around this topic suggests otherwise.
Random examples: https://discourse.gohugo.io/t/cyrillic-aware-slugify-function/27578/6, https://discourse.gohugo.io/t/replace-characters/43327/4 and linked threads.

At least as a human, I can (sort of) read https://example.org/cdhllt, but not (yet) https://example.org/c%C4%91%C4%A7%C5%82%C6%9A%C5%A7/. If it is printed somewhere, I have no problem typing cdhllt, a very hard time typing the second form, and highly doubt I would get çđħłƚŧ right. So there are still valid use cases for it.

However, it might also be an option to actively discourage users from using transliteration and point them to full UTF-8 support.

bep · 2024-02-05T20:22:57Z

OK, I have 2 concerns here:

Is maintenance; I don't want to maintain another language package / having to answer questions about "why this doesn't transliterates correctly in language x"
Speed.

The package we currrently use to remove accents has an API like below:

func main() {
	chain := transform.Chain(
		norm.NFD,
		runes.Map(func(r rune) rune {
			switch r {
			case 'ą':
				return 'a'
			case 'ć':
				return 'c'
			case 'ę':
				return 'e'
			case 'ł':
				return 'l'
			case 'ń':
				return 'n'
			case 'ó':
				return 'o'
			case 'ś':
				return 's'
			case 'ż':
				return 'z'
			case 'ź':
				return 'z'
			case 'ø':
				return 'o'
			}
			return r
		}),
		norm.NFC,
	)
	s, _, _ := transform.String(chain, "Bjørn Erik Pedersen")
	fmt.Println(s) // Works for me.
}

If we accept that the transliteration is a simple rune -> rune we could probably

Create a sensible default set.
Add an option to add (or) replace this per language. But I think we need to somehow avoid doing map lookups.

This is me thinking out loud.

jmooring · 2024-02-05T20:42:42Z

I was curious if there's a CLDR table...

Disabled temporarily. And there's probably a good reason for that.

istr · 2024-02-05T21:33:26Z

OK, I have 2 concerns here:

Is maintenance; I don't want to maintain another language package / having to answer questions about "why this doesn't transliterates correctly in language x"

I agree, which is why I would go with a more generic option, see my other comments.

Speed.

I agree as well. This is one of hugo's biggest USPs, so it is better not to sacrifice it for features.

The package we currrently use to remove accents has an API like below:

func main() {
	chain := transform.Chain(
		norm.NFD,
		runes.Map(func(r rune) rune {
			switch r {
			case 'ą':
				return 'a'
			case 'ć':
				return 'c'
			case 'ę':
				return 'e'
			case 'ł':
				return 'l'
			case 'ń':
				return 'n'
			case 'ó':
				return 'o'
			case 'ś':
				return 's'
			case 'ż':
				return 'z'
			case 'ź':
				return 'z'
			case 'ø':
				return 'o'
			}
			return r
		}),
		norm.NFC,
	)
	s, _, _ := transform.String(chain, "Bjørn Erik Pedersen")
	fmt.Println(s) // Works for me.
}

If we accept that the transliteration is a simple rune -> rune we could probably

Create a sensible default set.
Add an option to add (or) replace this per language. But I think we need to somehow avoid doing map lookups.

This is me thinking out loud.

From my point of view, it would be sufficient to simply expose the mapping in this function to config.
So just replace the hardcoded switch statement with a configurable mapping that is configurable per target language.

Everything else could be left to the user, so they clearly know that it is up to them to provide the mapping they need.

Personally, I would bet a lot on the claim that a simple per-language configurable rune -> rune mapping will do the trick and make a lot of users happy.

EDIT: note, however, that the target rune would need to be multi-character to support ä - ae (German) and ж - zh (Cyrillic) use cases and that the source rune would need to be multi-character to support both forms of UTF-8 accent rendering use cases.

istr · 2024-02-05T21:45:26Z

But I think we need to somehow avoid doing map lookups.

Are you sure this would make a performance difference to the hard-coded switch statement?
Hopefully the go implementation is close to O(1) for maps of this (presumably small) size.

bep · 2024-02-06T08:26:06Z

@istr no, you are right, for our use case, they seem to perform exactly the same: #11998

bep · 2024-02-06T10:33:58Z

OK, I have searched a little more around, and my current take on this is:

There's is no great transliteration Go library available, nothing that resembles a "standard way" of doing this. We don't have the resources to invent the wheel.
We have removePathAccents with a well defined behaviour, so we cannot add "different things" to that without breaking things.
I suggest that we add a paths config struct or something where we can put this.
We can name this new option something other than "transliterateSomething" so we can have an opening for doing better in future.
I think it should be possible to use the same API to create a default that matches most (e.g. ø => o) common cases.
But I don't think it should be too hard to allow custom mappings (per language).

jmooring · 2024-02-06T19:51:19Z

TLDR: I recommend deferring this indefinitely pending demand.

This started with the addition of the removePathAccents option, motivated (as far as I can tell) by desire for URL compatibility with other systems (e.g., migrating from Jekyll, Drupal, etc.).

But then you couldn't get to the term page with any of these:

{{ with "áéíñóú" }}
  {{ (site.Taxonomies.tags.Get .).Page.RelPermalink }}
  {{ (index site.Taxonomies.tags .).Page.RelPermalink }}
  {{ (site.GetPage (printf "/tags/%s" .)).RelPermalink }}
{{ end }}

And that generated some noise, in the Academic/Wowchemy/HugoBlox world in particular, despite the introduction of .Page.GetTerms a few years later, which covers the majority of the use cases. I still see themes doing it the hard way instead of using .Page.GetTerms.

The inability to get back to the term page was the primary driver for creating this issue, irrelevant with v0.123.0.

And then came the desire to have "accents" removed from non-composite characters, which is impossible, because they are not composite characters. I'm not sure if this desire was driven by compatibility requirements, aesthetic preference, or just a lack of understanding (e.g., "It's broken. It's not removing my accents.").

So that means transliteration. But as soon as you open that box, it needs to be language specific.

In my view there is insufficient "compatibility" or "aesthetic" demand to pursue this at the moment. The changes in v0.123.0 solved the initial problem, and actually solved another one in this area as well... all three work great:

{{ with "tag c" }}
  {{ (site.Taxonomies.tags.Get .).Page.RelPermalink }}
  {{ (index site.Taxonomies.tags .).Page.RelPermalink }}
  {{ (site.GetPage (printf "/tags/%s" .)).RelPermalink }}
{{ end }}

idarek · 2024-02-06T20:37:30Z

As @jmooring mentioned: " deferring this indefinitely pending demand."

My case was #7542 but I am not crying about it. I learned to live with it. There are other, more demanding things, that I think are worth spending more time on than this. Unless something simple is found out, I agree that deferring this will be the best approach. There are some good ideas in this issue, but it's all about how much time is allowed to be spent on that compared to the needs of users (me included).

jmooring · 2024-02-18T15:58:32Z

As a data point related to aesthetically pleasing URLs, Wikipedia doesn't feel this important.

In the browser's address bar you see this:

https://de.wikipedia.org/wiki/Straußwirtschaft

When you cut/paste the URL and copy it into an email (for example):

https://de.wikipedia.org/wiki/Strau%C3%9Fwirtschaft

Hugo's current behavior is identical. I'm inclined to remove "aesthetically pleasing URLs" as a reason to pursue this, leaving only compatibility with other systems that transliterate (e.g., Drupal, where transliteration is disabled by default).

istr · 2024-02-18T18:42:21Z

@jmooring Nice Wikipedia entry, I love to go to a Straußwirtschaft (aka Besenwirtschaft) in late summer.

I don't follow your argument here though. "We don't have a use case because {fill in any big Internet player here} doesn't care" is not a plausible argument. In fact, it is a fallacy (ad populum). Using the same fallacy, I could argue the opposite: transliteration is standardised, so we have a use case. See https://en.wikipedia.org/wiki/List_of_ISO_romanizations. Or that even the Serbian government provides a transliteration of its website (https://www.srbija.gov.rs/, select "Latinica").

I would consider both arguments invalid because they ignore the context. We have a use case for transliteration only because many Hugo users (including myself), for various reasons and repeatedly over a long period of time, seem to have a use case that is expressed in several GitHub issues and forum posts.

One could argue that "aesthetically pleasing URLs" is not a valid use case to begin with. But there are many other valid use cases, such as the (common) Cyrillic romanisation use case mentioned above, which was raised by a real Hugo user in a forum post.

jmooring · 2024-02-18T18:57:52Z

Not an argument, just a...

data point

jmooring added the Proposal label Nov 5, 2021

jmooring added a commit to jmooring/hugo that referenced this issue Nov 5, 2021

Add option to transliterate paths

651e63d

Closes gohugoio#9134

jmooring mentioned this issue Nov 5, 2021

Add option to transliterate paths #9135

Closed

jmooring added a commit to jmooring/hugo that referenced this issue Nov 5, 2021

Add option to transliterate paths

b1e769d

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Nov 5, 2021

Add option to transliterate paths

dd46525

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Dec 17, 2021

Add option to transliterate paths

1a0a7c6

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Apr 12, 2022

Add option to transliterate paths

6805a58

Closes gohugoio#9134

github-actions bot added the Stale label Nov 13, 2022

github-actions bot removed the Stale label Nov 22, 2022

jmooring mentioned this issue Feb 11, 2023

removePathAccents not work for some characters #10712

Closed

jmooring mentioned this issue May 14, 2023

preserveTaxonomyNames FALSE not changing single Polish characters ( Ł ł ) #7542

Closed

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

f3a8e1d

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

84ee31c

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

cec3b0e

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

59d8376

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

3a38b6e

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

264f98a

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

491a8a1

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

5cca6c7

Closes gohugoio#9134

jmooring added a commit to jmooring/hugo that referenced this issue Jul 12, 2023

helpers: Add option to transliterate content paths

efb6489

Closes gohugoio#9134

jmooring mentioned this issue Jul 13, 2023

helpers: Add option to transliterate content paths #11246

Closed

jmooring added a commit to jmooring/hugo that referenced this issue Sep 1, 2023

helpers: Add option to transliterate content paths

0e23100

Closes gohugoio#9134

jmooring mentioned this issue Oct 6, 2023

removePathAccents not work for some characters đ #11534

Closed

jmooring added a commit to jmooring/hugo that referenced this issue Oct 6, 2023

helpers: Add option to transliterate content paths

106cd77

Closes gohugoio#9134

jmooring self-assigned this Oct 8, 2023

jmooring added a commit to jmooring/hugo that referenced this issue Dec 4, 2023

helpers: Add option to transliterate content paths

809cd15

Closes gohugoio#9134

bep added this to the v0.124.0 milestone Feb 6, 2024

bep modified the milestones: v0.124.0, v0.125.0 Mar 4, 2024

jmooring removed their assignment Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to transliterate paths #9134

Option to transliterate paths #9134

jmooring commented Nov 5, 2021 •

edited

Loading

github-actions bot commented Nov 13, 2022

istr commented Nov 21, 2022

bep commented Feb 5, 2024 •

edited

Loading

jmooring commented Feb 5, 2024

istr commented Feb 5, 2024

istr commented Feb 5, 2024

bep commented Feb 5, 2024

istr commented Feb 5, 2024

bep commented Feb 5, 2024 •

edited

Loading

jmooring commented Feb 5, 2024

istr commented Feb 5, 2024 •

edited

Loading

istr commented Feb 5, 2024

bep commented Feb 6, 2024

bep commented Feb 6, 2024

jmooring commented Feb 6, 2024 •

edited

Loading

idarek commented Feb 6, 2024

jmooring commented Feb 18, 2024

istr commented Feb 18, 2024

jmooring commented Feb 18, 2024

Option to transliterate paths #9134

Option to transliterate paths #9134

Comments

jmooring commented Nov 5, 2021 • edited Loading

Background

Proposal

github-actions bot commented Nov 13, 2022

istr commented Nov 21, 2022

bep commented Feb 5, 2024 • edited Loading

jmooring commented Feb 5, 2024

istr commented Feb 5, 2024

istr commented Feb 5, 2024

bep commented Feb 5, 2024

istr commented Feb 5, 2024

bep commented Feb 5, 2024 • edited Loading

jmooring commented Feb 5, 2024

istr commented Feb 5, 2024 • edited Loading

istr commented Feb 5, 2024

bep commented Feb 6, 2024

bep commented Feb 6, 2024

jmooring commented Feb 6, 2024 • edited Loading

idarek commented Feb 6, 2024

jmooring commented Feb 18, 2024

istr commented Feb 18, 2024

jmooring commented Feb 18, 2024

jmooring commented Nov 5, 2021 •

edited

Loading

bep commented Feb 5, 2024 •

edited

Loading

bep commented Feb 5, 2024 •

edited

Loading

istr commented Feb 5, 2024 •

edited

Loading

jmooring commented Feb 6, 2024 •

edited

Loading