New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special characters in taxonomy and slugs #1180

Closed
nicolinuxfr opened this Issue May 29, 2015 · 23 comments

Comments

Projects
None yet
4 participants
@nicolinuxfr

nicolinuxfr commented May 29, 2015

I'm trying Hugo for my personal blog which has a lot of taxonomies. And as I'm writing in French, many taxonomies have special characters in them, like an accentuated letter.

Right now, I'm using WordPress which has the perfect behavior on this matter. The taxonomy name can have any special characters (for example, "Gérard Depardieu"), the slug associated with it only has standard characters (gerard-depardieu). But when you display the taxonomy archive, you still have the special characters : so in this case, you would not have "Gerard Depardieu", but "Gérard Depardieu"). You can see the example live here : http://voiretmanger.fr/acteur/gerard-depardieu/.

taxonomy-example

I don't know if Hugo could do the same. I know WordPress has a database, so it's easier. But I can see some solutions (or hacks) to make it work : either look in the metadata associated with the post to display the name of the taxonomy on the archive page, or have a "table" (a YAML/TOML config file, I guess) with all correspondances.

An idea, to end my Gérard Depardieu example :

gerard-depardieu: "Gérard Depardieu"

I hope a solution will be feasible, because it's the main thing that would be keeping me out of Hugo and with WordPress. I think I can find a solution for every other problems I have…

Thanks anyway for your time !

@bep bep added the Enhancement label May 29, 2015

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep May 29, 2015

Member

This isn't hard to fix and I understand the motivation for it.

We already do some URL normalization of the taxonomies, but probably didn't think about monsieur Depardieu back then. This might be a breaking change (as someone will have some URLs changed), but it's the right thing to do.

Member

bep commented May 29, 2015

This isn't hard to fix and I understand the motivation for it.

We already do some URL normalization of the taxonomies, but probably didn't think about monsieur Depardieu back then. This might be a breaking change (as someone will have some URLs changed), but it's the right thing to do.

@bep bep added this to the v0.15 milestone May 29, 2015

@bep bep self-assigned this May 29, 2015

bep added a commit to bep/hugo that referenced this issue May 29, 2015

Remove accents in URLs
So gerard-depardieu not gérard-depardieu etc.

Fixes #1180
@nicolinuxfr

This comment has been minimized.

Show comment
Hide comment
@nicolinuxfr

nicolinuxfr May 29, 2015

OK, great for the first and easy part ! Thanks :-)

There's one more problem though : I don't want the accent in the URL, but I want it on the archive page (like on the screenshot).

With nothing more, I don't see how it could work. Am I wrong ?

nicolinuxfr commented May 29, 2015

OK, great for the first and easy part ! Thanks :-)

There's one more problem though : I don't want the accent in the URL, but I want it on the archive page (like on the screenshot).

With nothing more, I don't see how it could work. Am I wrong ?

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep May 29, 2015

Member

You are wrong. The accents (and some others) are ONLY stripped for the paths (on disk and the URL presented to the user). The taxonomy name will be preserved as written.

I added the "Gérard Depardieu" tag to one of my posts to make sure. It has nothing to do with the actor, but I might publish it just to confuse people.

Member

bep commented May 29, 2015

You are wrong. The accents (and some others) are ONLY stripped for the paths (on disk and the URL presented to the user). The taxonomy name will be preserved as written.

I added the "Gérard Depardieu" tag to one of my posts to make sure. It has nothing to do with the actor, but I might publish it just to confuse people.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep May 29, 2015

Member

OK, I retract the last I said above -- there is one more fix to do, will check on that tomorrow.

Member

bep commented May 29, 2015

OK, I retract the last I said above -- there is one more fix to do, will check on that tomorrow.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep May 29, 2015

Member

I can get this to work in a hackish-kind-of way, but will have to look at this later -- to do a proper fix.

Member

bep commented May 29, 2015

I can get this to work in a hackish-kind-of way, but will have to look at this later -- to do a proper fix.

@bep bep removed their assignment May 30, 2015

@nicolinuxfr

This comment has been minimized.

Show comment
Hide comment
@nicolinuxfr

nicolinuxfr May 30, 2015

Well thanks ! I'm impressed, we are definitely not on the WordPress pace here… :-)

nicolinuxfr commented May 30, 2015

Well thanks ! I'm impressed, we are definitely not on the WordPress pace here… :-)

bep added a commit to bep/hugo that referenced this issue May 30, 2015

Remove accents in URLs
So the taxonomy `Gérard Depardieu` gives paths on the form `gerard-depardieu`.

Unfortunately this introduces two imports from the `golang.org/`, but Unicode-normalization isn't something we'd want to write from scratch.

See https://blog.golang.org/normalization

See #1180

bep added a commit to bep/hugo that referenced this issue May 30, 2015

Add PreserveTaxonomyNames flag
Before this commit, taxonomy names were hyphenated, lower-cased and normalized -- then fixed and titleized on the archive page.

So what you entered in the front matter isn't necessarily what you got in the final site.

To preserve backwards compability, `PreserveTaxonomyNames` is default `false`.

Setting it to `true` will preserve what you type (the first characters is made toupper for titles), but normalized in URLs.

This also means that, if you manually construct URLs to the archive pages, you will have to pass the Taxonomy names through the `urlize` func.

Fixes #1180

bep added a commit that referenced this issue May 31, 2015

Remove accents in URLs
So the taxonomy `Gérard Depardieu` gives paths on the form `gerard-depardieu`.

Unfortunately this introduces two imports from the `golang.org/`, but Unicode-normalization isn't something we'd want to write from scratch.

See https://blog.golang.org/normalization

See #1180

@bep bep closed this in be38acd May 31, 2015

@nicolinuxfr

This comment has been minimized.

Show comment
Hide comment
@nicolinuxfr

nicolinuxfr Jun 1, 2015

Just a quick note to thank bep for his work… it works exactly as I wanted it ! So it's perfect as far as I am concerned. :-)

capture d ecran 2015-06-01 a 17 44 13

nicolinuxfr commented Jun 1, 2015

Just a quick note to thank bep for his work… it works exactly as I wanted it ! So it's perfect as far as I am concerned. :-)

capture d ecran 2015-06-01 a 17 44 13

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jun 16, 2015

Contributor

Interesting @bep, because when you "normalize" Japanese, and remove the "accent" from katakana, the meaning changes completely. In some cases it's unrecognizable or at least quite humorous.

Contributor

RickCogley commented Jun 16, 2015

Interesting @bep, because when you "normalize" Japanese, and remove the "accent" from katakana, the meaning changes completely. In some cases it's unrecognizable or at least quite humorous.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jun 16, 2015

Member

OK, so that part may have been a bad idea ... I can revert that if I'm convinced ... Hmm, languages. @nicolinuxfr

Member

bep commented Jun 16, 2015

OK, so that part may have been a bad idea ... I can revert that if I'm convinced ... Hmm, languages. @nicolinuxfr

@bep bep reopened this Jun 16, 2015

@nicolinuxfr

This comment has been minimized.

Show comment
Hide comment
@nicolinuxfr

nicolinuxfr Jun 16, 2015

Hum, not a bad idea for me anyway. I hope I will be able to keep this really important feature for me.

nicolinuxfr commented Jun 16, 2015

Hum, not a bad idea for me anyway. I hope I will be able to keep this really important feature for me.

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jun 16, 2015

Contributor

@bep, I can give you precise information about which characters in Japanese are losing their "accent" if that will help.

For instance:

ビ  going to ヒ

Please advise how I can assist in figuring it out.

Contributor

RickCogley commented Jun 16, 2015

@bep, I can give you precise information about which characters in Japanese are losing their "accent" if that will help.

For instance:

ビ  going to ヒ

Please advise how I can assist in figuring it out.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jun 16, 2015

Member

@nicolinuxfr yes, that was the input I wanted (how important is it). @RickCogley I think the solution is to add an option around this, default old behaviour.

I will fix this later tonight. BTW: This is just about the URLs/file paths.

Member

bep commented Jun 16, 2015

@nicolinuxfr yes, that was the input I wanted (how important is it). @RickCogley I think the solution is to add an option around this, default old behaviour.

I will fix this later tonight. BTW: This is just about the URLs/file paths.

@nicolinuxfr

This comment has been minimized.

Show comment
Hide comment
@nicolinuxfr

nicolinuxfr Jun 16, 2015

He he, it's not that we love these, but the meaning is completely different without accents… :-)

Thanks for trying to satisfy everyone here !

nicolinuxfr commented Jun 16, 2015

He he, it's not that we love these, but the meaning is completely different without accents… :-)

Thanks for trying to satisfy everyone here !

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jun 16, 2015

Contributor

@bep, the bit of the character in Japanese that is getting stripped is called a "dakuten" https://en.wikipedia.org/wiki/Dakuten. There is one that looks like a double quote and one that looks like a circle. After rendering to public using hugo server, I see this:

rcogley@jrcmbp2015:~/dev/eSolia/public/ja/topics|rc-working-2⚡
⇒  ll
total 8
drwxr-xr-x  4 rcogley  staff   136 Jun 16 22:22 about
-rw-r--r--  1 rcogley  staff     0 Jun 16 22:23 index.html
-rw-r--r--  1 rcogley  staff  1928 Jun 16 22:23 index.xml
drwxr-xr-x  4 rcogley  staff   136 Jun 14 20:58 professional
drwxr-xr-x  4 rcogley  staff   136 Jun 14 20:58 お問い合わせ
drwxr-xr-x  4 rcogley  staff   136 Jun 16 22:25 はひふへほ
drwxr-xr-x  4 rcogley  staff   136 Jun 14 20:58 サーヒス
drwxr-xr-x  4 rcogley  staff   136 Jun 16 22:25 ハヒフヘホ

I'm using "topics" as a taxonomy here. The last 3 lines in the ll output have these marks, and are supposed to be:

topics:
  - About
  - ばびぶべぼ
  - バビブベボ
  - ぱぴぷぺぽ
  - パピプペポ

But Hugo strips the dakuten, and combines the four into two. That is, ba バ pa パ both become ハ.

Contributor

RickCogley commented Jun 16, 2015

@bep, the bit of the character in Japanese that is getting stripped is called a "dakuten" https://en.wikipedia.org/wiki/Dakuten. There is one that looks like a double quote and one that looks like a circle. After rendering to public using hugo server, I see this:

rcogley@jrcmbp2015:~/dev/eSolia/public/ja/topics|rc-working-2⚡
⇒  ll
total 8
drwxr-xr-x  4 rcogley  staff   136 Jun 16 22:22 about
-rw-r--r--  1 rcogley  staff     0 Jun 16 22:23 index.html
-rw-r--r--  1 rcogley  staff  1928 Jun 16 22:23 index.xml
drwxr-xr-x  4 rcogley  staff   136 Jun 14 20:58 professional
drwxr-xr-x  4 rcogley  staff   136 Jun 14 20:58 お問い合わせ
drwxr-xr-x  4 rcogley  staff   136 Jun 16 22:25 はひふへほ
drwxr-xr-x  4 rcogley  staff   136 Jun 14 20:58 サーヒス
drwxr-xr-x  4 rcogley  staff   136 Jun 16 22:25 ハヒフヘホ

I'm using "topics" as a taxonomy here. The last 3 lines in the ll output have these marks, and are supposed to be:

topics:
  - About
  - ばびぶべぼ
  - バビブベボ
  - ぱぴぷぺぽ
  - パピプペポ

But Hugo strips the dakuten, and combines the four into two. That is, ba バ pa パ both become ハ.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jun 16, 2015

Member

@RickCogley I know what we strip and how to not strip to them ... Will fix tonight.

Member

bep commented Jun 16, 2015

@RickCogley I know what we strip and how to not strip to them ... Will fix tonight.

@bep bep closed this in 4b7c134 Jun 16, 2015

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jun 16, 2015

Member

@nicolinuxfr please add RemovePathAccents = true to your config to keep the behavior you want.

Member

bep commented Jun 16, 2015

@nicolinuxfr please add RemovePathAccents = true to your config to keep the behavior you want.

@nicolinuxfr

This comment has been minimized.

Show comment
Hide comment
@nicolinuxfr

nicolinuxfr Jun 16, 2015

Great, thanks for keeping me satisfied along with everyone else ! :-)

Is it merged yet so I can try using home-brew or should I compile it manually ?

nicolinuxfr commented Jun 16, 2015

Great, thanks for keeping me satisfied along with everyone else ! :-)

Is it merged yet so I can try using home-brew or should I compile it manually ?

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jun 16, 2015

Member

Didn't you have some success with go get -u .... ? and yes its merged.

Member

bep commented Jun 16, 2015

Didn't you have some success with go get -u .... ? and yes its merged.

@dunn

This comment has been minimized.

Show comment
Hide comment
@dunn

dunn Jun 16, 2015

Contributor

@nicolinuxfr you can install the absolute latest version with brew install --HEAD hugo

Contributor

dunn commented Jun 16, 2015

@nicolinuxfr you can install the absolute latest version with brew install --HEAD hugo

@nicolinuxfr

This comment has been minimized.

Show comment
Hide comment
@nicolinuxfr

nicolinuxfr Jun 16, 2015

@dunn are you sure ? I tried to upgrade that way and the build seems old :

Hugo Static Site Generator v0.14 BuildDate: 2015-05-26T18:46:46+02:00

EDIT : oh, it seems I have a failed build because of decencies. Well, it doesn't matter, the go get -u -v github.com/spf13/hugo worked perfectly.

@bep for me, everything is still fine !

nicolinuxfr commented Jun 16, 2015

@dunn are you sure ? I tried to upgrade that way and the build seems old :

Hugo Static Site Generator v0.14 BuildDate: 2015-05-26T18:46:46+02:00

EDIT : oh, it seems I have a failed build because of decencies. Well, it doesn't matter, the go get -u -v github.com/spf13/hugo worked perfectly.

@bep for me, everything is still fine !

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jun 16, 2015

Contributor

@bep, thanks, after a recompile with go get -u -v github.com/spf13/hugo, I have proper Japanese again. :-)

@dunn, I stopped using brew install --HEAD hugo after finding it wonky a while back, but the above always works for me.

Contributor

RickCogley commented Jun 16, 2015

@bep, thanks, after a recompile with go get -u -v github.com/spf13/hugo, I have proper Japanese again. :-)

@dunn, I stopped using brew install --HEAD hugo after finding it wonky a while back, but the above always works for me.

@dunn

This comment has been minimized.

Show comment
Hide comment
@dunn

dunn Jun 16, 2015

Contributor

@RickCogley yeah, new dependencies have to be added manually, so it can break. I just opened Homebrew/legacy-homebrew#40794; thanks for the heads-up, @nicolinuxfr!

Contributor

dunn commented Jun 16, 2015

@RickCogley yeah, new dependencies have to be added manually, so it can break. I just opened Homebrew/legacy-homebrew#40794; thanks for the heads-up, @nicolinuxfr!

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jun 16, 2015

Contributor

@dunn ah, I see. Thanks. I hadn't realized that.

Contributor

RickCogley commented Jun 16, 2015

@dunn ah, I see. Thanks. I hadn't realized that.

tychoish added a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Remove accents in URLs
So the taxonomy `Gérard Depardieu` gives paths on the form `gerard-depardieu`.

Unfortunately this introduces two imports from the `golang.org/`, but Unicode-normalization isn't something we'd want to write from scratch.

See https://blog.golang.org/normalization

See #1180

tychoish added a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Add PreserveTaxonomyNames flag
Before this commit, taxonomy names were hyphenated, lower-cased and normalized -- then fixed and titleized on the archive page.

So what you entered in the front matter isn't necessarily what you got in the final site.

To preserve backwards compability, `PreserveTaxonomyNames` is default `false`.

Setting it to `true` will preserve what you type (the first characters is made toupper for titles), but normalized in URLs.

This also means that, if you manually construct URLs to the archive pages, you will have to pass the Taxonomy names through the `urlize` func.

Fixes #1180

tychoish added a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment