Header anchor with diacritics #807

romanmatyus · 2013-03-28T22:32:06Z

For correct generating ID is necessary to contains only characters [a-z0-9-].

Now is from:

## Používateľský účet

generated:

<h2 id="používateľský-účet"> Používateľský účet</h2>

Correct output is:

<h2 id="pouzivatelsky-ucet"> Používateľský účet</h2>

This behavior will be equal such as the behavior of GitHub markdown (etc. README.md)

jgm · 2013-03-29T02:32:57Z

Note: In HTML5, there is no such restriction on the ID attribute:
http://www.w3.org/TR/2011/WD-html5-20110525/elements.html#the-id-attribute
So really this only affects HTML 4.

I don't know of any general way to translate accented characters to ascii equivalents. In your case, there is an obvious translation -- just drop the accents -- but this won't be true in general, e.g. for Chinese. So for full generality one would have to use something like percent-encoding -- but without percent signs, of course. It would be ugly and it would be hard for users to calculate the IDs on the fly.

I just tried some documents with links to the unicode anchors, and they seem to work fine in modern browsers, even with an HTML 4 doctype. So I'm inclined not to worry about being "correct" in this respect. The alternative seems to me worse, and I don't see much advantage.

romanmatyus · 2013-03-29T11:53:59Z

Yes, for HTML5 is oputput realy valid.

I use Nette framework and he "webalize" strings like this.

Yes, this solution has problem with e.g. Chinese.

There is not the problem that it does not work in browsers, but that is not compatible with compilers e.g. GitHub.

Ideal is by my state, use ID generated by "Nette" algorithm, and if return empty string, use current algorithm - only replace spaces by dash. Problem is, that this solution is again not compatible with GitHub. :(

My reason for this issue:
I write text in markdown. I will use in text links to titles.

E.g.:

# Test
[Nejaký text](#nejaký-text)
## Nejaký text

When I generate using Pandoc PDF, output is correct.
But, when I publish text in GitHub, GitHub generate for title "Nejaký odkaz" ID "nejaky-odkaz" and my link fails. (And conversely)

I don't know solution for this problem without change id generator in pandoc.

It would be nice create for this feature at least argument in command line.

Now I must to choose the correct output on GitHub or via Pandoc.

PS: Current I get from

### Údaje o zákazníkovi

this

<h3 id="údaje-ozákazníkovi"> Údaje o zákazníkovi</h3>

why not

<h3 id="údaje-o-zákazníkovi"> Údaje o zákazníkovi</h3>

?

jgm · 2013-03-29T14:49:09Z

On the last point: I just tried it, and I got what you expected,

<h3 id="údaje-o-zákazníkovi"> Údaje o zákazníkovi</h3>

Unfortunately, I don't know of a Haskell library that provides a toAscii function like that used in the code you linked to. It would be possible (and tedious, but not conceptually hard) to write the function manually, by going through the unicode code points for European alphabets and specifying an Ascii equivalent for each.

Note that you can now specify the header ID explicitly in pandoc (though this probably won't work in github):

### Údaje o zákazníkovi {#udaje-o-sakaznikovi}

You can also use explicit HTML anchor tags if you need something that works in both.

jgm · 2013-03-29T15:24:10Z

OK, it wasn't actually too hard to create the needed function from the official unicode tables. I'll try to incorporate this.

romanmatyus · 2013-03-29T15:34:29Z

Method toAscii is one method above from webalize.

Thanks for your work!

jgm · 2013-03-30T16:14:41Z

Probably the best approach is to add a new markdown extension for strict IDs, and use it with markdown_github.

It appears that github just completely ignores characters that don't have ascii equivalents, in generating the ID.

romanmatyus · 2013-03-30T20:02:24Z

How?

pevik · 2017-06-24T05:05:31Z

@jgm The behaviour doesn't work any more (tested on current master, i.e. 5812ac0)

## Rozdělení do družin produces
<h2 id="rozdělení-do-družin">Rozdělení do družin</h2>

jgm · 2017-06-24T10:38:22Z

Try

pandoc -f markdown+ascii_identifiers

if you want to ensure that the identifiers are ASCII.

Wolf-SO · 2017-06-24T10:49:50Z

@jgm (concerning -f markdown+ascii_identifiers) Is it possible to change that extension (or replace with a modified one) as to transliterate following specific transliteration conventions (such as German: ß->ss, ä->ae, Ä->Ae, ö->oe, ...)?

jgm · 2017-06-24T10:56:24Z

@Wolf-at-SO yes that could easily be done. The relevant source file is src/Text/Pandoc/Asciify.hs; it's a simple map, so perhaps you could indicate (in a new bug report) all the changes you think would be appropriate. (We'd also have to change this to a map Char -> String, to allow for the two-letter replacements.)

pevik · 2017-06-26T06:29:44Z

@jgm Thanks, pandoc -f markdown+ascii_identifiers works :-). IMHO it should be default (I know unicode identifiers should work, but ASCII is default for web).

jgm closed this as completed in 031686b Apr 24, 2013

Wolf-SO mentioned this issue Jun 24, 2017

Markdown reader: allow for more complex transliterations in ascii_identifiers or another extension #3757

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header anchor with diacritics #807

Header anchor with diacritics #807

romanmatyus commented Mar 28, 2013

jgm commented Mar 29, 2013

romanmatyus commented Mar 29, 2013

jgm commented Mar 29, 2013

jgm commented Mar 29, 2013

romanmatyus commented Mar 29, 2013

jgm commented Mar 30, 2013

romanmatyus commented Mar 30, 2013

pevik commented Jun 24, 2017

jgm commented Jun 24, 2017

Wolf-SO commented Jun 24, 2017 •

edited

jgm commented Jun 24, 2017

pevik commented Jun 26, 2017

Header anchor with diacritics #807

Header anchor with diacritics #807

Comments

romanmatyus commented Mar 28, 2013

jgm commented Mar 29, 2013

romanmatyus commented Mar 29, 2013

jgm commented Mar 29, 2013

jgm commented Mar 29, 2013

romanmatyus commented Mar 29, 2013

jgm commented Mar 30, 2013

romanmatyus commented Mar 30, 2013

pevik commented Jun 24, 2017

jgm commented Jun 24, 2017

Wolf-SO commented Jun 24, 2017 • edited

jgm commented Jun 24, 2017

pevik commented Jun 26, 2017

Wolf-SO commented Jun 24, 2017 •

edited