Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header anchor with diacritics #807

Closed
romanmatyus opened this issue Mar 28, 2013 · 12 comments
Closed

Header anchor with diacritics #807

romanmatyus opened this issue Mar 28, 2013 · 12 comments

Comments

@romanmatyus
Copy link

For correct generating ID is necessary to contains only characters [a-z0-9-].

Now is from:

## Používateľský účet

generated:

<h2 id="používateľský-účet"> Používateľský účet</h2>

Correct output is:

<h2 id="pouzivatelsky-ucet"> Používateľský účet</h2>

This behavior will be equal such as the behavior of GitHub markdown (etc. README.md)

@jgm
Copy link
Owner

jgm commented Mar 29, 2013

Note: In HTML5, there is no such restriction on the ID attribute:
http://www.w3.org/TR/2011/WD-html5-20110525/elements.html#the-id-attribute
So really this only affects HTML 4.

I don't know of any general way to translate accented characters to ascii equivalents. In your case, there is an obvious translation -- just drop the accents -- but this won't be true in general, e.g. for Chinese. So for full generality one would have to use something like percent-encoding -- but without percent signs, of course. It would be ugly and it would be hard for users to calculate the IDs on the fly.

I just tried some documents with links to the unicode anchors, and they seem to work fine in modern browsers, even with an HTML 4 doctype. So I'm inclined not to worry about being "correct" in this respect. The alternative seems to me worse, and I don't see much advantage.

@romanmatyus
Copy link
Author

Yes, for HTML5 is oputput realy valid.

I use Nette framework and he "webalize" strings like this.

Yes, this solution has problem with e.g. Chinese.

There is not the problem that it does not work in browsers, but that is not compatible with compilers e.g. GitHub.

Ideal is by my state, use ID generated by "Nette" algorithm, and if return empty string, use current algorithm - only replace spaces by dash. Problem is, that this solution is again not compatible with GitHub. :(

My reason for this issue:
I write text in markdown. I will use in text links to titles.

E.g.:

# Test
[Nejaký text](#nejaký-text)
## Nejaký text

When I generate using Pandoc PDF, output is correct.
But, when I publish text in GitHub, GitHub generate for title "Nejaký odkaz" ID "nejaky-odkaz" and my link fails. (And conversely)

I don't know solution for this problem without change id generator in pandoc.

It would be nice create for this feature at least argument in command line.

Now I must to choose the correct output on GitHub or via Pandoc.

PS: Current I get from

### Údaje o zákazníkovi

this

<h3 id="údaje-ozákazníkovi"> Údaje o zákazníkovi</h3>

why not

<h3 id="údaje-o-zákazníkovi"> Údaje o zákazníkovi</h3>

?

@jgm
Copy link
Owner

jgm commented Mar 29, 2013

On the last point: I just tried it, and I got what you expected,

<h3 id="údaje-o-zákazníkovi"> Údaje o zákazníkovi</h3>

Unfortunately, I don't know of a Haskell library that provides a toAscii function like that used in the code you linked to. It would be possible (and tedious, but not conceptually hard) to write the function manually, by going through the unicode code points for European alphabets and specifying an Ascii equivalent for each.

Note that you can now specify the header ID explicitly in pandoc (though this probably won't work in github):

### Údaje o zákazníkovi {#udaje-o-sakaznikovi}

You can also use explicit HTML anchor tags if you need something that works in both.

@jgm
Copy link
Owner

jgm commented Mar 29, 2013

OK, it wasn't actually too hard to create the needed function from the official unicode tables. I'll try to incorporate this.

@romanmatyus
Copy link
Author

Method toAscii is one method above from webalize.

Thanks for your work!

@jgm
Copy link
Owner

jgm commented Mar 30, 2013

Probably the best approach is to add a new markdown extension for strict IDs, and use it with markdown_github.

It appears that github just completely ignores characters that don't have ascii equivalents, in generating the ID.

@romanmatyus
Copy link
Author

How?

@jgm jgm closed this as completed in 031686b Apr 24, 2013
@pevik
Copy link

pevik commented Jun 24, 2017

@jgm The behaviour doesn't work any more (tested on current master, i.e. 5812ac0)

## Rozdělení do družin produces
<h2 id="rozdělení-do-družin">Rozdělení do družin</h2>

@jgm
Copy link
Owner

jgm commented Jun 24, 2017

Try

pandoc -f markdown+ascii_identifiers

if you want to ensure that the identifiers are ASCII.

@Wolf-SO
Copy link

Wolf-SO commented Jun 24, 2017

@jgm (concerning -f markdown+ascii_identifiers) Is it possible to change that extension (or replace with a modified one) as to transliterate following specific transliteration conventions (such as German: ß->ss, ä->ae, Ä->Ae, ö->oe, ...)?

@jgm
Copy link
Owner

jgm commented Jun 24, 2017

@Wolf-at-SO yes that could easily be done. The relevant source file is src/Text/Pandoc/Asciify.hs; it's a simple map, so perhaps you could indicate (in a new bug report) all the changes you think would be appropriate. (We'd also have to change this to a map Char -> String, to allow for the two-letter replacements.)

@pevik
Copy link

pevik commented Jun 26, 2017

@jgm Thanks, pandoc -f markdown+ascii_identifiers works :-). IMHO it should be default (I know unicode identifiers should work, but ASCII is default for web).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants