Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete auto heading ID for extended Latin and CJK characters #6616

Closed
jkboxomine opened this issue Dec 14, 2019 · 9 comments · Fixed by #6706
Closed

Incomplete auto heading ID for extended Latin and CJK characters #6616

jkboxomine opened this issue Dec 14, 2019 · 9 comments · Fixed by #6706
Labels
Milestone

Comments

@jkboxomine
Copy link

@jkboxomine jkboxomine commented Dec 14, 2019

What version of Hugo are you using (hugo version)?

$ hugo version
Hugo Static Site Generator v0.61.0/extended darwin/amd64 BuildDate: unknown

Does this issue reproduce with the latest release?

Yes

Issue description

The following is the status of auto heading ID generation since Hugo v0.60.0.

  • Only ASCII alphanumeric (one-byte) characters are preserved, and extended latin characters (2 bytes) and other international characters (3 bytes) are discarded.
  • Only a small set of punctuations (hyphen, underscore, and slash) are taken into account. As a result, the IDs for headings with other punctuations become less readable or identifiable.

This proved to be caused by goldmark, but the author himself made it clear that it's no issue on goldmark and he won't change the default logic, as stated in the following issues and PR:

Possible solution

So I've wrote an extension for better auto heading ID generation.

I've tested with random heading strings including the ones from Kubernetes site (good place to test against former Hugo version + plenty of multi-language strings), and it appears to generate auto heading IDs on a par with Blackfriday.

I hope Hugo maintainers take a look at the extension and consider adopting it to resolve the issue.

@bep bep added this to the v0.62 milestone Dec 14, 2019
@bep

This comment has been minimized.

Copy link
Member

@bep bep commented Dec 14, 2019

Hey, and thanks for investigating this issue. I saw some ID diffs when testing this, but I kind of closed my eyes as there was so much to do...

I also forgat that we have an anchorize function that is supposed to create compatible IDs:

return blackfriday.SanitizedAnchorName(s), nil

What would be great is if you could:

  1. Copy the blackfriday function above into your extension. It's MIT and that is fine, but keep a link back to the source so people know its origin.
  2. Export it so Hugo can use it in the anchorize func above.

It would be good if you also stated in your README that this "id format" is Blackfriday compatible, so to speak.

@bep bep added Bug and removed NeedsInvestigation labels Dec 14, 2019
@jkboxomine

This comment has been minimized.

Copy link
Author

@jkboxomine jkboxomine commented Dec 15, 2019

@bep : If possible, I'd like to keep the code intact. I think the implementation is functionally equivalent to the algorithm used in Blackfriday (and other library mentioned in its package documentation, not sure which one is original).

Could you provide counter example strings or reasons (performance, etc.) that I have to replace the code with Blackfriday's?

As you are probably aware, the logic of Blackfriday or the extension code is at the very basic level and cannot satisfy more serious needs (e.g. c++ becomes c by both) and my hope is that somebody adds real extended features, such as the slugify I mentioned in the forum.

My point: for any regressions (that are being reported to goldmark, such as incomplete SmartyPants processing), should we borrow code directly from Blackfriday?

And for your second request,
Export it so Hugo can use it in the anchorize func above.
I doubt the function within the extension can be used as-is for Hugo's Anchorize function because it's specific to goldmark's IDs interface. Could you explain how it can be? (FYI, I'm still learning golang)

@bep

This comment has been minimized.

Copy link
Member

@bep bep commented Dec 15, 2019

Could you provide counter example strings or reasons (performance, etc.) that I have to replace the code with Blackfriday's?

The thing is, that would take time for me, so in reality, we wouldn't know until someone complains. I understand if you don't want to pull someone else's code into your project, it was just a suggestion.

@jkboxomine

This comment has been minimized.

Copy link
Author

@jkboxomine jkboxomine commented Dec 15, 2019

@bep : I understand your concern. If auto heading ID and anchorize function must produce the same output, those two should rely on the same logic.

Judging from the commit, the "other library" is the original and Blackfriday used its code.
And it's quite evident that its algorithm is computationally better than the extension code that uses three strings function calls for each heading. So if they are functionally equivalent (presumably), it's desirable to use better logic.

So, how would we go about this? Shall we import the original library or copy its code to make an exported function?

I'll follow your suggestion in general, as you're the expert in Golang and Hugo.

@jkboxomine

This comment has been minimized.

Copy link
Author

@jkboxomine jkboxomine commented Dec 30, 2019

Hi @bep , could we make some progress on this issue?
Let me know which way you prefer to go among the suggestions in the comment above.
And also let me know if there's anything else that I need to augment on the extension (for example, go module).

@bep

This comment has been minimized.

Copy link
Member

@bep bep commented Jan 2, 2020

Just a note to self: bep/docuapi@6f54292

@bep bep modified the milestones: v0.63, v0.62.2 Jan 4, 2020
bep added a commit to bep/hugo that referenced this issue Jan 4, 2020
You can turn off this behaviour:

```toml
[markup]
  [markup.goldmark]
    [markup.goldmark.parser]
      autoHeadingIDAsciiOnly = true
```
Note that the `anchorize` now adapts its behaviour depending on the default Markdown handler.

Fixes gohugoio#6616
bep added a commit to bep/hugo that referenced this issue Jan 4, 2020
You can turn off this behaviour:

```toml
[markup]
  [markup.goldmark]
    [markup.goldmark.parser]
      autoHeadingIDAsciiOnly = true
```
Note that the `anchorize` now adapts its behaviour depending on the default Markdown handler.

Fixes gohugoio#6616
bep added a commit to bep/hugo that referenced this issue Jan 4, 2020
You can turn off this behaviour:

```toml
[markup]
  [markup.goldmark]
    [markup.goldmark.parser]
      autoHeadingIDAsciiOnly = true
```
Note that the `anchorize` now adapts its behaviour depending on the default Markdown handler.

Fixes gohugoio#6616
bep added a commit to bep/hugo that referenced this issue Jan 4, 2020
You can turn off this behaviour:

```toml
[markup]
  [markup.goldmark]
    [markup.goldmark.parser]
      autoHeadingIDAsciiOnly = true
```
Note that the `anchorize` now adapts its behaviour depending on the default Markdown handler.

Fixes gohugoio#6616
bep added a commit to bep/hugo that referenced this issue Jan 4, 2020
You can turn off this behaviour:

```toml
[markup]
  [markup.goldmark]
    [markup.goldmark.parser]
      autoHeadingIDAsciiOnly = true
```
Note that the `anchorize` now adapts its behaviour depending on the default Markdown handler.

Fixes gohugoio#6616
@bep bep closed this in a82d270 Jan 4, 2020
@bep bep closed this in #6706 Jan 4, 2020
@jkboxomine

This comment has been minimized.

Copy link
Author

@jkboxomine jkboxomine commented Jan 5, 2020

@bep , many thanks for implementing the ID generation code to resolve this issue. I've tested with the latest master branch and verified that the first issue stated has been resolved.

  • (Resolved) Only ASCII alphanumeric (one-byte) characters are preserved, and extended latin characters (2 bytes) and other international characters (3 bytes) are discarded.

However, it appears that there are some loose ends regarding the other issue below:

  • Only a small set of punctuations (hyphen, underscore, and slash) are taken into account. As a result, the IDs for headings with other punctuations become less readable or identifiable.

On previous versions of Hugo, period and apostrophe were maintained. As a sample, you can test with the following examples:
https://kubernetes.io/docs/setup/release/notes/#downloads-for-v1-17-0
https://kubernetes.io/docs/setup/release/notes/#what-s-new-major-themes

There seems to be no set rule for how to handle punctuation, but it would be ideal if the new code can behave on a par with Blackfriday's, so that all the existing outgoing links to anchors in Hugo sites can remain accessible.

P.S. It's regrettable that I've made the subject of this issue too narrow to highlight the punctuation issue.

@bep

This comment has been minimized.

Copy link
Member

@bep bep commented Jan 5, 2020

This issue is closed. If you still think there are things to do here, raise another issue.

One quick note: Hugo's main strategy in this department is to create GitHub compatible anchor links. I didn't find their spec/implementation (I did not look too hard), but I'm pretty sure that the implementation is correct.

So with that in mind, if you find some cases where the anchors differ from GitHub, then I'll look at it.

This situation isn't ideal. I can add a note that I did ask for the Blackfriday implementation (and tests) in a PR related to this, but didn't get much response, so I implemented my own. I may consider adding "Blackfriday anchors" as an option, but that needs some real convincing for me to spend time on.

If you really want to help this situation, go get some movement in https://talk.commonmark.org/t/anchors-in-markdown/247 -- where people have been washing their hands for the last six years. Which is why we have this situation with many different auto ID implementations, some that support the {#anchorName} syntax (Hugo does, GitHub does not) ...

Hugo has now taken a stand in this: GitHub is a big player and has set the standard in other Markdown related issues (tables etc.), so it is natural to also follow their lead in anchor IDs.

bep added a commit that referenced this issue Jan 5, 2020
See #6707
See #6616
bep added a commit that referenced this issue Jan 5, 2020
See #6707
See #6616
@jkboxomine

This comment has been minimized.

Copy link
Author

@jkboxomine jkboxomine commented Jan 5, 2020

@bep, I appreciate your explanation and #6707.
I see some issue for whitespace handling for auto ID generation, so I'll submit another issue as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.