Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add option to scan and register Markdown anchors #39

Merged
merged 14 commits into from Feb 27, 2024
Merged

Conversation

pawamoy
Copy link
Member

@pawamoy pawamoy commented Feb 16, 2024

Supersedes/closes #20.

This PR adds a tree processor to our Markdown extension in order to iterate on the element tree (converted from Markdown) and register anchors that have an id. If an anchor is directly followed by a heading, then it will target this heading rather than itself.

We pass a reference to the plugin to the tree processor, so that it can obtain the current page URL and call its register_anchor method. The current page URL is set by the plugin, for each page, before Markdown conversion, in the on_page_markdown event.

You'll see in the changes that I left some comments: an alternative implementation is using regular expressions to match anchors in the HTML. It was working well, but one issue with that is that it's not trivial to support matching ids and hrefs in any order: <a id="hello" href="#hello"> and <a href="#hello" id="hello">. So instead of complicating the regular expression (it might be possible, but I'm lacking regex-fu), I went with the processor approach, which could also be more efficient

I'm planning to run some benchmarks to test both implementation and see which one performs the best, but maybe this is obvious to you @oprypin that the tree processor is the best approach, and in this case I won't bother benchmarking.

The regex approach was dropped, as it's always difficult and never a good idea to use regular expressions on HTML. It would have been even harder to infer the slug of headings appearing right after anchors.


Now, what does this feature bring, and how do we use it?

Basically, the changes in this PR make it possible to:

  1. define "aliases" to headings
[](){#install-from-sources}
## How to install from sources?
  1. define anchors in arbitrary positions in a document, without headings
Paragraph 1.

[](){#paragraph-2}

Paragraph 2.

The syntax in the above examples require the attr_list Markdown extension to be enabled.

autorefs then lets us reference these headings or non-heading locations with the usual syntax: [install from sources][install-from-sources] and [see paragraph 2][paragraph-2].

The alias feature is particularly interesting when you have several similar pages, with equal headings, as it allows you to keep these headings equal (anchor, permalink), while defining additional anchors which include the page name. For example:

docs/
    install/
        arch.md
        debian.md
        gentoo.md

Each page has:

## Install with package manager

...

## Install from sources

...

You don't want to change headings and make them redundant, like ## Arch: Install with package manager, ## Debian: Install with package manager just to be able to reference the right one with autorefs. Instead you can do this:

[](){#arch-install-pkg}

## Install with package manager

...

[](){#arch-install-src}

## Install from sources

...

...changing arch by debian, gentoo, etc. in the other pages.


I believe this PR also closes #25 and #35, since equal headings throughout a site do not cause determinism issues anymore: you can now define unique aliases to make sure to link to the right heading. No need to warn about equal headings, no need to add priority configuration.

Finally, with a bit more work in mkdocstrings, it could also make it possible to shorten the anchors in the permalinks, while keeping the full path of objects in autorefs. So, instead of this redundant permalink: https://mkdocstrings.github.io/griffe/reference/griffe/git/#griffe.git.load_git, we could have https://mkdocstrings.github.io/griffe/reference/griffe/git/#load_git, while still being able to reference it with [load_git][griffe.git.load_git].

@pawamoy
Copy link
Member Author

pawamoy commented Feb 16, 2024

@tvdboom if we go with the processor approach, I'm going to reverse commit attribution (me as author, you as co-author), I hope you don't mind 🙂

docs/changelog.md Outdated Show resolved Hide resolved
docs/index.md Outdated Show resolved Hide resolved
src/mkdocs_autorefs/plugin.py Outdated Show resolved Hide resolved
src/mkdocs_autorefs/plugin.py Outdated Show resolved Hide resolved
src/mkdocs_autorefs/references.py Outdated Show resolved Hide resolved
src/mkdocs_autorefs/references.py Outdated Show resolved Hide resolved
src/mkdocs_autorefs/plugin.py Outdated Show resolved Hide resolved
tests/test_plugin.py Outdated Show resolved Hide resolved
@oprypin
Copy link
Member

oprypin commented Feb 16, 2024

In your examples you show both "Markdown anchor" and "HTML anchor" side by side. So I guess you expect that both syntaxes will work equivalently. Which would definitely be nice. But actually I don't see how the HTML one could work under this implementation. If a user writes raw HTML, that HTML never becomes part of the element tree, it's just stashed.

@oprypin
Copy link
Member

oprypin commented Feb 16, 2024

If I were to use custom anchors for myself, ideally I'd probably prefer raw HTML over attr_list Markdown because:
[](#how-to-install-from-sources){#install-from-sources} renders as "{#install-from-sources}" on GitHub.

MkDocs itself in the upcoming release already has an implementation of exactly this
mkdocs/mkdocs#3463 and even including handling of raw HTML mkdocs/mkdocs#3505
So autorefs could simply pull this data from MkDocs - we should make sure that this is actually possible.
Of course, as this functionality is not released currently, this is not so relevant, but perhaps the implementation could be copied over for now - I could hand over the copyright, so to say, by contributing it also to this repo.

@oprypin
Copy link
Member

oprypin commented Feb 16, 2024

define "aliases" to headings

I appreciate the attention to detail here. I wonder if users will also be able to appreciate the benefit they get from typing

[](#install-with-package-manager){#arch-install-pkg}
## Install with package manager

rather than the simpler

[](){#arch-install-pkg}
## Install with package manager

As I understand, the effect is that [arch-install-pkg][] links to foo.html#install-with-package-manager? Took me a while to understand but maybe I'm just distracted 😅

@pawamoy
Copy link
Member Author

pawamoy commented Feb 16, 2024

As I understand, the effect is that [arch-install-pkg][] links to foo.html#install-with-package-manager?

Yes, sorry, I forgot to explain that part. If you look at the changes, register_anchor now accepts a third argument. This lets us register an identifier that points to a different anchor. Some kind of redirection if you want 🙂

[](){#arch-install-pkg} is definitely shorter and easier to type, but will not actually "activate" the heading's permalink, and so the ToC icon next to the heading will not be styled accordingly.

@pawamoy
Copy link
Member Author

pawamoy commented Feb 16, 2024

But actually I don't see how the HTML one could work under this implementation. If a user writes raw HTML, that HTML never becomes part of the element tree, it's just stashed.

Damn 😂 I actually didn't test that it worked with raw HTML anchors 😫

So autorefs could simply pull this data from MkDocs - we should make sure that this is actually possible. Of course, as this functionality is not released currently, this is not so relevant, but perhaps the implementation could be copied over for now - I could hand over the copyright, so to say, by contributing it also to this repo.

Both options (somehow using MkDocs logic, and copying code over) sound good to me. Whatever you think is best. We can maybe start by copying code over, and once MkDocs has released the feature we can start thinking about tighter integration.

@pawamoy pawamoy marked this pull request as draft February 16, 2024 17:40
@pawamoy
Copy link
Member Author

pawamoy commented Feb 16, 2024

I think I'm also fine with telling users that raw HTML is not supported (for now). There is a lot of markup that won't render correctly on GitHub, so I don't care that much about graceful degradation anymore.

@oprypin
Copy link
Member

oprypin commented Feb 16, 2024

[](){#arch-install-pkg} is definitely shorter and easier to type, but will not actually "activate" the heading's permalink, and so the ToC icon next to the heading will not be styled accordingly.

I actually didn't understand this part - how does this affect the ToC?

@oprypin
Copy link
Member

oprypin commented Feb 16, 2024

So autorefs could simply pull this data from MkDocs - we should make sure that this is actually possible.

I compared against the MkDocs code and of course it doesn't save the href data out of links, only the id data. So not usable for the aliases feature :(
Could still be changed, I guess, but it'd definitely be a specific integration.


I was also wondering, even just from a usage perspective, are there any downsides to making this

[](){#arch-install-pkg}
## Install with package manager

behave like the alias feature anyway, even though the user didn't explicitly type the associated anchor? (And then eliminate this alias-specific href handling). Unfortunately this only makes it harder to reuse MkDocs' code, but I'm suggesting this regardless of that :D

@pawamoy
Copy link
Member Author

pawamoy commented Feb 16, 2024

Can we can look ahead to find the following heading and somehow compute/get its slug? I guess it's easy if we iterate on the element tree.

@pawamoy
Copy link
Member Author

pawamoy commented Feb 16, 2024

I actually didn't understand this part - how does this affect the ToC?

With the toc extension and its permalink option set to true (or a character), you get a character next to headings that you can click on. The URL then gets updated with the selected heading's anchor, and the permalink character gets (I guess) a "selected" class or something. In Material, the character is styled with the secondary theme color.

@pawamoy
Copy link
Member Author

pawamoy commented Feb 17, 2024

I was also wondering...

[](){#arch-install-pkg}
## Install with package manager

I like this suggestion, so I implemented it. Support for passing the href is removed: users must use the syntax [](){#id}.

The previous implementation could have been confusing: [](#slug){#id} would have worked even if not directly above the relevant heading. Users could have started putting these aliases at the end of documents (still working), and then tried to put these in an external file (for example an always included document thanks to pymdownx snippet), and this would have failed because these aliases must be defined on the same page as their target heading.

This new implementation is stricter and less confusing: aliases must always appear right before their target heading.

src/mkdocs_autorefs/references.py Outdated Show resolved Hide resolved
src/mkdocs_autorefs/references.py Outdated Show resolved Hide resolved
pawamoy

This comment was marked as resolved.

@pawamoy pawamoy marked this pull request as ready for review February 18, 2024 12:21
@pawamoy
Copy link
Member Author

pawamoy commented Feb 18, 2024

Going out of draft because it's working and tested, but we can still change a few things if we want 🙂

@oprypin
Copy link
Member

oprypin commented Feb 18, 2024

And I suppose abandon the idea of supporting raw HTML for now 🥲 but yea it's the only reasonable way

@oprypin
Copy link
Member

oprypin commented Feb 18, 2024

The title and everything keeps mentioning HTML but that's not really accurate anymore, right? 🤔

@pawamoy
Copy link
Member Author

pawamoy commented Feb 18, 2024

Right, let me update everything to remove mentions of HTML. I'll also add docs.

@pawamoy pawamoy changed the title feat: Add option to scan and register HTML anchors feat: Add option to scan and register Markdown anchors Feb 18, 2024
@pawamoy
Copy link
Member Author

pawamoy commented Feb 18, 2024

Ah, do we want to allow multiple aliases per heading?

[](){#hello}
[](){#bonjour}
## Hello world!

I believe this is feasible, we just need to record anchors in a list before registering them through the plugin. One difficulty I can see is:

[](){#bonjour}
Bonjour.
[](){#hello}
## Hello world!

Will need some testing.

@pawamoy
Copy link
Member Author

pawamoy commented Feb 18, 2024

Done. Some docs added in the README 🙂

Copy link
Member

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice :)

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
assert plugin._url_map["alias1"] == "#heading-bar"
assert plugin._url_map["alias2"] == "#heading-bar"
assert plugin._url_map["alias3"] == "#alias3"
assert plugin._url_map["alias4"] == "#heading-baz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have an official specification of what is expected to happen in case of conflicts, with the most extreme case being this one:

[](){#foo}
## Bar

[](){#bar}
## Foo

Copy link
Member Author

@pawamoy pawamoy Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With

[](){#foo}
## Bar
...
[](){#bar}
## Foo
...

[Link to foo][foo]
[Link to bar][bar]

foobar

Erm sorry about the highlighted rectangle on the left, but you get the idea.

Clicking on the "foo" link will bring you to the "Bar" heading, and inversely.
So, even though the ids of the headings themselves have been suffixed, what the user specified has been achieved 🤷 (the aliases work)

src/mkdocs_autorefs/references.py Outdated Show resolved Hide resolved
@pawamoy
Copy link
Member Author

pawamoy commented Feb 18, 2024

I just noticed that anchors within docstrings are not picked up. I suppose it's because mkdocstrings does not re-adds the autorefs extension to the inner Markdown instance 🤔 Or maybe it re-adds it but without the plugin reference. The outer Markdown extension has zero chance of picking them up since mkdocstrings generates HTML contents that is then stashed.

UPDATE: the autorefs extension is correctly added back by mkdocstrings.

@pawamoy
Copy link
Member Author

pawamoy commented Feb 18, 2024

OK it's simply because of the IdPrepending processor: if I add [](){#config-hook} to the on_config hook's docstring, the anchor id is transformed into mkdocs_autorefs.plugin.AutorefsPlugin.on_config--config-hook. Not sure what to do about that 🤔 We probably shouldn't do anything, that's how mkdocstrings work. Definitely needs to be documented here however.

@oprypin
Copy link
Member

oprypin commented Feb 18, 2024

Interesting. Well even though the behavior is "correct", perhaps something should be done just because surely nobody wants these mangled ids for something that they wrote explicitly..
Implementation-wise it's possible to achieve as well..

@oprypin
Copy link
Member

oprypin commented Feb 18, 2024

The aliases feature has the goal of global disambiguation. I'm sure it makes sense in docstrings as well.

As for non-alias anchors, it is of course not so clear, and stopping to prefix them could be a breaking change for some users, though it seems super niche.

@oprypin
Copy link
Member

oprypin commented Feb 19, 2024

I have one idea for what mkdocstrings could do.

Upon encountering a tag like this (in etree form),
<a id="foo" href=""></a>,
normally mkdocstrings would replace it with
<a id="some.Identifier--foo" href=""></a>,
but how about it replaces it with
<a id="foo" href=""></a> <a id="some.Identifier--foo" href=""></a>.

The condition to do this could be specifically an a tag (in etree form) with an empty href.

@oprypin
Copy link
Member

oprypin commented Feb 19, 2024

Also as a side note, what if autorefs were to clean up those empty href attribute values as a bonus 🤔

@pawamoy
Copy link
Member Author

pawamoy commented Feb 19, 2024

Duplicating the anchors... such an elegant solution 🥲 Thanks a lot! Yeah we can also clean up the empty hrefs 👍

@pawamoy
Copy link
Member Author

pawamoy commented Feb 20, 2024

I think we will have to implement an iterator that yields parents and element position as well as elements, to insert a elements in the right place. Or make the IdPrependingTreeprocessor.run recursive instead of iterative 🤔

pawamoy added a commit to mkdocstrings/mkdocstrings that referenced this pull request Feb 20, 2024
This will be useful for the soon-to-come Markdown anchors feature of mkdocs-autorefs.

Related to mkdocs-autorefs#39: mkdocstrings/autorefs#39
pawamoy added a commit to mkdocstrings/mkdocstrings that referenced this pull request Feb 22, 2024
…autorefs' Markdown anchors

PR-#651: #651
Related-to-mkdocs-autorefs#39: mkdocstrings/autorefs#39
Co-authored-by: Oleh Prypin <oleh@pryp.in>
Copy link
Member

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider merging my commit where I address my suggestions.
oprypin@2fc4e3c

src/mkdocs_autorefs/plugin.py Outdated Show resolved Hide resolved
src/mkdocs_autorefs/references.py Outdated Show resolved Hide resolved
Copy link
Member

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My latest commit + a merge commit can be merged into this branch by doing the following:

git fetch origin 37c416947b23feee011e94901c68a57d294b63de
git merge 37c416947b23feee011e94901c68a57d294b63de

Or I could push this to the PR myself if that's OK

src/mkdocs_autorefs/plugin.py Outdated Show resolved Hide resolved
tests/test_references.py Show resolved Hide resolved
@pawamoy
Copy link
Member Author

pawamoy commented Feb 23, 2024

Definitely feel free to push directly to this PR!

Copy link
Member

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK in my view this is ready

@pawamoy
Copy link
Member Author

pawamoy commented Feb 27, 2024

One last thing maybe: clearing up empty hrefs on aliases and anchors (as suggested by you earlier)?

@pawamoy
Copy link
Member Author

pawamoy commented Feb 27, 2024

I don't think these empty hrefs do any harm, lets merge.

@pawamoy pawamoy merged commit a215a97 into main Feb 27, 2024
33 checks passed
@pawamoy pawamoy deleted the scan-anchors branch February 27, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants