Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No TOC and wrong local links (docx to markdown) #3088

Closed
zlolik opened this issue Aug 26, 2016 · 7 comments
Closed

No TOC and wrong local links (docx to markdown) #3088

zlolik opened this issue Aug 26, 2016 · 7 comments

Comments

@zlolik
Copy link

zlolik commented Aug 26, 2016

I am writing instructions in docx and publish them in local Gitlab. It has almost Github rules https://github.com/gitlabhq/gitlabhq/blob/master/doc/user/markdown.md#links
Simple example: Short.instructions.docx
Converting line:

pandoc -w markdown_github-raw_html -o short.md --toc "Short.instructions.docx"

Expect:
TOC in the beginning of the file.
Links are like:

[*Open remote folder*](#Remote-folder-or-longlonglonglonglong-file-with-manymanymanymany-letters-inside-opening)

Actually:
No TOC in the beginning of the file.
Links are like:

[*Open remote folder*](#_Remote_folder_opening)

Am I doing something wrong or some pandoc issue there?

@jkr
Copy link
Collaborator

jkr commented Aug 26, 2016

To get an automatically generated TOC in markdown, run with --standalone (-s)

pandoc -s -w markdown_github-raw_html -o short.md --toc "Short.instructions.docx"

As for the links, I'm not sure why you expect them to look like the long version. The short version that you are getting is what the anchor link actually is. You can see this by right-clicking on the link and selecting edit hyperlink (or by unzipping the docx file and looking at the xml).

@jkr
Copy link
Collaborator

jkr commented Aug 26, 2016

Oh -- sorry, I think I see what the issue is with the links now. Let me take another look at it and see if I can figure out where the problem is coming from.

@jkr
Copy link
Collaborator

jkr commented Aug 28, 2016

Okay -- I'm not sure exactly what your problem was, since you weren't getting the TOC, right? So the links should have been working correctly. But your report did point me to a problem in the way we handle internal links in pandoc, and how that interacts with TOCs.

The problem is that there is no such thing as an arbitrary internal link in pandoc. There are only links to headers and (in html output) to spans. The spans from the internal links are put into the header text (they're empty, but you can see them in the html). Now, there might be a way to rewrite ids to direct links to the header id (that would take another couple of passes, and so might be inefficient). But for now, the span's are getting copied into the text of the table of contents along with the section name. This means that the link points back to the table of contents.

So the quick solution for now would be to remove all anchor spans from TOC text. But there might be a more robust solution in the future.

@jkr jkr closed this as completed in 9f6fd61 Aug 28, 2016
@jkr
Copy link
Collaborator

jkr commented Aug 28, 2016

@zlolik : I finally figured it out. It was a tricky one -- it had to do with overlapping anchors in the docx, which doesn't usually seem to happen. It's now fixed in the dev branch. Would it be alright if I used your sample file as a test case?

Btw, here's the output I get for your test file now (with the repeated lines removed):

pandoc -s -w markdown_github-raw_html -t markdown --atx --toc "Short.instructions.docx"
-   [Short instructions](#short-instructions)
-   [Some instructions](#some-instructions)
    -   [Remote folder or longlonglonglonglong file with
        manymanymanymany letters inside
        opening](#remote-folder-or-longlonglonglonglong-file-with-manymanymanymany-letters-inside-opening)
    -   [Remote folder or longlonglonglonglong file with
        manymanymanymany letters inside
        closing](#remote-folder-or-longlonglonglonglong-file-with-manymanymanymany-letters-inside-closing)

# Short instructions

[*Open remote
folder*](#remote-folder-or-longlonglonglonglong-file-with-manymanymanymany-letters-inside-opening)

Do staff

[*Close remote
folder*](#remote-folder-or-longlonglonglonglong-file-with-manymanymanymany-letters-inside-closing)

# Some instructions

Lines

...

Lines

## Remote folder or longlonglonglonglong file with manymanymanymany letters inside opening

Open folder

## Remote folder or longlonglonglonglong file with manymanymanymany letters inside closing

Close folder

@zlolik
Copy link
Author

zlolik commented Aug 29, 2016

@jkr : You are absolutely right. Auto generated TOC was preferable way, and I have tried to write local links manually.
Thank you for explanation about -s. My bad: on first runs I did not mention any difference with standalone mode or without, so decided not to use this key, and did not read manual carefully.
Glad that my issue can help you to improve pandoc. Use sample file for free.

@jkr
Copy link
Collaborator

jkr commented Aug 29, 2016

Thanks -- I used a version of your document in the test suite. And just so you know, the fixed version shouldn't require manual writing of links when you're coming from docx. That is, the document you had should now produce the correct output, above, without manual intervention. (I was wrong about what was causing the issue at first.)

@berot3
Copy link

berot3 commented Nov 20, 2023

I guess these days it should looks something like this (in case someone stumbles along this like me):

pandoc "name.docx" -o "name.md" -s -t markdown-raw_html --toc --markdown-headings=atx

or maybe also

pandoc "name.docx" -o "name.md" -s -t markdown-bracketed_spans-native_spans-grid_tables --toc --markdown-headings=atx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants