Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AsciiDoc Reader / AsciiDoc input support #1456

Open
ERnsTL opened this issue Jul 26, 2014 · 67 comments
Open

Add AsciiDoc Reader / AsciiDoc input support #1456

ERnsTL opened this issue Jul 26, 2014 · 67 comments

Comments

@ERnsTL
Copy link

ERnsTL commented Jul 26, 2014

Greetings,

I would like to hereby suggest the addition of AsciiDoc input resp. an AsciiDoc Reader.

Besides Markdown, this format is growing in popularity, also in use inside a publishing toolchain (asciidoc -> docbook -> pdf/epub/html). Currently the only other viable implementation is asciidoctor, which uses Ruby or JRuby, but it is AsciiDoc-only in its input format and not a universal markup converter, like pandoc.

I am aware of only one relevant discussion thread regarding this, which showed positive echo for this feature. Someone actually had some basic code there, but I am not sure if kuznero resp. Roman Kuznetsov still has his code from back then to start from, but anyway, it would certainly make sense to have AsciiDoc input in the feature set.

@mpickering
Copy link
Collaborator

I have started working on this, it might not be ready for a few months.

@mpickering
Copy link
Collaborator

Have you tried going AsciiDoc -> DocBook -> Pandoc? Can you describe the shortcomings of doing this if you have?

@ERnsTL
Copy link
Author

ERnsTL commented Aug 7, 2014

Thanks for your positive comment!

I personally have not tried the conversion chain as you mentioned. The current choices are bulky with regards to its dependencies, while not being and not aspiring to be universal markup translators, like pandoc. Going through intermediate formats instead of one pandoc invocation seems hacky to me.

Recently, I read about two publishing houses switching away from LaTeX and moving to AsciiDoc as their source format, so I gather that it fulfills the needs of technical writing well, also regarding referencing and I find it useful to have another capable plain-text format for documents, if software support is good and can easily convert between different formats and offers multiple output choices. Which is where this feature would come in ;-)

I personally would also like to write articles and possibly a book rather in AsciiDoc than LaTeX, which is - at least my personal - motivation for this feature.

@alexborisov
Copy link

👍 I would love to see support for reading asciidoc files in pandoc. It was actually my primary reason to use pandoc. At the moment i have a bit of a hacky solution converting my asciidoc into html and then feeding that into pandoc. I would love to eliminate the extra project dependency and simplify my build chain.

@ciampix
Copy link

ciampix commented Nov 19, 2014

I can testify that this would be a very useful addition to pandoc, thanks in advance mr. mpickering!

@jgm
Copy link
Owner

jgm commented Apr 20, 2015

PR #2100 contributes a basic AsciiDoc reader (with many features not yet implemented).
@mpickering, how far did you get on your AsciiDoc reader? Is it farther along than #2100, or not as far along? It would be good to put something in the repository (in a branch) for people to work on to advance the project.

@mpickering
Copy link
Collaborator

I commented on the #2100

@romario89
Copy link

I agree to that asciidoc is getting popular. I've just tried to convert the book Pro Git 2 into epub using pandoc and soon noticed that the book was coded in asciidoc, and pandoc was unable to read it.

@benhourigan
Copy link

I’ve tried an asciidoc > html (via https://github.com/asciidoctor/asciidoctor) > epub (via pandoc) conversion chain and it works extremely well except for the following issue.

Asciidoctor wraps all HTML elements in divs with additional classes. This stops pandoc from splitting the epub automatically at headings, because it will never see a 'naked' h1 etc.

Also mentioned this issue here: asciidoctor/asciidoctor#184

The ePub file I created using this method did not pass epubcheck 3.0.1 because it contained one duplicate ID (something I could have avoided). More seriously, the way footnotes are handled is not compliant, and raised several errors like this:

ERROR: …/EPUB.epub/ch003.xhtml(2918,1223): '_footnote_4': fragment identifier is not defined in 'ch003.xhtml'
ERROR: …/EPUB.epub/ch008.xhtml(2065,27): '_footnoteref_1': fragment identifier is not defined in 'ch008.xhtml'

(paths truncated with ellipsis at start)

At the risk of stating the obvious, it's important that pandoc-generated epubs from any source format avoid epubcheck validation errors, as authors and publishers may need to submit these epubs to storefronts that will require epubcheck compliance (i.e. Smashwords, iBooks). Many of the Github-hosted CLI epub generators I've tried (e.g. https://github.com/avdgaag/rpub) omit consideration of epubcheck compliance, so this may not be an obvious point after all. The most common point of failure seems to be the manifest, which pandoc does correctly, which is great. But it could be yet more robust, as the above errors indicate.

@jgm
Copy link
Owner

jgm commented Jul 5, 2015

+++ Ben Hourigan [Jul 05 15 08:56 ]:

I’ve tried an asciidoc > html (via
[1]https://github.com/asciidoctor/asciidoctor) > epub (via pandoc)
conversion chain and it works extremely well except for the following
issue.

Asciidoctor wraps all HTML elements in divs with additional classes.
This stops pandoc from splitting the epub automatically at headings,
because it will never see a 'naked' h1 etc.

Also discussed this issue here: [2]asciidoctor/asciidoctor#184

You could handle this easily with a filter that strips out the
outer Divs before the EPUB writer sees it.

The ePub file I created using this method did not pass epubcheck 3.0.1
because it contained one duplicate ID (something I could have avoided)
and, more seriously, the way footnotes are handled is not compliant,
and raised several errors like this:
ERROR: …/EPUB.epub/ch003.xhtml(2918,1223): '_footnote_4': fragment identifier is
not defined in 'ch003.xhtml'
ERROR: …/EPUB.epub/ch008.xhtml(2065,27): '_footnoteref_1': fragment identifier i
s not defined in 'ch008.xhtml'

When I convert pandoc's README to epub3, I see no errors
with epubcheck 3.0.1 (and README has several footnotes).

My guess is that the HTML footnotes produced by asciidoctor
are not read by pandoc as native pandoc footnotes, and that
is the underlying issue.

If you attach a short sample file (of HTML produced by
asciidoctor), we could confirm that.

Unfortunately, there's no standard way of doing footnotes
in HTML, so the HTML reader never produces a Note element.

@benhourigan
Copy link

Hope this is a sufficient sample:

<div class="paragraph">
<p>… the kind of politics that the liberal economist F. A. Hayek called &#8220;socialist.&#8221; <span class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</span></p>
</div>

@jgm
Copy link
Owner

jgm commented Jul 5, 2015

Could you attach or link to the generated (noncompliant) epub itself?

@jgm
Copy link
Owner

jgm commented Jul 5, 2015

By the way, here's a simple filter (undiv.hs) that will remove your content divs. Run with --filter undiv.hs:

import Text.Pandoc.JSON

main = toJSONFilter undiv
  where undiv (Div (ident, ["content"], kvs) bs) = bs
        undiv b = [b]

@jgm
Copy link
Owner

jgm commented Jul 5, 2015

Depending on how asciidoc formats the notes, you may be able to get the HTML reader to parse them as notes. If you use -f html+epub_html_exts, then pandoc will interpret an element with the type attribute set to footnote or rearnote as a note, and an element with the type attribute set to noteref as a note reference, where the href attribute is an internal link to the corresponding footnote or rearnote. It looks as if asciidoc doesn't quite do it that way, but you could use a filter to add the needed type attributes, and then you'd be there.

@jgm
Copy link
Owner

jgm commented Jul 5, 2015

you could use a filter to add the needed type attributes, and then you'd be there.

Sorry, this is a bit misleading. Since a filter is applied only after the HTML reader, this wouldn't work unless you first filtered, then piped the resulting HTML into another invocation of pandoc. Anyway, there are numerous tools you could use to insert the type attribute where it's needed in the HTML, before passing to pandoc.

@jgm
Copy link
Owner

jgm commented Jul 5, 2015

Or maybe asciidoctor could be persuaded to insert the needed type attributes in the HTML.

@jgm
Copy link
Owner

jgm commented Jul 5, 2015

Actually, rather than the epub itself, it would be most useful for me to have the HTML from which it was generated.

@benhourigan
Copy link

Thanks for the filter. Will try this out. You can get the HTML file from https://www.dropbox.com/s/c2ror63pz16hc3w/2015-07-06-BH-STG-adoc-test.html?dl=0

@jgm
Copy link
Owner

jgm commented Jul 6, 2015

I tried:

% pandoc adoc-test.html -t epub3 -o adoc.epub
% epubcheck adoc.epub
Epubcheck Version 3.0.1

Validating against EPUB version 3.0
ERROR: adoc.epub: could not parse ch006.xhtml: duplicate id: cracks

Check finished with warnings or errors

So I edited adoc-test.html and changed one of the duplicate cracks ids to cracks2. I then regenerated the epub using pandoc and epubcheck gave no validation errors.
Are you using the latest version of pandoc?

@jgm
Copy link
Owner

jgm commented Jul 6, 2015

PS. You might have more success using asciidoc to produce DocBook, then converting that with pandoc. Have you tried that route?

@benhourigan
Copy link

Ah, damn---I didn't think to do a version check. Sorry for being such a novice. I'd been using 1.13.2, which is the latest version on homebrew. Will install 1.15 and try again. Your results sound promising.

@benhourigan
Copy link

BTW, as of pandoc 1.13.2, when I tried the asciidoctor docbook > pandoc epub route the output from docbook was inferior to the output from HTML. One particular thing that I noticed was that admonition blocks came in to the epub as blockquotes without an additional class, and so couldn't be styled specifically with CSS.

@jgm
Copy link
Owner

jgm commented Jul 6, 2015

Currently DocBook elements like <important>, <caution>,
<note>, <tip> ar rendered as a block quote starting with
a single paragraph with the word "Important", "Caution",
"Note", or "Tip" in strong. I'm not sure this is ideal;
we could switch to using divs. However, even with the
present setup, it would be simple to intercept these
block quotes in a filter and change them to divs. Just look
for

BlockQuote (Para [Strong [Str "Important"]] : xs)

and convert that to

Div ("", ["admonition"], []) xs

+++ Ben Hourigan [Jul 06 15 10:34 ]:

BTW, as of pandoc 1.13.2, when I tried the asciidoctor docbook > pandoc
epub route the output from docbook was inferior to the output from
HTML. One particular thing that I noticed was that admonition blocks
came in to the epub as blockquotes without an additional class, and so
couldn't be styled specifically with CSS.


Reply to this email directly or [1]view it on GitHub.

References

  1. Add AsciiDoc Reader / AsciiDoc input support #1456 (comment)

@benhourigan
Copy link

With the duplicate ID sorted in the .adoc file, from the pandoc 1.15 produces from the asciidoctor.html an epub that passes ePubcheck! :)

Inability to split the file at chapter headings remains an issue. Now trying the undiv filter. At first I got this error:

undiv.hs: createProcess: runInteractiveProcess: exec: does not exist (No such file or directory)

Then after installing ghc (7.8.4) and cabal-install (1.22.0.0) (not sure if latter was necessary) from homebrew, I got this error.

pandoc: Error running filter undiv.hs
fd:4: hPutBuf: resource vanished (Broken pipe)

Oddly, if I run chmod +x undiv.hs, I go back to getting the first error again.

All files used during generation here: https://www.dropbox.com/s/40uv3f4ad2fwuco/bh-undiv.hs-test.zip?dl=0 I'm running a script called generate.sh, which is just the following command:

pandoc -t epub --filter undiv.hs --epub-cover-image=cover.png --epub-stylesheet=epub.css --epub-metadata=metadata.xml --epub-chapter-level=1 -o EPUB.epub TEST-v32-Justin-Comments-In.html

@jgm
Copy link
Owner

jgm commented Jul 7, 2015

Is undiv.hs in your working directory?

What OS are you on?

+++ Ben Hourigan [Jul 06 15 18:13 ]:

With the duplicate ID sorted in the .adoc file, from the pandoc 1.15
produces from the asciidoctor.html an epub that passes ePubcheck! :)

Inability to split the file at chapter headings remains an issue. Now
trying the undiv filter. At first I got this error:
undiv.hs: createProcess: runInteractiveProcess: exec: does not exist (No such fi
le or directory)

Then after installing ghc (7.8.4) and cabal-install (1.22.0.0) (not
sure if latter was necessary) from homebrew, I got this error.
pandoc: Error running filter undiv.hs
fd:4: hPutBuf: resource vanished (Broken pipe)

Oddly, if I run chmod +x undiv.hs, I go back to getting the first error
again.

All files used during generation here:
[1]https://www.dropbox.com/s/40uv3f4ad2fwuco/bh-undiv.hs-test.zip?dl=0
I'm running a script called generate.sh, which is just the following
command:
pandoc -t epub --filter undiv.hs --epub-cover-image=cover.png --epub-stylesheet=
epub.css --epub-metadata=metadata.xml --epub-chapter-level=1 -o EPUB.epub TEST-v
32-Justin-Comments-In.html


Reply to this email directly or [2]view it on GitHub.

References

  1. https://www.dropbox.com/s/40uv3f4ad2fwuco/bh-undiv.hs-test.zip?dl=0
  2. Add AsciiDoc Reader / AsciiDoc input support #1456 (comment)

@tarleb
Copy link
Collaborator

tarleb commented Sep 20, 2016

Thanks for the feedback @jgm. It's emitting JSON now.

@zaxebo1
Copy link

zaxebo1 commented Sep 20, 2016

when you are now emitting pandoc JSON, then why not integrate the huskydoc with pandoc itself?

@tarleb
Copy link
Collaborator

tarleb commented Sep 20, 2016

There are three reasons:

  • Huskydoc, as of now, is incomplete, buggy, and simply not up yet to pandoc's high quality standards.
  • Pandoc is using Parsec, which is well tested and stable, while the asciidoc library is built with Megaparsec, which is modern but unstable. It won't be too difficult to swap Megaparsec for Parsec, but it requires some minor changes which I'd have to make first.
  • I'm using language features and libraries not available on all pandoc supported platforms. Those would have to be removed first.

I am planning to address these issues over the next months. I'd like to polish the library some more before I start to address compatibility issues. Being able to experiment is part of the reason this project exists.

The result of using the huskydoc executable is identical to what would be produced if pandoc was calling the library directly, so the method described above is hopefully be acceptable for now.

@hobson
Copy link

hobson commented Nov 29, 2016

+1

@tajmone
Copy link
Contributor

tajmone commented Mar 6, 2017

This thread is very interesting. I'd also love to see pandoc support Asciidoc reader.

As for the issue of AsciiDoctor wrapping paragraphs in <div> tags, I remember that I had stumbled on the issue in the past and did some research. It's possible to workaround this by using custom templates:

84. Provide Custom Templates

Asciidoctor allows you to override the converter methods used to render almost any individual AsciiDoc element.

This can be easily achieved using HAML and asciidoctor-backends, and it allows to change how elements are rendered into HTML by targeting single elements.

For example, to change how paragraphs are formatted in Asciidoctor's final output, you only need to add a modified version of this single file:

https://github.com/asciidoctor/asciidoctor-backends/blob/master/haml/html5/block_paragraph.html.haml

Look inside it:

%div{:id=>@id, :class=>['paragraph', role]}>
  - if title?
    .title=title
  %p<=content

You only need to remove the %div... part and the divs wrapping paragraphs won't be rendered. Somewhow those divs where meant for cases where the paragraph had special attributes, but there is no conditional checking, so even when not required they are still there taking place.

Another bad thing about this divs is that they make CSS styling really annoying.

The cool thing about using custom templates and backends is that only the needed files that you actually put in your custom template folder will be used, for the missing files it will fallback on the default. So there is no need to re-implement the whole template system.

I wish I could share more info on how to do it, but I researched this quite a long time ago and my memory is not fresh on the issue.

It is sad though that the AsciiDoctor project has been stuck for so long on developement of Chunked (multi-page) HTML output feature --- looks like is on stall right now.

Asciidoc FX

Those interested in a quick way to convert Asciidoc to html without having to install dependencies )not even AsciiDoc/AsciiDoctor) should look into Asciidoc FX: it's a cross platform AsciiDoc editor (also available as standalone app) that can convert Asciidoc documents to standalone (and templated) html5 docs (includin syntax highlighting with Highlight.js):

http://www.asciidocfx.com/

It's a Java app that bundles with AsciiDoctor and DocBook (no HTML output support though!), plus other tools --- thus sparing you to have to install anything. And it doesn't conflict with any locally installed version of AsciiDoctor, Asciidoc, etc.

I use pandoc to convert to AsciiDoc and then with Asciidoc FX I just open and save as HTML5 --- and I get a fully standalone document, with a nice template.

Hope this might help....

@jgm
Copy link
Owner

jgm commented Mar 7, 2017 via email

@priyadarshan
Copy link

An AsciiDoc Reader would be quite useful to many, especially in humanities, who need to be able to handle quote block and verse block as separate cases, to HTML and LaTeX targets.

As far as I know, only org and asciidoc support that natively, with AsciiDoc having perhaps a more complete syntax.

Unfortunately, AsciiDoc's LaTeX support is not good enough. Pandoc would be ideal.

@jgm
Copy link
Owner

jgm commented Apr 28, 2018 via email

@priyadarshan
Copy link

priyadarshan commented Apr 28, 2018

Thank you for the prompt comment.

The non-profit organisation I work for is maintaining several thousand books (poetry, prose, plays), currently in org-mode, that are converted to LaTeX to make books. We would like to switch to Pandoc for its speed, simplicity and elegance.

I though already of doing what you suggested, i.e. using pandoc markdown, > for quotes and ~~~ for verse (later converting verbatim environment to verse with a post-processing script).

That solution is not too elegant, since verse and quote have two different ways to mark their semantic space:

  • one uses begin/end marks (similar but simpler to org-mode),
  • the other needs to prepend > to each paragraph of a quote.

There is a "mismatch" between the two, which works fine, ultimately, but it would not make sense to the editors of the files. Of course we can tell them to "just do it", but that will detract for the final solution quite a bit.

Looking at lua filters gave me some ideas. Certainly adopting pandoc-markdown has its advantages.

Thank you for the advice.

@priyadarshan
Copy link

priyadarshan commented Apr 28, 2018

I wonder, would it be feasible to extend the fenced_divs syntax to LaTeX to pass on whatever environment one would need?

That would have additional benefit to allow any of the countless custom environments available in LaTeX (like cverse, drama, etc), a veritable boon to people in humanities.

For example,

::: {.quote}
Here is a paragraph in LaTeX quote environment, or HTML 5 class=quote

Another line
:::

or

::: {.drama}
Here is a paragraph in LaTeX drama environment, or HTML 5 class=drama

Another line
:::

@mb21
Copy link
Collaborator

mb21 commented Apr 29, 2018

@priyadarshan I think jgm meant to use blockquotes (prepend >) for quotes and line blocks (prepend |) for verse.

About rewriting divs in latex: you can already do that with filters. There is also #2106

@priyadarshan
Copy link

@mb21 Thank you, I understood that. As I said, using line blocks syntax for verse would not be compatible with our guideline of keeping original text as clean and intact as possible. That is even more so for poetry.

Thank you for pointer to rewriting divs in LaTeX. That is exactly what we need. I hope that ticket will be completed too, it seems nice to have same possibility in LaTeX as in HTML writers.

Apology for creating noise in this ticket. After all, that is about asciidoc.

Thank you so much to John MacFarlane and developers for the wonderful Pandoc!

@jgm
Copy link
Owner

jgm commented May 1, 2018

Note that content in fenced divs will be parsed as Markdown before it gets to your filter. So, to get line breaks between lines of verse, you'd have to insert them manually (two space or backslash at end of line) or enable the +hard_line_breaks extension globally. Another option would be to use a code block and do all the processing in the filter (splitting into lines and parsing as Markdown). For block quotes, it would be quite easy to use a filter to create a fenced syntax as you describe.

If you have further discussion on this topic, though, please move it to pandoc-discuss.

@priyadarshan
Copy link

Thank you so much for the valuable advice.

If you have further discussion on this topic, though, please move it to pandoc-discuss.

Thank you, I will.

@JamesRandom
Copy link

An asciidoc reader would also be very useful to me (to convert from asciidoc to rst)

@tajmone
Copy link
Contributor

tajmone commented Jan 15, 2021

Regarding the DIVs problem mentioned above, I wanted to add that there's a new third party semantic HTML backend for Asciidoctor now (if I remember correctly, it was added after this thread began):

https://github.com/jirutka/asciidoctor-html5s

So this might be a viable workaround for the problems mentioned above concerning ePubs creation.

@ciampix
Copy link

ciampix commented Jan 25, 2021

I have started working on this, it might not be ready for a few months.

2014 ... months? ;-)

@mb21
Copy link
Collaborator

mb21 commented Jun 19, 2021

@gmarpons just posted asciidoc-hs on pandoc-discuss:

It is meant to be both a library that some day could be integrated as a
Pandoc dependency (similarly to commonmark-hs) and a separated executable.

@alerque
Copy link
Contributor

alerque commented Jun 20, 2021

@ciampix Open Source Months are a fairly flexible unit of measure, each month being roughly equivalent to one promise on the more commercial 6-8 weeks scale.

@RichardJECooke
Copy link

What feature branch is this issue in the github code please? I can't see any branch related to 1456 or asciidoc.

@jgm
Copy link
Owner

jgm commented Apr 30, 2022

There's this library being developed independently:
https://github.com/gmarpons/asciidoc-hs
If it were finished, we could use this to add asciidoc support to pandoc. You might inquire there about how you might help.

@noraj
Copy link

noraj commented Jun 6, 2022

I'm confused.

https://pandoc.org/demos.html 28.

AsciiDoc:

pandoc -s MANUAL.txt -t asciidoc -o example28.txt

Why it supports only .txt extension and if you pass .adoc it triggers an error: Unknown input format asciidoc?

Also if as output I set example.pdf it will trigger cannot produce pdf output from asciidoc which is expected if it's not supported but if I put example.docx it will generate a docx file containing the Asciidoc document (same as txt output) in a docx and not something converted so if docx is not supported it should trigger an error too.

@tarleb
Copy link
Collaborator

tarleb commented Jun 6, 2022

Pandoc assumes that files with .txt extension contain Markdown, while those with a .adoc extension contain Asciidoc. However, while pandoc can write Asciidoc, it cannot read it. Hence this issue.

I'm not sure what you are asking in the second paragraph. Please raise this on the pandoc-discuss mailing list. We want to reserve this tracker for bug reports and feature requests, questions should go to the mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests