What is this?
I started working on pandoc before I knew much Haskell, and before there were many Haskell libraries available. In retrospect, I regret some of the early design decisions. This repository is a place to explore some architectural improvements.
So far, there's a definition of the basic data structure, a
builder DSL, a markdown reader, and an HTML writer. The package
includes an executable,
pandoc2 --help will give
usage instructions. With the
--strict flag, the program passes
all of the tests from the Markdown test suite.
The following pandoc markdown extensions have been implemented:
- smart typography (enable with
- delimited code blocks
- markdown inside HTML block-level tags
- TeX math
- fancy list markers
- automatic header identifiers
- definition lists
There are a few changes in how lists work. The most important is that changes in style now trigger a new list. The following is one list in pandoc, and two lists in pandoc2:
+ one + two - three - four
Some differences from pandoc 1
We now use
Blockelements instead of lists. This makes sense for text, since appending to the end of a
Sequenceis computationally cheap. These sequences are wrapped in newtypes,
Blocks. Thus, the
Emphconstructor now has the type
Inlines -> Inlinerather than
[Inline] -> Inline.
mappendis defined for
Inlinesin a way that builds in normalization: so, for example, if you append an
Inlinesthat begins with a space onto an
Inlinesthat ends with a space, there will only be one space. Similarly, adjacent
Inlines will be merged, and so on.
The individual inline and block parsers return an
Blocksinstead of an
Block; this allows them to return nothing, or multiple elements, where before we had to return a single elements. (So, for example,
memptyinstead of a
Textis used throughout instead of
The input text is tokenized, and the tokens fed to the parser. This makes the parsers simpler in some cases (especially in handling line endings) and seems to boost performance. Tabs are converted in the tokenization phase.
IO actions are now possible in the parsers. This should make it possible to handle things like LaTeX
\include. But it is also possible for the user to run the parsers in a pure Monad. (See the
It is also now easy to issue warnings and informational messages during parsing, to alert the user if information is being lost, for example.
The old markdown parser made two passes--one to get a list of references, and then again to parse the document, using this list of references. The new parser makes just one pass, and fills in the references at the end.
The old parser handled embedded blocks (block quotations, sublists) by first parsing out a "raw" chunk of text (omitting opening
>'s and indentation, for example), then parsing this raw text using block parsers. The new parser avoids the need for multiple passes by storing an "endline" and "block separator" parser in state.
The old parser required space after block elements, so that newlines would generally have to be added to the input. The new parser does not.
blaze-html is now used (instead of the old xhtml package) for HTML generation.
The code is cleaner and shorter.
Performance is significantly faster than pandoc, even with the
resolveRefs was made much faster by hand-coding it instead of
using generics. A further improvement was gained by removing
entirely, and having the parsers return functions from references to
values, which are then run at the end of parsing.
To run the Markdown test suite, do
make test. To run the PHP Markdown test
make phptests. Several of the PHP tests will fail, but in
these cases I disagree about what behavior is normative.