A prototype for a markup language on serlo.org.
In order to develop a markup language for serlo.org, we construct a prototype converter between the Serlo API and the markup language. This allows for testing the markup language with real contents by converting them back and forth. The conversion from markup language to the API will be used to display the edited contents.
To implement these goals, the following tools will be written: A library which handles the conversion and a few command line tools for interfacing the library.
Currently, the idea is to use AsciiDoc as the underlying markup language. This is done for two main reasons: Firstly it is a language which naturally allows for content type extensions. This allows the vanilla use of an already defined markup language. Secondly this vanilla use of a markup language gives rise to the hope that some (external) editors already provide syntax highlighting for the markup language.
Using the command line, one can convert an article from serlo.org to
AsciiDoc via the executable serlo2adoc
. You can find a manpage for it int
the folder doc
. You can use cabal to build and install
the library and executable. To get an impression, on what the programm can
convert, you can look at the test article which
roughly contains everything which can be converted.
Installation instructions are provided in INSTALL.adoc.
For downloading articles one can use the helper python scripts in the folder
helper
. They take as argument an article ID and print the
article’s contents to the standard output. The script fetch-article.py
only
fetches the article without extracting it’s contents from the API
request. This contains more metadata. The script fetch-article-content.py
only outputs the article’s contents. When you redirect the output from the
latter script, you can feed the file directly into serlo2adoc
to see a
conversion of the article to AsciiDoc.
The conversion is done in a three step process: First there is a conversion from the JSON created by the Serlo API to an internal representation for the Serlo content. In the second step this internal representation is converted to an internal representation of an AsciiDoc document. The third step consists of converting the internal AsciiDoc representation to the actual AsciiDoc document. For the conversion in the other direction, both steps get reversed. Hence the implemtation consists of two parts: A translation between the Serlo API and the internal model and a translation between the internal model and the markup language. Graphically:
Serlo API JSON <--> internal Serlo content model <--> internal AsciiDoc model <--> AsciiDoc
The Serlo Editor constists of a collection of plugins. These plugins each have a particular state, which determines their content. To track progress it is therefore useful to collect the plugins with which the translator can deal with. These are:
-
text (partially, for details see below)
-
injection (only from JSON to internal model)
-
rows
-
spoiler
-
image (only from JSON to internal model)
-
important (only from JSON to internal model)
The text plugin itself is complicated enough to deserve its own treatment: The text plugin itself has a substructure which denotes the text formatting. Currently supported are
-
bold/italic text
-
headings
-
normal text
-
paragraphs
-
maths
-
unordered lists
The internal model to represent an AsciiDoc document is still under heavy development. Regarding the conversion, there are two directions to cover: The printing, that is converting the internal model to an actual document, and the parsing, that is converting the actual document to the internal model. On the printing side, the following parts are working:
-
document header
-
sections
-
paragraph content (at least some parts)
The printable parts of the paragraph content are
-
plain text
-
bold/italic/highlighted text
-
subscript and superscript
-
maths
-
inline macro calls
Some representation for maths contents does exist, but it cant be rendered right now because it is unclear how to render it.
The parser is currently only capable of parsing the header and some blocks. These are
-
section
-
paragraph (for details see below)
Not all syntax elements of a paragraph are parseable right now. Currently the parser recognises the following syntax elements inside a paragraph
-
plain text (anything which is never a special char)
-
hard line breaks
-
end of a paragraph
-
macro calls
-
unconstrained bold/italic/monospace/highlighted text
-
sub- and superscript text
-
inline maths (asciimath and latex)
The unconstrained bold/italic/monospace implementation deviates from the specification by allowing stacking the formatting marks in any order.
There need to be two conversions for the internal model. Right now, there is a partial implementation to convert the Serlo internal model to the AsciiDoc internal model. In that, the following can be converted to AsciiDoc
-
rows
-
paragraphs (partially, see below)
-
spoilers
-
injections
-
images
-
important
For paragraphs, the following parts can be converted
-
plain text
-
bold/italic text
-
maths
The Serlo editor encoding is not one of the best documented pieces of software. Nevertheless the source code gives understandable state descriptions for some plugins. Sadly, the text plugin, which is the bread and butter plugin for article creation, is so complicated, that the editor code itself is not a good documentation for the underlying data structure. Hence the API-side implementation of the text plugin is highly explorative.
Similar caveats concern the overall code quality. It is explorative code which probably ignores a bunch of best practices.