This is a library to convert Emacs org-mode files to XML. The
resulting XML isn’t especially pretty, but that’s not the goal. The
goal is a complete and accurate translation of the internal
data structures to XML.
The assumption is that downstream XML processing tools can be used to transform it. I plan add a few XSLT examples to this repository.
For the curious, here’s how it works.
Consider, an org file:
#+TITLE: Some Title #+AUTHOR: Norman Walsh #+DATE: 2019-02-19 A paragraph with <markup> in it. This isn’t intended to be meaningful or useful. * First level heading :PROPERTIES: :CUSTOM_ID: first :END: ** TODO This is an example TODO item. DEADLINE: <2019-02-26 Tue +1w> :PROPERTIES: :CREATED: [2019-02-19 Tue 06:39] :SRC: [[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]] :END: See [[https://orgmode.org/][org-mode]] for more information about ~org-mode~.
- First, it’s parsed by
org-element-parse-buffer, swaths of which I’ve elided:
(org-data nil (section (:begin 1 :end 146 :contents-begin 1 :contents-end 145 :post-blank 1 :post-affiliated 1 :parent #0) (keyword (:key "TITLE" :value "Some Title" :begin 1 :end 21 :post-blank 0 :post-affiliated 1 :parent #1)) (keyword (:key "AUTHOR" :value "Norman Walsh" :begin 21 :end 44 :post-blank 0 :post-affiliated 21 :parent #1)) (keyword (:key "DATE" :value "2019-02-19" :begin 44 :end 64 :post-blank 1 :post-affiliated 44 :parent #1)) (paragraph (:begin 64 :end 145 :contents-begin 64 :contents-end 145 :post-blank 0 :post-affiliated 64 :parent #1) #("A paragraph with <markup> in it. This isn’t intended to be meaningful or useful. " 0 81 (:parent #2)))) (headline (:raw-value "First level heading" :begin 146 :end 544 :pre-blank 0 :contents-begin 168 :contents-end 544 :level 1 :priority … (link (:type "https" :path "//orgmode.org/" :format bracket :raw-link "https://orgmode.org/" :application nil :search-option nil :begin 470 :end 505 :contents-begin 494 :contents-end 502 :post-blank 1 :parent #4) #("org-mode" 0 8 (:parent #5))) #("for more information about " 0 27 (:parent #4)) (code (:value "org-mode" :begin 532 :end 542 :post-blank 0 :parent #4)) #(". " 0 2 (:parent #4)))))
- We setup a buffer to store the XML, then walk over this data structure
emiting XML elements for each sub-expression. The node properties become
attributes, except for the properties listed in
org-to-xml-ignore-symbolsor properties that come from a properties drawer which are ignored.
- Finally, we do a little post-processing cleanup on the XML:
- Replace occurrences of
- Remove leading spaces from
- Un-indent code blocks so that they begin on the left margin.
And then save the file, swaths of which I have also elided.
<?xml version="1.0"?> <!-- Converted from org-mode to XML by org-to-xml version 0.0.3 --> <!-- See https://github.com/ndw/org-to-xml --> <org-data xmlns="https://nwalsh.com/ns/org-to-xml"><section> <keyword key="TITLE">Some Title</keyword> <keyword key="AUTHOR">Norman Walsh</keyword> <keyword key="DATE">2019-02-19</keyword> <paragraph>A paragraph with <markup> in it. This isn’t intended to be meaningful or useful. </paragraph></section> <headline level="1"><title>First level heading</title> … <link type="https" path="//orgmode.org/" format="bracket" raw-link="https://orgmode.org/">org-mode</link> for more information about <code>org-mode</code>. </paragraph></section></headline></headline></org-data>
- Replace occurrences of
It’s been twenty years since I tried to do anything much more interesting than a keybinding in elisp. I expect the code, especially the tree walking, is embarrassingly crude. Suggestions for improvement, or simply pointers to the bits of the elisp manual I should read again, most humbly solicited.
I also confess, I’m completely winging it on current function naming/namspacing conventions.
Pros and Cons
There are two obvious ways to approach the problem of converting .org files to .xml.
- Use the ox framework.
- Do it the hard way.
My goal in this project is to have a complete dump of the org
structures in XML. That rules out the
ox framework. The
framework is definitely the place to start if you want to convert from
an unknown org file and extract the information that you know about.
But it flattens structures like the property drawer so that it’s
impossible to extract everything with fidelity, even the things you
don’t know about.
So this code attempts to do it the hard way. But I’m also lying when I say I want a complete dump of the org structures. I wantg a dump of the meaningful structures. One person’s meaning is another person’s pointless cruft, however.
Examples of structures I don’t consider meaningful:
post-blankproperties that the org data structures use to encode spaces in some circumstances.
- Leading blanks in code blocks.
- Leading spaces in paragraphs.
It’s likely that this list will grow as I learn more about the org-mode data strutures. Unless I give up on this project altogether, of course.