Skip to content
Library to convert Emacs org-mode files to XML
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
tests
xslt
LICENSE
README.org
org-to-xml.el

README.org

org-to-xml

This is a library to convert Emacs org-mode files to XML. The resulting XML isn’t especially pretty, but that’s not the goal. The goal is a complete and accurate translation of the internal org-mode data structures to XML.

The assumption is that downstream XML processing tools can be used to transform it. I plan add a few XSLT examples to this repository.

For the curious, here’s how it works.

Consider, an org file:

#+TITLE: Some Title
#+AUTHOR: Norman Walsh
#+DATE: 2019-02-19

A paragraph with <markup> in it. This isn’t intended to be meaningful
or useful.

* First level heading
  :PROPERTIES:
  :CUSTOM_ID: first
  :END:

** TODO This is an example TODO item.
   DEADLINE: <2019-02-26 Tue +1w>
   :PROPERTIES:
   :CREATED:  [2019-02-19 Tue 06:39]
   :SRC:      [[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]
   :END:

See [[https://orgmode.org/][org-mode]] for more information about ~org-mode~.
  1. First, it’s parsed by org-element-parse-buffer, swaths of which I’ve elided:
    (org-data nil (section (:begin 1 :end 146 :contents-begin 1
    :contents-end 145 :post-blank 1 :post-affiliated 1 :parent #0)
    (keyword (:key "TITLE" :value "Some Title" :begin 1 :end 21
    :post-blank 0 :post-affiliated 1 :parent #1))
    (keyword (:key "AUTHOR" :value "Norman Walsh" :begin 21 :end 44
    :post-blank 0 :post-affiliated 21 :parent #1))
    (keyword (:key "DATE" :value "2019-02-19" :begin 44 :end 64
    :post-blank 1 :post-affiliated 44 :parent #1))
    (paragraph (:begin 64 :end 145 :contents-begin 64 :contents-end 145
    :post-blank 0 :post-affiliated 64 :parent #1) #("A paragraph with
    <markup> in it. This isn’t intended to be meaningful or useful. " 0 81
    (:parent #2))))
    (headline (:raw-value "First level heading" :begin 146 :end 544
    :pre-blank 0 :contents-begin 168 :contents-end 544 :level 1 :priority
    …
    (link (:type "https" :path "//orgmode.org/" :format bracket
    :raw-link "https://orgmode.org/" :application nil :search-option nil
    :begin 470 :end 505 :contents-begin 494 :contents-end 502 :post-blank
    1 :parent #4) #("org-mode" 0 8 (:parent #5))) #("for more information
    about " 0 27 (:parent #4)) (code (:value "org-mode" :begin 532 :end
    542 :post-blank 0 :parent #4)) #(". " 0 2 (:parent #4)))))
        
  2. We setup a buffer to store the XML, then walk over this data structure emiting XML elements for each sub-expression. The node properties become attributes, except for the properties listed in org-to-xml-ignore-symbols or properties that come from a properties drawer which are ignored.
  3. Finally, we do a little post-processing cleanup on the XML:
    • Replace occurrences of <tag …></tag> with <tag …/>.
    • Remove leading spaces from <paragraph> elements.
    • Un-indent code blocks so that they begin on the left margin.

    And then save the file, swaths of which I have also elided.

    <?xml version="1.0"?>
    <!-- Converted from org-mode to XML by org-to-xml version 0.0.3 -->
    <!-- See https://github.com/ndw/org-to-xml -->
    <org-data xmlns="https://nwalsh.com/ns/org-to-xml"><section>
    <keyword key="TITLE">Some Title</keyword>
    <keyword key="AUTHOR">Norman Walsh</keyword>
    <keyword key="DATE">2019-02-19</keyword>
    <paragraph>A paragraph with &lt;markup&gt; in it. This isn’t
    intended to be meaningful or useful.
    </paragraph></section>
    <headline level="1"><title>First level heading</title>
    …
    <link type="https" path="//orgmode.org/" format="bracket"
    raw-link="https://orgmode.org/">org-mode</link> for more information
    about <code>org-mode</code>.
    </paragraph></section></headline></headline></org-data>
        

It’s been twenty years since I tried to do anything much more interesting than a keybinding in elisp. I expect the code, especially the tree walking, is embarrassingly crude. Suggestions for improvement, or simply pointers to the bits of the elisp manual I should read again, most humbly solicited.

I also confess, I’m completely winging it on current function naming/namspacing conventions.

Pros and Cons

There are two obvious ways to approach the problem of converting .org files to .xml.

  1. Use the ox framework.
  2. Do it the hard way.

My goal in this project is to have a complete dump of the org structures in XML. That rules out the ox framework. The ox framework is definitely the place to start if you want to convert from an unknown org file and extract the information that you know about. But it flattens structures like the property drawer so that it’s impossible to extract everything with fidelity, even the things you don’t know about.

So this code attempts to do it the hard way. But I’m also lying when I say I want a complete dump of the org structures. I wantg a dump of the meaningful structures. One person’s meaning is another person’s pointless cruft, however.

Examples of structures I don’t consider meaningful:

  • The pre-blank and post-blank properties that the org data structures use to encode spaces in some circumstances.
  • Leading blanks in code blocks.
  • Leading spaces in paragraphs.

It’s likely that this list will grow as I learn more about the org-mode data strutures. Unless I give up on this project altogether, of course.

You can’t perform that action at this time.