# Convert old blog posts into JSON

Convert my old blog posts, written in reST and published using Pelican, into a JSON format suitable for importing with `manage.py import_blog_json`. See [Simon's writeup](https://simonwillison.net/2017/Nov/4/import-refs/) for info on the technique.

## Handle custom directives

I used a bunch of custom reST directives on my previous blog; All this code exists to convert them to HTML for storage.

#### Tweet directive to HTML

Convert `.. tweet::` to the relevent embed html

In [1]:
from docutils import nodes
from docutils.parsers import rst
from docutils.parsers.rst import directives

# FIXME: the <script> tag shoud be in the extra_head so it's not repeated multiple times

TWEET_TEMPLATE = '''
    <blockquote {args}>
        <a href="{url}">{url}</a>
    </blockquote>
    <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
'''

class TweetDirective(rst.Directive):
    name = 'tweet'
    has_content = False
    required_arguments = 1
    optional_arguments = 0
    final_argument_whitespace = False
    option_spec = {
        'align': lambda arg: directives.choice(arg, ('left', 'center', 'right')),
        'conversation': directives.flag,
        'class': directives.class_option
    }

    def run(self):
        args = [
            'class="%s"' % self.options.get('class', 'twitter-tweet'),
            'align="%s"' % self.options.get('align', 'center')
        ]
        if not self.options.get('conversation', False):
            args.append('data-conversation=none')
        t = TWEET_TEMPLATE.format(args=' '.join(args), url=self.arguments[0])
        return [nodes.raw('', t, format='html')]

directives.register_directive('tweet', TweetDirective)

#### Raw HTML directive

Sphinx supports a `.. html` directive, which isn't in core Docutils. However, Docutils does have a `.. raw:: html` directive, so I can emulate `.. html` with a simple subclass of the raw directive:

In [2]:
from docutils.parsers.rst.directives.misc import Raw

class HTMLDirective(Raw):
    required_arguments = 0
    
    def run(self):
        self.arguments = ['html']
        return super().run()

directives.register_directive('html', HTMLDirective)

#### Ignore the "comment" directive

My old blog had a custom comment directive for rending comments. I'm not porting those over, so just ignore that directive.

In [3]:
class IgnoredDirective(rst.Directive):
    has_content = True
    def run(self):
        return []

directives.register_directive("comment", IgnoredDirective)

## Convert entries to JSON

OK, special rest handling complete; let's do this thing.

In [4]:
import pathlib
import json
import docutils.core

In [11]:
OLD_ENTRIES_DIR = pathlib.Path('~/c/jacobian.org/content').expanduser()

My old blog entries store metadata (slug, date, etc) as reST frontmatter, which show up in a "docinfo" element in the original source. This converts that into a dict:

In [6]:
def extract_meta_from_docinfo(doctree):
    try:
        docinfo = next(c for c in doctree.children if type(c) == docutils.nodes.docinfo)
    except StopIteration:
        return {}
    
    meta = {}
    
    # There are two ways a piece of metadata show up in the docinfo:
    #     (a) as a plain field : <date>2016...</date>
    #     (b) as a nested thing: <field><name>foo</name><value>bar</value></field>
    # this supports both
    for c in docinfo.children:
        if c.tagname == 'field':
            meta[c.children[0].astext()] = c.children[1].astext()
        else:
            meta[c.tagname] = c.astext()
    return meta

Convert entries:

In [13]:
entries = []

for entry_path in OLD_ENTRIES_DIR.glob('**/*.rst'):
    if not entry_path.parent.name.isdigit():
        continue
    
    text = entry_path.read_text()
    meta = extract_meta_from_docinfo(docutils.core.publish_doctree(source=text))
    parts = docutils.core.publish_parts(
        source=text, 
        writer_name='html', 
        settings_overrides={'initial_header_level': 3},
    )
    entries.append({
        'type': 'entry',
        'body': parts['body'],
        'title': parts['title'],
        'slug': meta['slug'],
        'datetime': meta['date'],
        'tags': [],
        'import_ref': 'old-blog:' + str(entry_path.relative_to(OLD_ENTRIES_DIR)),
        'source': text,
        'source_type': 'reStructuredText'
    })

And save to JSON to be imported with `manage.py import_blog_json`:

In [14]:
with open('/tmp/old-blog.json', 'w') as fp:
    json.dump(entries, fp)