Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text and PDF documents. It is used to power Indigo and uses grammars developed for the legal traditions in these countries:
- South Africa
Slaw allows you to:
- parse plain text and transform it into an Akoma Ntoso Act XML document
- unparse Akoma Ntoso XML into a plain-text format suitable for re-parsing
Slaw is lightweight because it wraps around a Nokogiri XML representation of the parsed document. It provides some support methods for manipulating these documents, but anything advanced must manipulate the XML directly.
Add this line to your application's Gemfile:
And then execute:
Or install it with:
$ gem install slaw
To run PDF extraction you will also need poppler's pdftotext. If you're on a Mac, you can use:
$ brew install poppler
You may also need Ghostscript to remove password protection from PDF files. This is installed by default on most systems (including Mac). On Ubuntu you can use:
$ sudo apt-get install ghostscript
The simplest way to use Slaw is via the commandline:
$ slaw parse myfile.pdf --grammar za
Slaw generates Acts in the Akoma Ntoso 2.0 XML standard for legislative documents. It first parses plain text using a grammar and then generates XML from the resulting syntax tree.
Most by-laws in South Africa are available as PDF documents. Slaw therefore has support for extracting and cleaning up text from PDFs before parsing it. Extracting text from PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of rules-of-thumb for correcting these. These rules are based on South African by-laws and may not be suitable for all regions.
The grammar is expressed as a Treetop grammar and has been developed specifically for the format of South African acts and by-laws. Grammars for other regions could de developed depending on the complexity of a region's formats.
The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering -- so Slaw performs some post-processing on the XML produced by the parser. In particular, it nests lists correctly.
Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse tree, the nodes of which know how to serialize themselves in XML format.
Supporting formats from other country's legal traditions probably requires creating a new grammar and parser.
- Fork it at http://github.com/longhotsummer/slaw/fork
- Install dependencies:
- Create your feature branch:
git checkout -b my-new-feature
- Write great code!
- Run tests:
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Create a new Pull Request
- Improved support for other legal traditions / grammars.
- Add Polish legal tradition grammar.
- Slaw no longer does too much introspection of a parsed document, since that can be so tradition-dependent.
- Move reformatting out of Slaw since it's tradition-dependent.
- Remove definition linking, Slaw no longer supports it.
- Remove unused code for interacting with the internals of acts.
- Match defined terms in 'definition' section.
- Updated nokogiri dependency to 1.8.2
- Support links and images inside tables, by parsing tables natively.
- Support --crop for PDFs. Requires poppler pdftotex, not xpdf.
- Update nokogiri to ~> 1.8.1
- Ignore non-AKN compatible table attributes
- Support tables in many non-PDF documents (eg. Word documents) by converting to HTML and then to Akoma Ntoso
- Convert non-breaking space (\xA0) to space
- Support links in remarks
- Support inline image tags, using Markdown syntax: ![alt text](image url)
- Smarter un-break lines
- FIX allow Schedule, Part and other headings at the start of blocklist and subsections
- FIX replace empty CONTENT elements with empty P tags so XML validates
- Better handling of empty subsections and blocklist items
- Support links/references using Markdown-like [text](href) syntax.
- FIX allow remarks in blocklist items
- Support newlines in table cells as EOL (or BR in HTML)
- FIX unparsing of remarks, introduced in 0.10.0
- Ensure backslash escaping handles listIntroductions and partial words correctly
- New command
unparse FILEwhich transforms an Akoma Ntoso XML document into plain text, suitable for re-parsing
- Support escaping special words with a backslash
- This release makes reasonably significant changes to generated XML, particularly for sections without explicit subsections.
- Blocklists with (aa) following (z) are using the same numbering format.
- Change how blockList listIntroduction elements are created to be more generic
- Support for sections that dive straight into lists without subsections
- Simplify grammar
- Fix elements with potentially duplicate ids
- During cleanup, break lines on section titles that don't have a space after the number, eg: "New section title 4.(1) The content..."
- Schedules can be empty (#10)
- Schedules can have both a title and a heading, permitting schedules titled "First Schedule" and not just "Schedule 1"
- FEATURE: parse command only reformats input for PDFs or when --reformat is given
- FIX: don't error on defn tags without link to defined term
- use refersTo to identify blocks containing term definitions, rather than setting an (invalid) ID
- add link-definitions command to find and extract defined terms and link them to their definitions
- exit with non-zero exit code on failure (see https://github.com/erikhuda/thor/issues/244)
- add --section-number-position argument to slaw command
- grammar supports empty chapters and parts
- major changes to grammar to permit chapters, parts, sections etc. in schedules