Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports the polish legislation tradition #17

Merged
merged 29 commits into from May 22, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
76015af
More modular grammars
longhotsummer May 10, 2018
db9b098
Basics of pl
longhotsummer May 10, 2018
ed708ba
Progress
longhotsummer May 10, 2018
3a07654
Sections without headings
longhotsummer May 10, 2018
d6f7129
Use intro
longhotsummer May 10, 2018
7b756be
Working tests
longhotsummer May 10, 2018
f423f2f
Factor out common nodes and grammars
longhotsummer May 10, 2018
dbfc066
More common rules and nodes
longhotsummer May 10, 2018
38441a7
Fixes
longhotsummer May 11, 2018
bd94410
Version 1.0.0
longhotsummer May 11, 2018
fffe9b0
Litera
longhotsummer May 11, 2018
cef4fb9
Remove definition linking, short title discovery
longhotsummer May 11, 2018
fc79370
Articles
longhotsummer May 11, 2018
9117684
Alpha version of 1.0.0
longhotsummer May 14, 2018
fe4e5ff
Indents
longhotsummer May 15, 2018
ebd07bf
Remove unnecessary eol
longhotsummer May 15, 2018
43abfa3
Floating paragraphs use subparagraph so as not to clash with numbered…
longhotsummer May 15, 2018
93d3fef
Move reformatter out of slaw
longhotsummer May 15, 2018
01dd813
1.0.0.alpha.4
longhotsummer May 15, 2018
6a1e5d9
Better handling of intro elements
longhotsummer May 16, 2018
e5690b2
Litera uses alinea, to avoid conflicts with list
longhotsummer May 16, 2018
667fa4b
Different hyphens in grammar
longhotsummer May 16, 2018
509626f
Alpha 6
longhotsummer May 16, 2018
1ff5fdf
Link to repo
longhotsummer May 16, 2018
0517e35
Remove unneeded files
longhotsummer May 16, 2018
e149224
README
longhotsummer May 16, 2018
165b27c
Alphabetize, remove points
longhotsummer May 22, 2018
eafe98f
Polish un-parse xsl
longhotsummer May 22, 2018
b9b276a
Version 1.0.0
longhotsummer May 22, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
170 changes: 22 additions & 148 deletions README.md
@@ -1,14 +1,16 @@
# Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)

Slaw is a lightweight library for generating and rendering Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
It is used to power [openbylaws.org.za](http://openbylaws.org.za) and [steno.openbylaws.org.za](http://steno.openbylaws.org.za)
and uses grammars developed for South African acts and by-laws.
Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
It is used to power [Indigo](https://github.com/OpenUpSA/indigo) and uses grammars developed for the legal
traditions in these countries:

* South Africa
* Poland

Slaw allows you to:

1. extract plain text from PDFs and clean up that text
2. parse plain text and transform it into an Akoma Ntoso Act XML document
3. render the XML document into HTML
1. parse plain text and transform it into an Akoma Ntoso Act XML document
2. unparse Akoma Ntoso XML into a plain-text format suitable for re-parsing

Slaw is lightweight because it wraps around a Nokogiri XML representation of
the parsed document. It provides some support methods for manipulating these
Expand Down Expand Up @@ -40,7 +42,7 @@ installed by default on most systems (including Mac). On Ubuntu you can use:

The simplest way to use Slaw is via the commandline:

$ slaw parse myfile.pdf
$ slaw parse myfile.pdf --grammar za

## Overview

Expand All @@ -61,152 +63,15 @@ formats.

The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering --
so Slaw performs some post-processing on the XML produced by the parser. In particular,
it nests lists correctly and looks for specially defined terms and their occurrences in the document.

## Quick Start

Install the gem using

gem install slaw

Extract text from a PDF and parse it as a South African by-law:

```ruby
require 'slaw'

# extract text from a PDF file and clean it up
extractor = Slaw::Extract::Extractor.new
text = extractor.extract_from_pdf('/path/to/file.pdf')

# parse the text into a XML and
generator = Slaw::ActGenerator.new
bylaw = generator.generate_from_text(text)
puts bylaw.to_xml(indent: 2)

# render the by-law as HTML, using / as the root
# for relative URLs
renderer = Slaw::Render::HTMLRenderer.new
puts renderer.render(bylaw.doc, '/')
```

## Extraction

Extraction is done by the `Slaw::Extract::Extractor` class. It currently handles
PDF and plain text files. Slaw uses `pdftotext` from the `xpdf` package to extract
the plain text from PDFs. PDFs are great for presentation, but suck for accurately storing
text. As a result, the extraction can produce oddities, such as lines broken in weird
places (or not broken when they should be). Slaw gets around this by running
some cleanup routines on the extracted text.

For example, it knows that these lines:

(b) any wall, swimming pool, reservoir or bridge
or any other structure connected therewith; (c) any fuel pump or any
tank used in connection therewith

should probably be broken at the section numbers:

(b) any wall, swimming pool, reservoir or bridge or any other structure connected therewith;
(c) any fuel pump or any tank used in connection therewith

If your region's numbering format differs significantly from this, these rules might not work.

Some other steps Slaw takes after extraction include (check `Slaw::Parse::Cleanser` for the full set):

* changing newlines to `\n`, and normalising quotation characters
* removing page numbers and other boilerplate
* stripping the table of contents (we can generate our own from the parsed document)
* changing tabs to spaces, stripping leading and trailing spaces and removing blank lines
it nests lists correctly.

## Parsing

Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse
tree, each node of which knows how to serialize itself in XML format.

While most South African by-laws are superficially very similar, there are a sufficient differences
in their typesetting to make parsing them difficult. The grammar handles most
edge cases but may not catch them all. The one thing it cannot yet detect well is the difference
between section titles before and after a section number:

1. Definitions
In this by-law, the following words ...

Definitions
1. In this by-law, the following words ...
tree, the nodes of which know how to serialize themselves in XML format.

This must be set by the user before parsing:

```ruby
generator = Slaw::ZA::BylawGenerator.new
generator.parser.options = {section_number_after_title: true}
```

The parser does its best not to choke on input it doesn't understand, preferring a best effort
to a completely accurate result. For example it may not be able to work out a section heading
and so will treat it as simply another statement in the previous section. This causes the parser
to use a lot of backtracking and negative lookahead assertions, which can be slow for large documents.

The grammar supports a number of subsection numbering formats, which are often mixed
in a document to indicate different levels of nesting.

(a)
(2)
(3b)
(ii)
3.4

During post-processing it works out how to nest these appropriately.

Special words, such as ``part`` and ``chapter`` are ignored if the line starts with a backslash ``\``.

For more information see the South African by-law grammar at
[lib/slaw/za/bylaw.treetop](lib/slaw/za/bylaw.treetop) and the list nesting
at [lib/slaw/parse/blocklists.rb](lib/slaw/parse/blocklists.rb).

## Rendering

Slaw renders XML to HTML using XSLT. For the most part there is a direct mapping between
Akoma Ntoso structure and the HTML layout, so most AN nodes are simply mapped to `div` or `span`
elements with a class attribute derived from the name of the AN element and an ID element taken
from the node, if any. This makes it both fast and flexible, since it's easy to
apply layout rules with CSS.

Slaw can render either an entire document like this, or just a portion of the XML tree.

```ruby
# render an entire document
renderer = Slaw::Render::HTMLRenderer.new
puts renderer.render(bylaw.doc, '/')

# render the first section only
puts renderer.render(bylaw.sections[0], '/')
```

For more information, see [/lib/slaw/render/html.rb](/lib/slaw/render/html.rb).

## Meta-data

Acts and by-laws have metadata which it is not possible to get from their plain text representations,
such as their title, date and format of publication or act number. Slaw provides some helpers
for manipulating this meta-data. For example,

```ruby
bylaw = Slaw::ByLaw.new('spec/fixtures/community-fire-safety.xml')
print bylaw.id_uri
bylaw.title = 'A new title'
bylaw.name = 'a-new-title'
bylaw.published!(date: '2014-09-28')
print bylaw.id_uri
```

## Schedules

South African acts and by-laws can have addendums called schedules. They are technically a part of
the act but are not part of the primary body and have more relaxed formatting. Slaw finds schedules
by looking for section headings, but makes no effort to capture the format of their contents.

Akoma Ntoso has no explicit support for schedules. Instead, Slaw stores all schedules under a single
Akoma Ntoso `component` elements at the end of the XML document, with a name of `schedules`.
Supporting formats from other country's legal traditions probably requires creating a new grammar
and parser.

## Contributing

Expand All @@ -218,6 +83,15 @@ Akoma Ntoso `component` elements at the end of the XML document, with a name of

## Changelog

### 1.0.0

* Improved support for other legal traditions / grammars.
* Add Polish legal tradition grammar.
* Slaw no longer does too much introspection of a parsed document, since that can be so tradition-dependent.
* Move reformatting out of Slaw since it's tradition-dependent.
* Remove definition linking, Slaw no longer supports it.
* Remove unused code for interacting with the internals of acts.

### 0.17.2

* Match defined terms in 'definition' section.
Expand Down
28 changes: 4 additions & 24 deletions bin/slaw
Expand Up @@ -17,19 +17,14 @@ class SlawCLI < Thor
desc "parse FILE", "Parse FILE into Akoma Ntoso XML"
option :input, enum: ['text', 'pdf'], desc: "Type of input if it can't be determined automatically"
option :pdftotext, desc: "Location of the pdftotext binary if not in PATH"
option :definitions, type: :boolean, desc: "Find and link definitions (this can be slow). Default: false"
option :fragment, type: :string, desc: "Akoma Ntoso element name that the imported text represents. Support depends on the grammar."
option :id_prefix, type: :string, desc: "Prefix to be used when generating ID elements when parsing a fragment."
option :section_number_position, enum: ['before-title', 'after-title', 'guess'], desc: "Where do section titles come in relation to the section number? Default: before-title"
option :reformat, type: :boolean, desc: "Reformat common formatting issues to make grammar matching better. Default: true for PDF files, false otherwise"
option :crop, type: :string, desc: "Crop box for PDF files, as 'left,top,width,height'."
option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
def parse(name)
logging

if options[:fragment] and options[:definitions]
raise Thor::Error.new("--definitions can't be used together with --fragment")
end

Slaw::Extract::Extractor.pdftotext_path = options[:pdftotext] if options[:pdftotext]
extractor = Slaw::Extract::Extractor.new

Expand All @@ -43,16 +38,13 @@ class SlawCLI < Thor
case options[:input]
when 'pdf'
text = extractor.extract_from_pdf(name)
options[:reformat] = true if options[:reformat].nil?
when 'text'
text = extractor.extract_from_text(name)
else
text = extractor.extract_from_file(name)
end

generator = Slaw::ActGenerator.new

text = generator.reformat(text) if options[:reformat]
generator = Slaw::ActGenerator.new(options[:grammar] || 'za')

if options[:fragment]
generator.document_class = Slaw::Fragment
Expand Down Expand Up @@ -94,25 +86,13 @@ class SlawCLI < Thor
exit 1
end

# definitions?
generator.builder.link_definitions(act.doc) if options[:definitions]

puts act.to_xml(indent: 2)
end

desc "link-definitions FILE", "Find and link defined terms in FILE"
def link_definitions(name)
builder = Slaw::ActGenerator.new.builder

doc = File.open(name, 'r') { |f| doc = builder.parse_xml(f.read) }
builder.link_definitions(doc)

puts builder.to_xml(doc)
end

desc "unparse FILE", "Unparse FILE from Akoma Ntoso XML back into text suitable for re-parsing"
option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
def unparse(name)
generator = Slaw::ActGenerator.new
generator = Slaw::ActGenerator.new(options[:grammar] || 'za')

doc = File.open(name, 'r') { |f| doc = generator.builder.parse_xml(f.read) }
puts generator.text_from_act(doc)
Expand Down
6 changes: 0 additions & 6 deletions lib/slaw.rb
Expand Up @@ -4,14 +4,8 @@
require 'slaw/namespace'
require 'slaw/logging'

require 'slaw/act'
require 'slaw/bylaw'
require 'slaw/collection'

require 'slaw/xml_support'
require 'slaw/lifecycle_event'

require 'slaw/render/html'
require 'slaw/parse/blocklists'
require 'slaw/parse/builder'
require 'slaw/parse/cleanser'
Expand Down