laws-africa · longhotsummer · May 22, 2018 · May 10, 2018 · May 10, 2018 · May 10, 2018
diff --git a/README.md b/README.md
@@ -1,14 +1,16 @@
 # Slaw [![Build Status](https://travis-ci.org/longhotsummer/slaw.svg)](http://travis-ci.org/longhotsummer/slaw)
 
-Slaw is a lightweight library for generating and rendering Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
-It is used to power [openbylaws.org.za](http://openbylaws.org.za) and [steno.openbylaws.org.za](http://steno.openbylaws.org.za)
-and uses grammars developed for South African acts and by-laws.
+Slaw is a lightweight library for generating Akoma Ntoso 2.0 Act XML from plain text and PDF documents.
+It is used to power [Indigo](https://github.com/OpenUpSA/indigo) and uses grammars developed for the legal
+traditions in these countries:
+
+* South Africa
+* Poland
 
 Slaw allows you to:
 
-1. extract plain text from PDFs and clean up that text
-2. parse plain text and transform it into an Akoma Ntoso Act XML document
-3. render the XML document into HTML
+1. parse plain text and transform it into an Akoma Ntoso Act XML document
+2. unparse Akoma Ntoso XML into a plain-text format suitable for re-parsing
 
 Slaw is lightweight because it wraps around a Nokogiri XML representation of
 the parsed document. It provides some support methods for manipulating these
@@ -40,7 +42,7 @@ installed by default on most systems (including Mac). On Ubuntu you can use:
 
 The simplest way to use Slaw is via the commandline:
 
-    $ slaw parse myfile.pdf
+    $ slaw parse myfile.pdf --grammar za
 
 ## Overview
 
@@ -61,152 +63,15 @@ formats.
 
 The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering --
 so Slaw performs some post-processing on the XML produced by the parser. In particular,
-it nests lists correctly and looks for specially defined terms and their occurrences in the document.
-
-## Quick Start
-
-Install the gem using
-
-    gem install slaw
-
-Extract text from a PDF and parse it as a South African by-law:
-
-```ruby
-require 'slaw'
-
-# extract text from a PDF file and clean it up
-extractor = Slaw::Extract::Extractor.new
-text = extractor.extract_from_pdf('/path/to/file.pdf')
-
-# parse the text into a XML and
-generator = Slaw::ActGenerator.new
-bylaw = generator.generate_from_text(text)
-puts bylaw.to_xml(indent: 2)
-
-# render the by-law as HTML, using / as the root
-# for relative URLs
-renderer = Slaw::Render::HTMLRenderer.new
-puts renderer.render(bylaw.doc, '/')
-```
-
-## Extraction
-
-Extraction is done by the `Slaw::Extract::Extractor` class. It currently handles
-PDF and plain text files. Slaw uses `pdftotext` from the `xpdf` package to extract
-the plain text from PDFs. PDFs are great for presentation, but suck for accurately storing
-text. As a result, the extraction can produce oddities, such as lines broken in weird
-places (or not broken when they should be). Slaw gets around this by running
-some cleanup routines on the extracted text.
-
-For example, it knows that these lines:
-
-    (b) any wall, swimming pool, reservoir or bridge
-    or any other structure connected therewith; (c) any fuel pump or any
-    tank used in connection therewith
-
-should probably be broken at the section numbers:
-
-    (b) any wall, swimming pool, reservoir or bridge or any other structure connected therewith;
-    (c) any fuel pump or any tank used in connection therewith
-
-If your region's numbering format differs significantly from this, these rules might not work.
-
-Some other steps Slaw takes after extraction include (check `Slaw::Parse::Cleanser` for the full set):
-
-* changing newlines to `\n`, and normalising quotation characters
-* removing page numbers and other boilerplate
-* stripping the table of contents (we can generate our own from the parsed document)
-* changing tabs to spaces, stripping leading and trailing spaces and removing blank lines
+it nests lists correctly.
 
 ## Parsing
 
 Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse
-tree, each node of which knows how to serialize itself in XML format.
-
-While most South African by-laws are superficially very similar, there are a sufficient differences
-in their typesetting to make parsing them difficult. The grammar handles most
-edge cases but may not catch them all. The one thing it cannot yet detect well is the difference
-between section titles before and after a section number:
-
-    1. Definitions
-    In this by-law, the following words ...
-
-    Definitions
-    1. In this by-law, the following words ...
+tree, the nodes of which know how to serialize themselves in XML format.
 
-This must be set by the user before parsing:
-
-```ruby
-generator = Slaw::ZA::BylawGenerator.new
-generator.parser.options = {section_number_after_title: true}
-```
-
-The parser does its best not to choke on input it doesn't understand, preferring a best effort
-to a completely accurate result. For example it may not be able to work out a section heading
-and so will treat it as simply another statement in the previous section. This causes the parser
-to use a lot of backtracking and negative lookahead assertions, which can be slow for large documents.
-
-The grammar supports a number of subsection numbering formats, which are often mixed
-in a document to indicate different levels of nesting.
-
-    (a)
-    (2)
-    (3b)
-    (ii)
-    3.4
-
-During post-processing it works out how to nest these appropriately.
-
-Special words, such as ``part`` and ``chapter`` are ignored if the line starts with a backslash ``\``.
-
-For more information see the South African by-law grammar at
-[lib/slaw/za/bylaw.treetop](lib/slaw/za/bylaw.treetop) and the list nesting
-at [lib/slaw/parse/blocklists.rb](lib/slaw/parse/blocklists.rb).
-
-## Rendering
-
-Slaw renders XML to HTML using XSLT. For the most part there is a direct mapping between
-Akoma Ntoso structure and the HTML layout, so most AN nodes are simply mapped to `div` or `span`
-elements with a class attribute derived from the name of the AN element and an ID element taken
-from the node, if any. This makes it both fast and flexible, since it's easy to
-apply layout rules with CSS.
-
-Slaw can render either an entire document like this, or just a portion of the XML tree.
-
-```ruby
-# render an entire document
-renderer = Slaw::Render::HTMLRenderer.new
-puts renderer.render(bylaw.doc, '/')
-
-# render the first section only
-puts renderer.render(bylaw.sections[0], '/')
-```
-
-For more information, see [/lib/slaw/render/html.rb](/lib/slaw/render/html.rb).
-
-## Meta-data
-
-Acts and by-laws have metadata which it is not possible to get from their plain text representations,
-such as their title, date and format of publication or act number. Slaw provides some helpers
-for manipulating this meta-data. For example,
-
-```ruby
-bylaw = Slaw::ByLaw.new('spec/fixtures/community-fire-safety.xml')
-print bylaw.id_uri
-bylaw.title = 'A new title'
-bylaw.name = 'a-new-title'
-bylaw.published!(date: '2014-09-28')
-print bylaw.id_uri
-```
-
-## Schedules
-
-South African acts and by-laws can have addendums called schedules. They are technically a part of
-the act but are not part of the primary body and have more relaxed formatting. Slaw finds schedules
-by looking for section headings, but makes no effort to capture the format of their contents.
-
-Akoma Ntoso has no explicit support for schedules. Instead, Slaw stores all schedules under a single
-Akoma Ntoso `component` elements at the end of the XML document, with a name of `schedules`.
+Supporting formats from other country's legal traditions probably requires creating a new grammar
+and parser.
 
 ## Contributing
 
@@ -218,6 +83,15 @@ Akoma Ntoso `component` elements at the end of the XML document, with a name of
 
 ## Changelog
 
+### 1.0.0
+
+* Improved support for other legal traditions / grammars.
+* Add Polish legal tradition grammar.
+* Slaw no longer does too much introspection of a parsed document, since that can be so tradition-dependent.
+* Move reformatting out of Slaw since it's tradition-dependent.
+* Remove definition linking, Slaw no longer supports it.
+* Remove unused code for interacting with the internals of acts.
+
 ### 0.17.2
 
 * Match defined terms in 'definition' section.

diff --git a/bin/slaw b/bin/slaw
@@ -17,19 +17,14 @@ class SlawCLI < Thor
   desc "parse FILE", "Parse FILE into Akoma Ntoso XML"
   option :input, enum: ['text', 'pdf'], desc: "Type of input if it can't be determined automatically"
   option :pdftotext, desc: "Location of the pdftotext binary if not in PATH"
-  option :definitions, type: :boolean, desc: "Find and link definitions (this can be slow). Default: false"
   option :fragment, type: :string, desc: "Akoma Ntoso element name that the imported text represents. Support depends on the grammar."
   option :id_prefix, type: :string, desc: "Prefix to be used when generating ID elements when parsing a fragment."
   option :section_number_position, enum: ['before-title', 'after-title', 'guess'], desc: "Where do section titles come in relation to the section number? Default: before-title"
-  option :reformat, type: :boolean, desc: "Reformat common formatting issues to make grammar matching better. Default: true for PDF files, false otherwise"
   option :crop, type: :string, desc: "Crop box for PDF files, as 'left,top,width,height'."
+  option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
   def parse(name)
     logging
 
-    if options[:fragment] and options[:definitions]
-      raise Thor::Error.new("--definitions can't be used together with --fragment")
-    end
-
     Slaw::Extract::Extractor.pdftotext_path = options[:pdftotext] if options[:pdftotext]
     extractor = Slaw::Extract::Extractor.new
 
@@ -43,16 +38,13 @@ class SlawCLI < Thor
     case options[:input]
     when 'pdf'
       text = extractor.extract_from_pdf(name)
-      options[:reformat] = true if options[:reformat].nil?
     when 'text'
       text = extractor.extract_from_text(name)
     else
       text = extractor.extract_from_file(name)
     end
 
-    generator = Slaw::ActGenerator.new
-
-    text = generator.reformat(text) if options[:reformat]
+    generator = Slaw::ActGenerator.new(options[:grammar] || 'za')
 
     if options[:fragment]
       generator.document_class = Slaw::Fragment
@@ -94,25 +86,13 @@ class SlawCLI < Thor
       exit 1
     end
 
-    # definitions?
-    generator.builder.link_definitions(act.doc) if options[:definitions]
-
     puts act.to_xml(indent: 2)
   end
 
-  desc "link-definitions FILE", "Find and link defined terms in FILE"
-  def link_definitions(name)
-    builder = Slaw::ActGenerator.new.builder
-
-    doc = File.open(name, 'r') { |f| doc = builder.parse_xml(f.read) }
-    builder.link_definitions(doc)
-
-    puts builder.to_xml(doc)
-  end
-
   desc "unparse FILE", "Unparse FILE from Akoma Ntoso XML back into text suitable for re-parsing"
+  option :grammar, type: :string, desc: "Grammar name (usually a two-letter country code). Default is za."
   def unparse(name)
-    generator = Slaw::ActGenerator.new
+    generator = Slaw::ActGenerator.new(options[:grammar] || 'za')
 
     doc = File.open(name, 'r') { |f| doc = generator.builder.parse_xml(f.read) }
     puts generator.text_from_act(doc)

diff --git a/lib/slaw.rb b/lib/slaw.rb
@@ -4,14 +4,8 @@
 require 'slaw/namespace'
 require 'slaw/logging'
 
-require 'slaw/act'
-require 'slaw/bylaw'
-require 'slaw/collection'
-
 require 'slaw/xml_support'
-require 'slaw/lifecycle_event'
 
-require 'slaw/render/html'
 require 'slaw/parse/blocklists'
 require 'slaw/parse/builder'
 require 'slaw/parse/cleanser'