Find file
Fetching contributors…
Cannot retrieve contributors at this time
84 lines (63 sloc) 3.29 KB

eurlex.js

eurlex.js is a command line utility to retrieve documents (specifically: regulation drafts) in all supported languages from the EUR-Lex website and convert them into JSON. It is made with node.js and can be installed locally via npm.

Install

eurlex.js can be installed using npm:

npm install -g eurlex

Of course you must have node node.js with npm installed.

eurlex.js works fine with Linux, *BSD and Darwin, but never was tested with Win32.

Usage

Once installed you can use eurlex on the command line:

eurlex [options] <EUR-Lex URI>

You get a brief description of all the options with

eurlex --help

If you are curious what it looks like to get and convert something, try:

eurlex -vu -l de,en,fr COM:2012:0011:FIN -o eurlex-com-2012-0011-fin.json

profile.json

Since the HTML otuput of Eurlex is pretty far from being machine readable, eurlex.js applies a lot of magic to read it anyway. The magic can be fine tuned with setting in a file called profile.json. Here is a stripped and commented version of profile.json:

{
    "lang": ["en","de","..."],           // array of avalable languages
    "expressions": {                     // regular expressions
        "lang": "...",                   // to match the language of the document 
        "title": "..."                   // to match the title of the document
    },
    "delimiters": {                      // delimiters (they are all regex)
        "en": {                          // for this language 
            "recitals": ["...","..."],   // start and end of recitals
            "articles": ["...","..."],   // start and end of articles
            "chapter": "^CHAPTER ",      // string to match a chapter
            "section": "^SECTION ",      // string to match a section
            "article": "^Article ",      // string to match an article
            "fixes": [                   // before a line is parsed
                ["...","..."],           // .replace(/first/, "second")
                ["...","..."]            // as many as you need
            ]
        },
        "lv": {
            "recitals": ["...","..."],
            "articles": ["...","..."],
            "chapter": [                 // if this is an array
                "^([XVI]+) NODAĻA",      // if matches: chapter
                "^([XVI]+) NODAĻA$",     // if matches: text missing
                "^([XVI]+) NODAĻA (.*)$" // $1 is the literal, $2 is the text
            ],
            "section": [                 // same here...
                "^([0-9]+)\\. IEDAĻA", 
                "^([0-9]+)\\. IEDAĻA$", 
                "^([0-9]+)\\. IEDAĻA (.*)$"
            ],
            "article": [                 // note! for article[3] 
                "^([0-9]+)\\. pants",    // $1 is the literal, __$3__ is the text
                "^([0-9]+)\\. pants$", 
                "^([0-9]+)(\\.) pants (.*)$"
            ],
            "fixes": []                  // fixes indeed can be empty
        }
    }
}

Limitations & Known issues

  • In Magyar, paragraphs and points partly use the same literal enclosures, which leads to paragraphs will be interpreted as headless points. You should be safe using --unify with another language as first parameter.
  • The translations for Malti are formatted pretty crappy and have redundant fragments. You have to hardly rely on the fixes in your profile.json

License

eurlex.js is licensed under EUPL