Skip to content

lobbyplag/eurlex-js

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

eurlex.js

eurlex.js is a command line utility to retrieve documents (specifically: regulation drafts) in all supported languages from the EUR-Lex website and convert them into JSON. It is made with node.js and can be installed locally via npm.

Install

eurlex.js can be installed using npm:

npm install -g eurlex

Of course you must have node node.js with npm installed.

eurlex.js works fine with Linux, *BSD and Darwin, but never was tested with Win32.

Usage

Once installed you can use eurlex on the command line:

eurlex [options] <EUR-Lex URI>

You get a brief description of all the options with

eurlex --help

If you are curious what it looks like to get and convert something, try:

eurlex -vu -l de,en,fr COM:2012:0011:FIN -o eurlex-com-2012-0011-fin.json

profile.json

Since the HTML otuput of Eurlex is pretty far from being machine readable, eurlex.js applies a lot of magic to read it anyway. The magic can be fine tuned with setting in a file called profile.json. Here is a stripped and commented version of profile.json:

{
	"lang": ["en","de","..."],           // array of avalable languages
	"expressions": {                     // regular expressions
		"lang": "...",                   // to match the language of the document 
		"title": "..."                   // to match the title of the document
	},
	"delimiters": {                      // delimiters (they are all regex)
		"en": {                          // for this language 
			"recitals": ["...","..."],   // start and end of recitals
			"articles": ["...","..."],   // start and end of articles
			"chapter": "^CHAPTER ",      // string to match a chapter
			"section": "^SECTION ",      // string to match a section
			"article": "^Article ",      // string to match an article
			"fixes": [                   // before a line is parsed
				["...","..."],           // .replace(/first/, "second")
				["...","..."]            // as many as you need
			]
		},
		"lv": {
			"recitals": ["...","..."],
			"articles": ["...","..."],
			"chapter": [                 // if this is an array
				"^([XVI]+) NODAĻA",      // if matches: chapter
				"^([XVI]+) NODAĻA$",     // if matches: text missing
				"^([XVI]+) NODAĻA (.*)$" // $1 is the literal, $2 is the text
			],
			"section": [                 // same here...
				"^([0-9]+)\\. IEDAĻA", 
				"^([0-9]+)\\. IEDAĻA$", 
				"^([0-9]+)\\. IEDAĻA (.*)$"
			],
			"article": [                 // note! for article[3] 
				"^([0-9]+)\\. pants",    // $1 is the literal, __$3__ is the text
				"^([0-9]+)\\. pants$", 
				"^([0-9]+)(\\.) pants (.*)$"
			],
			"fixes": []                  // fixes indeed can be empty
		}
	}
}

Limitations & Known issues

  • In Magyar, paragraphs and points partly use the same literal enclosures, which leads to paragraphs will be interpreted as headless points. You should be safe using --unify with another language as first parameter.
  • The translations for Malti are formatted pretty crappy and have redundant fragments. You have to hardly rely on the fixes in your profile.json

License

eurlex.js is licensed under EUPL

About

Retrieve documents from EUR-Lex and convert them to usable data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published