Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

dataset licenses

Comprehensive set of all useful (non-vanity) licenses, open and proprietary, superseded or current, to indexing any kind of cultural work (documents, softwares, etc.), that express a license, or have a generic implied license.

See dataset proposal #118 for CSV of licenses.


Files described by datapackage.json.

The licenses.csv have all relevant active licenses, and licenses highlighted by status "superseded" or "retired".

Redundant licenses are listed at redundants.csv, relating with its equivalent (synonymous) licenses, as illustred by the "Licenses that are redundant with more popular licenses" at

A list of all URLs used as "oficial" and realiable text of the license, are at license_urls.csv.

The families.csv is a complement of the family column in the lincense.csv dataset, for license aggregation (grouping similar licenses). It offers also some "sort by openness degreee" criteria, and the column scope adds more one level of aggregation.

NOTE: "openness" or "permissiveness" or ... the metric name can be changed, is a branding matter (for "aggregation and families" conventions see section below, for concepts, discussion here).

Curated sources

There are 3 main (curated) sources of information for populate the datasets of this project,

And other primary or secondary sources

Even with these (total or partial machine-readable) sources, there are some data and interpretations that not exist, so, the Data Packaged Core Datasets community need to check (handly) information from another sources.

Implied licenses

The implied license is a general problem of license-document indexation, because we must to point the document's license ID, but have no idea about (exactly) what is the license. The "license inference process" in general is time consuming, and oficial analysis have a cost... So we can retain a report with this oficial interpretation (or a community endorsement of the report) of document's context and its implied license. For relevant and big document repositories, like law (ex. LexML or a N-Lex country), one report is valid to the full repository.

As solution, the report describing a implicit license need not be exact: some aggregation level can be used to express the license, that is, is sufficient to indicate the license's family (as defined by okfn/licenses/issues/54).

The implieds.csv is a list of legislative repository's licenses, a demand motivated by the GODI/legislation datasets, and an in-progress work. Each implied license need a report, like implied-lex-BR-v1 (the license used in ~790000 brazilian legistaltive documents) or implied-berne-v1971 (the most used license in the world!).

The main reference work to these interpretations is the rules by territory. There are also, in the form of Public Domain Mark templates, some concrete licenses as PD-Albania-exempt, PD-BrazilGov, PD-GermanGov etc. The Wikimedia's "rules by territory" reduce but not eliminates the need of reports, that resembles jurisdiction ports, when necessary to describe specific details.

Aggregation and families

The purpose of licenses.csv is to concentrate all relevant licenses in one dataset of license metadata. We perceive here both, the high diversity in the number of lines of the file, and the ability to highlight differences, in the number of columns. For users, the next step and demand, is to group similar licenses.

Some fields can be used as key-group. We can use domain_* to "select by domain", or mantainer to group licenses "by brand". Using more than one field and some conditions we can "select by similar contract clauses"; fields is_by, is_sa, is_salink and is_nd do this role — the set of fields about clauses was inspired in the "License Properties" of the ccREL, sec. 3.2.

To group licenses by similar contract clauses we can use also a standard selection process, with some curatory revision: this is the rule of the family field, it is a functional grouping of licenses.

NOTE: the field is_ref=2 is a control flag to indicate the "most popular representative" of each family, that is a kind of canonical license, and is used also as family-label. The suffix "*" in the family label indicates "some little restrictions more than the no-suffix family". The field is_ref=1 is a control flag to indicate the "default in the brand" (to resolve references without version or other maintainer/family/version ambiguities).


As curated sources have no automatic merge process, and are all in-progress works, the best way to prepare is into a spreadsheet. So the preparations have two an initial organization steps, and the updating steps.

  1. Check if okfn/licenses or Wikipedia license-medadata was updated. Witch php csv-conv.php > temp.csv, that generates a basic csv file from each json at okfn/licenses/licenses, we can check the CSV file while it is not updated. Other software may be developed to extract data and check updates from Wikipedia.

  2. open it in the collaborative spreadsheet, making changes, using the curated sources.

  3. Make a clone (git clone and save each spreadsheet part (lincenses, families, etc.) as CSV file in the data folder. If all ok, commit and make git push to update data folder.


proposal for CSV of licenses, see



No releases published


No packages published