Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch back to the Apache-v2 license #1310

Closed
st-pasha opened this issue Sep 25, 2018 · 20 comments
Closed

Switch back to the Apache-v2 license #1310

st-pasha opened this issue Sep 25, 2018 · 20 comments
Labels
wont-fix Issues that will not be fixed for various reasons
Milestone

Comments

@st-pasha
Copy link
Contributor

The absolute majority of Python packages are using Apache, MIT, BSD, or similar open licenses. It would be courteous to the broader Python community, and invite broader collaboration/contribution, if we did as well.

Historically, this project has been Apache from the very first commit. However, sometime before the public release, we switched to MPL-2 license. The idea was to have the same license as R data.table project (which at that time switched from GPL to MPL too). Unfortunately, we failed to grasp the primary difference between R and Python communities at that point: the majority of R packages are licensed as GPL, and within such environment, an MPL-licensed project can be integrated freely and will be seen as more open compared to others. On the contrary, within Python community, an MPL license is more restrictive and will be eyed with suspicion. In fact, MPL license creates a perfectly tangible barrier: ASF includes this license into the Category B list of software that can only be integrated in binary, but not in source code form.

Please, share your thoughts/comments.

@mattdowle
Copy link
Contributor

mattdowle commented Sep 25, 2018

My last comment on the R-data.table project members list 10 days ago was :

I've spoken to Pasha today. Indeed my fears are realized. He believes that the only part of pydatatable that came from data.table is one file: fread. He believes all the rest he did himself. This is why fread is the only file which is kind of different to the rest in terms of the licensing at the top. Everything else in pydatatable has copyright of H2O. This means that because H2O is the copyright holder, H2O can change the license of pydatatable without asking any contributor to data.table. That is Pasha's view I believe.
Pasha - can you confirm?

This remains unanswered. Can you confirm?

This came up recently here : #1232
The license change for data.table was extensive: Rdatatable/data.table#2456

Note that I explicitly left the door open to Apache in Rdatatable/data.table#2456. But my problem with you is that you've been sneaky. You tricked the data.table contributors by rewriting data.table and licensing fread differently to the rest of it. You ignored all those discussions. You ignored the agreement. You're saying that Rdatatable/data.table#2456 didn't matter, that it was a waste of time, and you don't need to discuss with the past contributors to data.table in R.

Otherwise, what was the point of Rdatatable/data.table#2456?

@st-pasha
Copy link
Contributor Author

@mattdowle I did not answer your comment, because the comment was rude. It is rude because you're talking to me, while at the same time referring to me in 3rd person. Also, it would be easier to answer if instead of accusations, you actually asked a question to be answered. Do I believe that "all the rest I did myself"? No, of course not. I never said or implied as much. Many people contributed to the project, and it is thus a collective effort. Some parts were borrowed from other projects -- all properly attributed, according to their respective licenses.

You also correctly noticed that fread.cc file has a special license header. This was done out of respect for YOUR wishes. Remember back in April 2017 how you rewrote data.table's fread to "agnostify" it, and then asked me to integrate it into python datatable? That was the code contributed by you, and it still bears the license of your choice - MPL. Where in this whole process do you find yourself "tricked"?

@mattdowle
Copy link
Contributor

mattdowle commented Sep 25, 2018

This is now like the twilight zone. You did say you wrote everything other than fread yourself. You said you could change the license without asking data.table contributors. I asked you that question and you replied yes.

@st-pasha
Copy link
Contributor Author

Could be I misunderstood your question, or you misunderstood my answer, or both. Glad that we cleared that up.

@oleksiyskononenko
Copy link
Contributor

What you propose, Pasha, sounds reasonable to me. By the way, are there any disadvantages that we should anticipate switching back to Apache license?

@st-pasha
Copy link
Contributor Author

@oleksiyskononenko
As far as I can see, there are no real disadvantages; however, there are certain factors that may be assessed differently by different people. For example:

Apache license will allow the code to be more easily reused. To me, and many other people, this is an advantage: it will allow the code to survive longer, and be useful to more people. But to some other people, this is actually a drawback, as the code can be incorporated in a proprietary software.

Attractiveness to developers. Stemming from the same ideological difference, the license will have an effect on who wants to contribute to the project. There are developers who would never want to contribute to a GPL/MPL project. Conversely, there are developers who will not contribute to an Apache/MIT/BSD project. (And of course, there are those who don't care either way). By switching into Apache, we will become more attractive to the former group, and less attractive to the latter. It is my understanding that within the Python community the first group (Apache advocates) is much larger than the second, and therefore the benefit outweighs the cost.

@mattdowle
Copy link
Contributor

The MPL is very liberal in terms of reuse. I was clear in Rdatatable/data.table#2456 that R-data.table can be used in proprietary software: that is the express wish of R-data.table contributors (to pick one example of a group of people). Please will you address the issue of datatablePRO being created which I explained clearly in Rdatatable/data.table#2456 too. If datatable is Apache then anyone can create a closed-source improvement of the library itself: datatablePRO. What you've written above appears to misunderstand this concern. datatablePRO would not be reuse but competition with the original work by standing on its shoulders and taking advantage of the contributors who contributed on the basis of the library remaining open-source. Lets say a company creates datatablePRO. Would you continue to contribute your evenings and weekends to datatable, only for that effort to be ingested by that company into datatablePRO for free, making it better for free and that company make all the money from it? The only restriction of MPL pertains to the library itself to prevent datatablePRO being created. What is wrong with that?

@st-pasha
Copy link
Contributor Author

st-pasha commented Sep 28, 2018

What is wrong with that?

@mattdowle There is a difference between being able to reuse the software in binary form, and in source code form. For example, you wrote the sorting function in data.table. Since it provided superior performance, it was later included into the base R. This was possible only because data.table and R had compatible licenses. Similarly, because data.table is licensed as MPL, and the majority of R packages are GPL, there is no problem for anyone to incorporate data.table code into their projects. Thus, data.table is "nice" to other R developers.

The situation is different with Python. The Python itself, as well as the majority of Python modules, are licensed under Apache-like licenses. Which means no other python module can incorporate or otherwise benefit from datatable's code. As such, datatable is "not nice" to other Python developers. Which is a shame, considering that we include code from several other Apache/MIT/BSD projects.

Thus, MPL license prevents not only the creation of a hypothetical "datatablePRO" but also other quite legitimate forms of reuse.

Anyone would be able to create a closed-source improvement of the library itself.

First, I don't believe this to be a very realistic scenario. I do not know of any closed-source proprietary module in the Python ecosystem. But even if, by a bizarre twist of fate, such product does appear -- wouldn't it be great? It means they'll be developing the product, and I'll be happy to work on other things. Or not happy. But regardless of how I personally feel, the user community will be the ultimate winners -- they'll have an even better tool than before. I think this is a good commitment to make: to work for the benefit of the users, and not for my own. And certainly, I do not strive to prevent others from making a profit where I could not.

But even more importantly, the possibility for datatablePRO (or datatable2) to appear, is actually important for project's survival. No matter what happens to me or other project developers, with Apache license someone else would be able to create their own clone and continue the development.

@mattdowle I appreciate your concern for my evenings and weekends; and your efforts to prevent datatable project from going down what you believe is a perilous path. However, for reasons outlined above, I remain convinced that Apache is a better choice for datatable. This by no means implies that I demand R data.table to follow the suit (although it is, of course, welcome to) -- by the same reasons as I mentioned above, MPL is actually a perfectly reasonable choice for an R library.

@st-pasha
Copy link
Contributor Author

st-pasha commented Oct 4, 2018

I spoke with all datatable contributors about their license preferences. Three people (or 4 including myself) indicated that they are in favor of switching to a more open license such as Apache; three other said they were indifferent; no one was against.

@st-pasha
Copy link
Contributor Author

st-pasha commented Oct 5, 2018

Another point of reference: Google's policies on the use of external packages (https://opensource.google.com/docs/thirdparty/licenses/) state the following with respect to MPL license:

  • Such packages can be used in binary / unmodified form;
  • If a modification is needed, it should be submitted upstream;
  • If the modification cannot be submitted (for example because it is Google-specific, such as making changes to the build script to accommodate Google infrastructure), then the package must not be used.

This shows that MPL is not particularly corporate-friendly. Other companies may have similar, or even more stringent restrictions (unfortunately, not many will publish their policies openly).

Previously I said that the choice of license only impacts developers, not users. Now I stand corrected: the users are affected, if they are constrained by the policies of the companies they work at.

@mattdowle
Copy link
Contributor

mattdowle commented Oct 6, 2018

The context of that Google document is important.
The first sentence (my bolding):

Google needs to comply with open source licenses for all software that we distribute externally.

The penultimate paragraph (their bolding) :

These requirements only apply to products shipped to end users. Software that is run internally (even if displayed on the web to the user) does not have to meet these requirements.

So for instance, AGPL is the only license Google asked me not to use for data.table. Because then they couldn't use it in their web services (AGPL considers a web service to be distributing; stricter than GPL).

Your comment :

This shows that MPL is not particularly corporate-friendly.

Another twilight zone moment. Most corporations do not ship software, and even for those that do, MPL is pretty amazingly friendly even allowing it to be used in closed-source software for goodness sake. It's even less restrictive than the LGPL.

MPL FAQ 6 :

Q6: I want to distribute software which is available under the MPL, either changed or unchanged, within my organization. What do I have to do?
Nothing. The right to private modification and distribution (and inside a company or organization counts as 'private') is another right guaranteed by free and open source software licenses, including the MPL.

How on earth you can label that "not particularly corporate-friendly" beats me.

@st-pasha
Copy link
Contributor Author

st-pasha commented Oct 6, 2018

@mattdowle This feels like a "glass-half-empty / half-full" kind of argument. Surely, it's only the external reuse which is restricted, while internally the package can be used freely. But why have the glass half-full, when it can be 100%-full with an open license?

How on earth you can label that "not particularly corporate-friendly"?

This is actually quite easy to answer. Imagine yourself at the helm of a small (or large) company. Would you want to build your stack with software that restricts you? Today you may be running a small pizza shop and it doesn't matter; but tomorrow you'll want to sell your innovative pizza-making software to all pizzerias around the world -- and it suddenly does. Today you're happily writing data-transformation pipelines at Google -- but tomorrow it's suddenly not Google but Alphabet, and your pipeline suddenly connects several legally distinct companies.

Surely, MPL is more friendly than GPL or AGPL; but it is definitly less friendly than Apache or MIT. And if a corporation has any choice, then the license might become one of the crucial factors in their decision-making.

@mattdowle
Copy link
Contributor

Imagine yourself at the helm of a small (or large) company. Would you want to build your stack with software that restricts you?

Ok. I'm imagining. I think I would be perfectly happy to use MPL software. Because I would know that I did not want to take advantage of the contributors of that library by competing with them and trying to kill their library. I would understand that I don't want to create datatablePRO. On the contrary, I want to create closed-source pizza software that uses the datatable library, and I'd appreciate that's encouraged. Further, if I had a contribution to make to the datatable library, I would be more likely to contribute to the library because I would feel protected that nobody else would try and create datatablePRO and make a ton of money thanks to my free and stupidly trusting significant contribution. I would also be suspicious if the copyright holder of the library was a small company who might i) go bust or ii) change the license on me later. That would be a risk for my pizza software business. Finally, I would look at the history of the library and check that past contributors were respected because that would be a sign the project will flourish.

@st-pasha
Copy link
Contributor Author

st-pasha commented Oct 8, 2018

A small business owner doesn't care about whether anyone makes money off datatable or not. If anything, having 2 competing versions of the library is better for him: competitions spurs faster innovation and makes it more likely that at least one of them will survive longer. A smart business owner does care about whether or not he can make private modifications to the library. These can be small, such as changes necessary to accommodate his build environment; or large, such as his own pizza-related functionality built into the internal C++ code.
And by the way, this is not far-fetched at all: vis-data-server is already doing that. And ###### has done that too, when they needed skewness and kurtosis calculations.

I would also be suspicious if the copyright holder of the library was a small company who might i) go bust or ii) change the license on me later.

Are you talking about the H2O.ai company here? It is very unlikely to "go bust". Far less likely than a project supported by volunteers only, who may not be able to make the ends meet next week.
As for changing the license, it is, certainly, theoretically possible. However, any such change will not have retroactive effect: the new license will apply only from that point in history onward, all past versions of the project will still be available under the old license.
And speaking of datatable specifically, we have a public promise from the company that the project will be kept open-source.

Finally, I would look at the history of the library and check that past contributors were respected because that would be a sign the project will flourish.

Practical difficulties with "looking back at the history" aside, this is a great maxim! Surely, past contributors ought to be respected. Now, what happens if there is a disagreement? Well, we don't have any kind of formal procedure yet -- but presumably, some kind of vote has to occur. It also seems desirable to give the voting power to contributors according to a measure of their contributions. Doesn't have to be a linear function, but it should be at least increasing. Say, a person who once fixed a typo somewhere in the documentation shouldn't have as much say as a person who spent many years working on the project. Also, it could be a good idea to weigh more recent contributions more compared to older contributions -- this would encourage new people to join the project, and existing members to keep contributing.

@st-pasha
Copy link
Contributor Author

After re-reading the text of the MPL-2 license, I've come to the conclusion that we have started the "datatable PRO" argument based on an invalid premise.

In fact, the MPL license does not preclude a third party from creating and distributing a closed-source solution which is based on data.table's code. This is because MPL is a file-level license: it only protects individual files that already exist. So, in order for the third party to develop a closed-source modified version of data.table all they would need to do is to be careful in putting any new functionality in separate files, and then modify the existing data.table's code to call or include that new functionality. They would still be forced to release the modified data.table code, but those modifications will consist of #includeing closed-source files and calling unavailable functions -- such modified code will be less than useful.

So in summary, MPL does not offer any real protection against creation of "datatable PRO", and neither does Apache/MIT.

@st-pasha
Copy link
Contributor Author

st-pasha commented Jan 31, 2019

Comparing the MIT vs the MPL-2 licenses, summary of the arguments presented so far is as follows:

  • MIT (and likewise BSD/Apache) license is prevalent in the open source Python community, and therefore the preferred choice for a new python library;

  • MPL-2 is not widely known among developers; thus potential users/developers may be cautious about using a product licensed under the terms they don't fully understand;

  • MPL-2 license is a barrier from the point of view of legally-savvy organizations too: for example, ASF in its guildelines allows use of MPL-licensed binaries only, not the source files.

  • Likewise, Google also lists MPL-2 as reciprocal type of license, and allows inclusion of only in unmodified binary form but not in source form (in products that are distributed externally). At the same time, they classify MIT and Apache as notice license, so that it requires only a notice of inclusion;

  • Historically, H2O has released its open source software under Apache license, using a more restrictive license such as MPL will be a detraction from the open source ideals;

  • MIT license is very easy to apply: including the text of the license as the LICENSE file is sufficient, since the text speaks of "This software..." and therefore applies to the whole repository. Conversely, applying MPL-2 license is difficult and error-prone:

    • putting the text of MPL-2 license into the LICENSE file is not sufficient, and in fact serves no purpose other than to make GitHub display MPL-2 in the "license" field;
    • MPL-2 is a file-level license (as opposed to traditional repository-level licenses), meaning that it has to be affirmatively applied to each and every file that one wants to license. The application of the license consist of adding the Notice (see Exhibit A) into each licensed file, or by attaching this notice via a comment in a side-along file;
    • if an unaware contributor adds a file that has no Notice, that file is unlicensed;
    • if the repository does not use CLA and does not automatically assigns copyright to the repository owner, that single unlicensed file may legally "poison" the entire project since nobody other than the original author will have the ability to edit, use or remove that file (see No License);
    • if a file contains the Notice, it does not automatically mean that the file is licensed under MPL. This is because the license defines covered software as "Source Code Form to which the initial Contributor has attached the notice in Exhibit A". If the Notice was put into a file by someone else other than the initial contributor, it has no effect;
    • adding a "binary" file (any file that cannot have comments, such as config file, data, image, etc) requires attaching the Notice to that file via a separate side-along file. It is not enough to have a readme.txt saying "all files in this repository have the following Notice attached: ...", since the Notice must be attached by the initial contributor. This means that it is even easier to accidentally commit an unlicensed binary file than it is to commit an unlicensed source code file;
    • due to the factors listed above, verifying the compliance of a repository to the terms of MPL-2 is an extremely difficult task, which probably requires specialized tools;
    • in addition to all the above, the binary form of the library must "inform the recipients how they can obtain the copy of the source code". This is not as easy to implement as it seems. If, say, we distribute library of version 0.6, the recipient must be informed how to obtain the source code for version 0.6. Which means we have to be very careful about providing URL to the specific point in git history (pointing to a branch is not enough, because multiple bug-fix releases may be hosted on the same branch);
    • the documentation pages that are rendered in html and hosted on an external website are also considered binary form of the underlying source .rst files. As such, we are required within the documentation to inform the reader where the source of each documentation file is;
    • logos present an additional challenge if we decide to use them anywhere, because this too becomes distribution. For example, if you were to put a datatable logo on a t-shirt, that t-shirt should bear a notice of the URL where the Source Code Form of the logo can be found;
    • MPL is a file-level license, which means every file in the repository is a separate entity, independently licensed under MPL-2. It can be argued then that the requirement to "inform the recipient of the executable form how they can obtain the source code form" pertains to each file independently. This would mean the executable form should inform the user about each and every file that it is comprised of, giving the exact URLs of each file at the correct point in git history.
  • It has been claimed that one decisive advantage of MPL over MIT/Apache is that the former prevents a third party from creating an enhanced closed-source version of the library (so-called "datatablePRO") that would compete with ours. Unfortunately, this claim is inaccurate: a closed-source library based on data.table can be created, provided that any new "pro" features will be added in separate files. The C/C++ language makes this process particularly simple through the #include pre-processor directive.

Given all the above, I believe the Apache/MIT license to be a greatly superior choice for the datatable library, compared to the MPL-2 license. Thus, switching back to Apache/MIT will be a step that will be met positively by the community, and at the same time eliminate significant legal hurdles associated with the MPL-2 license.

@mattdowle
Copy link
Contributor

putting the text of MPL-2 license into the LICENSE file is not sufficient, and in fact serves no purpose other than to make GitHub display MPL-2 in the "license" field;

But the MPL-2 license contains this paragraph :

If it is not possible or desirable to put the notice in a particular
file, then You may include the notice in a location (such as a LICENSE
file in a relevant directory) where a recipient would be likely to look
for such a notice.

Why do you write that a LICENSE file is not sufficient when the license itself says it is?

@st-pasha
Copy link
Contributor Author

st-pasha commented Feb 1, 2019

The text of the license says "...You may include the notice in a location (such as a LICENSE
file in a relevant directory) where a recipient would be likely to look for such a notice."
The notice that they are talking about is the notice from Exhibit A. It is the affirmative act of attaching this notice that makes a file the Covered Software. If the LICENSE file contains nothing but the text of MPL-2, then there is no notice there (only the Exhibit of a notice, which is not the same).

So, in order to properly license a binary file, the following mechanism is suggested:

  1. create a file called LICENSE (or any other name "where a recipient would be likely to look for a notice");
  2. Within the file include the text of the notice from Exhibit A;
  3. Explicitly specify that "the notice above is considered attached to files rdatatable.png, .gitignore, Makefile, ..., and to this file".

@arnocandel
Copy link
Member

Given your arguments here, I don't see any reason not to support an Apache-v2 license.

@st-pasha
Copy link
Contributor Author

@arnocandel
In relation to the choice of MIT vs Apache-v2, I see the following pros & cons:

  • MIT license is much shorter and easier to understand;
  • Apache-v2 is more precise and strict in its definitions/language, and therefore provides better legal protection;
  • The term "MIT license" it itself slightly ambiguous, there are actually 2 flavors: the "Expat License" and "X11 License";
  • The boilerplate for Apache-v2 is better designed: it clearly states that the software is licensed under Apache-v2, whereas with MIT the boilerplate contains the text of the license, making it hard to understand what kind of license it is;
  • Apache-v2 includes a patent license clause, whereas MIT doesn't;
  • Apache-v2 is much newer license than MIT, and incorporates decades of legal research;
  • Apache-v2 is backed by the Apache Software Foundation;
  • Apache-v2 was used originally for this project, making it a more natural choice compared to MIT;
  • Apache-v2 is used in most other H2O open source projects;
  • As of 2018, MIT is the most popular open source license on GitHub (26%), closely followed by Apache (22%).

Given this analysis, I'd say Apache-v2 handily defeats MIT.

@st-pasha st-pasha added the wont-fix Issues that will not be fixed for various reasons label Jan 29, 2020
@st-pasha st-pasha added this to the Olden times milestone Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wont-fix Issues that will not be fixed for various reasons
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants