Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]: BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia #2849

Closed
38 of 40 tasks
whedon opened this issue Nov 19, 2020 · 46 comments
Assignees
Labels
accepted Julia Jupyter Notebook published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review

Comments

@whedon
Copy link

whedon commented Nov 19, 2020

Submitting author: @sylvaticus (Antonello Lobianco)
Repository: https://github.com/sylvaticus/BetaML.jl
Version: v0.2.2
Editor: @terrytangyuan
Reviewer: @ablaom, @ppalmes
Archive: 10.5281/zenodo.4730205

⚠️ JOSS reduced service mode ⚠️

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f"><img src="https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f/status.svg)](https://joss.theoj.org/papers/27dfe1b0d25d1925af5424f0706f728f)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@ablaom & @ppalmes, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @terrytangyuan know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @ablaom

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the repository url?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@sylvaticus) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @ppalmes

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the repository url?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@sylvaticus) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?
@whedon
Copy link
Author

whedon commented Nov 19, 2020

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @ablaom, @ppalmes it looks like you're currently assigned to review this paper 🎉.

⚠️ JOSS reduced service mode ⚠️

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

⭐ Important ⭐

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Nov 19, 2020

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@whedon
Copy link
Author

whedon commented Nov 19, 2020

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s10107-010-0420-4 is OK
- 10.5281/zenodo.3541505 is OK
- 10.21105/joss.00602 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@ablaom
Copy link

ablaom commented Nov 19, 2020

Okay, here's an update of my review from the pre-review thread

What the package provides

The package under review provides pure-julia implementations of two
tree-based models, three clustering models, a perceptron model (with 3
variations) and a basic neural network model. In passing, it should be
noted that all or almost all of these algorithms have existing julia
implementations (e.g., DecisionTree.jl, Clustering.jl, Flux.jl). The package
is used in a course on Machine Learning but integration between the
package and the course is quite loose, as far as I could ascertain
(more on this below).

Apart from a library of loss functions, the package provides no
other tools.
In addition to the models the package provides a number
of loss functions, as well as activation functions for the neural
network models, and some tools to rescale data. I did not see tools to
automate resampling (such as cross-validation), hyper parameter
optimization, and no model composition (pipelining). The quality of
the model implementations looks good to me, although the author warns
us that "the code is not heavily optimized and GPU [for neural
networks] is not supported "

Existing machine learning toolboxes in Julia

For context, consider the following multi-paradigm ML
toolboxes written in Julia which are relatively mature, by Julia standards:

package number of models resampling hyper-parameter optimization composition
ScikitLearn.jl > 150 yes yes basic
AutoMLPipeline.jl > 100 no no medium
MLJ.jl 151 yes yes advanced

In addition to these are several excellent and mature packages
dedicated to neural-networks, the most popular being the AD-driven
Flux.jl package. So far, these provide limited meta-functionality,
although MLJ now provides an interface to certain classes of Flux
models (MLJFlux) and
ScikitLearn.jl provides interfaces to python neural network models
sufficient for small datasets and pedagogical use.

Disclaimer: I am a designer/contributor to MLJ.

According to the JOSS requirements,
Submissions should "Have an obvious research application."
In its
current state of maturity, BetaML is not a serious competitor to the
frameworks above, for contributing directly to research. However, the
author argues that it has pedagogical advantages over existing tools.

Value as pedagogical tool

I don't think there are many rigorous machine learning courses or
texts closely integrated with models and tools implemented in julia
and it would be useful to have more of these. The degree of
integration in this case was difficult for me to ascertain because I
couldn't see how to access the course notes without formally
registering for the course (which is, however, free).
I was also
disappointed to find only one link from doc-strings to course
materials; from this "back door" to the course notes I could find no
reference back to the package, however. Perhaps there is better
integration in course exercises? I couldn't figure this out.

edit Okay, I see that I missed the link to the course notes, as
opposed to the course itself. However the notes make only references
to python code and so do not appear to be directly integrated with the
package BetaML.

The remaining argument for BetaML's pedagogical value rests on a
number of perceived drawbacks of existing toolboxes, for the
beginner. Quoting from the JOSS manuscript:

  1. "For example the popular Deep Learning library Flux (Mike Innes,
    2018), while extremely performant and flexible, adopts some
    designing choices that for a beginner could appear odd, for example
    avoiding the neural network object from the training process, or
    requiring all parameters to be explicitly defined. In BetaML we
    made the choice to allow the user to experiment with the
    hyperparameters of the algorithms learning them one step at the
    time. Hence for most functions we provide reasonable default
    parameters that can be overridden when needed."

  2. "To help beginners, many parameters and functions have pretty
    longer but more explicit names than usual. For example the Dense
    layer is a DenseLayer, the RBF kernel is radialKernel, etc."

  3. "While avoiding the problem of “reinventing the wheel”, the
    wrapping level unin- tentionally introduces some complications for
    the end-user, like the need to load the models and learn
    MLJ-specific concepts as model or machine. We chose instead to
    bundle the main ML algorithms directly within the package. This
    offers a complementary approach that we feel is more
    beginner-friendly."

Let me respond to these:

  1. These cricitism only apply to dedicated neural network
    packages, such as Flux.jl; all of the toolboxes listed
    above provide default hyper parameters for every model. In the case
    of neural networks, user-friendly interaction close to the kind
    sought here is available either by using the MLJFlux.jl models
    (available directly through MLJ) or by using the python models
    provided through ScikitLearn.jl.

  2. Yes, shorter names are obstacles for the beginner but hardly
    insurmountable. For example, one could provide a cheat sheet
    summarizing the models and other functionality needed for the
    machine learning course (and omitting all the rest).

  3. Yes, not needing to load in model code is slightly more
    friendly. On the other hand, in MLJ for example, one can load and
    instantiate a model with a single macro. So the main complication
    is having to ensure relevant libraries are in your environment. But
    this could be solved easily with a BeginnerPackage which curates
    all the necessary dependencies. I am not convinced beginners should
    find the idea of separating hyper-parameters and learned parameters
    (the "machines" in MLJ) that daunting. I suggest the author's
    criticism may have more to do with their lack of familiarity than a
    difficulty for newcomers, who do not have the same preconceptions
    from using other frameworks. In any case, the point is moot, as one
    can interact with MLJ models directly via a "model" interface and
    ignore machines. To see this, I have
    translated part of a
    BetaML notebook into MLJ syntax. There's hardly any difference - if
    anything the presentation is simpler (less hassle when splitting
    data horizontally and vertically).

In summary, while existing toolboxes might present a course instructor
with a few challenges, these are hardly game-changers. The advantages of
introducing a student to a powerful, mature, professional toolbox ab
initio far outweigh any drawbacks, in my view.

Conclusions

To meet the requirements of JOSS, I think either: (i) The BetaML
package needs to demonstrate tighter integration with easily
accessible
course materials; or (ii) BetaML needs very substantial
enhancements to make it competitive with existing toolboxes.

Frankly, a believe a greater service to the Julia open-source software
community would be to integrate the author's course materials with one
of the mature ML toolboxes. In the case of MLJ, I would be more than
happy to provide guidance for such a project.


Sundry comments

I didn't have too much trouble installing the package or running the
demos, except when running a notebook on top of an existing Julia
environment (see commment below).

  • added The repository states quite clearly that the primary
    purpose of the package is dilectic (for teaching purposes). If this
    is true, the paper should state this clearly in the "Summary" (not
    just that it was developed in response to the course).

  • added The authors should reference for comparison the toolboxes
    ScitkitLearn.jl and AutoMLPipeline.jl

  • The README.md should provide links to the toolboxes listed in
    the table above, for the student who "graduates" from BetaML.

  • Some or most intended users will be new to Julia, so I suggest
    including with the installation instructions something about how to
    set up a julia environment that includes BetaML. Something like
    this, for example.

  • I found it weird that the front-facing demo is an unsupervised
    model. A more "Hello World" example might be to train a Decision
    Tree.

  • The way users load the built-in datasets seems pretty awkward. Maybe
    just define some functions to do this? E.g.,
    load_bike_sharing(). Might be instructive to have examples where
    data is pulled in using RDatasets, UrlDownload or similar?

  • A cheat-sheet summarizing the model fitting functions and the loss
    functions would be helpful. Or you could have functions models() and
    loss_functions() that list these.

  • I found it pretty annoying to split data by hand the way this is
    done in the notebooks and even beginners might find this
    annoying. One utility function here would go a long way to making
    life easier here (something like the partition function in the
    MLJ, which you are welcome to lift).

  • The notebooks are not portable as they do not come with a
    Manifest.toml. One suggestion on how to handle this is
    here
    but you should add a comment in the notebook explaining that the
    notebook is only valid if it is accompanied by the Manifest.toml. I
    think an even better solution is provided by InstantiateFromUrl.jl
    but I haven't tried this yet.

  • The name em for the expectation-maximization clustering algorithm
    is very terse, and likely to conflict with a user variable. I admit, I had
    to dig up the doc-string to find out what it was.

@whedon
Copy link
Author

whedon commented Nov 26, 2020

👋 @ppalmes, please update us on how your review is going.

@whedon
Copy link
Author

whedon commented Nov 26, 2020

👋 @ablaom, please update us on how your review is going.

@ablaom
Copy link

ablaom commented Nov 26, 2020

I would consider my initial review finished. I have left unchecked, " Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines" as I would prefer the editor make this call, based on my comments. I would say "yes", but according to the guidelines there should be "obvious research application". If research includes education, then yes, definitely scholarly.

The other minor installation item needs addressing from author; I make some suggestions.

@ppalmes
Copy link

ppalmes commented Nov 26, 2020

I'll start my review next week.

@sylvaticus
Copy link

Dear, I am pretty new to this open-format of reviewing papers. Please let me know if and when I am supposed to reply, in particular if I need to wait for the second reviewer and/or the editors, thank you :-)

@ppalmes
Copy link

ppalmes commented Dec 5, 2020

I would suggest you can reply to @ablaom review questions/comments because it can help also hasten the review process so that I can just focus on those issues not covered by both of your conversations if I still need more clarifications.

@terrytangyuan
Copy link
Member

Yes, @sylvaticus please respond existing feedback while we are waiting for additioinal feedback from @ppalmes. Thanks.

@sylvaticus
Copy link

Author's response to @ablaom review 1

Above all, I would like to thanks the reviewer for having taken the time to provide the review and the useful suggestions he brings. I have implemented most of them, as they helped improving the software.

My detailed response is below.

Okay, here's an **update** of my review from the [pre-review thread](https://github.com/openjournals/joss-reviews/issues/2512)

## What the package provides

The package under review provides pure-julia implementations of two
tree-based models, three clustering models, a perceptron model (with 3
variations) and a basic neural network model. In passing, it should be
noted that all or almost all of these algorithms have existing julia
implementations (e.g., DecisionTree.jl, Clustering.jl, Flux.jl).

While "most" of the functionality is indeed already present, from the user point of view, they are not necessarily accessed in the same way and for some functionality, like missing imputation using GMM models, I am not aware of implementations in Julia. Also the kind of output is often different from current implementations. For example most classifiers in BetaML report the whole PMF of the various items rather than the mode. Together with the fact that the function accuracy has an extra optional parameter for selecting the range of items to consider the estimate correct, one can train a classifier that is best in returning a correct value for example within the most probable 2 results (rather than the single most probable one). This can be useful in some applications where the second-best is also an acceptable value.

The package
is used in a course on Machine Learning but integration between the
package and the course is quite loose, as far as I could ascertain
(more on this below).

I am sorry for the misunderstanding here. I am not affiliated with that course. The course referenced uses Python to teach the algorithms, while I believe a Julia approach, when dealing with the internals of the algorithms (conversely to "just" using some API), is more appropriate, this is why I translated, and generalised, the code in Julia.

~~Apart from a library of loss functions, the package provides no
other tools.~~ In addition to the models the package provides a number
of loss functions, as well as activation functions for the neural
network models, and some tools to rescale data. I did not see tools to
automate resampling (such as cross-validation), hyper parameter
optimization, and no model composition (pipelining). The quality of
the model implementations looks good to me, although the author warns
us that "the code is not heavily optimized and GPU [for neural
networks] is not supported "

While tools for automatic sampling and cross-validation may be in scope with BetaML, I believe that the added value for pipeling in a language like Julia is not so strong like it is for other programming languages.
In R and Python for example loops are slow, and it definitely helps having a fast library implementing for example hyper-parameters tuning.
Julia is instead highly expressive and has fast loops at the same time. The computational and convenience benefits to use a specific framework to build a chain of models or tune the hyper-parameters balance again the flexibility and easiness of using just the "core" Julia functionalities to do the same, so that the advantage is partially shaded and depends from the situation.

## Existing machine learning toolboxes in Julia

For context, consider the following multi-paradigm ML
toolboxes written in Julia which are relatively mature, by Julia standards:

package          | number of models | resampling  | hyper-parameter optimization | composition
-----------------|------------------|-------------|------------------------------|-------------
[ScikitLearn.jl](https://github.com/cstjean/ScikitLearn.jl)   | > 150            | yes         | yes                          | basic
[AutoMLPipeline.jl](https://github.com/IBM/AutoMLPipeline.jl)| > 100            | no          | no                           | medium
[MLJ.jl](https://joss.theoj.org/papers/10.21105/joss.02704)           | 151              | yes         | yes                          | advanced

In addition to these are several excellent and mature packages
dedicated to neural-networks, the most popular being the AD-driven
Flux.jl package. So far, these provide limited meta-functionality,
although MLJ now provides an interface to certain classes of Flux
models ([MLJFlux](https://github.com/alan-turing-institute/MLJFlux.jl)) and
ScikitLearn.jl provides interfaces to python neural network models
sufficient for small datasets and pedagogical use.

Disclaimer: I am a designer/contributor to MLJ.

**According to the [JOSS requirements](https://joss.theoj.org/about),
Submissions should "Have an obvious research application."**  In its
current state of maturity, BetaML is not a serious competitor to the
frameworks above, for contributing directly to research. However, the
author argues that it has pedagogical advantages over existing tools.

## Value as pedagogical tool

I don't think there are many rigorous machine learning courses or
texts closely integrated with models and tools implemented in julia
and it would be useful to have more of these. ~~The degree of
integration in this case was difficult for me to ascertain because I
couldn't see how to access the course notes without formally
registering for the course (which is, however, free).~~ I was also
disappointed to find only one link from doc-strings to course
materials; from this "back door" to the course notes I could find no
reference back to the package, however. Perhaps there is better
integration in course exercises? I couldn't figure this out.

**edit** Okay, I see that I missed the link to the course notes, as
opposed to the course itself. However the notes make only references
to python code and so do not appear to be directly integrated with the
package BetaML.

The remaining argument for BetaML's pedagogical value rests on a
number of perceived drawbacks of existing toolboxes, for the
beginner. Quoting from the JOSS manuscript:

1. "For example the popular Deep Learning library Flux (Mike Innes,
   2018), while extremely performant and flexible, adopts some
   designing choices that for a beginner could appear odd, for example
   avoiding the neural network object from the training process, or
   requiring all parameters to be explicitly defined. In BetaML we
   made the choice to allow the user to experiment with the
   hyperparameters of the algorithms learning them one step at the
   time. Hence for most functions we provide reasonable default
   parameters that can be overridden when needed."

2. "To help beginners, many parameters and functions have pretty
   longer but more explicit names than usual. For example the Dense
   layer is a DenseLayer, the RBF kernel is radialKernel, etc."

3. "While avoiding the problem of “reinventing the wheel”, the
   wrapping level unin- tentionally introduces some complications for
   the end-user, like the need to load the models and learn
   MLJ-specific concepts as model or machine.  We chose instead to
   bundle the main ML algorithms directly within the package. This
   offers a complementary approach that we feel is more
   beginner-friendly."

Let me respond to these:

1. These cricitism only apply to dedicated neural network
   packages, such as Flux.jl; all of the toolboxes listed
   above provide default hyper parameters for every model. In the case
   of neural networks, user-friendly interaction close to the kind
   sought here is available either by using the MLJFlux.jl models
   (available directly through MLJ) or by using the python models
   provided through ScikitLearn.jl.

2. Yes, shorter names are obstacles for the beginner but hardly
   insurmountable. For example, one could provide a cheat sheet
   summarizing the models and other functionality needed for the
   machine learning course (and omitting all the rest).

3. Yes, not needing to load in model code is slightly more
   friendly. On the other hand, in MLJ for example, one can load and
   instantiate a model with a single macro. So the main complication
   is having to ensure relevant libraries are in your environment. But
   this could be solved easily with a `BeginnerPackage` which curates
   all the necessary dependencies. I am not convinced beginners should
   find the idea of separating hyper-parameters and learned parameters
   (the "machines" in MLJ) that daunting. I suggest the author's
   criticism may have more to do with their lack of familiarity than a
   difficulty for newcomers, who do not have the same preconceptions
   from using other frameworks. In any case, the point is moot, as one
   can interact with MLJ models directly via a "model" interface and
   ignore machines. To see this, I have
   [translated](https://github.com/ablaom/ForBetaMLReview) part of a
   BetaML notebook into MLJ syntax. There's hardly any difference - if
   anything the presentation is simpler (less hassle when splitting
   data horizontally and vertically).

In summary, while existing toolboxes might present a course instructor
with a few challenges, these are hardly game-changers. The advantages of
introducing a student to a powerful, mature, professional toolbox *ab*
*initio* far outweigh any drawbacks, in my view.

I rephrased the readme.md of the package, as the project evolved from being a mere "rewriting" of algorithms in Julia.
The focus of the package is on the accessibility to people from different backgrounds, and consequently different interests, than researchers or practitioners in computer sciences.
The current ML ecosystem in Julia is out of scope for some kind of PhD students and researchers, for example many in my lab.
They have different research interests and don't have the time to deep into ML so much, "just" applying it (often to small datasets) for their concrete problems. So the way to access the algorithms is particularly important. This is why, for example, both the decision trees / GMM algorithms in BetaML accept data with missing values, or it is not necessarily to specify in the decision tree algorithm the kind of job (regression/classification), as this is automatically inferred by the type of the labels (this is also true for DecisionTrees, but using two different API, DecisionTreeRegressor/DecisionTreeClassifier on one side and build_tree on the other). This is an example where we explicitly traded simplicity for efficiency, as adding support for missing data directly in the algorithms considerably reduces their performances (and this is the reason, I assume, the leading packages don't implement it).

## Conclusions

To meet the requirements of JOSS, I think either: (i) The BetaML
package needs to demonstrate tighter integration with ~~easily
accessible~~ course materials; or (ii) BetaML needs very substantial
enhancements to make it competitive with existing toolboxes.

Frankly, a believe a greater service to the Julia open-source software
community would be to integrate the author's course materials with one
of the mature ML toolboxes. In the case of MLJ, I would be more than
happy to provide guidance for such a project.

I do appreciate both the Reviewer comments and the MLJ as a mature, state-of-the art framework, I just believe that there is space for a different approach with different user cases.


## Sundry comments

I didn't have too much trouble installing the package or running the
demos, except when running a notebook on top of an existing Julia
environment (see commment below).

- **added** The repository states quite clearly that the primary
  purpose of the package is dilectic (for teaching purposes). If this
  is true, the paper should state this clearly in the "Summary" (not
  just that it was developed in response to the course).

As specified on a previous comment, the focus is on usability, whether this is important for didactic or applied research purposes.

- **added** The authors should reference for comparison the toolboxes
  ScitkitLearn.jl and AutoMLPipeline.jl

- The README.md should provide links to the toolboxes listed in
  the table above, for the student who "graduates" from BetaML.

I added an "Alternative packages" section that lists the most relevant and mature Julia packages in the topics covered by BetaML.

- Some or most intended users will be new to Julia, so I suggest
  including with the installation instructions something about how to
  set up a julia environment that includes BetaML. Something like
  [this](https://alan-turing-institute.github.io/MLJ.jl/dev/#Installation-1), for example.
- A cheat-sheet summarizing the model fitting functions and the loss
  functions would be helpful. Or you could have functions `models()` and
  `loss_functions()` that list these.

Being a much smaller package than MLJ, I believe the "Installation" and "Loading the module(s)" (for the first point) and "Usage" (for the second one) in the documentation do suffice.

- I found it weird that the front-facing demo is an *unsupervised*
  model. A more "Hello World" example might be to train a Decision
  Tree.

I added a basic Random Forest example in the Readme.md so to provide the readers of an overview of different techniques to analyse the same dataset (iris).

- The way users load the built-in datasets seems pretty awkward. Maybe
  just define some functions to do this? E.g.,
  `load_bike_sharing()`. Might be instructive to have examples where
  data is pulled in using `RDatasets`, `UrlDownload` or similar?

I now load the data using a path relative to the package base path. In this way the script should load the correct data whichever is the current directory from which it is called by the user.

- I found it pretty annoying to split data by hand the way this is
  done in the notebooks and even beginners might find this
  annoying. One utility function here would go a long way to making
  life easier here (something like the `partition` function in the
  MLJ, which you are welcome to lift).

Thank you. I did indeed add a simple partition function to allow partition multiple matrices in one line, e.g.
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3]).
Note that a release of the software including the new partition function has still to be made.

- The notebooks are not portable as they do not come with a
  Manifest.toml. One suggestion on how to handle this is
  [here](https://github.com/ablaom/ForBetaMLReview/blob/main/bike_sharing.ipynb)
  but you should add a comment in the notebook explaining that the
  notebook is only valid if it is accompanied by the Manifest.toml. I
  think an even better solution is provided by InstantiateFromUrl.jl
  but I haven't tried this yet.

Having a manifest means that I need to keep it updated and the user understand what it is.
Instead the notebooks all have a section at the beginning where the required packages are loaded. In this way even if the user just copy and paste the code to his/her own IDE, it will likely works.

A related issue is to guarantee that notebooks are kept in sync with the code. I noticed that the reviewer use Literate.jl, I may consider it, as it helps keeping the examples under testing control.

- The name `em` for the expectation-maximization clustering algorithm
  is very terse, and likely to conflict with a user variable.  I admit, I had
  to dig up the doc-string to find out what it was.

I agree and changed the name to gmm.

@ablaom
Copy link

ablaom commented Dec 14, 2020

Response to author's response to my review

@sylvaticus Thank you for your response and addressing some of my criticisms.

@terrytangyuan The author has not addressed, to my satisfaction, a central objection, which can be rephrased as this: To show the software meets a research need, it needs to be demonstrated that the software is substantially easier to use than the substantially more powerful alternatives (in the demonstrated absence of some other pedagogical value). The author agrees that there are much more powerful alternatives. However, I maintain it is not substantially easier to use or to learn MLBeta, as I detail in my rebuttals 1-3 to assertions in the paper.

This said, as an author of one of the alternatives, I naturally find my package easier to use than one with which I am less familiar. It is possible that @sylvaticus feels the same way about MLBeta for the much the same reason. Perhaps @ppalmes would care to comment specifically on this question (see italics above).

To be clear, I think the software and paper are quality products. I also do not dismiss the possibility that users might prefer a future enhanced version of the MLBeta to existing alternatives. I am simply questioning whether MLBeta meets the stated requirements of JOSS at this stage of its development.

@arfon
Copy link
Member

arfon commented Feb 8, 2021

I would suggest you can reply to @ablaom review questions/comments because it can help also hasten the review process so that I can just focus on those issues not covered by both of your conversations if I still need more clarifications.

👋 @ppalmes - I think this review could definitely benefit from your input here 🙏

@ppalmes
Copy link

ppalmes commented Feb 12, 2021

My decision is Major Revision.

The main contribution of the package is the reimplementation in pure Julia of the various algorithms in supervised and unsupervised learning for teaching purposes.

I agree with @ablaom that in terms of usability, other existing toolkit are more straightforward and consistent to use. Among the things that I consider to be a major issue is the absence of pipeline API. All related packages mentioned support this API which is a big factor for usability.

Here are my list of suggestions:

  1. Improve the online documentation. If the target audience are students, the online documentation needs more examples and tutorials. There is only one page showing examples in the online documentation. Notebooks are great but static documentation (html or pdf) is faster to read without any installation issues.
  2. Include good use-cases in the documentation that employ the toolkit to solve real problems incorporating different strategies from the toolbox. This will add value to the scholarly effort. Perform some benchmark and discussions about the results and implementation choices including internal data structures.
  3. Implement the pipeline API. It's important for the ML toolkit to make the data preprocessing steps composable for easier usage and experimentation from the perspective of students.

@arfon
Copy link
Member

arfon commented Apr 13, 2021

👋 all, I'm stepping in here to assist @terrytangyuan who is struggling to make time for JOSS editorial work at this time.

Firstly, @ppalmes and @ablaom, many thanks for your careful and constructive reviews. There is some excellent feedback here for the @sylvaticus.

I do need to address one aspect of this feedback however, best captured in this comment:

@terrytangyuan The author has not addressed, to my satisfaction, a central objection, which can be rephrased as this: To show the software meets a research need, it needs to be demonstrated that the software is substantially easier to use than the substantially more powerful alternatives (in the demonstrated absence of some other pedagogical value). The author agrees that there are much more powerful alternatives. However, I maintain it is not substantially easier to use or to learn MLBeta, as I detail in my rebuttals 1-3 to assertions in the paper.

I agree this is important but it's not a strict requirement for a works to be published in JOSS. Primarily, the review criteria around (Substantial Scholarly Effort)[https://joss.readthedocs.io/en/latest/review_criteria.html#substantial-scholarly-effort] are designed to exclude very minor software contributions which we don't believe add meaningful value for potential users of the tooling. Based on @sylvaticus' responses in this thread I do not believe this work falls into that category.

JOSS' primary mission is to provide a mechanism for authors doing software work to receive career credit for their work, and in borderline situations such as this, we defer to the author's need/ability to be cited for their work. As such, on this criterion of Substantial scholarly effort I am making an editorial decision to allow this submission to move forward.

That said, there is still a reasonable amount of feedback (especially that most recently from @ppalmes) that it would be good to hear your response to @sylvaticus. Could you please respond here with your thoughts and potential plans to address?

@ppalmes
Copy link

ppalmes commented Apr 13, 2021

Yeah, I am ok to proceed with publication. My suggestions are to make the work more usable to the wider community. It is usable at the current form and I believe that the work will continue to improve.

@sylvaticus
Copy link

Yes, as you can see in the commit log, I am actually still implementing the modifications required by the reviewers.. I created an interface to my models for one of the toolbox cited (these interfface has been already being pushed but still needs to be included in a release of BetaML) and I am implementing a more detailed set of tutorials.
I would still need a 1-2 weeks to complete it and update the JOSS paper.

@arfon
Copy link
Member

arfon commented Apr 13, 2021

⚡ thanks for the feedback @sylvaticus, looking forward to seeing the updates!

@ablaom
Copy link

ablaom commented Apr 14, 2021

I agree this is important but it's not a strict requirement for a works to be published in JOSS. Primarily, the review criteria around (Substantial Scholarly Effort)[https://joss.readthedocs.io/en/latest/review_criteria.html#substantial-scholarly-effort] are designed to exclude very minor software contributions which we don't believe add meaningful value for potential users of the tooling. Based on @sylvaticus' responses in this thread I do not believe this work falls into that category.

@afron Thanks for this clarification! Your statements makes perfect sense to me and I am very happy to see this work acknowledged through publication.

@sylvaticus
Copy link

sylvaticus commented Apr 19, 2021

Dear editor and reviewers, I have updated the library and the paper to account for the reviewers' comments:

  • a detailed step-by-step tutorial has been added to show the usage of the library (and, more in general, of ML techniques) and compare it with existing libraries. As I did expect, BetaML is surely less computationally performant, but it is not less accurate nor (with the notable exception of neural networks) less flexible;
  • I added several interfaces to BetaML models to be used within the MLJ framework;
  • In the meantime I updated/added several utility functions to the algorithms. For example, the set of scale, pca, oneHotEncoder, crossValidation (with a configurable user-provided function / "do block") and other functions allows to easily make a "workflow" from the data to the ML models, even if not exactly a "pipeline".

I am confident that the modifications introduced will help the users of the library and I thanks the reviewers for the time they spent in suggesting the improvements to the library and their guidance in implementing them.

@sylvaticus
Copy link

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Apr 20, 2021

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@sylvaticus
Copy link

@arfon what are the next steps now ? Should I create a software deposit on zenodo ?

@arfon
Copy link
Member

arfon commented Apr 23, 2021

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Apr 23, 2021

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@arfon
Copy link
Member

arfon commented Apr 23, 2021

@sylvaticus - could you please clean up all of the comments in your paper? I was trying to give the paper.md a read, and realized that lots of the content that I was struggling with was actually commented out.

Also, please add more information to your affiliations - I'm not sure what many of them are.

@sylvaticus
Copy link

@whedon generate pdf

Done it. I have removed the comments and specified the full affiliation names.

I am very sorry for the 6 affiliations (it's crazy, I know..) but that's the way we have been asked to sign our papers :-/ :

  1. Tout ce qui est publié dans le BETA par les chercheurs INRA et AgroParisTech et tout ce qui traite de la forêt et du bois dans le BETA doit être signé:

“Université de Lorraine, Université de Strasbourg, AgroParisTech, CNRS, INRA, BETA, 54000, Nancy, France” pour les lorrains

Et

“Université de Strasbourg, Université de Lorraine, AgroParisTech, CNRS, INRA, BETA, 67000, Strasbourg, France” pour les strasbourgeois

@whedon
Copy link
Author

whedon commented Apr 23, 2021

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@arfon
Copy link
Member

arfon commented Apr 30, 2021

@sylvaticus - I made a few minor changes to your paper here: sylvaticus/BetaML.jl#23 . Once you have merged this PR, could you make a new release of this software that includes the changes that have resulted from this review. Then, please make an archive of the software in Zenodo/figshare/other service and update this thread with the DOI of the archive? For the Zenodo/figshare archive, please make sure that:

  • The title of the archive is the same as the JOSS paper title
  • That the authors of the archive are the same as the JOSS paper authors

I can then move forward with accepting the submission.

@sylvaticus
Copy link

Hello, I have created release v0.5.1 of the software which includes the text corrections of @arfon (thank you!) and I have deposited it on Zenodo: https://doi.org/10.5281/zenodo.4730205

@sylvaticus
Copy link

@whedon set 10.5281/zenodo.4730205 as archive

@whedon
Copy link
Author

whedon commented Apr 30, 2021

I'm sorry @sylvaticus, I'm afraid I can't do that. That's something only editors are allowed to do.

@arfon
Copy link
Member

arfon commented Apr 30, 2021

@whedon set 10.5281/zenodo.4730205 as archive

@whedon
Copy link
Author

whedon commented Apr 30, 2021

OK. 10.5281/zenodo.4730205 is the archive.

@arfon
Copy link
Member

arfon commented Apr 30, 2021

@whedon accept

@whedon whedon added the recommend-accept Papers recommended for acceptance in JOSS. label Apr 30, 2021
@whedon
Copy link
Author

whedon commented Apr 30, 2021

Attempting dry run of processing paper acceptance...

@whedon
Copy link
Author

whedon commented Apr 30, 2021

👋 @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof 👉 openjournals/joss-papers#2267

If the paper PDF and Crossref deposit XML look good in openjournals/joss-papers#2267, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

@whedon
Copy link
Author

whedon commented Apr 30, 2021

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1007/s10107-010-0420-4 is OK
- 10.5281/zenodo.3541505 is OK
- 10.21105/joss.00602 is OK
- 10.21105/joss.01284 is OK
- 10.5281/zenodo.4294939 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@arfon
Copy link
Member

arfon commented Apr 30, 2021

@whedon accept deposit=true

@whedon
Copy link
Author

whedon commented Apr 30, 2021

Doing it live! Attempting automated processing of paper acceptance...

@whedon whedon added accepted published Papers published in JOSS labels Apr 30, 2021
@whedon
Copy link
Author

whedon commented Apr 30, 2021

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

@whedon
Copy link
Author

whedon commented Apr 30, 2021

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

  1. Check final PDF and Crossref metadata that was deposited 👉 Creating pull request for 10.21105.joss.02849 joss-papers#2268
  2. Wait a couple of minutes, then verify that the paper DOI resolves https://doi.org/10.21105/joss.02849
  3. If everything looks good, then close this review issue.
  4. Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? Notify your editorial technical team...

@arfon
Copy link
Member

arfon commented Apr 30, 2021

@ablaom, @ppalmes - many thanks for your reviews here and to @terrytangyuan for editing this submission. JOSS relies upon the volunteer efforts of people like you and we simply wouldn't be able to this without you ✨

@sylvaticus - your paper is now accepted and published in JOSS ⚡🚀💥

@arfon arfon closed this as completed Apr 30, 2021
@whedon
Copy link
Author

whedon commented Apr 30, 2021

🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02849/status.svg)](https://doi.org/10.21105/joss.02849)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.02849">
  <img src="https://joss.theoj.org/papers/10.21105/joss.02849/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: https://joss.theoj.org/papers/10.21105/joss.02849/status.svg
   :target: https://doi.org/10.21105/joss.02849

This is how it will look in your documentation:

DOI

We need your help!

Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

@sylvaticus
Copy link

Thanks everyone for your precious time and useful suggestion. I ran the cloc util again and the lines of code went from 2450 when starting the review to over 5000 now, most of which incorporates reviewers ideas and suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Julia Jupyter Notebook published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review
Projects
None yet
Development

No branches or pull requests

6 participants