Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface refactor #304

Open
dancooke opened this issue Apr 13, 2018 · 6 comments
Open

Interface refactor #304

dancooke opened this issue Apr 13, 2018 · 6 comments

Comments

@dancooke
Copy link
Contributor

dancooke commented Apr 13, 2018

I would like to suggest a fairly substantial change to rangers main Forest interface. Currently, a Forest must essentially be constructed with all parameters and data, with the same interface being used for both training and prediction. This has three problems:

  1. Forest is not reusable - if I want to predict a trained Forest on multiple data sets I need to construct a new Forest for each set (and pay the price of loading the Forest each time), or manually merge all my data - which may not be desirable or feasible.
  2. It is not good from an interface perspective since it does not separate areas of concern; training and prediction are coupled. If a user just wants to predict new data on a pre-trained forest, they shouldn't need to be concerned with any parameters to do with training, but the current interface enforces that.
  3. It prohibits potential performance optimisations since Forest and Data are strongly coupled; since Forest explicitly stores the Data it will later use for training and prediction, it must store a pointer to the data and incur a virtual lookup for every data point access. However, Data has a common interface, and Forest doesn't depend on what type of data is actually used - so long as it satisfies ordering properties etc. Ideally, rather than Data being used as a polymorphic type, the underlying data would be used directly via a templateed method.

I think something along the lines of the following interface would address these points:

class Forest
{
public:
    struct Parameters { /* general parameters (num trees etc) */ };
   
    // Construct a new forest from provided parameters
    Forest(Parameters params);

    // Construct a new forest from a saved forest 
    Forest(std::stream& is);
   
    // Train the forest on the provided input data;
    // the Forest is modified (no const).
    template <typename Matrix, typename Vector>
    void train(const Matrix& X, const Vector& y);
   
    // Predict new data and write results to y
    template <typename Matrix, typename Vector>
    Vector& predict(const Matrix& X, Vector& y) const;
   
    // Write the state of the Forest to the stream
    void serialise(std::ostream& os) const;

private:
    // implementation
};

What do you think? I can have a go at implementing this if you agree.

@mnwright
Copy link
Member

Generally no objections to an interface change. The current interface is mainly due to the R version. Btw., the R version is far more used (including by myself) and my major focus in development. Thus, your efforts to improve the standalone C++ version are highly appreciated!

I agree with the problems you describe (though at least 1 and 2 don't apply to the R version). As long as a new interface is working with Rcpp (I can do the changes related to that) I'm fine with a change.

@mnwright
Copy link
Member

Maybe related to this: It would also be nice to be able to compute permutation variable importance for existing forests.

@dlong11
Copy link

dlong11 commented Jul 23, 2019

@dancooke @mnwright This sounds like a great idea. Did this ever get any traction?

@mnwright
Copy link
Member

Not yet, unfortunately.

@rtlprmft
Copy link

A well sorted C++ API would be really helpful, as it could be linked to CERN's root package which makes I/O and plotting of results really trivial. Right now, it requires converting all data to ASCII files, running ranger, merging the prediction back to the data, converting the data back to binary to be able to create result plots.

@stephematician
Copy link
Contributor

stephematician commented Mar 27, 2023

For what it's worth, I've been refactoring ranger as a part of my very old attempt at a multiple imputation package. (edit) The refactoring is now sitting in its own package: https://github.com/stephematician/literanger (/edit)

Some of the issues here are addressed, e.g.:

  1. The training and prediction interfaces are separated.
  2. The Forest object is re-usable and is passed to/from R as an external pointer.
  3. Forest no longer stores Data as an object; I'm thinking about updating how it is passed around (currently it is passed around as a shared resource) - the template idea is nice.

Some issues raised here are not addressed: e.g. I'm prevaricating on polymorphism (compile-time vs run-time).

I'm nowhere near the full feature set of ranger - but regardless of where I end up, my effort might be useful as a starting point for further refactoring. I switched to the cpp11 package for R as it has safer semantics than Rcpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants