Interface refactor #304

dancooke · 2018-04-13T12:56:27Z

I would like to suggest a fairly substantial change to rangers main Forest interface. Currently, a Forest must essentially be constructed with all parameters and data, with the same interface being used for both training and prediction. This has three problems:

Forest is not reusable - if I want to predict a trained Forest on multiple data sets I need to construct a new Forest for each set (and pay the price of loading the Forest each time), or manually merge all my data - which may not be desirable or feasible.
It is not good from an interface perspective since it does not separate areas of concern; training and prediction are coupled. If a user just wants to predict new data on a pre-trained forest, they shouldn't need to be concerned with any parameters to do with training, but the current interface enforces that.
It prohibits potential performance optimisations since Forest and Data are strongly coupled; since Forest explicitly stores the Data it will later use for training and prediction, it must store a pointer to the data and incur a virtual lookup for every data point access. However, Data has a common interface, and Forest doesn't depend on what type of data is actually used - so long as it satisfies ordering properties etc. Ideally, rather than Data being used as a polymorphic type, the underlying data would be used directly via a templateed method.

I think something along the lines of the following interface would address these points:

class Forest
{
public:
    struct Parameters { /* general parameters (num trees etc) */ };
   
    // Construct a new forest from provided parameters
    Forest(Parameters params);

    // Construct a new forest from a saved forest 
    Forest(std::stream& is);
   
    // Train the forest on the provided input data;
    // the Forest is modified (no const).
    template <typename Matrix, typename Vector>
    void train(const Matrix& X, const Vector& y);
   
    // Predict new data and write results to y
    template <typename Matrix, typename Vector>
    Vector& predict(const Matrix& X, Vector& y) const;
   
    // Write the state of the Forest to the stream
    void serialise(std::ostream& os) const;

private:
    // implementation
};

What do you think? I can have a go at implementing this if you agree.

The text was updated successfully, but these errors were encountered:

mnwright · 2018-04-16T07:19:48Z

Generally no objections to an interface change. The current interface is mainly due to the R version. Btw., the R version is far more used (including by myself) and my major focus in development. Thus, your efforts to improve the standalone C++ version are highly appreciated!

I agree with the problems you describe (though at least 1 and 2 don't apply to the R version). As long as a new interface is working with Rcpp (I can do the changes related to that) I'm fine with a change.

mnwright · 2018-04-16T07:22:06Z

Maybe related to this: It would also be nice to be able to compute permutation variable importance for existing forests.

dlong11 · 2019-07-23T14:23:03Z

@dancooke @mnwright This sounds like a great idea. Did this ever get any traction?

mnwright · 2019-08-12T06:43:40Z

Not yet, unfortunately.

rtlprmft · 2019-10-26T21:57:44Z

A well sorted C++ API would be really helpful, as it could be linked to CERN's root package which makes I/O and plotting of results really trivial. Right now, it requires converting all data to ASCII files, running ranger, merging the prediction back to the data, converting the data back to binary to be able to create result plots.

stephematician · 2023-03-27T10:28:58Z

For what it's worth, I've been refactoring ranger as a part of my very old attempt at a multiple imputation package. (edit) The refactoring is now sitting in its own package: https://github.com/stephematician/literanger (/edit)

Some of the issues here are addressed, e.g.:

The training and prediction interfaces are separated.
The Forest object is re-usable and is passed to/from R as an external pointer.
Forest no longer stores Data as an object; I'm thinking about updating how it is passed around (currently it is passed around as a shared resource) - the template idea is nice.

Some issues raised here are not addressed: e.g. I'm prevaricating on polymorphism (compile-time vs run-time).

I'm nowhere near the full feature set of ranger - but regardless of where I end up, my effort might be useful as a starting point for further refactoring. I switched to the cpp11 package for R as it has safer semantics than Rcpp.

mnwright mentioned this issue Apr 10, 2019

[C++] Running the model in a process #349

Open

mnwright mentioned this issue Aug 12, 2019

C++ API for writing the model like MLPack? #420

Open

mnwright mentioned this issue May 8, 2020

predict.ranger using up large amounts of memory #500

Open

mnwright mentioned this issue Dec 1, 2022

Integrate C++ ranger in larger C++ program #644

Open

stephematician mentioned this issue May 9, 2023

Interface refactor stephematician/literanger#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interface refactor #304

Interface refactor #304

dancooke commented Apr 13, 2018 •

edited

Loading

mnwright commented Apr 16, 2018

mnwright commented Apr 16, 2018

dlong11 commented Jul 23, 2019

mnwright commented Aug 12, 2019

rtlprmft commented Oct 26, 2019

stephematician commented Mar 27, 2023 •

edited

Loading

Interface refactor #304

Interface refactor #304

Comments

dancooke commented Apr 13, 2018 • edited Loading

mnwright commented Apr 16, 2018

mnwright commented Apr 16, 2018

dlong11 commented Jul 23, 2019

mnwright commented Aug 12, 2019

rtlprmft commented Oct 26, 2019

stephematician commented Mar 27, 2023 • edited Loading

dancooke commented Apr 13, 2018 •

edited

Loading

stephematician commented Mar 27, 2023 •

edited

Loading