Code for the paper: Extended Federated Ensemble Regression using Classification (EFERUC).
To reproduce the EFERUC results having copied the correct codebase in the EFERUC folder to a local directory:
- Download the
experiments datasets.zip
folder from http://dx.doi.org/10.17632/mpvwnhv4vb.2 - There are four folders in the
experiments datasets
folder downloaded in (1). To reproduce the gene expression experiments, copy thegene_expression
dataset folder into thegene expression
folder in this repository and rename it todatasets
. It should be in the same folder level as thecode
folder. To run the experiments, go into thecode
folder and execute theeferuc.R
script. It will automatically create anoutput
folder in the same level as thecode
anddatasets
folders. The same process should be followed for the other dataset folders, however, thegeneral
folder in this repository should be used instead of thegene_expression
folder. Some code folders have_nominal
attached to them. This indicates that the datasets with nominal attributes should be performed using the code in that folder.
The output folder has the following structure:
+-- output
| +-- dataset_name_A
| | +-- weight_details
| | +-- best_bin_size
| | +-- performance
| | | +-- regression
| | | +-- classification
| | | | +-- RDS
| +-- dataset_name_B
| ...
| +-- dataset_name_C
To reproduce the resampling results use resampler_script.R
in the Resampling
folder. Perform the following:
- Download
datasets_and_splits.zip
from https://data.mendeley.com/datasets/mpvwnhv4vb/2 and unpack in your working directory. These are the datasets and training/test splits used in the paper. - Download the
Code and Data
folder from [https://www.dcc.fc.up.pt/~ltorgo/ExpertSystems/] and unpack in your working directory. This is code accompanying the paper by Torgo et al. - You will need to install the
uba
package in order to use the relevance function. Note that one of the dependencies of the package has been taken of CRAN so you will need to install it by hand by first downloading the freshest version here and then runninginstall.packages( "Path/To/DMwR_0.4.1.tar.gz", repos=NULL, type="source" )
. - Make sure
smoter_helper.R
,nominal_var_info.rds
andrelevance.rds
are in your working directory, or amend the script (lines 9, 20, 22) so that it knows where it is. - To run an experiment (for all the learners: ranger, xgboost, lasso and ridge) for data collection
d
, dataseta
, resampling method Smoter (s=TRUE
) or undersampler (s=FALSE
), undersampling levelu
and, for SmoteR, oversampling levelo
, execute:Rscript --vanilla resampler_script.R d=d a=a s=s u=u o=o
- dataset collections are:
Yeast
,QSAR
,OpenML
,gene_expression
andPaoBrancoImbalanced
, corresponding to the four sub-folders in thedatasets_and_splits/datasets
folder. - dataset names is the first portion of the file names in
datasets_and_splits/datasets/
subfolders, e.g. foryeast_x5.Fluorocytosine_X.csv
, it isyeast_x5.Fluorocytosine
. - to run SmoteR, choose
s=TRUE
, else, to run undersampler, chooses=FALSE
. - undersampling and oversampling levels can be freely chosen by the user.
Example use:
Rscript --vanilla resampler_script.R d=Yeast a=yeast_x4NQO s=TRUE u=50 o=100
- Results will be saved in
output
in the formatd_a_l_s_u_o_mod.rds
(model fitted on a resampled dataset),d_a_l_s_u_o_pred.rds
(prediction obtained on the test set) andd_a_l_s_u_o_rsq.rds
(R-squared).
To reproduce analysis performed in the paper, user will need to run resampler_script.R
for all dataset collections, resampling methods and combinations of under- and oversampling detailed in our paper.