Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funtion reduceResultsBatchmark saves backend when store_backends=FALSE #18

Closed
MislavSag opened this issue Oct 13, 2023 · 23 comments
Closed

Comments

@MislavSag
Copy link

Hi again,

I am experimenting with reduceResultsBatchmark function.

The function consumes lots of RAM's on my local machine, even after setting store_models=FALSE and store_backends=FALSE. I have looked at source code and it seems it stores task in Result object no meter of store_backends=FALSE argument:

 rdata = mlr3::ResultData$new(data.table(task = list(task), 
                                          learner = list(learner), resampling = list(resampling), 
                                          iteration = tab$repl, prediction = map(results, 
                                                                                 "prediction"), learner_state = map(results, 
                                                                                                                    "learner_state"), uhash = tab$job.name), store_backends = store_backends)

It just sets store_backends attribute to store_backends, but it saves task inside the object.

Is that intentional? I would expect task would not save the backend if the store_backend argument is set to FALSE.

@sebffischer
Copy link
Sponsor Member

Hey @MislavSag. The task is stored, but task$backend should return NULL.
Are you using renv in your project? renv recently changed the way packages are installed, which does not work well with R6 classes (hence the large RAM usage).
We are waiting for feedback from the renv team until we will decide how to proceed.
In case you are using renv, you should be able to address the problems by setting options("install.opts" = "--without-keep.source") in you .Rprofile.

@MislavSag
Copy link
Author

Hi, I don't use renv package.
Ultimately, I decided to import only results from the registry folder and add some columns from the backend using row ids.
I don't know why it consumes so much RAM.

@sebffischer
Copy link
Sponsor Member

Do you install packages with the --with-keep.source option?

@MislavSag
Copy link
Author

I don't use this option. If it't default it TRUE, than I use it.

@sebffischer
Copy link
Sponsor Member

Thanks for the info, can you provide a reprex?

@sebffischer
Copy link
Sponsor Member

Or instead, can you please show me the output of attributes(mlr3::benchmark_grid)?

@MislavSag
Copy link
Author

Here is the ouput:

> attributes(mlr3::benchmark_grid)
NULL

I am nor sure how should I reproduce the problem. I can upload to the cloud sample structure (say 10 results) so you can try to import them. But generaly, it takes lots of time to import results.

@sebffischer
Copy link
Sponsor Member

sebffischer commented Nov 21, 2023

Thanks, this is already helpful. Maybe you can run loadResult(1) and send us the rds files like you suggested (e.g. in mattermost to sebastian_fischer)

@MislavSag
Copy link
Author

This is what loadResult(1) returns:

$learner_state
$learner_state$log
Empty data.table (0 rows and 3 cols): stage,class,msg

$learner_state$train_time
[1] 122.86

$learner_state$param_vals
named list()

$learner_state$task_hash
[1] "c9a12e2eccb321ef"

$learner_state$data_prototype
Empty data.table (0 rows and 1656 cols): retExcessStand5,aRCHLM132TRUE,aRCHLM66TRUE,ac9132TRUE,ac966TRUE,accountPayables...

$learner_state$task_prototype
Empty data.table (0 rows and 1656 cols): retExcessStand5,aRCHLM132TRUE,aRCHLM66TRUE,ac9132TRUE,ac966TRUE,accountPayables...

$learner_state$mlr3_version
[1] ‘0.17.0.9000’

$learner_state$predict_time
[1] 1.481


$prediction
$prediction$test
<PredictionDataRegr:1449>


$param_values
named list()

$learner_hash
[1] "fa8c1bb28276406e"

@sebffischer
Copy link
Sponsor Member

We have run into large RAM usage / slow reduceResultsBatchmark repeatedly with mlr3batchmark so this is also not really due to your specific circumstances. We will try to look into this.The only unusual thing here is the large amount of features. In your case, the task_prototype and data_prototype will definitely make the problem worse, because they are relatively large (reproducing such a data.table creates objects of size ~100KB for me).

@sebffischer
Copy link
Sponsor Member

Are you using a GraphLearner?

@MislavSag
Copy link
Author

Yes, all my learners are graph learners.

@sebffischer
Copy link
Sponsor Member

Okay, this problem will hopefully be gone when the new version of paradox (which we use to represent the parameter sets) is done.

@MislavSag
Copy link
Author

@sebffischer , could you recommend some workaround before new paradox package comes oout ? I hve 16.000 results now and it takes more than a day to import this, even in parallel.

@sebffischer
Copy link
Sponsor Member

This depends on what exactly you want to do with the results.
When you are interested in the evaluated scores, you can do use the code below as the starting point.
Note that the learner hash and task hash uniquely identify a learner / task, whereas different learners can have the same IDs. You will likely have to adapt the code below and might have to look a little into the batchtools documentation: https://github.com/mllg/batchtools.

library(mlr3verse)
#> Loading required package: mlr3
library(batchtools)
library(mlr3batchmark)
library(mlr3misc)
#> 
#> Attaching package: 'mlr3misc'
#> The following object is masked from 'package:batchtools':
#> 
#>     chunk

reg = makeExperimentRegistry(NA)
#> No readable configuration file found
#> Created registry in '/var/folders/ft/n79895td0xn0gpr6ny8jyh800000gn/T/Rtmpq5etTo/registry1b37547891f5' using cluster functions 'Interactive'

design = benchmark_grid(
  tsks(c("iris", "sonar")),
  lrns(c("classif.rpart", "classif.featureless")),
  rsmp("cv")
)

batchmark(design)
#> Adding algorithm 'run_learner'
#> Adding problem '1c326920b82b400b'
#> Exporting new objects: '6b67bf63ecedae30' ...
#> Exporting new objects: '70dd22724e5c724d' ...
#> Exporting new objects: '7c35d835f3dfae37' ...
#> Adding 20 experiments ('1c326920b82b400b'[1] x 'run_learner'[2] x repls[10]) ...
#> Adding problem '7e770c7dda9c66ef'
#> Exporting new objects: 'c1fa2fa572e6d386' ...
#> Adding 20 experiments ('7e770c7dda9c66ef'[1] x 'run_learner'[2] x repls[10]) ...

submitJobs()
#> Submitting 40 jobs in 40 chunks using cluster functions 'Interactive' ...

job_table = getJobTable()

unique_jobs = unique(job_table$job.name)
measure = msr("classif.acc")
result = map_dtr(unique_jobs, function(job_name) {
  ids = job_table[job_name, "job.id", on = "job.name"][[1]]
  learner_info = job_table[job_name, "algo.pars", on = "job.name"]$algo.pars[[1]]
  task_info = job_table[job_name, "prob.pars", on = "job.name"]$prob.pars[[1]]
  task_id = task_info$task_id
  task_hash = task_info$task_hash

  learner_id = learner_info$learner_id
  learner_hash = learner_info$learner_hash

  scores = map_dbl(ids, function(id) {
    result = loadResult(id)
    test_prediction = as_prediction(result$prediction$test)

    score = measure$score(test_prediction)

    score
  })

  avg_score = mean(scores)

  list(acc = avg_score, learner_id = learner_id, task_id = task_id, learner_hash = learner_hash, task_hash = task_hash)
})

result
#>          acc          learner_id task_id     learner_hash        task_hash
#> 1: 0.9333333       classif.rpart    iris 70dd22724e5c724d 1c326920b82b400b
#> 2: 0.2333333 classif.featureless    iris 7c35d835f3dfae37 1c326920b82b400b
#> 3: 0.7223810       classif.rpart   sonar 70dd22724e5c724d 7e770c7dda9c66ef
#> 4: 0.5330952 classif.featureless   sonar 7c35d835f3dfae37 7e770c7dda9c66ef

Created on 2023-12-18 with reprex v2.0.2

@MislavSag
Copy link
Author

I have gound the workaround, that worked till today.

Now, when I tried to import tasks from problems folder:

tasks_files = dir_ls(fs::path(PATH, "problems"))
task = readRDS(tasks_files[2])
tasks = lapply(tasks_files, readRDS)
names(tasks) = lapply(tasks, function(t) t$data$id)

I get an error

> task
$name
[1] "4492359b9e42ed34"

$seed
NULL

$cache
[1] FALSE

$data
Error in .__Task__id(self = self, private = private, super = super, rhs = rhs) : 
  could not find function ".__Task__id"

I am not sure if this error is linked to mlr3batchmark package or some other package from mlr3 universe.

@sebffischer
Copy link
Sponsor Member

you need to load mlr3 for that

@MislavSag
Copy link
Author

I have loaded mlr3 with library(mlr3) but get the same error.

@MislavSag
Copy link
Author

I have used mlr 17.0 on HPC, and than 17.1. locally. Can that be the source of the error ?

@sebffischer sebffischer reopened this Dec 22, 2023
@sebffischer
Copy link
Sponsor Member

Yes, this can be the case, as id was only made an active binding very recently in this commit: mlr-org/mlr3@244572f

Can you try using the same version and reporting whether it works? (It should)

@sebffischer
Copy link
Sponsor Member

It seems that you have created the Task object with an mlr3 version where the task ID was already an active binding, and then loaded it with an mlr3 version where task is not yet an active binding. Thanks a lot for putting our attention to this issue, at least this needs to be properly documented somewhere.

@sebffischer
Copy link
Sponsor Member

We have now addressed this with a warning message

@tdhock
Copy link
Contributor

tdhock commented Feb 14, 2024

hi I have the same issue: reduceResultsBatchmark is taking up too much RAM on my cluster system, which is killing my job.
I expected that I should be able to give reduceResultsBatchmark some argument to tell it to use less RAM.
I tried store_backends=FALSE, but that did not help.
After looking at the source code of reduceResultsBatchmark, I see that the large RAM usage happens on this line:

results = batchtools::reduceResultsList(tab$job.id, reg = reg)

I see that reduceResultsList has an argument fun which defaults to NULL, meaning the identity function (the whole result file is read into RAM and returned). I was wondering if you could please give reduceResultsBatchmark a new argument, say reduceResultsList.fun which would be passed on to reduceResultsList? I believe this would fix the issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants