New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable caching of filter values during tuning #2463

Merged
merged 60 commits into from Feb 1, 2019

Conversation

5 participants
@pat-s
Copy link
Member

pat-s commented Oct 23, 2018

fixes #1995

Implementation

  • caching via memoise::memoise()
  • caching is done on the filesystem because memory caching is not supported in parallel processes (r-lib/memoise#77)
  • argument cache can be passed to makeFilterWrapper(). It works with resample(), tuneParams(), filterFeatures().
  • cache accepts a logical vector (using default cache dirs determined by the rappdirs pkg) or a chr vector specifying a custom directory
  • new function: delete_cache(): Deletes ONLY the default cache dirs.
  • new function: get_cache_dir(): Returns the default cache dirs.
  • removal of argument nselect in generateFilterValuesData() and all filters because its not used
  • finally remove deprecated getFilterValues()

Benchmark example

WITH Cache

library(mlr)
#> Loading required package: ParamHelpers
library(microbenchmark)

lrn = makeFilterWrapper(learner = "regr.ksvm", fw.method = "chi.squared", 
                        cache = TRUE)
ps = makeParamSet(makeNumericParam("fw.perc", lower = 0, upper = 1),
                  makeNumericParam("C", lower = -10, upper = 10, 
                                   trafo = function(x) 2^x),
                  makeNumericParam("sigma", lower = -10, upper = 10,
                                   trafo = function(x) 2^x)
)
rdesc = makeResampleDesc("CV", iters = 3)

y <- rnorm(100)
x <- matrix(rnorm(100 * 2000), 100, 2000)
dat <- data.frame(data.frame(y, as.data.frame(x)))
task = makeRegrTask(target = "y", data = dat)


set.seed(123)
microbenchmark(tuneParams(lrn, task = task, resampling = rdesc, par.set = ps,
                          control = makeTuneControlRandom(maxit = 10), show.info = T),
               times = 1)
#> [Tune] Started tuning learner regr.ksvm.filtered for parameter set:
#>            Type len Def    Constr Req Tunable Trafo
#> fw.perc numeric   -   -    0 to 1   -    TRUE     -
#> C       numeric   -   - -10 to 10   -    TRUE     Y
#> sigma   numeric   -   - -10 to 10   -    TRUE     Y
#> With control class: TuneControlRandom
#> Imputation value: Inf
#> [Tune-x] 1: fw.perc=0.962; C=4.08; sigma=1.23
#> [Tune-y] 1: mse.test.mean=1.0250471; time: 0.5 min
#> [Tune-x] 2: fw.perc=0.403; C=195; sigma=0.152
#> [Tune-y] 2: mse.test.mean=1.0250471; time: 0.0 min
#> [Tune-x] 3: fw.perc=0.288; C=0.0104; sigma=0.0106
#> [Tune-y] 3: mse.test.mean=1.0222349; time: 0.0 min
#> [Tune-x] 4: fw.perc=0.482; C=0.0326; sigma=0.0196
#> [Tune-y] 4: mse.test.mean=1.0226891; time: 0.0 min
#> [Tune-x] 5: fw.perc=0.674; C=0.00189; sigma=16.2
#> [Tune-y] 5: mse.test.mean=1.0226888; time: 0.0 min
#> [Tune-x] 6: fw.perc=0.352; C=0.283; sigma=85.6
#> [Tune-y] 6: mse.test.mean=1.0273247; time: 0.0 min
#> [Tune-x] 7: fw.perc=0.919; C=0.0491; sigma=597
#> [Tune-y] 7: mse.test.mean=1.0242091; time: 0.0 min
#> [Tune-x] 8: fw.perc=0.728; C=13.2; sigma=0.00203
#> [Tune-y] 8: mse.test.mean=1.0249831; time: 0.0 min
#> [Tune-x] 9: fw.perc=0.395; C=0.736; sigma=2.31
#> [Tune-y] 9: mse.test.mean=1.0307186; time: 0.0 min
#> [Tune-x] 10: fw.perc=0.698; C=318; sigma=5.16
#> [Tune-y] 10: mse.test.mean=1.0250471; time: 0.0 min
#> [Tune] Result: fw.perc=0.288; C=0.0104; sigma=0.0106 : mse.test.mean=1.0222349
#> Unit: seconds
#>                                                                                                                             expr
#>  tuneParams(lrn, task = task, resampling = rdesc, par.set = ps,      control = makeTuneControlRandom(maxit = 10), show.info = T)
#>       min       lq     mean   median       uq      max neval
#>  42.67322 42.67322 42.67322 42.67322 42.67322 42.67322     1

Created on 2018-11-02 by the reprex package (v0.2.1)

WITHOUT caching

library(mlr)
#> Loading required package: ParamHelpers
library(microbenchmark)

lrn = makeFilterWrapper(learner = "regr.ksvm", fw.method = "chi.squared", 
                        cache = FALSE)
ps = makeParamSet(makeNumericParam("fw.perc", lower = 0, upper = 1),
                  makeNumericParam("C", lower = -10, upper = 10, 
                                   trafo = function(x) 2^x),
                  makeNumericParam("sigma", lower = -10, upper = 10,
                                   trafo = function(x) 2^x)
)
rdesc = makeResampleDesc("CV", iters = 3)

y <- rnorm(100)
x <- matrix(rnorm(100 * 2000), 100, 2000)
dat <- data.frame(data.frame(y, as.data.frame(x)))
task = makeRegrTask(target = "y", data = dat)


set.seed(123)
microbenchmark(tuneParams(lrn, task = task, resampling = rdesc, par.set = ps,
                          control = makeTuneControlRandom(maxit = 10), show.info = T),
               times = 1)
#> [Tune] Started tuning learner regr.ksvm.filtered for parameter set:
#>            Type len Def    Constr Req Tunable Trafo
#> fw.perc numeric   -   -    0 to 1   -    TRUE     -
#> C       numeric   -   - -10 to 10   -    TRUE     Y
#> sigma   numeric   -   - -10 to 10   -    TRUE     Y
#> With control class: TuneControlRandom
#> Imputation value: Inf
#> [Tune-x] 1: fw.perc=0.962; C=4.08; sigma=1.23
#> [Tune-y] 1: mse.test.mean=0.7753929; time: 0.4 min
#> [Tune-x] 2: fw.perc=0.403; C=195; sigma=0.152
#> [Tune-y] 2: mse.test.mean=0.7753929; time: 0.3 min
#> [Tune-x] 3: fw.perc=0.288; C=0.0104; sigma=0.0106
#> [Tune-y] 3: mse.test.mean=0.7776375; time: 0.4 min
#> [Tune-x] 4: fw.perc=0.482; C=0.0326; sigma=0.0196
#> [Tune-y] 4: mse.test.mean=0.7772678; time: 0.4 min
#> [Tune-x] 5: fw.perc=0.674; C=0.00189; sigma=16.2
#> [Tune-y] 5: mse.test.mean=0.7778738; time: 0.4 min
#> [Tune-x] 6: fw.perc=0.352; C=0.283; sigma=85.6
#> [Tune-y] 6: mse.test.mean=0.7774229; time: 0.4 min
#> [Tune-x] 7: fw.perc=0.919; C=0.0491; sigma=597
#> [Tune-y] 7: mse.test.mean=0.7766936; time: 0.4 min
#> [Tune-x] 8: fw.perc=0.728; C=13.2; sigma=0.00203
#> [Tune-y] 8: mse.test.mean=0.7762520; time: 0.4 min
#> [Tune-x] 9: fw.perc=0.395; C=0.736; sigma=2.31
#> [Tune-y] 9: mse.test.mean=0.7763041; time: 0.4 min
#> [Tune-x] 10: fw.perc=0.698; C=318; sigma=5.16
#> [Tune-y] 10: mse.test.mean=0.7753929; time: 0.3 min
#> [Tune] Result: fw.perc=0.403; C=195; sigma=0.152 : mse.test.mean=0.7753929
#> Unit: seconds
#>                                                                                                                             expr
#>  tuneParams(lrn, task = task, resampling = rdesc, par.set = ps,      control = makeTuneControlRandom(maxit = 10), show.info = T)
#>       min       lq     mean   median       uq      max neval
#>  224.1479 224.1479 224.1479 224.1479 224.1479 224.1479     1

Created on 2018-11-02 by the reprex package (v0.2.1)

pat-s added some commits Oct 23, 2018

add caching option via memoise
set perc = 1 as default
hide documentation
remove nselect arg
delete getFilterValues
@larskotthoff

This comment has been minimized.

Copy link
Contributor

larskotthoff commented Oct 23, 2018

Could you add a test that checks that results are the same with and without caching please?

@pat-s

This comment has been minimized.

Copy link
Member Author

pat-s commented Nov 4, 2018

I don't see that we need the extra dependency fs formlr.

I tried with base dir.create() and it did silently fail to create the directory (if a custom cache directory was given). Instead, using fs::dir_create() worked.

That's why I used the fs pkg.
I think that's a valid reason. Also it is in SUGGESTS and should not have that much impact then.

Although nselect is not used by any mlr filter currently, I had some custom filters which used it. I would prefer to keep the few extra lines.

Ok, fair enough.

@pat-s pat-s force-pushed the cache-filtering branch from b568c34 to adeae6b Nov 4, 2018

merge master
Merge branch 'master' into cache-filtering

# Conflicts:
#	.travis.yml
#	appveyor.yml
#	tic.R
@berndbischl

This comment has been minimized.

Copy link
Contributor

berndbischl commented Nov 4, 2018

@pat-s you might get angry at me saying this now:
but i am REALLY unsure whether we should "shoehorn" such a mechnism into mlr now.
caching just for filter values. that seems not that reasonable?
what are other thoughts here? i wnat this at least discussed before we do this, this is a major change to the base system
@mllg @larskotthoff

@pat-s pat-s referenced this pull request Nov 4, 2018

Open

Caching #16

@pat-s

This comment has been minimized.

Copy link
Member Author

pat-s commented Nov 6, 2018

but i am REALLY unsure whether we should "shoehorn" such a mechnism into mlr now.
caching just for filter values. that seems not that reasonable?

I understand your point.
However, filtering is the code part that will profit from caching most (since it is the only part that is recalled with the same settings every time).
Enabling caching for all mlr parts would involve a lot of more work with not so much impact as for the filter stuff.

Imo it is sufficient to only have caching for filtering in mlr and implement it properly (pkg wide) in mlr3 right from the start.

I do not really a big disadvantage of this PR. Yes it is not pkg wide but does this point make it not mergeable?

But if you do not want to have this in mlr because of this point I'll just leave it in the branch.

@mllg

This comment has been minimized.

Copy link
Member

mllg commented Nov 6, 2018

I tried with base dir.create() and it did silently fail to create the directory (if a custom cache directory was given). Instead, using fs::dir_create() worked.

If dir.create() fails, the alarm bells should start to ring. I assume this is because of race conditions during parallelization. libuv used by fs might have some fallbacks (e.g., timeouts, retries, ...) to solve this. Nevertheless, you still will encounter race conditions for the files. Memorization will not work reliably in parallel.

That's why I used the fs pkg.
I think that's a valid reason. Also it is in SUGGESTS and should not have that much impact then.

I disagree. Most of our problems regarding maintainability is because of the long list of packages in Suggests, not the few packages in Imports.

@pat-s

This comment has been minimized.

Copy link
Member Author

pat-s commented Nov 6, 2018

If dir.create() fails, the alarm bells should start to ring. I assume this is because of race conditions during parallelization. libuv used by fs might have some fallbacks (e.g., timeouts, retries, ...) to solve this. Nevertheless, you still will encounter race conditions for the files. Memorization will not work reliably in parallel.

dir.create() is just doing a one-time call creating the directory for caching. Even when executing it "manually" it failed to create the dir. I assume some permission problems here? But I did not debug further.
To be more explicit, dir.create(rappdirs::use_cache_dir()) failed while fs::dir_create(rappdirs::use_cache_dir()) worked on my machine.

Memorization will not work reliably in parallel.

As far as I've read the docs of memoise, parallelization is not a problem as long as caching is done on the filesystem. Parallelization and caching do not work when caching in memory as the processes do not share the same memory.

I disagree. Most of our problems regarding maintainability is because of the long list of packages in Suggests, not the few packages in Imports.

I think we can debate about a lot of packages in SUGGESTS (to possibly clean up) but isn't fs one of those that really do a better (safer) job than the base R implementation?
If you insist on not using fs, it will take quite some time to figure out why dir.create() failed silently (!) :/.

@mllg

This comment has been minimized.

Copy link
Member

mllg commented Nov 6, 2018

If dir.create() fails, the alarm bells should start to ring. I assume this is because of race conditions during parallelization. libuv used by fs might have some fallbacks (e.g., timeouts, retries, ...) to solve this. Nevertheless, you still will encounter race conditions for the files. Memorization will not work reliably in parallel.

dir.create() is just doing a one-time call creating the directory for caching. Even when executing it "manually" it failed to create the dir. I assume some permission problems here? But I did not debug further.
To be more explicit, dir.create(rappdirs::use_cache_dir()) failed while fs::dir_create(rappdirs::use_cache_dir()) worked on my machine.

Try

if (!dir.exists(path)) dir.create(path)

If this does not work, the permissions are indeed set strangely on your machine.

Memorization will not work reliably in parallel.

As far as I've read the docs of memoise, parallelization is not a problem as long as caching is done on the filesystem. Parallelization and caching do not work when caching in memory as the processes do not share the same memory.

Then the docs are wrong. You spawn multiple threads / processes which do the following concurrently:

  1. Check if cache dir exists and create it, if necessary
  2. Hash the inputs to the function call
  3. Run the filtering
  4. Store result as cache_dir/[hash].rds
  5. Next computation will load the stored file

So lets assume that the cache dir already existed. Then (1) is no problem. The calculated hash will be the same on all / many workers. As soon as we try to store the results in (4), all workers with the same hash will try to write to the exactly same file concurrently. Depending on the file system different things will happen now.

  • On a local file system with a working file locking, the workers will write the file, one after another. Other workers in the meantime might try to load the file for (5), and now have to wait (which can lead to timeouts).
  • On a network file system without proper locking, you will corrupt your files, and system admins will get very angry because you overburden the file system and make the system unresponsive for everyone.

You either need a thread-safe storage system (like a data base), or pre-compute the results before the parallelization.

@mllg

This comment has been minimized.

Copy link
Member

mllg commented Nov 6, 2018

Caching would still be nice to have for sequential execution though.

@pat-s

This comment has been minimized.

Copy link
Member Author

pat-s commented Nov 6, 2018

Try

if (!dir.exists(path)) dir.create(path)
If this does not work, the permissions are indeed set strangely on your machine.

Tried again. For whatever reason it works now. Strange. 🤔 😆

You either need a thread-safe storage system (like a data base), or pre-compute the results before the parallelization.

Looking at my tests I indeed never checked it with parallelization. I only checked that all callers work (resample(), tuneParams(), filterFeatures()).
I see the point of the potential writing conflict.

I tested it now:
Case 1: 3 cpus and 3 folds
In this case we get always different hashes since the tasks are different and no problems during writing.

Case 2: 6 cpus and 3 folds
When using cpus = 6 and folds = 3, two workers each are running on the exact same setup and would write the same hash file (potentially at the same time). This also works without trouble for me. In the end I have 3 hash files in the cache dir (I deleted the cache dir before the run).

I am not sure if the latter only works if there is a small delay between the writing attempts of both workers. Or if they really wait on each other. Or or or...
But in the end it worked for me in the same way as doing it sequentially (no additional files, no conflicts).

@mllg Happy to make additional checks on this or making the implementation more robust. My parallelization knowledge ends at this point (= balancing multiple parallel write attempts). From a user perspective I do not see any drawback using the caching method in parallel as well atm?

pat-s and others added some commits Nov 6, 2018

merge master
Merge branch 'master' into cache-filtering

# Conflicts:
#	tic.R
cleanups
changed default of cache
disable tests to write to the user's home directory
create paths with `recursive = TRUE`
@mb706

This comment has been minimized.

Copy link
Contributor

mb706 commented on tests/testthat/test_base_caching.R in 319841a Nov 16, 2018

FYI I think skip_on_cran is also skipped on travis right now.

pat-s added some commits Nov 20, 2018

merge master
Merge branch 'master' into cache-filtering

# Conflicts:
#	NEWS.md
#	R/FilterWrapper.R
#	R/filterFeatures.R
#	R/generateFilterValues.R
#	tic.R

@pat-s pat-s merged commit a18312b into master Feb 1, 2019

1 check passed

deploy/netlify Deploy preview ready!
Details

@pat-s pat-s deleted the cache-filtering branch Feb 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment