Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upEnable caching of filter values during tuning #2463
Conversation
pat-s
added some commits
Oct 23, 2018
This comment has been minimized.
This comment has been minimized.
Could you add a test that checks that results are the same with and without caching please? |
pat-s
added some commits
Oct 23, 2018
This comment has been minimized.
This comment has been minimized.
I tried with base That's why I used the fs pkg.
Ok, fair enough. |
pat-s
force-pushed the
cache-filtering
branch
from
b568c34
to
adeae6b
Nov 4, 2018
This comment has been minimized.
This comment has been minimized.
@pat-s you might get angry at me saying this now: |
This comment has been minimized.
This comment has been minimized.
I understand your point. Imo it is sufficient to only have caching for filtering in mlr and implement it properly (pkg wide) in mlr3 right from the start. I do not really a big disadvantage of this PR. Yes it is not pkg wide but does this point make it not mergeable? But if you do not want to have this in mlr because of this point I'll just leave it in the branch. |
This comment has been minimized.
This comment has been minimized.
If
I disagree. Most of our problems regarding maintainability is because of the long list of packages in Suggests, not the few packages in Imports. |
This comment has been minimized.
This comment has been minimized.
As far as I've read the docs of
I think we can debate about a lot of packages in SUGGESTS (to possibly clean up) but isn't |
This comment has been minimized.
This comment has been minimized.
Try if (!dir.exists(path)) dir.create(path) If this does not work, the permissions are indeed set strangely on your machine.
Then the docs are wrong. You spawn multiple threads / processes which do the following concurrently:
So lets assume that the cache dir already existed. Then (1) is no problem. The calculated hash will be the same on all / many workers. As soon as we try to store the results in (4), all workers with the same hash will try to write to the exactly same file concurrently. Depending on the file system different things will happen now.
You either need a thread-safe storage system (like a data base), or pre-compute the results before the parallelization. |
This comment has been minimized.
This comment has been minimized.
Caching would still be nice to have for sequential execution though. |
This comment has been minimized.
This comment has been minimized.
Tried again. For whatever reason it works now. Strange.
Looking at my tests I indeed never checked it with parallelization. I only checked that all callers work ( I tested it now: Case 2: 6 cpus and 3 folds I am not sure if the latter only works if there is a small delay between the writing attempts of both workers. Or if they really wait on each other. Or or or... @mllg Happy to make additional checks on this or making the implementation more robust. My parallelization knowledge ends at this point (= balancing multiple parallel write attempts). From a user perspective I do not see any drawback using the caching method in parallel as well atm? |
pat-s
and others
added some commits
Nov 6, 2018
This comment has been minimized.
This comment has been minimized.
FYI I think |
pat-s commentedOct 23, 2018
•
edited
fixes #1995
Implementation
memoise::memoise()
cache
can be passed tomakeFilterWrapper()
. It works withresample()
,tuneParams()
,filterFeatures()
.cache
accepts a logical vector (using default cache dirs determined by the rappdirs pkg) or a chr vector specifying a custom directorydelete_cache()
: Deletes ONLY the default cache dirs.get_cache_dir()
: Returns the default cache dirs.nselect
ingenerateFilterValuesData()
and all filters because its not usedgetFilterValues()
Benchmark example
WITH Cache
Created on 2018-11-02 by the reprex package (v0.2.1)
WITHOUT caching
Created on 2018-11-02 by the reprex package (v0.2.1)