New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache auto-loaded data during load.project() #160
Conversation
…project able to migrate old projects smoothly. Regression tests pass, but no tests for new functionality written yet
…oject directory in the working directory instead of in a temp directory. Tests themselves not changed and still work
Like this idea. Thanks! Will put review shortly (hopefully today). One question (don't think this is a show stopper but want your opinion). Sometimes |
@KentonWhite let me think about those. I need to look into the reader architecture more carefully. My first thought though on the The excel presumably creates multiple objects in the global environment which we would have to cache. Rather than reading the Let me chew on this and see if there is a more generic solution. |
…dataload to determine what shoudl be cached. Created some tests to test this also.
Hi @KentonWhite, I've pushed an update which I think addresses those cases which you highlight. In the end, I went with the snapshot of global env before and after data load and then apply the caching to any new variables found. Special care taken to make sure that we don't include functions in the global env search (just in case some-one loads a function in the Added some additional tests to explicitly check that variables created in I didn't test multi sheet excel because I don't have perl installed on my machine (can't wait for #159 to be merged ....). It should work though. |
Thanks. Will review tomorrow :) |
…the scope of what cache.project does
Hi @KentonWhite The present meaning is "the auto-loaded variable names in the global env provided that multiple data sets aren't loaded from a single file (in which case it represents the filename they were loaded from without the extension)". I think this is the reason And, for belt and braces, the So, with this tidy up, I think this PR does both a more comprehensive |
@connectedblue had a review of the PR. My instinct is that we will run into edge cases with the strategy comparing the list of variables before the load step and after the load step. I ran through a number of edge cases I could think of and didn't find any problems so no reason not to proceed. Lets keep an eye on this and file any issues we see. Before I merge, it would be nice if the refactoring of the |
Thanks @KentonWhite. I'll raise another PR as you suggest for Agree that the before/after snapshot might cause some side effects with different reader types, but happy to keep an eye on the issues log once it's out there. |
Not ready for review yet. Dependency on #162 to be merged into |
@KentonWhite , I've updated this PR now to reflect the merge of #162. This PR now contains only the relevant changes for caching auto-loaded data. Not quite ready for review yet. I need to update the website documentation for the new cache_loaded_data flag. I'll let you know when that's complete. |
Hi @KentonWhite Last minute fix for consistency. When Some tests added to check this case. Also updated the website documentation for the new flag. Also took the opportunity to add the documentation for other config values that were missing, so we have a complete set now. This PR is ready for review now. |
Looks good. Thanks @connectedblue will play around with this in my work flow starting today :) |
* added in auto cache when data loaded directly, and also made migrate.project able to migrate old projects smoothly. Regression tests pass, but no tests for new functionality written yet * changed the set up of the version tests which were creating a test_project directory in the working directory instead of in a temp directory. Tests themselves not changed and still work * Added some tests for migration and load.project * Updated some variables in migrate.project in order to pass R CMD checks * Changed caching logic to inspect global environment before and after dataload to determine what shoudl be cached. Created some tests to test this also. * updated behaviour for how project.info is calculated and also expand the scope of what cache.project does * create a temporary migrate.project file to merge manually * updated migrate.project with test for cache_loaded_data flag warning * fixed bug which caused two migration tests to fail (wrong config variable name) * changed default missing value to FALSE and added some website docuemntation
This PR proposes a change to allow data read from the
data
directory to be cached automatically whenload.project()
is called.Quite often, reading raw data files can take a lot of time, especially compared to reading from the cache. The motivation for this PR is to avoid the user having to explicitly cache raw datasets in a
munge
script. Instead there is a new config parameter calledcache_loaded_data
which is set toTRUE
by default that controls whether or not caching should take place duringload.project()
The
cache
function itself is now faster due to a recent change, making it inexpensive to add toload.project()
(it won't re-cache if the cache is already up to date for example). Appropriate messages are shown to the user so they can see what's going on.The
migrate.project()
function was also refactored to allow for specific warnings and treatment to apply when existing projects are migrated. For this PR, although the default setting forcache_loaded_data
isTRUE
for new projects, I think it is more conservative to set it toFALSE
for existing projects. This is because they may have done their own caching rules inmunge
which we don't want to break. Appropriate messages are shown during migration, and the user can always switch it back on if they want.I've also added some new regression tests for this functionality. Whilst doing so, I noticed a small problem with existing tests in
test-version.R
. The beginning of each test created atest_project
folder in the current working directory rather than in a temporary file. This causes problems sometimes if they fail to unlink properly when debugging. Anyway, I changed this to use thetempfile
notation that is used in the other tests. The actual tests themselves were not altered.I haven't updated the website documentation yet - I'll await a review of this feature from the community first.