Cache auto-loaded data during load.project() #160

connectedblue · 2016-10-16T16:44:14Z

This PR proposes a change to allow data read from the data directory to be cached automatically when load.project() is called.

Quite often, reading raw data files can take a lot of time, especially compared to reading from the cache. The motivation for this PR is to avoid the user having to explicitly cache raw datasets in a munge script. Instead there is a new config parameter called cache_loaded_data which is set to TRUE by default that controls whether or not caching should take place during load.project()

The cache function itself is now faster due to a recent change, making it inexpensive to add to load.project() (it won't re-cache if the cache is already up to date for example). Appropriate messages are shown to the user so they can see what's going on.

The migrate.project() function was also refactored to allow for specific warnings and treatment to apply when existing projects are migrated. For this PR, although the default setting for cache_loaded_data is TRUE for new projects, I think it is more conservative to set it to FALSE for existing projects. This is because they may have done their own caching rules in munge which we don't want to break. Appropriate messages are shown during migration, and the user can always switch it back on if they want.

I've also added some new regression tests for this functionality. Whilst doing so, I noticed a small problem with existing tests in test-version.R. The beginning of each test created a test_project folder in the current working directory rather than in a temporary file. This causes problems sometimes if they fail to unlink properly when debugging. Anyway, I changed this to use the tempfile notation that is used in the other tests. The actual tests themselves were not altered.

I haven't updated the website documentation yet - I'll await a review of this feature from the community first.

…project able to migrate old projects smoothly. Regression tests pass, but no tests for new functionality written yet

…oject directory in the working directory instead of in a temp directory. Tests themselves not changed and still work

KentonWhite · 2016-10-17T13:17:38Z

Like this idea. Thanks! Will put review shortly (hopefully today).

One question (don't think this is a show stopper but want your opinion). Sometimes cache.project() doesn't cache everything loaded in the data load step. Examples are when you create data in a .R file or read multiple tabs from an excel file. How do you think these situations should be handled?

connectedblue · 2016-10-17T15:16:56Z

@KentonWhite let me think about those. I need to look into the reader architecture more carefully.

My first thought though on the .R file is that maybe that can be read in with the native CODE parameter in cache, which could bring additional benefits like only loading when the code changes.

The excel presumably creates multiple objects in the global environment which we would have to cache.

Rather than reading the project.info$data parameter, perhaps it would be better to take a snap shot of global variables before and after data load and cache the difference?

Let me chew on this and see if there is a more generic solution.

…dataload to determine what shoudl be cached. Created some tests to test this also.

connectedblue · 2016-10-17T20:00:55Z

Hi @KentonWhite, I've pushed an update which I think addresses those cases which you highlight. In the end, I went with the snapshot of global env before and after data load and then apply the caching to any new variables found.

Special care taken to make sure that we don't include functions in the global env search (just in case some-one loads a function in the .R file in the data directory).

Added some additional tests to explicitly check that variables created in .R files cache correctly.

I didn't test multi sheet excel because I don't have perl installed on my machine (can't wait for #159 to be merged ....). It should work though.

KentonWhite · 2016-10-17T20:35:34Z

Thanks. Will review tomorrow :)

…the scope of what cache.project does

connectedblue · 2016-10-18T14:08:37Z

Hi @KentonWhite
I've given some further thought to cache.project() behaviour and come to the conclusion that there's actually a problem with what project.info$data should contain.

The present meaning is "the auto-loaded variable names in the global env provided that multiple data sets aren't loaded from a single file (in which case it represents the filename they were loaded from without the extension)".

I think this is the reason cache.project() doesn't work properly. The latest commit above changes this meaning to be "the data sets in global env loaded directly from the data directory" which is a cleaner definition.

And, for belt and braces, the cache.project function should look at both the $data variable and anything that's already in the cache. The cache() function itself takes care of skipping if it doesn't need to be done again.

So, with this tidy up, I think this PR does both a more comprehensive load.project() where raw data is cached automatically by default, and also allows cache.project() to be more inclusive.

KentonWhite · 2016-10-18T20:26:21Z

@connectedblue had a review of the PR. My instinct is that we will run into edge cases with the strategy comparing the list of variables before the load step and after the load step. I ran through a number of edge cases I could think of and didn't find any problems so no reason not to proceed. Lets keep an eye on this and file any issues we see.

Before I merge, it would be nice if the refactoring of the migrate.project is put into a separate PR. I like the refactoring, just would like to isolate any issues with migrate.project from the cache_loaded_data changes. I propose merging the migrate.project changes in before the cache_loaded_data changes, since cache_loaded_data depends on the migrate.data changes (hence why you refactored it!)

connectedblue · 2016-10-18T20:48:29Z

Thanks @KentonWhite. I'll raise another PR as you suggest for migrate.project() and then use this one for the load.project()changes. I'll let you know when they are ready.

Agree that the before/after snapshot might cause some side effects with different reader types, but happy to keep an eye on the issues log once it's out there.

connectedblue · 2016-10-20T22:38:52Z

Not ready for review yet. Dependency on #162 to be merged into master and then this PR can be re-examined.

…able name)

connectedblue · 2016-11-01T04:00:52Z

@KentonWhite , I've updated this PR now to reflect the merge of #162.

This PR now contains only the relevant changes for caching auto-loaded data.

Not quite ready for review yet. I need to update the website documentation for the new cache_loaded_data flag. I'll let you know when that's complete.

…tation

connectedblue · 2016-11-01T20:21:35Z

Hi @KentonWhite

Last minute fix for consistency. When cache_loaded_data is missing and load.project() is called, the default value used will be FALSE. This is consistent now with the way we treat during migrate.project() where we also set FALSE for the reasons noted above.

Some tests added to check this case.

Also updated the website documentation for the new flag. Also took the opportunity to add the documentation for other config values that were missing, so we have a complete set now.

This PR is ready for review now.

KentonWhite · 2016-11-03T13:01:09Z

Looks good. Thanks @connectedblue will play around with this in my work flow starting today :)

* added in auto cache when data loaded directly, and also made migrate.project able to migrate old projects smoothly. Regression tests pass, but no tests for new functionality written yet * changed the set up of the version tests which were creating a test_project directory in the working directory instead of in a temp directory. Tests themselves not changed and still work * Added some tests for migration and load.project * Updated some variables in migrate.project in order to pass R CMD checks * Changed caching logic to inspect global environment before and after dataload to determine what shoudl be cached. Created some tests to test this also. * updated behaviour for how project.info is calculated and also expand the scope of what cache.project does * create a temporary migrate.project file to merge manually * updated migrate.project with test for cache_loaded_data flag warning * fixed bug which caused two migration tests to fail (wrong config variable name) * changed default missing value to FALSE and added some website docuemntation

connectedblue added 4 commits October 16, 2016 12:30

added in auto cache when data loaded directly, and also made migrate.…

36b186d

…project able to migrate old projects smoothly. Regression tests pass, but no tests for new functionality written yet

changed the set up of the version tests which were creating a test_pr…

325e166

…oject directory in the working directory instead of in a temp directory. Tests themselves not changed and still work

Added some tests for migration and load.project

d69a984

Updated some variables in migrate.project in order to pass R CMD checks

84121f9

Changed caching logic to inspect global environment before and after …

262060d

…dataload to determine what shoudl be cached. Created some tests to test this also.

connectedblue mentioned this pull request Oct 18, 2016

functions to import xls and xlsx datasets using readxl package #159

Merged

updated behaviour for how project.info is calculated and also expand …

83de270

…the scope of what cache.project does

connectedblue mentioned this pull request Oct 18, 2016

Refactor migrate.project() and fix version tests #162

Merged

connectedblue added 5 commits October 21, 2016 20:55

merge in latest update from master

092072c

merge from master

59b0820

create a temporary migrate.project file to merge manually

f5450d7

updated migrate.project with test for cache_loaded_data flag warning

f25de5e

fixed bug which caused two migration tests to fail (wrong config vari…

9491f5c

…able name)

changed default missing value to FALSE and added some website docuemn…

d28afa7

…tation

KentonWhite merged commit 2ff97d6 into KentonWhite:master Nov 3, 2016

connectedblue deleted the cacherawdata branch November 3, 2016 13:18

connectedblue mentioned this pull request Nov 16, 2016

use of .R files in the data directory #165

Closed

Hugovdberg mentioned this pull request Jan 15, 2017

Handle multiple sheet variable.names out of xls(x).reader() #140

Closed

connectedblue mentioned this pull request Jan 15, 2017

Add list.data and incorporate in load.project #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache auto-loaded data during load.project() #160

Cache auto-loaded data during load.project() #160

connectedblue commented Oct 16, 2016 •

edited

KentonWhite commented Oct 17, 2016

connectedblue commented Oct 17, 2016

connectedblue commented Oct 17, 2016

KentonWhite commented Oct 17, 2016

connectedblue commented Oct 18, 2016 •

edited

KentonWhite commented Oct 18, 2016

connectedblue commented Oct 18, 2016

connectedblue commented Oct 20, 2016

connectedblue commented Nov 1, 2016

connectedblue commented Nov 1, 2016

KentonWhite commented Nov 3, 2016

Cache auto-loaded data during load.project() #160

Cache auto-loaded data during load.project() #160

Conversation

connectedblue commented Oct 16, 2016 • edited

KentonWhite commented Oct 17, 2016

connectedblue commented Oct 17, 2016

connectedblue commented Oct 17, 2016

KentonWhite commented Oct 17, 2016

connectedblue commented Oct 18, 2016 • edited

KentonWhite commented Oct 18, 2016

connectedblue commented Oct 18, 2016

connectedblue commented Oct 20, 2016

connectedblue commented Nov 1, 2016

connectedblue commented Nov 1, 2016

KentonWhite commented Nov 3, 2016

connectedblue commented Oct 16, 2016 •

edited

connectedblue commented Oct 18, 2016 •

edited