Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache auto-loaded data during load.project() #160

Merged
merged 12 commits into from Nov 3, 2016

Conversation

connectedblue
Copy link
Contributor

@connectedblue connectedblue commented Oct 16, 2016

This PR proposes a change to allow data read from the data directory to be cached automatically when load.project() is called.

Quite often, reading raw data files can take a lot of time, especially compared to reading from the cache. The motivation for this PR is to avoid the user having to explicitly cache raw datasets in a munge script. Instead there is a new config parameter called cache_loaded_data which is set to TRUE by default that controls whether or not caching should take place during load.project()

The cache function itself is now faster due to a recent change, making it inexpensive to add to load.project() (it won't re-cache if the cache is already up to date for example). Appropriate messages are shown to the user so they can see what's going on.

The migrate.project() function was also refactored to allow for specific warnings and treatment to apply when existing projects are migrated. For this PR, although the default setting for cache_loaded_data is TRUE for new projects, I think it is more conservative to set it to FALSE for existing projects. This is because they may have done their own caching rules in munge which we don't want to break. Appropriate messages are shown during migration, and the user can always switch it back on if they want.

I've also added some new regression tests for this functionality. Whilst doing so, I noticed a small problem with existing tests in test-version.R. The beginning of each test created a test_project folder in the current working directory rather than in a temporary file. This causes problems sometimes if they fail to unlink properly when debugging. Anyway, I changed this to use the tempfile notation that is used in the other tests. The actual tests themselves were not altered.

I haven't updated the website documentation yet - I'll await a review of this feature from the community first.

…project able to migrate old projects smoothly. Regression tests pass, but no tests for new functionality written yet
…oject directory in the working directory instead of in a temp directory. Tests themselves not changed and still work
@KentonWhite
Copy link
Owner

Like this idea. Thanks! Will put review shortly (hopefully today).

One question (don't think this is a show stopper but want your opinion). Sometimes cache.project() doesn't cache everything loaded in the data load step. Examples are when you create data in a .R file or read multiple tabs from an excel file. How do you think these situations should be handled?

@connectedblue
Copy link
Contributor Author

@KentonWhite let me think about those. I need to look into the reader architecture more carefully.

My first thought though on the .R file is that maybe that can be read in with the native CODE parameter in cache, which could bring additional benefits like only loading when the code changes.

The excel presumably creates multiple objects in the global environment which we would have to cache.

Rather than reading the project.info$data parameter, perhaps it would be better to take a snap shot of global variables before and after data load and cache the difference?

Let me chew on this and see if there is a more generic solution.

…dataload to determine what shoudl be cached. Created some tests to test this also.
@connectedblue
Copy link
Contributor Author

Hi @KentonWhite, I've pushed an update which I think addresses those cases which you highlight. In the end, I went with the snapshot of global env before and after data load and then apply the caching to any new variables found.

Special care taken to make sure that we don't include functions in the global env search (just in case some-one loads a function in the .R file in the data directory).

Added some additional tests to explicitly check that variables created in .R files cache correctly.

I didn't test multi sheet excel because I don't have perl installed on my machine (can't wait for #159 to be merged ....). It should work though.

@KentonWhite
Copy link
Owner

Thanks. Will review tomorrow :)

@connectedblue
Copy link
Contributor Author

connectedblue commented Oct 18, 2016

Hi @KentonWhite
I've given some further thought to cache.project() behaviour and come to the conclusion that there's actually a problem with what project.info$data should contain.

The present meaning is "the auto-loaded variable names in the global env provided that multiple data sets aren't loaded from a single file (in which case it represents the filename they were loaded from without the extension)".

I think this is the reason cache.project() doesn't work properly. The latest commit above changes this meaning to be "the data sets in global env loaded directly from the data directory" which is a cleaner definition.

And, for belt and braces, the cache.project function should look at both the $data variable and anything that's already in the cache. The cache() function itself takes care of skipping if it doesn't need to be done again.

So, with this tidy up, I think this PR does both a more comprehensive load.project() where raw data is cached automatically by default, and also allows cache.project() to be more inclusive.

@KentonWhite
Copy link
Owner

@connectedblue had a review of the PR. My instinct is that we will run into edge cases with the strategy comparing the list of variables before the load step and after the load step. I ran through a number of edge cases I could think of and didn't find any problems so no reason not to proceed. Lets keep an eye on this and file any issues we see.

Before I merge, it would be nice if the refactoring of the migrate.project is put into a separate PR. I like the refactoring, just would like to isolate any issues with migrate.project from the cache_loaded_data changes. I propose merging the migrate.project changes in before the cache_loaded_data changes, since cache_loaded_data depends on the migrate.data changes (hence why you refactored it!)

@connectedblue
Copy link
Contributor Author

Thanks @KentonWhite. I'll raise another PR as you suggest for migrate.project() and then use this one for the load.project()changes. I'll let you know when they are ready.

Agree that the before/after snapshot might cause some side effects with different reader types, but happy to keep an eye on the issues log once it's out there.

@connectedblue
Copy link
Contributor Author

Not ready for review yet. Dependency on #162 to be merged into master and then this PR can be re-examined.

@connectedblue
Copy link
Contributor Author

@KentonWhite , I've updated this PR now to reflect the merge of #162.

This PR now contains only the relevant changes for caching auto-loaded data.

Not quite ready for review yet. I need to update the website documentation for the new cache_loaded_data flag. I'll let you know when that's complete.

@connectedblue
Copy link
Contributor Author

Hi @KentonWhite

Last minute fix for consistency. When cache_loaded_data is missing and load.project() is called, the default value used will be FALSE. This is consistent now with the way we treat during migrate.project() where we also set FALSE for the reasons noted above.

Some tests added to check this case.

Also updated the website documentation for the new flag. Also took the opportunity to add the documentation for other config values that were missing, so we have a complete set now.

This PR is ready for review now.

@KentonWhite KentonWhite merged commit 2ff97d6 into KentonWhite:master Nov 3, 2016
@KentonWhite
Copy link
Owner

Looks good. Thanks @connectedblue will play around with this in my work flow starting today :)

KentonWhite pushed a commit that referenced this pull request Dec 13, 2017
* added in auto cache when data loaded directly, and also made migrate.project able to migrate old projects smoothly.  Regression tests pass, but no tests for new functionality written yet

* changed the set up of the version tests which were creating a test_project directory in the working directory instead of in a temp directory.  Tests themselves not changed and still work

* Added some tests for migration and load.project

* Updated some variables in migrate.project in order to pass R CMD checks

* Changed caching logic to inspect global environment before and after dataload to determine what shoudl be cached.  Created some tests to test this also.

* updated behaviour for how project.info is calculated and also expand the scope of what cache.project does

* create a temporary migrate.project file to merge manually

* updated migrate.project with test for cache_loaded_data flag warning

* fixed bug which caused two migration tests to fail (wrong config variable name)

* changed default missing value to FALSE and added some website docuemntation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants