New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a data_evolution() function that contrasts the processed (and exported) datasets to their original versions #134
Comments
Update:Added a first version of the Outlook:Essentially two things still need to be done:
|
What the function should be/do:
What we are considering:After careful consideration, here are the different paths we could take to implement a
You reached the end of this (very long) summary, feel free to provide me with some feedback @EstherPeev @henriquesposito @jhollway |
Thank you @BBieri , this is very informative! I, as you, also think a log file might be a good option to consider. Though, I do not like the log displays that much as it currently stands in the example provided (I do not like the headers and think information about operations done to data is still missing) ... As I understand from the bit of research I did, the logr package (https://github.com/dbosak01/logr/blob/master/R/logr.R) builds on tidylog (https://cran.r-project.org/web/packages/tidylog/readme/README.html), which is a package that describes dplyr operations below codes in scripts (https://www.r-bloggers.com/2020/01/tidylog/). From looking at the tidylog GitHub's page, the package seems to provide an alternative for us to modify and extract the operations done to data with various dplyr functions (https://github.com/elbersb/tidylog/tree/master/R) and info about datasets (https://github.com/elbersb/tidylog/blob/master/R/tidylog.R). As well, maybe it can help us writing helper functions to describe the operations done with our qData/qCreate functions. After this we could simply use grep to generate a log file copying the desired comments in the "logged" preparation script when running data_evolution(). This is just a suggestion that came to mind, but what do you think @BBieri since you have been looking into all this much closer? |
Here are my two cents on this function. As stated above, we want something that is easy to read for non-programmer audiences to increase the transparency of the preparation scripts. I.e. we would like to replace the action of looking at a well-commented preparation script with something more visual. The
Hence, as it stands the To sum it up, we should encourage users to read the data preparation scripts for now if they want to see the exact changes that the data underwent. In the future, if we deem that the data preparation scripts are not clear enough, we should either look for or develop our own logging/diff package that centers around the differences between the input and output objects rather than a pure logging approach. Hope it helps :) |
Thanks @BBieri , this is a very useful summary. @henriquesposito , is there a good place in the README or a vignette that we can suggest to users to either investigate the preparation scripts and/or run a diff on the original imported dataset and what's available in the package? |
Thank you @BBieri for the helpful comments! @jhollway Maybe we can have a simpler version of a "data_evolution()" function that brings up the information contained in the preparation scripts to users? Even if that means getting the preparation scripts themselves... In any case, I think you are right and we should make this information more easily accessible for users. |
I have added a first draft of a working version of this function. The main issue with setting this up was that the data-raw folders are ignored in our data packages, so they not available to users who simply load our packages. To overcome this, I had to download data from the package page on GitHub whenever available. The function, for now, compares data raw and available data in package as long as they are both available, data-raw is in ".csv" format, and both the data-raw and the available dataset have the same name (not always the case); otherwise the function opens the dataset preparation script on GitHub. The function is located in the "data_report" script in the development branch of @jhollway and @jaeltan please let me know what you think. If we are to move ahead with this idea we might want to consider standardising names of data-raw to match names of datasets in our databases as well as add these as ".csv" files the data-raw folder whenever possible (text files facilitate the download and visualisation on GitHub), all of which could possibly be done programatically in |
export_data()
function to store the preparation script as an attribute for now or to the preparation script templates so that the initial variables are correctly saved as attributes to the exported objectThe text was updated successfully, but these errors were encountered: