Create a data_evolution() function that contrasts the processed (and exported) datasets to their original versions #134

BBieri · 2021-03-09T16:16:27Z

data_evolution()
- This function will compare the evolution from the originally imported dataset to the final product we display in our packages to ensure transparency.
- This will contain information such as the link to the original datasets, the original coding variables, etc.
- Will make some changes to either the export_data() function to store the preparation script as an attribute for now or to the preparation script templates so that the initial variables are correctly saved as attributes to the exported object

The text was updated successfully, but these errors were encountered:

BBieri · 2021-06-17T14:42:55Z

Update:

Added a first version of the data_evolution() function in {qData} as well as link_metadata() function that links key metadata of the original object to the exported object in {qCreate}

Outlook:

Essentially two things still need to be done:

Overall feedback on the function and its integration in the contributor/user workflow
- @EstherPeev , @henriquesposito , @jhollway
Pretty print method for console output :)

BBieri · 2021-06-23T15:56:11Z

What the function should be/do:

Informative for end-users who run data analysis:
- Should contain the initial/transformed variable names
- Should contain the initial/ transformed variable types
- Should be clear for the end-user how the data was transformed every step of the way
Easy to implement on a large scale and for a broad set of data preparation scripts:
- In our current two data packages ({qStates} and {qEnviron}), we already have a broad set of preparation scripts as data about states and treaties comes in a wide variety of formats and requires different steps to become "qStandard".
- Therefore, we need a way to have a simple, intuitive and convenient way to implement that in the data preparation scripts while also catering to all the end-users needs

What we are considering:

After careful consideration, here are the different paths we could take to implement a data_evolution() function:

In-house function that links metadata about the original object and another function that displays it. (Same as we have now)

Advantages:
- Easy to implement, literally one line of code in the data preparation script (that can even be automatically generated during the data processing script creation from the template)
Disadvantages:
- Not very informative for end-users unfortunately

Parsing the data preparation script with regex and generating some form of in house report

Advantages:
- We can be very specific about how we want to process things and include exactly the information we want in the data_evolution() part.
Disadvantages:
- While I don't have @henriquesposito 's experience, I don't see it as being scalable in the medium or long term for multiple packages.

Creating a log file and adding it to every data_raw folder and then let users access it with the data_evolution function

Advantages:
- We can build a somewhat standard way of reporting changes in datasets over the processing script and illustrate them by printing intermediary steps to this file
- This would be a simpler alternative than just looking at the code in the preparation script since it would be illustrated by snippets of the processed dataset as it moves through the preparation script
Disadvantages:
- Neither R nor most packages are great at logging things so we would have to either develop our own logging function or elaborate upon a currently available package ({logr} being the closes to what we would want)
- The {logr} package is actively maintained but doesn't allow for code to be printed to the log file yet. We need to find a way to do that.

You reached the end of this (very long) summary, feel free to provide me with some feedback @EstherPeev @henriquesposito @jhollway

BBieri · 2021-06-23T16:03:20Z

Here is an example logfile illustrating option 3

henriquesposito · 2021-06-24T07:52:07Z

Thank you @BBieri , this is very informative!

I, as you, also think a log file might be a good option to consider. Though, I do not like the log displays that much as it currently stands in the example provided (I do not like the headers and think information about operations done to data is still missing) ...

As I understand from the bit of research I did, the logr package (https://github.com/dbosak01/logr/blob/master/R/logr.R) builds on tidylog (https://cran.r-project.org/web/packages/tidylog/readme/README.html), which is a package that describes dplyr operations below codes in scripts (https://www.r-bloggers.com/2020/01/tidylog/). From looking at the tidylog GitHub's page, the package seems to provide an alternative for us to modify and extract the operations done to data with various dplyr functions (https://github.com/elbersb/tidylog/tree/master/R) and info about datasets (https://github.com/elbersb/tidylog/blob/master/R/tidylog.R). As well, maybe it can help us writing helper functions to describe the operations done with our qData/qCreate functions. After this we could simply use grep to generate a log file copying the desired comments in the "logged" preparation script when running data_evolution(). This is just a suggestion that came to mind, but what do you think @BBieri since you have been looking into all this much closer?

henriquesposito · 2022-05-19T07:12:49Z

https://cran.r-project.org/web/packages/chronicler/chronicler.pdf

BBieri · 2022-05-24T15:42:51Z

Here are my two cents on this function. As stated above, we want something that is easy to read for non-programmer audiences to increase the transparency of the preparation scripts. I.e. we would like to replace the action of looking at a well-commented preparation script with something more visual.

The {chronicler} package implements logging monads to each function with a "function factory" approach where every function used in one of our preparation scripts has to be treated by the record() function to add the logging capability. The main issue here is once again the scope of the returned information. We unfortunately only get the name of the function, the attributes used in it, and the execution time e.g. the following example.

read_log(avia_monthly) #avia_monthly is a chronicler object
#> [1] "Complete log:"                                                                                                             
#> [2] "OK! select(1,contains(\"20\")) ran successfully at 2022-05-18 10:33:48"                                                    
#> [3] "OK! pivot_longer(-starts_with(\"unit\"),date,passengers) ran successfully at 2022-05-18 10:33:48"                          
#> [4] "OK! separate(1,c(\"unit\", \"tra_meas\", \"air_pr\\\\time\"),,) ran successfully at 2022-05-18 10:33:48"                   
#> [5] "OK! filter(tra_meas == \"PAS_BRD_ARR\",!is.na(passengers),str_detect(date, \"M\")) ran successfully at 2022-05-18 10:33:50"
#> [6] "OK! mutate(paste0(date, \"01\"),ymd(date)) ran successfully at 2022-05-18 10:33:50"                                        
#> [7] "OK! select(air_pr\\time,date,passengers) ran successfully at 2022-05-18 10:33:50"                                          
#> [8] "Total running time: 1.95047211647034 secs"

Hence, as it stands the {chronicler} package is less informative than simply looking at the preparation script. However, since our needs are more related to describing the changes a vector/variable in a tibble underwent, it might be useful to consider using {diffobj} to highlight differences between input and output objects.

To sum it up, we should encourage users to read the data preparation scripts for now if they want to see the exact changes that the data underwent. In the future, if we deem that the data preparation scripts are not clear enough, we should either look for or develop our own logging/diff package that centers around the differences between the input and output objects rather than a pure logging approach.

Hope it helps :)

jhollway · 2022-05-24T22:24:21Z

Thanks @BBieri , this is a very useful summary. @henriquesposito , is there a good place in the README or a vignette that we can suggest to users to either investigate the preparation scripts and/or run a diff on the original imported dataset and what's available in the package?

henriquesposito · 2022-05-25T07:47:55Z

Thank you @BBieri for the helpful comments!

@jhollway Maybe we can have a simpler version of a "data_evolution()" function that brings up the information contained in the preparation scripts to users? Even if that means getting the preparation scripts themselves... In any case, I think you are right and we should make this information more easily accessible for users.

henriquesposito · 2022-06-22T14:17:42Z

I have added a first draft of a working version of this function. The main issue with setting this up was that the data-raw folders are ignored in our data packages, so they not available to users who simply load our packages. To overcome this, I had to download data from the package page on GitHub whenever available. The function, for now, compares data raw and available data in package as long as they are both available, data-raw is in ".csv" format, and both the data-raw and the available dataset have the same name (not always the case); otherwise the function opens the dataset preparation script on GitHub.

The function is located in the "data_report" script in the development branch of {manydata}. Of course this need to be expanded and improved, this is just an idea.

@jhollway and @jaeltan please let me know what you think.

If we are to move ahead with this idea we might want to consider standardising names of data-raw to match names of datasets in our databases as well as add these as ".csv" files the data-raw folder whenever possible (text files facilitate the download and visualisation on GitHub), all of which could possibly be done programatically in {manypkgs}.

BBieri self-assigned this Mar 9, 2021

BBieri mentioned this issue Mar 9, 2021

Rebrand family of report_data() functions #131

Closed

henriquesposito mentioned this issue May 14, 2021

Re-think print argument for report_data family of functions #160

Closed

BBieri added the enhancement New feature or request label May 19, 2021

BBieri mentioned this issue May 24, 2022

Connect codebooks/coding rules to datasets/variables globalgov/manypkgs#16

Closed

BBieri assigned henriquesposito May 24, 2022

henriquesposito mentioned this issue Oct 18, 2022

Add database plotting functions and improved ´consolidate()´ function #234

Merged

16 tasks

henriquesposito closed this as completed in #234 Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a data_evolution() function that contrasts the processed (and exported) datasets to their original versions #134

Create a data_evolution() function that contrasts the processed (and exported) datasets to their original versions #134

BBieri commented Mar 9, 2021

BBieri commented Jun 17, 2021

BBieri commented Jun 23, 2021

BBieri commented Jun 23, 2021

henriquesposito commented Jun 24, 2021

henriquesposito commented May 19, 2022

BBieri commented May 24, 2022

jhollway commented May 24, 2022

henriquesposito commented May 25, 2022

henriquesposito commented Jun 22, 2022 •

edited

Create a data_evolution() function that contrasts the processed (and exported) datasets to their original versions #134

Create a data_evolution() function that contrasts the processed (and exported) datasets to their original versions #134

Comments

BBieri commented Mar 9, 2021

BBieri commented Jun 17, 2021

Update:

Outlook:

BBieri commented Jun 23, 2021

What the function should be/do:

What we are considering:

BBieri commented Jun 23, 2021

henriquesposito commented Jun 24, 2021

henriquesposito commented May 19, 2022

BBieri commented May 24, 2022

jhollway commented May 24, 2022

henriquesposito commented May 25, 2022

henriquesposito commented Jun 22, 2022 • edited

henriquesposito commented Jun 22, 2022 •

edited