Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a data_evolution() function that contrasts the processed (and exported) datasets to their original versions #134

Closed
1 task
BBieri opened this issue Mar 9, 2021 · 9 comments · Fixed by #234
Assignees
Labels
enhancement New feature or request

Comments

@BBieri
Copy link
Contributor

BBieri commented Mar 9, 2021

  • data_evolution()
    • This function will compare the evolution from the originally imported dataset to the final product we display in our packages to ensure transparency.
    • This will contain information such as the link to the original datasets, the original coding variables, etc.
    • Will make some changes to either the export_data() function to store the preparation script as an attribute for now or to the preparation script templates so that the initial variables are correctly saved as attributes to the exported object
@BBieri
Copy link
Contributor Author

BBieri commented Jun 17, 2021

Update:

Added a first version of the data_evolution() function in {qData} as well as link_metadata() function that links key metadata of the original object to the exported object in {qCreate}

Outlook:

Essentially two things still need to be done:

@BBieri
Copy link
Contributor Author

BBieri commented Jun 23, 2021

What the function should be/do:

  • Informative for end-users who run data analysis:
    • Should contain the initial/transformed variable names
    • Should contain the initial/ transformed variable types
    • Should be clear for the end-user how the data was transformed every step of the way
  • Easy to implement on a large scale and for a broad set of data preparation scripts:
    • In our current two data packages ({qStates} and {qEnviron}), we already have a broad set of preparation scripts as data about states and treaties comes in a wide variety of formats and requires different steps to become "qStandard".
    • Therefore, we need a way to have a simple, intuitive and convenient way to implement that in the data preparation scripts while also catering to all the end-users needs

What we are considering:

After careful consideration, here are the different paths we could take to implement a data_evolution() function:

  1. In-house function that links metadata about the original object and another function that displays it. (Same as we have now)
  • Advantages:
    • Easy to implement, literally one line of code in the data preparation script (that can even be automatically generated during the data processing script creation from the template)
  • Disadvantages:
    • Not very informative for end-users unfortunately
  1. Parsing the data preparation script with regex and generating some form of in house report
  • Advantages:
    • We can be very specific about how we want to process things and include exactly the information we want in the data_evolution() part.
  • Disadvantages:
    • While I don't have @henriquesposito 's experience, I don't see it as being scalable in the medium or long term for multiple packages.
  1. Creating a log file and adding it to every data_raw folder and then let users access it with the data_evolution function
  • Advantages:
    • We can build a somewhat standard way of reporting changes in datasets over the processing script and illustrate them by printing intermediary steps to this file
    • This would be a simpler alternative than just looking at the code in the preparation script since it would be illustrated by snippets of the processed dataset as it moves through the preparation script
  • Disadvantages:
    • Neither R nor most packages are great at logging things so we would have to either develop our own logging function or elaborate upon a currently available package ({logr} being the closes to what we would want)
    • The {logr} package is actively maintained but doesn't allow for code to be printed to the log file yet. We need to find a way to do that.

You reached the end of this (very long) summary, feel free to provide me with some feedback @EstherPeev @henriquesposito @jhollway

@BBieri
Copy link
Contributor Author

BBieri commented Jun 23, 2021

@henriquesposito
Copy link
Collaborator

Thank you @BBieri , this is very informative!

I, as you, also think a log file might be a good option to consider. Though, I do not like the log displays that much as it currently stands in the example provided (I do not like the headers and think information about operations done to data is still missing) ...

As I understand from the bit of research I did, the logr package (https://github.com/dbosak01/logr/blob/master/R/logr.R) builds on tidylog (https://cran.r-project.org/web/packages/tidylog/readme/README.html), which is a package that describes dplyr operations below codes in scripts (https://www.r-bloggers.com/2020/01/tidylog/). From looking at the tidylog GitHub's page, the package seems to provide an alternative for us to modify and extract the operations done to data with various dplyr functions (https://github.com/elbersb/tidylog/tree/master/R) and info about datasets (https://github.com/elbersb/tidylog/blob/master/R/tidylog.R). As well, maybe it can help us writing helper functions to describe the operations done with our qData/qCreate functions. After this we could simply use grep to generate a log file copying the desired comments in the "logged" preparation script when running data_evolution(). This is just a suggestion that came to mind, but what do you think @BBieri since you have been looking into all this much closer?

@henriquesposito
Copy link
Collaborator

@BBieri
Copy link
Contributor Author

BBieri commented May 24, 2022

Here are my two cents on this function. As stated above, we want something that is easy to read for non-programmer audiences to increase the transparency of the preparation scripts. I.e. we would like to replace the action of looking at a well-commented preparation script with something more visual.

The {chronicler} package implements logging monads to each function with a "function factory" approach where every function used in one of our preparation scripts has to be treated by the record() function to add the logging capability. The main issue here is once again the scope of the returned information. We unfortunately only get the name of the function, the attributes used in it, and the execution time e.g. the following example.

read_log(avia_monthly) #avia_monthly is a chronicler object
#> [1] "Complete log:"                                                                                                             
#> [2] "OK! select(1,contains(\"20\")) ran successfully at 2022-05-18 10:33:48"                                                    
#> [3] "OK! pivot_longer(-starts_with(\"unit\"),date,passengers) ran successfully at 2022-05-18 10:33:48"                          
#> [4] "OK! separate(1,c(\"unit\", \"tra_meas\", \"air_pr\\\\time\"),,) ran successfully at 2022-05-18 10:33:48"                   
#> [5] "OK! filter(tra_meas == \"PAS_BRD_ARR\",!is.na(passengers),str_detect(date, \"M\")) ran successfully at 2022-05-18 10:33:50"
#> [6] "OK! mutate(paste0(date, \"01\"),ymd(date)) ran successfully at 2022-05-18 10:33:50"                                        
#> [7] "OK! select(air_pr\\time,date,passengers) ran successfully at 2022-05-18 10:33:50"                                          
#> [8] "Total running time: 1.95047211647034 secs"

Hence, as it stands the {chronicler} package is less informative than simply looking at the preparation script. However, since our needs are more related to describing the changes a vector/variable in a tibble underwent, it might be useful to consider using {diffobj} to highlight differences between input and output objects.

To sum it up, we should encourage users to read the data preparation scripts for now if they want to see the exact changes that the data underwent. In the future, if we deem that the data preparation scripts are not clear enough, we should either look for or develop our own logging/diff package that centers around the differences between the input and output objects rather than a pure logging approach.

Hope it helps :)

@jhollway
Copy link
Collaborator

Thanks @BBieri , this is a very useful summary. @henriquesposito , is there a good place in the README or a vignette that we can suggest to users to either investigate the preparation scripts and/or run a diff on the original imported dataset and what's available in the package?

@henriquesposito
Copy link
Collaborator

Thank you @BBieri for the helpful comments!

@jhollway Maybe we can have a simpler version of a "data_evolution()" function that brings up the information contained in the preparation scripts to users? Even if that means getting the preparation scripts themselves... In any case, I think you are right and we should make this information more easily accessible for users.

@henriquesposito
Copy link
Collaborator

henriquesposito commented Jun 22, 2022

I have added a first draft of a working version of this function. The main issue with setting this up was that the data-raw folders are ignored in our data packages, so they not available to users who simply load our packages. To overcome this, I had to download data from the package page on GitHub whenever available. The function, for now, compares data raw and available data in package as long as they are both available, data-raw is in ".csv" format, and both the data-raw and the available dataset have the same name (not always the case); otherwise the function opens the dataset preparation script on GitHub.

The function is located in the "data_report" script in the development branch of {manydata}. Of course this need to be expanded and improved, this is just an idea.

@jhollway and @jaeltan please let me know what you think.

If we are to move ahead with this idea we might want to consider standardising names of data-raw to match names of datasets in our databases as well as add these as ".csv" files the data-raw folder whenever possible (text files facilitate the download and visualisation on GitHub), all of which could possibly be done programatically in {manypkgs}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants