Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Examples of using {workflowr} with {targets} #238

Open
rgayler opened this issue Feb 18, 2021 · 24 comments
Open

[Discussion] Examples of using {workflowr} with {targets} #238

rgayler opened this issue Feb 18, 2021 · 24 comments

Comments

@rgayler
Copy link

rgayler commented Feb 18, 2021

Hi. Do you have any links to examples of using {targets} (the successor to {drake} with {workfowr}? I haven't been able to find any with internet search, but my search foo may be insufficient to the task.

@jdblischak
Copy link
Member

@rgayler I don't know of any examples off of the top of my head, and my GitHub searches were unsuccessful. When I search filename:_targets.R filename:_workflowr.yml, it just shows me results for whichever file I put second.

For inspiration, here is a nice example of drake+workflowr: https://github.com/pat-s/pathogen-modeling

If you give it a try, I'd love to hear what works well and what is awkward when combining the 2 tools.

@rgayler
Copy link
Author

rgayler commented Feb 19, 2021

Thanks @jdblischak

I will have a look at https://github.com/pat-s/pathogen-modeling - but I think I will have to do that from a high-level-feels perspective rather than a code-I-can-steal-verbatim perspective because {targets} is an incompatible successor to {drake}.

I think this week will be the toe in the water week for combining workflowr with targets. It's likely to be messy because I have experience with targets and only slightly more with workflowr. It will necessarily be rather exploratory because I will simultaneously be trying to find a workflow that is compatible with my notebook-centric exploratory approach and a more pipeline-centric approach for the computational core of my current project.

I am hopeful that targetopia and {tarchetypes} will help. There is an archetype for rendering Rmarkdown reports that I hope can be used directly or modified to call wflow_build()instead of rmarkdown::render().

Presuming that I want to make targets responsible for scheduling all computation in the project, I suspect my main issue is going to be taking responsibility for dependency directed execution away from workflowr and giving it exclusively to targets. Alternatively, i could make targets responsible for the computational core of the project and restrict workflowr to reporting, but that might leave gaps in ensuring that everything is up to date.

@rgayler rgayler closed this as completed Feb 19, 2021
@jdblischak
Copy link
Member

Also check out this thread ropensci/tarchetypes#23 where we discuss ideas about how to integrate targets with rmarkdown and related packages like bookdown and workflowr.

2 of the example targets repos end with creating a final Rmd-generated report:

tar_render(report, "report.Rmd")

https://github.com/wlandau/targets-keras/blob/main/_targets.R#L74
https://github.com/wlandau/targets-stan/blob/main/_targets.R#L72

The main difference is that you'll need to run rmarkdown::render_site() (or wflow_build()) instead of rmarkdown::render(). From my reading of the thread above, I think that would look something like this:

    tar_target(
        html_file,
        command = {
            !! tar_knitr_deps_expr("report.Rmd")
            rmarkdown::render_site("report.Rmd")
            "report.html"
        },
        format = "file"
    )

a workflow that is compatible with my notebook-centric exploratory approach and a more pipeline-centric approach for the computational core of my current project.

This is a struggle for me as well. I tend to do it in 2-stages: 1) pipeline that uniformly processes lots of files on an HPC cluster, 2) workflowr to explore the results produced by the pipeline.

@pat-s
Copy link

pat-s commented Feb 19, 2021

Just dropping my 2 cents here because I indirectly stumbled over this issue :)

{targets} is an incompatible successor to {drake}.

I'd argue differently and say that {targets} is very similar to {drake} and only minor adoptions are needed to make the switch (whatever "minor" means in this case ofc).

{targets} is still quite new and starting with {drake} and then switching is surely still a valid approach these days.

The big picture is the same and if one understands {drake} then {targets} is just round the corner.

With respect to projects combining {targets} and {workflowr} IDK of any ATM but this will surely change in the future.
People follow slowly from my experience and many will discover {drake} in the next years even though its superseded already.
This is our experience with mlr vs. mlr3.

In the end one learns most by reading the manual and getting started step-by-step. Looking at other projects within the process can help to overcome edge cases or refine but overall you need to foster your personal binding to these kind of packages if they should drive your whole analysis :)

@rgayler
Copy link
Author

rgayler commented Feb 20, 2021

Thanks @jdblischak and @pat-s - I have so far only got as far as looking at https://github.com/pat-s/pathogen-modeling
My uneducated reading of the code is as follows (my aplogies to @pat-s if I have misconstrued this):

  • Define the computational core as the code which reads and prepares the data, runs the analyses, and generates the figures for inclusion in the paper.
  • The computational core is the authors' current view of what is correct. That is, the embedded documentation describes the current computational core and the development path leading to the current computational core is not visible without indulging in some git archaeology (and that would only show the changes in the code, not he reasoning behind those changes).
  • The computational core is managed completely by {drake}. The purpose of {drake} in this context is to ensure that the results of the computational core are up to date after the last (no longer visible) changes to the computational core.
  • The {workflowr} notebooks load results created by the computational core using calls to drake::loadd() and then display them appropriately.
  • Some of the notebook displays are for providing visibility of what the computational core has done. That is, these displays are intended to be consumed by eyeballs-only as part of the feedback loop for developing the computational core.
  • Some of the notebook displays create the figures to be included in the research paper. These figures are copied programmatically by the notebook from the rendered {workflowr} website to the source directory for the research paper.
  • The {workflowr} notebooks are manged completely by {workflowr} only. They are not managed by {drake}.
  • The rendering of the research paper appears to be completely manually managed.

So, overall, the only integration between {workflowr} and {drake} is that the {workflowr} notebooks use drake::loadd() to fetch results from the computational core.
I presume that {workflowr} doesn't directly know whether those fetched results are up to date and {drake} is not managing the notebooks so it can't re-execute them when the results they depend on are out of date.
The research paper is manually managed, so there's no mechanism to force re-rendering when the results change.

@jdblischak wrote

[Finding a suitable workflow] is a struggle for me as well. I tend to do it in 2-stages: 1) pipeline that uniformly processes lots of files on an HPC cluster, 2) workflowr to explore the results produced by the pipeline.

That appears to be the same as pathogen-modeling workflow except that the pathogen modelling notebooks only displays the final results whereas "explore the results" suggests (to me) two possible interpretations:

  1. Exploring the results of the final (assumed to be the best) execution of the computational core
  2. Exploring the results of earlier iterations of the computational core in order to direct the evolution of the computational core to its final form.

All the articles I have read so far on Rmarkdown/etc. for reproducible research have only mentioned aspect 1 (the final product), not aspect 2 (the reasoning leading to the final product).

My preference (strongly influenced by my day job) is to also record the path leading to the final product in addition to the final product. I want that history to be explicitly present, rather than having to be reconstructed by git archaeology. So my problem is to work out what I really want to do, subject to it being reasonably supported by the packages, and not being infeasibly complex/expensive to implement.

I think I'll write up a sketch of what I am aiming for (informed by a better understanding of {workflowr} and {targets}, but not here as it will probably be too long to be appropriate for a GitHub comment.

@pat-s wrote:

In the end one learns most by reading the manual and getting started step-by-step

I agree. That's my next step for {targets}. I think the parts of {targets} relating to my computational core will be relatively straight forward. I hope that by having a firmer understanding of my desired project documentation (as distinct from product documentation) goals will help me to more easily spot the aspects of {targets} that I will need to implement that project documentation workflow.

@jdblischak
Copy link
Member

My preference (strongly influenced by my day job) is to also record the path leading to the final product in addition to the final product. I want that history to be explicitly present, rather than having to be reconstructed by git archaeology. So my problem is to work out what I really want to do, subject to it being reasonably supported by the packages, and not being infeasibly complex/expensive to implement.

One option would be to create a separate Rmd to analyze the results of each iteration of your processing pipeline, and then link the HTML directly in the main page. That way it is easier to find than having to search through the Git history. So the workflow would be something like:

  1. Edit the pipeline files and commit the changes.
  2. Re-run the pipeline to generate the results.
  3. (If feasible) Commit the changes to the data files
  4. Create a new Rmd to analyze the final results (and any intermediate stages of interest)
  5. Add a link to this new HTML in index.Rmd
  6. Run wflow_publish() on both the new Rmd and index.Rmd

Then someone viewing the site could see the list of analyses as the pipeline evolved.

@rgayler
Copy link
Author

rgayler commented Feb 21, 2021

@jdblischak I am currently doing something like that with my {workflowr}-only workflow except that it tends to be a separate Rmd for each step in the pipeline (rather than each iteration in the evolution of the pipeline) because the results from each pipeline step usually inform the design of the following pipeline steps.

Having thought some more about it overnight, I think mentioning "history" in my earlier comments is misleading. I don't need a copy of everything I have ever done (it's not like a financial transaction audit trail). Rather, I need a record of the reasoning behind the current design of the computational core pipeline, where:

  • The design of the computational core pipeline and the reasoning supporting that design are potentially revised over the lifetime of the project.
  • The current design and it's supporting reasoning are assumed to be my best current beliefs and improvements over any previous design and supporting reasoning.
  • Consequently, I don't need easy and immediate access to the prior designs and the prior supporting reasoning. git archaeology is fine if I ever need access to those earlier versions of the pipeline and design reasoning.
  • Some aspects of the current design may have been decided much earlier in the lifetime of the project.
  • Decisions about the design and supporting reasoning for the design are based (at least partly) on analyses of data available in the computational core pipeline.
  • I want the ability to run any of those design support analyses currently even if they relate to decisions that were taken long ago in the lifetime of the project.

I think I can support all that with {targets} primarily being used to manage the computational core pipeline, and {workflowr} Rmd notebooks hanging off the pipeline as side-branches.

I am currently working through the {targets} tutorial material. After I have done that I will write a document setting out in detail what I think my workflow will be. I will post a link to that back here when it's complete and then try to retro-fit it in my current personal project.

@jdblischak
Copy link
Member

I think I can support all that with {targets} primarily being used to manage the computational core pipeline, and {workflowr} Rmd notebooks hanging off the pipeline as side-branches.

I think this is a great strategy, documenting the decisions for each step of the pipeline. And I think you'll be able to automate it such that the workflowr Rmds get updated whenever, e.g. the data output by the corresponding step changes.

I am currently working through the {targets} tutorial material. After I have done that I will write a document setting out in detail what I think my workflow will be. I will post a link to that back here when it's complete and then try to retro-fit it in my current personal project.

Perfect! I look forward to reading how you decided to combine targets+workflowr.

@petrbouchal
Copy link

Hi - just to say thank you for the code in the comment above. I came here via the linked {tarchetypes} issue.

FWIW here is what I ended up with adapting it:

  • add the _site.yml config file as a target
  • add a directory with component images, CSS etc. (docs) as a target
  • add the input Rmd as a target
  • make the main target depend on the three above

...still requires this lengthy chunk of code for each page (as I failed at writing a target factory), but it does work.

An alternative would be to tar_load() the three targets inside each Rmd file - that would make the targets code more concise but the structure more opaque.

t_html <- list(
  tar_file(siteconf, "_site.yml"),
  tar_file(sitefiles, "site"),
  # https://github.com/jdblischak/workflowr/issues/238#issuecomment-782024069
  tar_file(s_index_rmd, "index.Rmd"),
  tar_target(s_index_html, command = {!! tar_knitr_deps_expr("index.Rmd")
    s_index_rmd
    siteconf
    sitefiles
    rmarkdown::render_site("index.Rmd")
    "docs/index.html"}, format = "file"),
  tar_file(s_inputchecks_rmd, "_site.yml"),
  tar_target(s_inputchecks_html, command = {!! tar_knitr_deps_expr("s_inputchecks.Rmd")
    s_inputchecks_rmd
    siteconf
    sitefiles
    rmarkdown::render_site("s_inputchecks.Rmd")
    "docs/s_inputchecks.html"}, format = "file")
)

@jdblischak
Copy link
Member

@petrbouchal Thanks for sharing the code for your solution!

@rgayler
Copy link
Author

rgayler commented Mar 3, 2021

OK @jdblischak I have finally put my thoughts together on what I hope to do at https://rgayler.github.io/fa_sim_cal/workflow.html
Any feedback would be greatly appreciated.

Those notes are specific to that particular project, but I think it represents what I have been trying to do (badly) for ages.

Now all I have to do is retrofit targets into the project. I'll get back with feedback on that when it's done.

@jdblischak
Copy link
Member

@rgayler Thanks for these detailed notes! Overall I think it looks good, and I look forward to hearing how it goes in practice.

A few minor comments:

workflowr::wflow_build() tracks modification dates of Rmd files and the corresponding rendered output files

This is true when the argument make is TRUE. The default value is make = is.null(files). Thus if you specify specific files to build, but you also want to catch any other potentially outdated files, you'll need to explicitly set make = TRUE.

workflowr::wflow_publish() tracks git status of Rmd files and the corresponding rendered output files

If you want wflow_publish() to automatically detect (and build) Rmd files that have been committed more recently than their corresponding HTML files, you can pass the argument update = TRUE.

And some more resources I have found recently:

  • A new project by @franzbischoff that plans to combine targets and workflowr (repo, site) (note: he is experimenting with hosting the website files on the gh-pages branch, so it may look a little different than a standard workflowr project)

  • @maurolepore has started a YouTube playlist on using targets. The latest video describes how to include Rmd files in a targets pipeline, so it could be informative.

@rgayler
Copy link
Author

rgayler commented Mar 3, 2021

@jdblischak thanks for those extra resources.

@franzbischoff
Copy link

Hello,

Sorry I didn't take the proper time to real all above messages.

My POV about drake/targets + workflowr is that the former packages lack the history of experiments, so workflowr comes into place.

I also have another thing to fit in my project: the thesis paper that I'm doing with thesisdown (just a pre-configured bookdown).

So As @jdblischak said, I was trying to use the gh-pages branch to publish the website instead of the docs folder. This would make things cleaner, but sadly I could not solve workflowr making a history of both Rmd and HTML (actually browsable) files. So I dropped this matter.

How do I think I'll do targets+workflowr for now?

  • I'll use targets independently. If you read the documentation, they suggest that the Report must be the Rmd because you don't want to stay knitting things around every time. After I have an iteration of my experiment, I'll have targets objects to import and do the actual Rmd.
  • Then, I'll use workflowr to knit and record that iteration.
  • Finally, I'll write something in bookdown and knit my thesis that will be a subfolder of workflowr docs/

That's my 2 cents, accepting suggestions :)

@rgayler rgayler changed the title Examples of using {workflowr} with {targets} [Discussion] Examples of using {workflowr} with {targets} May 27, 2021
@rgayler
Copy link
Author

rgayler commented May 27, 2021

I have finally finished converting the initial stage of my current night-job research project to use {targets} and {workflowr}.

Points of interest:

A major requirement for me is to capture the reasoning behind the design of the computational experiments, so I have partitioned the computations into three groups: core, publication, meta.

core computations are the essential parts of the computational experiments. These are handled as straight {targets} computations.

publication computations are whatever is required to transform the core results into something that can be presented/published. The output format will generally be a document (pdf/slides/ ...) rather than a website. So I don't see {workflowr} being needed for this. These computations will be managed by {targets} using tar_render(). (I haven't implemented any publications yet.)

meta computations support and document the design of the core pipeline and are primarily {workflowr} analysis notebooks. (They could be assisted by some {targets}-only steps, but I haven't needed to do that yet.) Most of the meta notebooks result in some functions being written in functions.R to be used by {targets} to implement edges in the DAG for the core pipeline. The same functions are also displayed and used in the meta notebooks by using knitr::read_chunk(). This guarantees that exactly the same functions are used in the meta notebooks and the core pipeline.

The meta notebooks are targets managed by {targets} which renders them using wflow_build(). However, the decision to publish the meta notebooks seems necessarily manual to me. So I only publish the notebooks by manually calling wflow_publish(). Because of this I do not track the rendered HTML file in the target. If I did track the HTML file then publishing the notebook would update the rendered HTML and invalidate the target so that {targets} would want to rebuild it.

The target for a meta notebook looks like:

tar_target(
        meta_notebook_1,
        command = {
            !! tar_knitr_deps_expr("meta_notebook_1.Rmd")
            list(core_fn_1, core_fn_2, core_fn_3)
            workflowr::wflow_build("meta_notebook_1.Rmd")
            "meta_notebook_1.Rmd""
        },
        format = "file"
    )

I probably need to add make = TRUE to the wflow_build() as suggested by @jdblischak .

At the moment I don't have any targets corresponding to the top-level website (index.Rmd etc.) and build those manually.

@rgayler rgayler reopened this May 27, 2021
@wlandau
Copy link

wlandau commented May 27, 2021

I suspect similar patterns of targets + workflowr will get easier after I implement Target Markdown: ropensci/targets#469. This will allow you to prototype and construct targets pipelines from R Markdown instead of manually writing to _targets.R.

@wlandau
Copy link

wlandau commented May 27, 2021

So then workflowr notebooks could set up and run the pipeline, rather than being artifacts inside a pipeline.

@rgayler
Copy link
Author

rgayler commented May 28, 2021

Thanks @wlandau , I will keep an eye out for progress on Target Markdown.

FYI

In my current (possibly very idiosyncratic) workflow I have a very high ratio of lines of notebook written to lines of core pipeline functions written. The currently written code is all about infesting and understanding some data. I have 9 quite long notebooks defining 20 core pipeline functions (each only a few lines long) that are composed into one function that is the command corresponding to one edge in the core pipeline DAG.

The core pipeline functions used in the target corresponding to each notebook are manually entered as dependencies in that target. That's a bit tedious and probably more than average error prone.

@rgayler
Copy link
Author

rgayler commented May 29, 2021

@wlandau I tried out your new Target Markdown feature. It looks really good, but I think not compatible with my (probably idiosyncratic) workflow.

Assuming I have interpreted it correctly, it appears to be aimed at defining the entire targets DAG from within one Rmarkdown document. When you knit the document it overwrites _targets.R and associated files based on the code in the targets chunks. So it sounds like there is no ability to build up the contents of _targets.R from multiple Rmarkdown documents.

In my current targets + workflowr project it takes me 9 Rmarkdown notebooks to build one target definition:

tar_target(
  c_clean_entity_data,
  command = raw_entity_data_make_clean(c_raw_entity_data_file)
)

Obviously, the complexity is not in the target, but in the function that calculates the target:

raw_entity_data_make_clean <- function(
  file_path # character - file path usable by vroom
  # value - data frame
) {
  raw_entity_data_read(file_path) %>%
    raw_entity_data_excl_status() %>%
    raw_entity_data_excl_test() %>%
    raw_entity_data_drop_novar() %>%
    raw_entity_data_parse_dates() %>%
    raw_entity_data_drop_admin() %>%
    raw_entity_data_drop_demog() %>%
    raw_entity_data_clean_all() %>%
    raw_entity_data_add_id()
}

There's an approximately 1:1 relationship between each of those internal functions and an Rmarkdown document. Most of those internal functions are very short, e.g.

raw_entity_data_excl_status <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::filter(
      voter_status_desc == "ACTIVE" & voter_status_reason_desc == "VERIFIED"
    )
}

raw_entity_data_drop_admin <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::select(-c(county_id, registr_dt, cancellation_dt))
}

So, the programming part is close to trivial. The hard part is understanding the data well enough to justify the choice of function definitions. The data is externally supplied from manual administrative systems, so it can have all sorts of unexpected things in it.

The median length of the Rmarkdown documents is 300 lines, with the longest being ~1400 lines. The rendered output is considerably longer. If I tried to collapse all the Rmarkdown documents into a single document it would be very unwieldy, computationally and cognitively. Also, bear in mind that this was all to define just one target. Some of the later targets may take much less effort to define because they are modelling rather than data wrangling.

I could split the single target into multiple targets, corresponding to each of the internal functions. (I'm not sure how that would help, but hey.) I am very loath to do that because those intermediate targets have no lasting value. They are only of value to the next step in the pipeline. They differ trivially from their neighbours and are potentially quite large for storage. So I would end up with a large number of (hard to name and remember) space consuming and redundant target objects.

Have I misunderstood Target Markdown or not seen some approach it implies?

@wlandau
Copy link

wlandau commented May 29, 2021

Assuming I have interpreted it correctly, it appears to be aimed at defining the entire targets DAG from within one Rmarkdown document. When you knit the document it overwrites _targets.R and associated files based on the code in the targets chunks. So it sounds like there is no ability to build up the contents of _targets.R from multiple Rmarkdown documents.

I just added some clarifying comments to the manual. You should be able to spread your work over multiple reports as long as all the code chunks have unique labels. The _targets.R file for Target Markdown is always the same no matter what code chunk writes it, and the file only actually gets written if the hash is different from that of the template. Here is what the file looks like.

# Generated by Target Markdown in targets 0.4.2.9000: do not edit by hand
library(targets)
lapply(
  list.files(
    "_targets_r/globals",
    pattern = "\\.R$",
    full.names = TRUE
  ),
  source
)
lapply(
  list.files(
    "_targets_r/targets",
    pattern = "\\.R$",
    full.names = TRUE
  ),
  function(x) source(x)$value
)

There is nothing specific to any particular chunk. The _targets.R file simply reads the independent chunk-specific scripts written to the _targets_r/ folder. That way, you can have as many reports as you want as long as the chunk labels do not collide.

@wlandau
Copy link

wlandau commented May 29, 2021

Related to your work: With the root.dir knit option, it will be much easier to create different sub-pipelines from different R Markdown documents. root.dir takes care of the _targets.R script, the supplementary _targets_r/ scripts, and the _targets/ data store.

```{r}
knitr::opts_knit$set(root.dir = "your/choice/")
```

@rgayler
Copy link
Author

rgayler commented May 29, 2021

create different sub-pipelines from different R Markdown documents

I will definitely have to look at that. My current meta notebooks can be modestly slow to execute (~5 minutes). It would be useful to have a more fine-grained pipeline with cached results for those notebooks, but once the corresponding part of the coarse-grained core pipeline is constructed I would want to discard the cached results from the fine-grained meta pipelines.

Would that discardable caching behaviour be of any benefit to targets independent of markdown?

Say you have a DAG where some of the targets are marked as "transient" because you don't need to refer to their values other than for calculating the dependent targets. You could have a function tar_transient_flush() that deletes the value of the target data object, but leaves a target data object recording that this has happened. The effect of this is that any downstream targets depending on the transient value would not be invalidated by the transient being flushed. However, if a dependent target needed to be recalculated (say, because of a changed function definition) it would also flag any upstream flushed transient targets as invalidated.

If you never flushed the transient targets the behaviour would be unchanged from current behaviour. If you reach the point where you think some targets will never be accessed again you flush them and downstream unflushed targets remain cached and usable unless some upstream change implies that the flushed transients should have been recalculated.

This is the point where @wlandau (who is amazingly productive and has the patience of a saint) can tell me that you can already achieve this effect by doing something that's already documented in the targets manual.

@wlandau
Copy link

wlandau commented May 29, 2021

Discardable caching does not really fit what targets is trying to do. It breaks key assumptions of the mental model. I do not have plans to implement it.

@jdblischak
Copy link
Member

I just want to chime in to express my gratitude for this thread.

@rgayler as always thank you so much for documenting and sharing your experience. I'm sure it will be useful to other users. I also really like your site. I clicked around to confirm that you are essentially versioning the evolution of your computational pipeline. I was able to quickly view past versions of the HTML files documenting the results of earlier iterations.

@wlandau thank you so much for all your help to enable workflowr users to take advantage of the targets ecosystem. And of course for the targets ecoystem itself! You're building an amazing resource for the R community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants