-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Examples of using {workflowr} with {targets} #238
Comments
@rgayler I don't know of any examples off of the top of my head, and my GitHub searches were unsuccessful. When I search For inspiration, here is a nice example of drake+workflowr: https://github.com/pat-s/pathogen-modeling If you give it a try, I'd love to hear what works well and what is awkward when combining the 2 tools. |
Thanks @jdblischak I will have a look at https://github.com/pat-s/pathogen-modeling - but I think I will have to do that from a high-level-feels perspective rather than a code-I-can-steal-verbatim perspective because I think this week will be the toe in the water week for combining workflowr with targets. It's likely to be messy because I have experience with targets and only slightly more with workflowr. It will necessarily be rather exploratory because I will simultaneously be trying to find a workflow that is compatible with my notebook-centric exploratory approach and a more pipeline-centric approach for the computational core of my current project. I am hopeful that Presuming that I want to make targets responsible for scheduling all computation in the project, I suspect my main issue is going to be taking responsibility for dependency directed execution away from workflowr and giving it exclusively to targets. Alternatively, i could make targets responsible for the computational core of the project and restrict workflowr to reporting, but that might leave gaps in ensuring that everything is up to date. |
Also check out this thread ropensci/tarchetypes#23 where we discuss ideas about how to integrate targets with rmarkdown and related packages like bookdown and workflowr. 2 of the example targets repos end with creating a final Rmd-generated report:
https://github.com/wlandau/targets-keras/blob/main/_targets.R#L74 The main difference is that you'll need to run
This is a struggle for me as well. I tend to do it in 2-stages: 1) pipeline that uniformly processes lots of files on an HPC cluster, 2) workflowr to explore the results produced by the pipeline. |
Just dropping my 2 cents here because I indirectly stumbled over this issue :)
I'd argue differently and say that {targets} is very similar to {drake} and only minor adoptions are needed to make the switch (whatever "minor" means in this case ofc). {targets} is still quite new and starting with {drake} and then switching is surely still a valid approach these days. The big picture is the same and if one understands {drake} then {targets} is just round the corner. With respect to projects combining {targets} and {workflowr} IDK of any ATM but this will surely change in the future. In the end one learns most by reading the manual and getting started step-by-step. Looking at other projects within the process can help to overcome edge cases or refine but overall you need to foster your personal binding to these kind of packages if they should drive your whole analysis :) |
Thanks @jdblischak and @pat-s - I have so far only got as far as looking at https://github.com/pat-s/pathogen-modeling
So, overall, the only integration between {workflowr} and {drake} is that the {workflowr} notebooks use @jdblischak wrote
That appears to be the same as pathogen-modeling workflow except that the pathogen modelling notebooks only displays the final results whereas "explore the results" suggests (to me) two possible interpretations:
All the articles I have read so far on Rmarkdown/etc. for reproducible research have only mentioned aspect 1 (the final product), not aspect 2 (the reasoning leading to the final product). My preference (strongly influenced by my day job) is to also record the path leading to the final product in addition to the final product. I want that history to be explicitly present, rather than having to be reconstructed by git archaeology. So my problem is to work out what I really want to do, subject to it being reasonably supported by the packages, and not being infeasibly complex/expensive to implement. I think I'll write up a sketch of what I am aiming for (informed by a better understanding of {workflowr} and {targets}, but not here as it will probably be too long to be appropriate for a GitHub comment. @pat-s wrote:
I agree. That's my next step for {targets}. I think the parts of {targets} relating to my computational core will be relatively straight forward. I hope that by having a firmer understanding of my desired project documentation (as distinct from product documentation) goals will help me to more easily spot the aspects of {targets} that I will need to implement that project documentation workflow. |
One option would be to create a separate Rmd to analyze the results of each iteration of your processing pipeline, and then link the HTML directly in the main page. That way it is easier to find than having to search through the Git history. So the workflow would be something like:
Then someone viewing the site could see the list of analyses as the pipeline evolved. |
@jdblischak I am currently doing something like that with my {workflowr}-only workflow except that it tends to be a separate Rmd for each step in the pipeline (rather than each iteration in the evolution of the pipeline) because the results from each pipeline step usually inform the design of the following pipeline steps. Having thought some more about it overnight, I think mentioning "history" in my earlier comments is misleading. I don't need a copy of everything I have ever done (it's not like a financial transaction audit trail). Rather, I need a record of the reasoning behind the current design of the computational core pipeline, where:
I think I can support all that with {targets} primarily being used to manage the computational core pipeline, and {workflowr} Rmd notebooks hanging off the pipeline as side-branches. I am currently working through the {targets} tutorial material. After I have done that I will write a document setting out in detail what I think my workflow will be. I will post a link to that back here when it's complete and then try to retro-fit it in my current personal project. |
I think this is a great strategy, documenting the decisions for each step of the pipeline. And I think you'll be able to automate it such that the workflowr Rmds get updated whenever, e.g. the data output by the corresponding step changes.
Perfect! I look forward to reading how you decided to combine targets+workflowr. |
Hi - just to say thank you for the code in the comment above. I came here via the linked {tarchetypes} issue. FWIW here is what I ended up with adapting it:
...still requires this lengthy chunk of code for each page (as I failed at writing a target factory), but it does work. An alternative would be to
|
@petrbouchal Thanks for sharing the code for your solution! |
OK @jdblischak I have finally put my thoughts together on what I hope to do at https://rgayler.github.io/fa_sim_cal/workflow.html Those notes are specific to that particular project, but I think it represents what I have been trying to do (badly) for ages. Now all I have to do is retrofit |
@rgayler Thanks for these detailed notes! Overall I think it looks good, and I look forward to hearing how it goes in practice. A few minor comments:
This is true when the argument
If you want And some more resources I have found recently:
|
@jdblischak thanks for those extra resources. |
Hello, Sorry I didn't take the proper time to real all above messages. My POV about I also have another thing to fit in my project: the thesis paper that I'm doing with So As @jdblischak said, I was trying to use the gh-pages branch to publish the website instead of the docs folder. This would make things cleaner, but sadly I could not solve How do I think I'll do
That's my 2 cents, accepting suggestions :) |
I have finally finished converting the initial stage of my current night-job research project to use {targets} and {workflowr}. Points of interest:
A major requirement for me is to capture the reasoning behind the design of the computational experiments, so I have partitioned the computations into three groups: core, publication, meta. core computations are the essential parts of the computational experiments. These are handled as straight {targets} computations. publication computations are whatever is required to transform the core results into something that can be presented/published. The output format will generally be a document (pdf/slides/ ...) rather than a website. So I don't see {workflowr} being needed for this. These computations will be managed by {targets} using meta computations support and document the design of the core pipeline and are primarily {workflowr} analysis notebooks. (They could be assisted by some {targets}-only steps, but I haven't needed to do that yet.) Most of the meta notebooks result in some functions being written in The meta notebooks are targets managed by {targets} which renders them using The target for a meta notebook looks like:
I probably need to add At the moment I don't have any targets corresponding to the top-level website (index.Rmd etc.) and build those manually. |
I suspect similar patterns of |
So then |
Thanks @wlandau , I will keep an eye out for progress on Target Markdown. FYI In my current (possibly very idiosyncratic) workflow I have a very high ratio of lines of notebook written to lines of core pipeline functions written. The currently written code is all about infesting and understanding some data. I have 9 quite long notebooks defining 20 core pipeline functions (each only a few lines long) that are composed into one function that is the command corresponding to one edge in the core pipeline DAG. The core pipeline functions used in the target corresponding to each notebook are manually entered as dependencies in that target. That's a bit tedious and probably more than average error prone. |
@wlandau I tried out your new Target Markdown feature. It looks really good, but I think not compatible with my (probably idiosyncratic) workflow. Assuming I have interpreted it correctly, it appears to be aimed at defining the entire In my current
Obviously, the complexity is not in the target, but in the function that calculates the target:
There's an approximately 1:1 relationship between each of those internal functions and an Rmarkdown document. Most of those internal functions are very short, e.g.
So, the programming part is close to trivial. The hard part is understanding the data well enough to justify the choice of function definitions. The data is externally supplied from manual administrative systems, so it can have all sorts of unexpected things in it. The median length of the Rmarkdown documents is 300 lines, with the longest being ~1400 lines. The rendered output is considerably longer. If I tried to collapse all the Rmarkdown documents into a single document it would be very unwieldy, computationally and cognitively. Also, bear in mind that this was all to define just one target. Some of the later targets may take much less effort to define because they are modelling rather than data wrangling. I could split the single target into multiple targets, corresponding to each of the internal functions. (I'm not sure how that would help, but hey.) I am very loath to do that because those intermediate targets have no lasting value. They are only of value to the next step in the pipeline. They differ trivially from their neighbours and are potentially quite large for storage. So I would end up with a large number of (hard to name and remember) space consuming and redundant target objects. Have I misunderstood Target Markdown or not seen some approach it implies? |
I just added some clarifying comments to the manual. You should be able to spread your work over multiple reports as long as all the code chunks have unique labels. The
There is nothing specific to any particular chunk. The |
Related to your work: With the
|
I will definitely have to look at that. My current meta notebooks can be modestly slow to execute (~5 minutes). It would be useful to have a more fine-grained pipeline with cached results for those notebooks, but once the corresponding part of the coarse-grained core pipeline is constructed I would want to discard the cached results from the fine-grained meta pipelines. Would that discardable caching behaviour be of any benefit to Say you have a DAG where some of the targets are marked as "transient" because you don't need to refer to their values other than for calculating the dependent targets. You could have a function If you never flushed the transient targets the behaviour would be unchanged from current behaviour. If you reach the point where you think some targets will never be accessed again you flush them and downstream unflushed targets remain cached and usable unless some upstream change implies that the flushed transients should have been recalculated. This is the point where @wlandau (who is amazingly productive and has the patience of a saint) can tell me that you can already achieve this effect by doing something that's already documented in the |
Discardable caching does not really fit what |
I just want to chime in to express my gratitude for this thread. @rgayler as always thank you so much for documenting and sharing your experience. I'm sure it will be useful to other users. I also really like your site. I clicked around to confirm that you are essentially versioning the evolution of your computational pipeline. I was able to quickly view past versions of the HTML files documenting the results of earlier iterations. @wlandau thank you so much for all your help to enable workflowr users to take advantage of the targets ecosystem. And of course for the targets ecoystem itself! You're building an amazing resource for the R community. |
Hi. Do you have any links to examples of using
{targets}
(the successor to{drake}
with{workfowr}
? I haven't been able to find any with internet search, but my search foo may be insufficient to the task.The text was updated successfully, but these errors were encountered: