Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More details on benfits of Guix needed (Altuna) #9

Closed
markziemann opened this issue May 26, 2023 · 12 comments
Closed

More details on benfits of Guix needed (Altuna) #9

markziemann opened this issue May 26, 2023 · 12 comments

Comments

@markziemann
Copy link
Owner

Guix part can be improved it provides much more in comparison to Conda and most python and R packages can be imported via automatic importers (if dependency structure is clear)

@markziemann
Copy link
Owner Author

if you have guix environment you can easily export it as a container in various formats

@markziemann
Copy link
Owner Author

Related to this, found a nice article on the topic: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9532446/

@markziemann
Copy link
Owner Author

The article is great and the gitlab repo points to a reproducible workflow. It took a while to download the data and build it but it did work.

@markziemann
Copy link
Owner Author

It was interesting that it build R3.6, but I looked at the config and couldn't find out how that version was defined apart from the hash for that specific guix commit.

The problem I have with this approach is that it isn't clear how anyone could specify which R or python version they want, as there doesn't appear to be a searchable database of commits relating to R/python versions.

So If I wanted to reproduce an analysis that was conducted with eg: R3.2.2 that does not already have a channel and manifest file, how could I do that? @pierrepo

@pierrepo
Copy link
Collaborator

pierrepo commented Jun 1, 2023

Hum. I don't know. I will ask an expert.

@zimoun
Copy link

zimoun commented Jun 2, 2023

@markziemann Author of https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9532446/ here :-)
@pierrepo Am I the expert? ;-)

Thanks for your interest in Guix. :-)

Well, R3.2.2 is very old (~2015 IIRC). This version landed in Guix by commit 62de2545e1 pushed on Mon Aug 17 10:35:41 2015. Sadly, this commit pre-dates the guix time-machine (see Inferiors). Other said, the point zero for time traveling with Guix is v1.0 released in 2019, roughly. Therefore, there is no "easy" way, IMHO, for having R 3.2.2.

Now, my questions. :-)

Do you need only R 3.2.2? Or do you also need some other R packages from that time on 2015?

One approach is the guix-past channel (bringing software from the past to the present).

Hope that helps.

PS: I am currently off line (or almost ;-)) in the Alps. I will be back in two weeks. Let me know if you need help. Cheers.

@pierrepo
Copy link
Collaborator

pierrepo commented Jun 2, 2023

@pierrepo Am I the expert? ;-)

Oh yes, definitely.

Thanks for your input, Simon.

@zimoun
Copy link

zimoun commented Jun 2, 2023

It was interesting that it build R3.6, but I looked at the config and couldn't find out how that version was defined apart from the hash for that specific guix commit.

Yes, it's defined by the file channels.scm. That captures the complete graph of dependencies (= state). The paradigm is different because there is no version resolver (as SAT or else). Using a package manager based on some version resolver, the complete graph of dependencies depends on the output of this version resolver.

The problem I have with this approach is that it isn't clear how anyone could specify which R or python version they want, as there doesn't appear to be a searchable database of commits relating to R/python versions.

It is what we tried to explain in the mentioned paper. :-) Other said, if I might, here a video quick presentation explaining the "Guix paradigm". When using Guix, you specificy one state ( = graph of dependencies) defining the packages at specific versions, and not some label versions letting the version resolver builds that state.

What is behind is that version label (v1.2.3) is not enough for reproducibility. You also need to capture the "versions" of the dependencies and more importantly the various options for compiling. For example, please give a look at the PDF slides appendix of the quick presentation (the computation of a Bessel function where the result depends on the compilation option; yes that's a corner case but it spots out the potential issues when speaking about computational reproducibility with the various other paramaters that the version labels do not capture).

Well, in short, for one, version label identifier is not enough for identifying one version as intrinsic vs extrinsic identifiers is explaining. And for two, we audit the source code but we run binaries and thus we also need to capture the transformation from source to binary (options for compiling). Both is what channels.scm is capturing. Does it make sense?

About some searchable database, yeah I agree that some tooling is missing. :-)

Let me know if more details/explanations about Guix are required. :-)

@markziemann
Copy link
Owner Author

@zimoun many thanks for taking the time to answer these questions. It is making a lot of sense now. I've been able to build two past R versions in isolated environments on my system which is fantastic. Your presentation at FOSDEM was very impressive. Guix is definitely a complete solution for forward reproducibility (reproducing current work 10-20 years in the future), but not so great for backward reproducibility (reproducing a 5-10 year old paper). Often, we only have R/Python label versions to work from when reproducing past studies, so a table like the following showing Guix commits corresponding to major releases would be very helpful. I understand that the label version is not the complete story but in these cases we're working with limited information.

Tag R vers Python vers date
base-for-issue-62196 4.3 3.11 2023-04
v1.4.0 4.2 3.10 2022-12

Do you mind if we acknowledge your advice in our article?

Thanks again and enjoy your alps trip!

@markziemann
Copy link
Owner Author

With Altuna and Simon's advice and Pierre and my changes to the manuscript I think we can close this one. Thanks all!

@zimoun
Copy link

zimoun commented Sep 14, 2023

Hi @markziemann @pierrepo

Well, I have just read the paper today. Nice!

Thanks for acknowledging me in the paper for the few answers I provided.

In that previous message, you wrote:

Guix is definitely a complete solution for forward reproducibility (reproducing current work 10-20 years in the future), but not so great for backward reproducibility (reproducing a 5-10 year old paper).

Please note that Guix cannot fix the poor description of past published papers. :-) That's said, if the paper contains enough information, it is possible to redo (today) what the paper did in the past. Let me pick an example from Reproducible research hackathon: experience report. :-)

Consider the 10 Years Challenge. A paper from 2015 had been redone in 2021 and they used a Python stack. Now, in 2023 and using Guix, @civodul had been able to reproduce based on the description from 2021.

Even, I have pushed a bit further the fun. :-) For example, using another submission from the 10 Years Challenge (intial paper from 1998-2006 reproduced in 2022 and Guixifed in 2023), I show how Continuous Integration as GitHubAction could be exploited for generating reproducible Docker images based on Guix.

And I also show as small first proof-of-concept about what would happen in the worst case: all internet is down (all the servers are gone) and the only available source is the code archived in Software Heritage. Are we able to reproduce the computational environment from the files describing it in such worst-case scenario? Using Guix, the answer is almost. :-) Using another tool... I am not aware of any attempts for such worse-case scenario. That worst-case matters because we cannot know beforehand what will be still up and what won't be.

Last, Guix has a lot of annoyances too... as its bug tracker illustrates. And as you said, the main current drawback is the lack of documentation targeting scientific communities.

Often, we only have R/Python label versions to work from when reproducing past studies, so a table like the following showing Guix commits corresponding to major releases would be very helpful. I understand that the label version is not the complete story but in these cases we're working with limited information.
Tag R vers Python vers date
base-for-issue-62196 4.3 3.11 2023-04
v1.4.0 4.2 3.10 2022-12

Yes, I agree that a tool is missing. Month ago, I have started a tool for indexing all these label versions that are in the Guix history... but I have not been very far. I must resume... eventually. :-)

Thanks again for mentioning Guix in the paper.

Cheers,
simon

@markziemann
Copy link
Owner Author

Thanks for the feedback Simon! I look forward to adding some Guix guides on specific bioinformatics applications to protocols.io or similar in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants