Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bioconda #10

Merged
merged 14 commits into from
May 30, 2018
Merged

Bioconda #10

merged 14 commits into from
May 30, 2018

Conversation

tiagochst
Copy link
Contributor

@tiagochst tiagochst commented May 27, 2018

I'm trying to finish the docker file. I rebased some of the master fixes as it was required to run the workflow.

The major changes are:

  • Add ngsplot and genome files installed
  • Add run_spp.R from github.
  • Add conda-forge::gawk=4.2.1 to enviroment.yml as gawk was required by phantompeakqualtools
  • The deeptools set in the beginning of the enviroment.yml was not being installed, but if I set it in the end it is installed.
  • Changed RUN conda env create -f /environment.yml to RUN conda env update -n root --file environment.yml because of the source below:
    • If you are using the Conda package manager, avoid creating a Conda environment. Instead, update the root environment with whatever dependencies you want to install.

      As an example to the above warning, it would be best practice to run this: RUN conda env update -n root --file environment.yml and avoid this: RUN conda env create -f environment.yml

https://docs.datascience.com/en/master/appendix-1/dockerfile-basics-and-best-practices-1.html

My files are big so I could not finish the test yet, but deeptools and phantompeakqualtools are working. chippeakanno never finishes and I was not able to test ngsplot yet.

@tiagochst
Copy link
Contributor Author

My docker image is here: https://hub.docker.com/r/tiagochst/chip-seq/

Dockerfile Outdated

COPY environment.yml /
RUN conda env create -f /environment.yml && conda clean -a
RUN conda update -n base conda
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had sort of tried to avoid this in the past, as the base docker image already updates. However, I guess we don't know how stale the base container is, so it doesn't hurt 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that because there was a warn to update it every time asking to update conda. It is not really necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the more I think about it, the more I think that it's probably a good idea to have in all nf-core containers..

Dockerfile Outdated
RUN conda env create -f /environment.yml && conda clean -a
RUN conda update -n base conda
RUN conda update --all
RUN conda env update -n root --file environment.yml && conda clean -a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brilliant - I didn't know that you could just use the root environment like this. Much cleaner 👍

Dockerfile Outdated
RUN conda env create -f /environment.yml && conda clean -a
RUN conda update -n base conda
RUN conda update --all
RUN conda env update -n root --file environment.yml && conda clean -a
ENV PATH /opt/conda/envs/nfcore-chipseq-1.4dev/bin:$PATH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not using a conda environment, I guess we can probably get rid of this line? This was an alternative to source activate nfcore-chipseq-1.4dev, but if we're not using that environment then I guess everything should already be available on the PATH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the conda environment. But If I understand it right we might be able to remove that line. It is better to check that before. I have a question about nexflow. Can I specify which process to reexecute (even after completed)? I removed the folder of the process in the work directory and it just restarted all processes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I specify which process to reexecute (even after completed)? I removed the folder of the process in the work directory and it just restarted all processes.

No, it's automatic - if it thinks that downstream processes rely on a work folder that you deleted, then it will rerun them all. It does this because it can't know that the downstream results would be the same otherwise.

If I'm misunderstanding your question, then come ask on gitter where we can chat: https://gitter.im/nf-core/Lobby

Dockerfile Outdated
ENV PATH /opt/conda/envs/nfcore-chipseq-1.4dev/bin:$PATH

ENV NGSPLOT_VERSION="2.63"
RUN curl -fsSL https://github.com/shenlab-sinai/ngsplot/archive/${NGSPLOT_VERSION}.tar.gz -o /opt/ngsplot_${NGSPLOT_VERSION}.tar.gz && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Whilst it works, we have previously discussed that we should try to avoid direct installations into docker images like this. Instead, we hope to package missing tools into bioconda.

The reason for this is that users who aren't using docker and singularity, but only using the bioconda environment will not be able to run the full pipeline. So if possible, for maximum portability, it's best to try to use bioconda for everything.

Your thoughts on this are welcome. Do you think it's possible to add these to bioconda? Would you be willing / able to do so..? 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this package is in Bioconda: shenlab-sinai/ngsplot#74
So we had to add it there. It makes sense to port everything to bioconda. I'll check what we can do.

But still, the ngsplot genomes needs to be installed (they are available in google drive). That was one of the problems I had before. I'm not sure that is possible with Bioconda.

What if we added a script and run it from docker? And users would be able to execute it locally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still, the ngsplot genomes needs to be installed (they are available in google drive).

Ah yeah, I'd forgotten about this. No they can't (or rather, shouldn't) be added to bioconda. I'd argue that they shouldn't really be in the docker container either. It works, but it makes the container huge and is bloated for anyone not using these reference genomes.

I think this falls into the same category as all references - can we not just ask the user to fetch them and specify their location in the same way as alignment genome references?

We could also look into adding them to AWS-iGenomes or somthing. We have also been discussing an upcoming project to create a bioconda-like resource for large genome references such as this which would be the best solution (see https://gitter.im/nf-core/Lobby?at=5afd44f02df44c2d0634e430)

Phil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apeltzer / @chuan-wang - any thoughts or previous discussion points that I've missed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @tiagochst, I missed a bit..

What if we added a script and run it from docker? And users would be able to execute it locally?

Yeah, I guess.... I would consider this a last resort 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry still on vacation and thus only iPad typing available. I agree that we shouldn’t package this in bio conda and keep it external of the Container for the reasons you mentioned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apeltzer - you mean the references yeah? And package the tool scripts in bioconda?

It could also be worth bringing this topic up in the bioconda community to see what they have to say.. https://gitter.im/bioconda/Lobby

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just the references . I can ask once I’m back - already made a todo

docs/usage.md Outdated
@@ -213,7 +252,7 @@ projects or different sets of reference genomes.
Note - you can use this to override defaults. For example, we run on UPPMAX but don't want to use the MultiQC
environment module as is the default. So we specify a config file using `-c` that contains the following:

```groovy
```nextflow
process.$multiqc.module = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this example is super outdated and should be removed.

We need to overhaul this whole part of the "standard" documentation really. I'm hoping to write a helper tool to work with this soon (basically using the cookiecutter template and doing a git diff)

@ewels
Copy link
Member

ewels commented May 28, 2018

Some messy git history stuff going on here. I just updated bioconda from master which will hopefully help a little. Apologies if I made life difficult!

@ewels
Copy link
Member

ewels commented May 30, 2018

Hi @tiagochst,

So your pull request generated quite a bit of discussion! 😄 (cf nf-core/cookiecutter#29)

I think that we came to the consensus that it's better to leave the Dockerfiles as they are with the pseudo-environments.

For the conda update we can remove this I think. If we start getting warnings about it being out of date, we can just trigger a new build on the upstream nfcore/base container (I just did this now).

Finally, there's the manual installations of the tools in the Dockerfile. Whilst this is great in that it makes the container work with the pipeline, we agree that we would prefer to package these tools in bioconda instead and use these.

Would it be possible for you to revert these changes in the Dockerfile but keep the other things in your pull request (bug fixes, new packages etc). Then we can merge here and open up a new PR when the bioconda packaging is done.

Thanks for your hard work on this! Let me know if you have comments / questions.

Phil

@tiagochst
Copy link
Contributor Author

No problem. I'm just wondering how you will do for installing the genome db for ngsplot. phantompeakqualtools shouldn't be a problem. We just need to keep in mind that the run_spp.R has a line missing (library(caTools)) to make it work.

@ewels
Copy link
Member

ewels commented May 30, 2018

Thanks! Yes, we need to discuss this with the bioconda team I think.

@ewels ewels merged commit b7711fd into nf-core:bioconda May 30, 2018
@ewels ewels mentioned this pull request May 30, 2018
7 tasks
jherrero pushed a commit to UCL-BLIC/chipseq that referenced this pull request Apr 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants