Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eggNOG_annotation takes very long #351

Closed
SilasK opened this issue Jan 8, 2021 · 17 comments
Closed

eggNOG_annotation takes very long #351

SilasK opened this issue Jan 8, 2021 · 17 comments

Comments

@SilasK
Copy link
Member

SilasK commented Jan 8, 2021

    1. This genecatalog step is taking me soo long this time :-) (10 days now for 10 samples). I know it's the most time consuming step, but is there some way I can run this more effectively on a single machine please? I have only 3 days of analysis time, and each time it aborts if I go over my time limit, I hope it restarts from the last subset? You mention something about a virtual machine in the config. I don't know exactly how to do this. Can I use temporarily my scratch space? Can I specify 'true', and then in the second line, do I have to give the path to my scratch, or it doesn't work like that?
eggNOG_use_virtual_disk: false   # coping the eggNOG DB to a virtual disk can sppeed up the annotation
virtual_disk: /dev/shm           # But you need 37G extra ram

This below is how it runs in the log, do you know why the mem is only 20, because I specified 160 in the config?

[Sun Jan 3 13:44:22 2021]
rule eggNOG_annotation:
input: /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/eggnog.db, /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/eggnog_proteins.dmnd, /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2/checksum_checked, Genecatalog/subsets/genes/subset3.emapper.seed_orthologs
output: Genecatalog/subsets/genes/subset3.emapper.annotations
log: Genecatalog/subsets/genes/logs/subset3/eggNOG_annotate_hits_table.log
jobid: 1075
wildcards: folder=Genecatalog/subsets/genes, prefix=subset3
threads: 36
resources: mem=20, time=5

@SilasK
Copy link
Member Author

SilasK commented Jan 8, 2021

Hey @Sofie8 I put this question online as it might also help others.

So yes you can use the virtual memory for the eggNOg mapper step. scratch should be fine but the default /dev/shm  is even better if it exists on your server.

If you set

eggNOG_use_virtual_disk: true

the eggNog db gets copied to the virtual disk and accelerates the access.

@SilasK
Copy link
Member Author

SilasK commented Jan 8, 2021

Why does it use 20gm ram, this is because I hardcoded the memory, which is not good. I try to fix this.

@SilasK
Copy link
Member Author

SilasK commented Jan 14, 2021

Respons to : eggnogdb/eggnog-mapper#249 (comment)
Sorry @Sofie8 that It takes so long.

What you can do:

  1. Check the filter options for the gene catalog.
  2. There is an option in atlas to split the genecatalog into smaller subsets and annotate them separately. (Yes, this means you would to restart again)
  3. If the annotate_hits_table takes so long, are you using the virtual, ramdisk option as we discussed above. It really improves the speed of this step.
  4. See the code I use to combine the annotations produced by several eggNog runs: rule combine_egg_nogg_annotations:

@Sofie8
Copy link

Sofie8 commented Jan 14, 2021

Hi Silas,

Thanks for your fast reply!

  1. I will check the filter options. I am running the annotation on contigs not 1 genome, so it can take more time.
  2. spit gene catalogs in smaller sizes . OK will do
  3. virtual memory: I think we need to additionally specify --usemem for this to take effect? See below.
  4. ok , how can I run this rule separately, where is it saved?

eggNOG_annotation:
input:
eggnog_db_files=get_eggnog_db_file(),
seed = rules.eggNOG_homology_search.output
output:
temp("{folder}/{prefix}.emapper.annotations")
params:
data_dir = config['virtual_disk'] if config['eggNOG_use_virtual_disk'] else EGGNOG_DIR,
prefix = "{folder}/{prefix}",
copyto_shm="t" if config['eggNOG_use_virtual_disk'] else 'f'
threads:
config.get("threads", 1)
resources:
mem=20
shadow:
"minimal"
conda:
"%s/eggNOG.yaml" % CONDAENV
log:
"{folder}/logs/{prefix}/eggNOG_annotate_hits_table.log"
shell:
"""

        if [ {params.copyto_shm} == "t" ] ;
        then
            cp {EGGNOG_DIR}/eggnog.db {params.data_dir}/eggnog.db 2> {log}
        fi

        emapper.py --annotate_hits_table {input.seed} --no_file_comments \
          --override -o {params.prefix} --cpu {threads} --data_dir {params.data_dir} 2>> {log}

- The line mem=20 I, I will change to config["mem"], or perhaps config["mem  - 40 (or what it takes to load the db in memory"]
-  --data_dir: is pointing to the EGGNOG_DIR, or to the virtual disk path: so the virtual disk path is same as EGGNOG_dir no? because I see it downloaded exactly a duplicate of the db in my other virtual disk path..
- --data_dir: I think we need to additionally specify --usemem (If a local hmmpressed database is provided as target using --db, this flag will allocate the whole database in memory using hmmpgmd. Database will be unloaded after execution.
- --database option: specify the target database for sequence searches: here we can choose euk, bact, arch.. (I don't know what the default is, all?). 
- --qtype: (hmm, seq, MmSeqs?)
- The version implemented is emapper-2.0.1, perhaps we need to update to a higher version to be able to use the option mmseq?

Sofie

@SilasK
Copy link
Member Author

SilasK commented Jan 14, 2021

  1. by default we take all genes from contigs, but also a lot of partial genes
  2. I followed the instruction here I copy the eggnog.db on a disk that is in memory. SO Iguess the --usemem flag is not necessary. but I could use it if the eggNOG_use_virtual_disk: false .Did you use this otion.
  3. Dont use the rule use the code if you want to hack it yourselves. otherwise run atlas genecatalog with smaller subsets from the beginning. If you use the eggNOG_use_virtual_disk it should be quite fast.

I don't think the mmseqs version already officially released?

@Sofie8
Copy link

Sofie8 commented Feb 1, 2021

Hi Silas,

It finally finished the annotate_hits_table step. I have several partial subsets now (next time I will specify smaller subsets from the beginning).

Now I did:

# run emapper.py annotate_hits_table
source activate /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/conda_envs/13ec4a8d

emapper.py --annotate_hits_table Genecatalog/subsets/genes/remaining3_seed_orthologs \
--no_file_comments --resume -o Genecatalog/subsets/genes/subset2 --cpu 36 --usemem --database /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/eggnogSofie \
--data_dir /ddn1/vol1/site_scratch/leuven/314/vsc31426/db/atlas/EggNOGV2 2>> Genecatalog/subsets/genes/logs/subset1/eggNOG_annotate_hits_table.log

# set the ones that are finished already aside
cd /ddn1/vol1/site_scratch/leuven/314/vsc31426/NGS_oct/StiemerOct/Genecatalog/subsets/genes
join -v 1 -t $'\t' <(grep -v "^#" remaining2_seed_orthologs | sort) <(grep -v "^#" subset2.emapper.annotations | sort) | cut -f 1-4 > remaining3_seed_orthologs

Which gives me partial outputs like:
subset21.emapper.annotations
subset22.emapper.annotations
subset23.emapper.annotations
==> needs to be combined to subset2.emapper.annotations (I can use cat but its doing something with the header lines, or the lines where it wants to merge them..)

  1. I ran the rule combine_egg_nogg_annotations, but it is not seeing subset21, cause it combines directly subset1.emapper.annotations, subset2.emapper.annotations, etc till 4, and exits also with the error:
[Sun Jan 31 03:20:22 2021]
localrule combine_egg_nogg_annotations:
    input: Genecatalog/subsets/genes/subset4.emapper.annotations, Genecatalog/subsets/genes/subset1.emapper.annotat                                                           ions, Genecatalog/subsets/genes/subset2.emapper.annotations, Genecatalog/subsets/genes/subset3.emapper.annotations
    output: Genecatalog/annotations/eggNog.tsv.gz
    jobid: 335
    resources: mem=150, time=5

Job counts:
        count   jobs
        1       combine_egg_nogg_annotations
        1
[Sun Jan 31 03:20:40 2021]
Error in rule combine_egg_nogg_annotations:
    jobid: 334
    output: Genecatalog/annotations/eggNog.tsv.gz

RuleException:
ParserError in line 592 of /vsc-hard-mounts/leuven-data/314/vsc31426/miniconda3/envs/atlas206/lib/python3.6/site-pa                                                           ckages/atlas/rules/genecatalog.snakefile:
Error tokenizing data. C error: Expected 22 fields in line 23052, saw 28

I will share you the files via drive, because probably they have an error where it wants to merge, a header line? or putting more items on one line?

Thanks,
Sofie

@SilasK
Copy link
Member Author

SilasK commented Feb 1, 2021

In theory you can just cat them. The header get's added during the combine_egg_nogg_annotations.

Unfortunately in subset3 and maybe in the others as well there are some errors that cause the error in atlas.

E.g. in line 23052 of subset3 you have an unfinished line combined with the next line.
GO:1901360,GO:1901363,GO:Gene0021212',

I don't know a way to make sure that you got all annotations. I fear you would need to rerun the annotation step for subset3. The others are fine.

I'm sorry you had to do this excursion and running eggNog mapper yourself. But if I understand correctly everything could be done inside atlas, isn't it?

atlas allows to define smaller subsets and also to access the ramdisk to speed up the process.

@Sofie8
Copy link

Sofie8 commented Feb 4, 2021

Ok, I deleted that one line for now, and I could merge the others.

atlas allows to define smaller subsets and also to access the ramdisk to speed up the process.
Yes, for 500.000 subsets it runs 3 days till 90 % completion and then deletes everything. So I have to try with 100.000 or less, but it might be also let's say it's doing subset 1, 2, 3 fine, and the 4th half, and then deletes everything from subset 4.. but at least I don't loose all the computing credits to restart only subset 4 from scratch.

Let me know when you have a new release that perhaps also fixes this mem usage. But you don't need to release a new version for only this, cause I changed it for now in the rule, or do I need to change it somewhere else, and where :-)?
~/miniconda3/envs/atlas206/lib/python3.6/site-packages/atlas/rules/genecatalog.snakefile

Thanks,
Sofie

@SilasK
Copy link
Member Author

SilasK commented Feb 4, 2021

I scanned the file and found other problematic ones but if you managed to merge maybe it't ok.

What could be optimized?

@Sofie8
Copy link

Sofie8 commented Feb 4, 2021

Yes it were in the end 2 lines in my subset 3 I think which throw the error, the others it passed without error.

What could be optimized? ==> the hardcoded 20 Gb mem, I changed to reading in the mem in the config file.
Further, no real other good solutions, preventing emapper.py --annotate_hits_table from deleting its output file perhaps, but then if wall time is exceeded and it writes an incomplete line, you have the hassle I had above..
Cluster submission is a solution, but qsub from within a qsub submitted pbs script is not allowed for us.. So, ugh, now I 'schedule' myself, I run atlas assembly and atlas genome on the normal mem nodes, when completed I submit atlas genecatalog to the bigmem partition. So scheduling myself a bit, the one after the other.

@SilasK
Copy link
Member Author

SilasK commented Feb 5, 2021

@Sofie8 Thank you for your suggestions.

  • The hard coded mem is changed.
  • I should make the annotation output in a shadow so it does not get overwritten.
  • eggnog mapper using mmseqs2 is now released I should have a look.
  • filter genes is still very slow

For your cluster problem, try this out https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-can-i-run-snakemake-on-a-cluster-where-its-main-process-is-not-allowed-to-run-on-the-head-node

@SilasK SilasK reopened this Feb 5, 2021
@MalbertR
Copy link

MalbertR commented Feb 5, 2021

So, when the pipeline is running other tools I see something like this:
submit command: sbatch --parsable --job-name=bam_2_sam_contigs -n 4 --mem=10g --time=30 -N 1 /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/.snakemake/tmp.xschlvv0/snakejob.bam_2_sam_contigs.431.sh

Whereas when running EggNogmapper, this is the command I see:

 /hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/bin/diamond /hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/lib/python2.7/site-packages
#  emapper-2.0.1
# ./emapper.py  -m diamond --no_annot --no_file_comments --data_dir /hpc/dla_mm/mrogers/metagenomics/atlas_db/EggNOGV2 --cpu 1 -i Genecatalog/subsets/genes/subset2.faa -o Genecatalog/subsets/genes/subset2 --override
  /hpc/dla_mm/mrogers/metagenomics/atlas_db/conda_envs/fecb35ac/bin/diamond blastp -d /hpc/dla_mm/mrogers/metagenomics/atlas_db/EggNOGV2/eggnog_proteins.dmnd -q /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/Genecatalog/subsets/genes/subset2.faa --more-sensitive --threads 1 -e 0.001000 -o /hpc/dla_mm/mrogers/metagenomics/Roos/fecal/metagenome-atlas/.snakemake/shadow/tmpq8x2y2_6/emappertmp_dmdn_eWFYon/e02ee5ee476241f69a07767e97bc0acc --top 3 --query-cover 0 --subject-cover 0

No sbatch anywhere. Also, it does not output a slurm output file, like I do see for the rest of the tools (I don't think this file is created only when the job is finished right? It should be created as soon as the job starts).

@SilasK
Copy link
Member Author

SilasK commented Feb 5, 2021

@MalbertR I get your point. Seems eggNOG_homology_search is not submitted to the cluster. I don't know why. You don't have an old version of atlas, do you?

@SilasK
Copy link
Member Author

SilasK commented Feb 5, 2021

From the code there is no reason why this rule should be executed locally and not the others. eggNOG_homology_search

@MalbertR
Copy link

MalbertR commented Feb 5, 2021

@MalbertR I get your point. Seems eggNOG_homology_search is not submitted to the cluster. I don't know why. You don't have an old version of atlas, do you?

I will check my version and update to the latest (just in case). The re-run and come back to you.
Thanks!

@MalbertR
Copy link

MalbertR commented Feb 8, 2021

From the code there is no reason why this rule should be executed locally and not the others. eggNOG_homology_search

Hi Silas,

The issue seems to be solved now. At first I thought it was probably thanks to the updated version of the pipeline, but then I went to have a look if there were differences with how I was running the pipeline before compared to this last time and realized that I wasn't adding the cluster profile before. So, that was probably the issue...Oops!

@SilasK
Copy link
Member Author

SilasK commented Feb 8, 2021

@MalbertR No problem. By the way, have a look at the discussion in this thread #351 for optimal results with the atlas genecatalog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants