Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Evidencemodeler #528

Closed
LemoAlex opened this issue Jan 11, 2021 · 12 comments
Closed

Problem with Evidencemodeler #528

LemoAlex opened this issue Jan 11, 2021 · 12 comments

Comments

@LemoAlex
Copy link

Hello funannotate users,

I am currently using funanotate v1.8.4, installed through docker, and funannotate check and testing works without issues.

I am trying to run funannotate predict on some fish genome assembly.

So, when I run:

funannotate-docker predict -i ~softmasked.genome.fasta -o ./output1 -s "Species name" --transcript_evidence Transcriptome.fasta --optimize_augustus --other_gff /home/alexandre/funannotate/Species.transdecoder.gff3 --protein_evidence uniprot.reviewed.fasta uniprot-reviewed.fasta --organism other --rna_bam ~/funannotate/alignment.bam --weights codingquarry:1 --cpus 4

Everything runs smoothly until the EvidenceModeler part. Then, I get this message :

funannotate-EVM.log
EVM: partitioning input to ~ 35 genes per partition
Traceback (most recent call last):
File "/venv/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-runEVM.py", line 433, in
partitions=args.no_partitions)
File "/venv/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-runEVM.py", line 203, in create_partitions
k, len(SeqRecords[k])))
File "/venv/lib/python3.7/site-packages/Bio/File.py", line 248, in getitem
record = self._proxy.get(self._offsets[key])
KeyError: 'scaffold_1'
[Jan 11 08:24 AM]: Evidence modeler has failed, exiting
Traceback (most recent call last):
File "/venv/bin/funannotate", line 713, in
main()
File "/venv/bin/funannotate", line 703, in main
mod.main(arguments)
File "/venv/lib/python3.7/site-packages/funannotate/predict.py", line 1730, in main
os.remove(EVM_out)
FileNotFoundError: [Errno 2] No such file or directory: '~/output1/predict_misc/evm.round1.gff3'

The EVM logfile (attached) does not show any error, so I am a bit confused with what's going on here.

Thanks for the help,
Best,
Alexandre
funannotate-EVM.log

@nextgenusfs
Copy link
Owner

This seems odd from logfile.

[01/11/21 08:23:44]: 9,557 total contigs; skipping -51,760 contigs with no genes

Do you have the predict logfile that I could look at as well?

@LemoAlex
Copy link
Author

Yes, here it is attached.
funannotate-predict.log

Thanks for your help.

Alexandre

@nextgenusfs
Copy link
Owner

Hmm, okay thanks. I can't quite tell, but maybe looks like the command line around the --species argument perhaps isn't getting passed properly, ie if you look at the log file that is printing the command:

/venv/bin/funannotate predict -i /home/alexandre/funannotate/fish.masked.fa -o ./output1 -s Species name--transcript_evidence /home/alexandre/funannotate/Alignment/Tran.fa --optimize_augustus --other_gff /home/alexandre/funannotate/Tran.fa.transdecoder.gff3 --protein_evidence uniprot-catfish-reviewed.fasta uniprot-zebrafish-reviewed.fasta --organism other --rna_bam /home/alexandre/funannotate/sorted.bam --weights codingquarry:1 --cpus 4

I don't know how that would necessarily be causing problems per say with EVM.... but seems like maybe just a typo? In your initial command above there is clearly a space.

-s Species name--transcript_evidence /home/alexandre/funannotate/Alignment/Tran.fa

So assuming above is not related to error, you can try to run the EVM command from that same directory and maybe that will yield more info to stdout, ie:

funannotate-docker /venv/bin/python /venv/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-runEVM.py -w /home/alexandre/funannotate/output1/predict_misc/weights.evm.txt -c 4 -g /home/alexandre/funannotate/output1/predict_misc/gene_predictions.gff3 -d /home/alexandre/funannotate/output1/predict_misc/EVM -f /home/alexandre/funannotate/output1/predict_misc/genome.softmasked.fa -l ./output1/logfiles/funannotate-EVM.log -m 10 -o /home/alexandre/funannotate/output1/predict_misc/evm.round1.gff3 --EVM_HOME /venv/opt/evidencemodeler-1.1.1 -p /home/alexandre/funannotate/output1/predict_misc/protein_alignments.gff3 -t /home/alexandre/funannotate/output1/predict_misc/transcript_alignments.gff3

@nextgenusfs
Copy link
Owner

Actually that will probably fail based on what I have in the bash script, you can create a new bash wrapper like this that will just run the image (it is same just doesn't include call to funannotate):

#!/usr/bin/env bash

realpath() {
  OURPWD=$PWD
  cd "$(dirname "$1")"
  LINK=$(readlink "$(basename "$1")")
  while [ "$LINK" ]; do
    cd "$(dirname "$LINK")"
    LINK=$(readlink "$(basename "$1")")
  done
  REALPATH="$PWD/$(basename "$1")"
  cd "$OURPWD"
  echo "$REALPATH"
}

timezone() {
    if [ "$(uname)" == "Darwin" ]; then
        TZ=$(readlink /etc/localtime | sed 's#/var/db/timezone/zoneinfo/##')
    else
        TZ=$(readlink /etc/timezone)
    fi
    echo $TZ
}

# Only allocate tty if one is detected. See - https://stackoverflow.com/questions/911168
if [[ -t 0 ]]; then IT+=(-i); fi
if [[ -t 1 ]]; then IT+=(-t); fi

USER="$(id -u $(logname)):$(id -g $(logname))"
WORKDIR="$(realpath .)"
MOUNT="type=bind,source=${WORKDIR},target=${WORKDIR}"
TZ="$(timezone)"

exec docker run --rm "${IT[@]}" --user "${USER}" -e TZ="${TZ}" --workdir "${WORKDIR}" --mount "${MOUNT}" nextgenusfs/funannotate:latest "$@"

@nextgenusfs
Copy link
Owner

nextgenusfs commented Jan 11, 2021

Here is a generalized version of this bash script -- you could run with any docker container: https://github.com/nextgenusfs/dw/

@LemoAlex
Copy link
Author

Hello again,

Thanks for the answers.
I tried by removing the spaces in the species name, but I still get the same error .

I also tried running the EVM step using the bash script through dw, but again I get the exact same output as I did when running the whole pipeline. I also get (I had it before aswell), a single file called : genes.1.bed in the predict_mis/EVM folder. It feels like EVM can't go past the first scaffold, could this be possible?

Thanks,
Alexandre

@nextgenusfs
Copy link
Owner

nextgenusfs commented Jan 11, 2021

I suppose it could be running out of RAM. Can you increase the RAM allocated to docker?

Nevermind, saw your log file and it is already 264 GB.

When you call this are all of the files you are passing to the docker container located in the same run directory?

Other thing to try would be to just move into the docker image interactively and then try to run the EVM workflow, ie docker run -it -v {need to mount filesystem folders} nextgenusfs/funannotate /bash/bin

And then lastly, I assume the test dataset runs on your system?

funannotate-docker test -t rna-seq --cpus XX

@nextgenusfs
Copy link
Owner

One other thing to try would be to delete all of the EVM temp files and then try to add --no-evm-partitions to your predict command (I just realized its not in the help menu) -- but this will run the partitioning differently if that is what is causing EVM to die.

@nextgenusfs
Copy link
Owner

But going back to my original thought in the EVM log file, that this line seems strange:

[01/11/21 08:23:44]: 9,557 total contigs; skipping -51,760 contigs with no genes

What is happening in the code is this:

    # sort the results by contig and position
    ChrGeneCounts = {}
    sortedResults = natsorted(Results, key=lambda x: (x[0], x[1]))
    with open(bedGenes, 'w') as outfile:
        for x in sortedResults:
            outfile.write('{}\t{}\t{}\t{}\t{}\t{}\n'.format(x[0], x[1], x[2],
                                                            x[3], x[4], x[5]))
            if not x[0] in ChrGeneCounts:
                ChrGeneCounts[x[0]] = 1
            else:
                ChrGeneCounts[x[0]] += 1
    ChrNoGenes = len(SeqRecords) - len(ChrGeneCounts)
    lib.log.debug('{:,} total contigs; skipping {:,} contigs with no genes'.format(len(SeqRecords), ChrNoGenes))

This suggests something is wrong with the input files (something I've not seen before), it it is saying that it somehow found >50k contigs that don't have genes associated with them.

This suggests that something is wrong with the headers on one of these input files -- can you validate that the input files have appropriate FASTA/Sequence headers? For example, the custom GFF that you are passing do they match the genome FASTA headers? And the BAM file as well, do the headers match?

@LemoAlex
Copy link
Author

For example, the custom GFF that you are passing do they match the genome FASTA headers?

Ok, maybe the problem is there! My GFF file comes from Transdecoder, but I used the transcriptome as an input. So obviously, the transcriptome and the genome don't have the same headers. Could the problem come from there? What could I use as an alternative then?

Thanks,

Alexandre

@nextgenusfs
Copy link
Owner

So if the transcripts aren't aligned to the genome reference then it shouldn't be passed as GFF_other. If you have transcripts from Transdecoder that you want to align, you can pass those as FASTA format to --transcript_evidence -- this option takes multiple inputs as space delimited.

Maybe its not obvious -- but the pipeline might work a lot better if you let funannotate train run Trinity/PASA/transdecoder. That way those tools get run in a way that funannotate knows the format....

nextgenusfs pushed a commit that referenced this issue Jan 12, 2021
@LemoAlex
Copy link
Author

Hi,

Sorry for the long delay. Just to let you know that I ran it as you suggested and I was able to finish the whole pipeline successfully, so thank you!

Best,
Alexandre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants