Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different minimap2 alignment results of different FunAnnotate version #444

Closed
qihualiang opened this issue Jun 29, 2020 · 13 comments
Closed

Comments

@qihualiang
Copy link

qihualiang commented Jun 29, 2020

When I used FunAnnotate v1.7.2, I was able to align transcripts to genome with minimap2, log as
`[06/22/20 12:34:13]: Aligning transcript evidence to genome with minimap2
[06/22/20 12:34:13]: /home/qliang/0.soft/miniconda2/envs/funannotate/lib/python2.7/site-packages/funannotate/aux_scripts/sam2bam.sh minimap2 -ax splice -t 36 --cs -u b -G 3000 /home/qliang/results_minimap/predict_misc/genome.softmasked.fa results_minimap/predict_misc/transcripts.combined.fa 4 results_minimap/predict_misc/transcripts.minimap2.bam
[06/22/20 12:34:18]: [M::mm_idx_gen::0.4261.00] collected minimizers
[M::mm_idx_gen::0.496
2.30] sorted minimizers
[M::main::0.4962.30] loaded/built the index for 21 target sequence(s)
[M::mm_mapopt_update::0.540
2.19] mid_occ = 160
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 21
[M::mm_idx_stat::0.5722.12] distinct minimizers: 2337210 (92.46% are singletons); average occurrences: 1.407; average spacing: 2.929
[M::worker_pipeline::4.270
19.98] mapped 30940 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice -t 36 --cs -u b -G 3000 /home/qliang/results_minimap/predict_misc/genome.softmasked.fa results_minimap/predict_misc/transcripts.combined.fa
[M::main] Real time: 4.290 sec; CPU: 85.338 sec; Peak RSS: 0.511 GB
[bam_sort_core] merging from 0 files and 4 in-memory blocks...

[06/22/20 12:34:25]: Found 20,273 alignments, wrote GFF3 and Augustus hints to file`

I updated FunAnnotate to v1.8.0 and repeated with same files/settings, but transcripts alignments were different with 0 alignments. Log as
[06/29/20 12:21:03]: Aligning transcript evidence to genome with minimap2 [06/29/20 12:21:03]: minimap2 -ax splice -t 36 --cs -u b -G 3000 /home/qliang/results_minimap/predict_misc/genome.softmasked.fa results_minimap/predict_misc/transcripts.combined.fa | samtools sort --reference /home/qliang/results_minimap/predict_misc/genome.softmasked.fa -@ 4 -o results_minimap/predict_misc/transcripts.minimap2.bam - [06/29/20 12:22:00]: Found 0 alignments, wrote GFF3 and Augustus hints to file

I ran the cmd in log
minimap2 -ax splice -t 36 --cs -u b -G 3000 /home/qliang/results_minimap/predict_misc/genome.softmasked.fa results_minimap/predict_misc/transcripts.combined.fa | samtools sort --reference /home/qliang/results_minimap/predict_misc/genome.softmasked.fa -@ 4 -o results_minimap/predict_misc/transcripts.minimap2.bam -
The output was
[M::mm_idx_gen::0.414*1.00] collected minimizers [M::mm_idx_gen::0.480*2.30] sorted minimizers [M::main::0.480*2.30] loaded/built the index for 21 target sequence(s) [M::mm_mapopt_update::0.520*2.20] mid_occ = 160 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 21 [M::mm_idx_stat::0.548*2.14] distinct minimizers: 2337210 (92.46% are singletons); average occurrences: 1.407; average spacing: 2.929 [M::worker_pipeline::3.805*22.79] mapped 30940 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -ax splice -t 36 --cs -u b -G 3000 /home/qliang/results_minimap/predict_misc/genome.softmasked.fa results_minimap/predict_misc/transcripts.combined.fa [M::main] Real time: 3.828 sec; CPU: 86.715 sec; Peak RSS: 0.573 GB [bam_sort_core] merging from 0 files and 4 in-memory blocks...

It seems like there is no explicit error of running minimap2, but it is weird that there is no alignment at all for transcripts2genome. Could you help troubleshooting this? Many thanks!

@nextgenusfs
Copy link
Owner

nextgenusfs commented Jun 29, 2020

Thanks for reporting -- likely a bug in the master code. I added the --reference to samtools sort to accommodate another bug from a user with a large genome, that must be the error I'm assuming. Is the BAM file empty (results_minimap/predict_misc/transcripts.minimap2.bam)?

@qihualiang
Copy link
Author

The BAM file is not empty, but I notice the file size of this file increase by 1byte compared to my previous run of v1.7.2

@nextgenusfs
Copy link
Owner

okay, so samtools view results_minimap/predict_misc/transcripts.minimap2.bam | wc -l yields how many alignments compared to the v1.7.2?

@nextgenusfs
Copy link
Owner

And if you wouldn't mind checking the header for both of them? samtools view -H file.bam, maybe that is where discrepancy is?

@qihualiang
Copy link
Author

BAM from v1.8.0 has 36947 lines; from v1.7.2 has 37492 lines
They have the exact same header

@nextgenusfs
Copy link
Owner

Okay -- that means its not there, its in the processing of the bam to gff3 (alignments). Its likely then a python2/3 issue -- which version of python?

@qihualiang
Copy link
Author

Python 2.7

@nextgenusfs
Copy link
Owner

Hmm, okay. Well when I test it locally I'm not seeing a problem. You can test the bam2gff3 conversion in the utility menu, ie.

funannotate util bam2gff3 -i results_minimap/predict_misc/transcripts.minimap2.bam -o test.gff3

I guess I didn't ask if the predict_misc/transcript_alignments.gff3 file is indeed empty?

@qihualiang
Copy link
Author

predict_misc/transcript_alignments.gff3 and predict_misc/transcript_minimap2.gff3 are empty.

bam2gff3 could generate non-empty gff3 results

@nextgenusfs
Copy link
Owner

So if the bam2gff3 is working then I don't know what the problem is. I would try to re-run perhaps in a clean output folder and see if you get the same behavior.

@lyl8086
Copy link

lyl8086 commented Jul 4, 2020

Hi, I have the same issue. After checking the codes in library.py, I found some potential bugs in function bam2ExonsHints.
line 1635: should be tags = cols[11:] ?
line 1664: should be cols[1] == '0' ?
line 1672: should be cols[1] == '16' ?

I'm not sure if I fixed it correctly , but it works for me.

@nextgenusfs
Copy link
Owner

nextgenusfs commented Jul 4, 2020

@lyl8086 thanks -- I think you are right, those look like leftover errors from using pybam which is now just samtools for compatibility. I appreciate the help.

@nextgenusfs
Copy link
Owner

Please reopen if after installing master branch and it is still a problem.

nextgenusfs added a commit that referenced this issue Aug 9, 2020
* python3 port

* Delete .DS_Store

* a few updates

* handle signalP 5.0 parsing -- not tested yet

* more updates to support signalp 5.0 as well as input results at command line

* bump version and fix typo

* string format error in signalp

* ensure no empty sequences converting from bam to fasta

* more py2/3 fixes -- remove pybam in favor of samtools parsing

* DBxref to Dbxref in GFF3

* push some @photocyte PR changes to py3 dev branch, untested

* small doc fix in argparse usage

* try to fix unicode error in setupDB

* fix download in test

* update download print stmnt

* fix typo in predict

* fix genemark path call

* fix typo in exonerate_pident

* try to fix typos again

* fix p2g cmd

* fix genemark logic if not in path

* more messing with genemark stupid

* fix iprscan download for docker

* remove sam2bam.sh from library map

* fix typo in bam2gff3

* add busco_seed_species local directory check before running busco

* fix another type in bam parser

* remove .lib from library dummy

* add back augustus generic to local folder

* remove sam2bam.sh calls with subprocess.Popen

* add lowercase to snap; start to move downloads to json

* load download links from file

* add status msg to download json file

* fix a few typos in the bam alignment

* fix minimap2_cmd typo

* global FNULL

* fix indent error

* troubleshooting compare

* troubleshooting compare

* troubleshooting compare again

* troubleshooting compare yet again

* troubleshooting compare yet yet again

* troubleshooting compare yet yet again; try loc

* pandas bug fix, roll back debug print stmnts

* code linting

* update menu with p2g options

* typo - identidy to identity

* support explicit number of threads for tRNAscan-SE runs

* support explicit number of threads for tRNAscan-SE runs #411

* force to string for cpu arg

* fix trnascan cpus

* update dbCAN to v8

* filter cazy subdomains out if exist

* update EVM script to try to partition in between putative gene loci

* EVM update; PASA updated options; predict simplified busco

* fix typos in setup; add interpro fix

* incorporate py2 backwards compatibility; fix to interpo mapping terms

* updates to resources and typo fix

* update pip install instructions, no longer py2 only

* update setup menu with --local option

* fix --rename option in annotate

* add io.open to setup for py2 compatibility

* updateTBL for error catching

* updateTBL for error catching

* updateTBL for error catching 2

* add ncRNA parsing in tbl2dict

* test fix for antiSMASH v5 parsing

* support for ncRNA in gff output

* dynamic output of annotation table for custom headers

* unify gb and gff3 dictionary

* fix resources COGS bug

* dont let smcogs be their own column

* write synonyms/alias to tbl file

* preference to custom annotations over automated

* make sure synonyms are not name and do not repeat

* make sure synonyms func update

* make sure synonyms func update not working, print to debug

* clean func diff logic

* clean func diff logic

* clean func working, remove debug

* fix embarassing typo for cazymes

* troubleshoot cazy subdomain reduction

* fix cazy parsing

* exit on parse error for custom

* exit on parse error for custom

* fix tbl parser for rRNA features

* fix minimap2 bam2hints function, hopefully fix #444

* add alias= parsing for gff2dict

* just some formatting changes to code

* make sure synonyms are not duplcated in gff3 parsing

* do not enumerate gene names if from alt transcripts

* do not enumerate gene names if from alt transcripts

* add debug print stmnt to naming method

* hopfully fixed, forgot a len()

* annotation table error to terminal

* annotation table error to terminal

* annotation table fix for note field

* fix gff3 parsing skip all features it cant parse

* updates to support strange genbank GFF3 files

* fix for gbk parsing to capture misc_RNA as ncRNA

* fix typo on pasa DB name

* fix pasa function in update for string

* same str change in train

* try to fix gene number count if atypical locus_tag, #453

* minor update to latest fix

* minor update to latest fix

* update logging for GO enrichment in compare

* better error handling for subprocess calls

* fix py3 incompatibility in annotationtable

* if --organism=other then make codingquarry default off unless valid --weights passed

* specify the pasa conf file in env variable

* replace . from names as well in dbname

* require distro for linux_distribution info for python 3.8 and beyond

* remove merge comments for commandline options in update

Co-authored-by: Jon Palmer <nextgenusfs@gmail.com>
Co-authored-by: Jason Stajich <jasonstajich.phd@gmail.com>
Co-authored-by: Jason Stajich <jason.stajich@ucr.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants