Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with '-m' still predicts genes across regions of N #30

Closed
johnne opened this issue Sep 7, 2017 · 9 comments
Closed

Running with '-m' still predicts genes across regions of N #30

johnne opened this issue Sep 7, 2017 · 9 comments

Comments

@johnne
Copy link

johnne commented Sep 7, 2017

I'm running prodigal (Prodigal V2.6.3: February, 2016) and have contigs with some regions masked with 'N' where infernal cmscan has predicted non-coding RNAs. However, although I run prodigal with the '-m' option to "Treat runs of N as masked sequence; don't build genes across them." the output has genes predicted across those regions with protein sequences translated into stretches of 'X'. I saw that the stretches of 'N' need to be at least 50 characters and they all are but nevertheless it doesn't seem to act as a mask.

I know the '-m' option was used originally by JGI for masking, but isn't it implemented anymore in prodigal? Or am I using it wrong?

Sincerely,
John

@tseemann
Copy link
Contributor

@johnne can you provide the FASTA file and the parameters you used? I can check the code, as I use prodigal a lot and want to make sure it is working correctly.

@johnne
Copy link
Author

johnne commented Sep 12, 2017

testfiles.tar.gz
@tseemann Thanks for the reply. Attaching the testfiles I'm using. I ran prodigal as:

prodigal -i final_contigs.masked.fa -d final_contigs.masked.ffn -a final_contigs.masked.faa -o final_contigs.masked.gff -f gff -p meta -m

`

@hyattpd
Copy link
Owner

hyattpd commented Sep 18, 2017

-m is really implemented in a bad clunky way, and I don't believe it expressly forbids genes from crossing the gap (it just turns the sequence into N's). So if there's stuff that looks like protein coding on both sides in the correct frame, and can overcome the score penalty of a bunch of N's, it can predict genes across gaps. This -m stuff was never really intended as a permanent solution to this problem.

In the development version, I explicitly added a gap-handling mode where the user can specify the behavior upon seeing any stretch of N's (pass across, run into like scaffolds, or hard stop). This is described in the wiki. Unfortunately, I have no ETA for this version to be done (mostly work on plants these days, and hard to find time to come back to this and finish it... there's still a lot to do before even getting to updating the metagenomic side).

@tseemann
Copy link
Contributor

tseemann commented Oct 2, 2017

@hyattpd but metagenomics is "hot" these days. do it for the plant microbiome! :)

@hyattpd
Copy link
Owner

hyattpd commented Oct 4, 2017

Yes, the metagenomic version can be vastly improved and I have many ideas how to do this. Just need to find time to work on it.

@jtamames
Copy link

jtamames commented Feb 6, 2019

Hi! This issue is still not fixed, rigth? I am masking my contigs in the places where I am predicting a RNA, for gene prediction not giving me a gene there. Somethinng like this:

k139_29
ATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

And I get this:

k139_29_3-368 # 3 # 368 # -1 # ID=29_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=1.000
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Besides it should skip the N's, I am puzzled on how prodigal can detect a gene there, with just two valid bases in the whole contig.

Best,
Javier

@hyattpd
Copy link
Owner

hyattpd commented Feb 6, 2019

i doubt I will update -m in the 2.x Prodigal (as noted before, this is done in a better way in 3.0).

If you are masking the sequence manually, a simple trick is to begin and end the mask with TTAATTAATTAA, which inserts stop codons in all 6 frames.

@hyattpd hyattpd closed this as completed Sep 30, 2019
@tseemann
Copy link
Contributor

I used to use NNNNNCATTCCATTCATTAATTAATTAATGAATGAATGNNNNN as my joiner!

@Jiulong-Zhao
Copy link

I used to use NNNNNCATTCCATTCATTAATTAATTAATGAATGAATGNNNNN as my joiner!

I want to know why prodigal can not predict genes across the sequence you provided. Are there some references? Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants