Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative values in functional annotations #442

Closed
timyerg opened this issue Mar 10, 2022 · 9 comments
Closed

Negative values in functional annotations #442

timyerg opened this issue Mar 10, 2022 · 9 comments

Comments

@timyerg
Copy link

timyerg commented Mar 10, 2022

Dear developers.
I know that there is similar issue that was already closed: #401 (comment)
I got negative values in unmapped row with SqueezeMeta 1.4.1 (coassembly mode) and found above mentioned issue but the answer from there was to rerun the same samples with newer version of SqueezeMeta (1.5.1) and it was closed due to the lack of activity.
I installed SqueezeMeta 1.5.1 and also installed denovo all databases. I run my new dataset in the sequential mode, and pooled resulted outputs. Still, I have negative values in the 'unmapped' row.
Is it normal to have negative values in unmapped row?
Can I just remove this row to perform DESeq2 analysis?
Slice of KO abundances table is attached for an example.
combined.KO.abund.zip

And thanks again for the tool. It is very convenient to use.

@fpusan
Copy link
Collaborator

fpusan commented Mar 10, 2022

No, I don't expect this to happen.
Does this also happen for the individual samples or only for the combined output?

@timyerg
Copy link
Author

timyerg commented Mar 10, 2022

Thank you for super fast reply!
Yes, I also have it in my table that was created while pooling in sample directory:
S1.KO.abund.zip

@fpusan
Copy link
Collaborator

fpusan commented Mar 10, 2022

I see. Any way you can share the S1 project directory with me?
This definitely needs to be corrected

@timyerg
Copy link
Author

timyerg commented Mar 10, 2022

I will zip it, and send you a link to it.

My command:SqueezeMeta.pl -m sequential -s samples.tsv -f ../../$fqs -extdb ../../mydb.list --nobins -b 8 -t 6

If it is important:Samples were quality controlled with TrimGalore and host DNA was removed with bowtie2 before SqueezeMeta run.

@fpusan
Copy link
Collaborator

fpusan commented Mar 14, 2022

I can see where the problem is. The ORF table (which we use to calculate the function abundances in SQMtools) has extra reads somewhere. The difference is not very big, but in this case you have very high mapping percentages (97% of your reads map back to the assembly), meaning that this small difference is enough to push the sum of mapped reads to the ORF table above 100% hence the negative values in the Unmapped row.

This only happens for the ORF table, the contig table has the right result. I wonder if this can be due to the presence of overlapping ORFs, that then make the same reads be counted twice... we'll look into that.

Ignore this for now, results should be ok otherwise. However we'll look into this more carefully.

@timyerg
Copy link
Author

timyerg commented Mar 14, 2022

Thank you for your reply!
As I understood, I can just ignore 'Unmapped' row in my both datasets and proceed with the analysis. That's great!
Could you please update me later when you will take a deeper look into this issue?
I will close it now, thank you again.

@fpusan
Copy link
Collaborator

fpusan commented Mar 14, 2022

Leave it open, that way we don't forget to fix it!

@fpusan
Copy link
Collaborator

fpusan commented Mar 15, 2022

So after looking at this a bit more carefully, the problem is indeed due to the ORF table having more counts than the total number of reads present in the sample.

This stems from the way we assign reads to features. As long as a read mapping to the contig partially maps to a given features (as defined in the gff file) that ORF will get a count. Therefore, when a single read maps to two different features, the ORF table will get two counts (even if there was really only one read).

This will happen:

  1. When several features overlap. This can be true overlapping of genes in the genome, but also I've found some instances in which different software predict a feature in the same region (e.g. prodigal predicts a protein-coding gene and barrnap predicts a rRNA in the same place).
  2. When non-overlapping features are close enough that the same reap will map to both. This becomes more of a problem for small features (most of the reads mapping to one feature will also map to another close feature, leading to a lot of double counting).

This is normally not a big deal in practice, but in your case the mapping percentage was so already so high (almost 98% in the project you sent me) that it was enough to push the total counts in the ORF table above the total number of reads. Not by a large margin (10721 out of 77 million reads) but enough to generate a negative number of Unmapped reads in the function tables.

You can safely ignore the negative number (as long as it happens in the "Unmapped" row). We will remove that row from the function tables in the next version. We will consider whether to recommend using the number of mapped bases as a safer approach (SqueezeMeta also keeps track of that) but we think that the difference will be negligible in most cases.

@timyerg
Copy link
Author

timyerg commented Mar 15, 2022

Thank you for detailed clarification. I would be very surprised if the issue was only in the overlapping ORFs since it is not very common and as far as I know mostly spreaded among viruses. But you explanation about closely located small functions and parallel counts from different tools solved the mystery for me. Now it is crystal clear and (most important for me) I do not need to rerun my analyses (just removed several Tb of raw data on the server).
I will not close the issue since I do not know if you need it as reminder so please close it whenever you decide so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants