Negative values in functional annotations #442

timyerg · 2022-03-10T08:58:53Z

Dear developers.
I know that there is similar issue that was already closed: #401 (comment)
I got negative values in unmapped row with SqueezeMeta 1.4.1 (coassembly mode) and found above mentioned issue but the answer from there was to rerun the same samples with newer version of SqueezeMeta (1.5.1) and it was closed due to the lack of activity.
I installed SqueezeMeta 1.5.1 and also installed denovo all databases. I run my new dataset in the sequential mode, and pooled resulted outputs. Still, I have negative values in the 'unmapped' row.
Is it normal to have negative values in unmapped row?
Can I just remove this row to perform DESeq2 analysis?
Slice of KO abundances table is attached for an example.
combined.KO.abund.zip

And thanks again for the tool. It is very convenient to use.

fpusan · 2022-03-10T09:01:33Z

No, I don't expect this to happen.
Does this also happen for the individual samples or only for the combined output?

timyerg · 2022-03-10T09:07:45Z

Thank you for super fast reply!
Yes, I also have it in my table that was created while pooling in sample directory:
S1.KO.abund.zip

fpusan · 2022-03-10T09:09:28Z

I see. Any way you can share the S1 project directory with me?
This definitely needs to be corrected

timyerg · 2022-03-10T09:10:35Z

I will zip it, and send you a link to it.

My command:SqueezeMeta.pl -m sequential -s samples.tsv -f ../../$fqs -extdb ../../mydb.list --nobins -b 8 -t 6

If it is important:Samples were quality controlled with TrimGalore and host DNA was removed with bowtie2 before SqueezeMeta run.

fpusan · 2022-03-14T12:59:04Z

I can see where the problem is. The ORF table (which we use to calculate the function abundances in SQMtools) has extra reads somewhere. The difference is not very big, but in this case you have very high mapping percentages (97% of your reads map back to the assembly), meaning that this small difference is enough to push the sum of mapped reads to the ORF table above 100% hence the negative values in the Unmapped row.

This only happens for the ORF table, the contig table has the right result. I wonder if this can be due to the presence of overlapping ORFs, that then make the same reads be counted twice... we'll look into that.

Ignore this for now, results should be ok otherwise. However we'll look into this more carefully.

timyerg · 2022-03-14T13:09:20Z

Thank you for your reply!
As I understood, I can just ignore 'Unmapped' row in my both datasets and proceed with the analysis. That's great!
Could you please update me later when you will take a deeper look into this issue?
I will close it now, thank you again.

fpusan · 2022-03-14T13:10:09Z

Leave it open, that way we don't forget to fix it!

fpusan · 2022-03-15T19:09:06Z

So after looking at this a bit more carefully, the problem is indeed due to the ORF table having more counts than the total number of reads present in the sample.

This stems from the way we assign reads to features. As long as a read mapping to the contig partially maps to a given features (as defined in the gff file) that ORF will get a count. Therefore, when a single read maps to two different features, the ORF table will get two counts (even if there was really only one read).

This will happen:

When several features overlap. This can be true overlapping of genes in the genome, but also I've found some instances in which different software predict a feature in the same region (e.g. prodigal predicts a protein-coding gene and barrnap predicts a rRNA in the same place).
When non-overlapping features are close enough that the same reap will map to both. This becomes more of a problem for small features (most of the reads mapping to one feature will also map to another close feature, leading to a lot of double counting).

This is normally not a big deal in practice, but in your case the mapping percentage was so already so high (almost 98% in the project you sent me) that it was enough to push the total counts in the ORF table above the total number of reads. Not by a large margin (10721 out of 77 million reads) but enough to generate a negative number of Unmapped reads in the function tables.

You can safely ignore the negative number (as long as it happens in the "Unmapped" row). We will remove that row from the function tables in the next version. We will consider whether to recommend using the number of mapped bases as a safer approach (SqueezeMeta also keeps track of that) but we think that the difference will be negligible in most cases.

timyerg · 2022-03-15T19:26:32Z

Thank you for detailed clarification. I would be very surprised if the issue was only in the overlapping ORFs since it is not very common and as far as I know mostly spreaded among viruses. But you explanation about closely located small functions and parallel counts from different tools solved the mystery for me. Now it is crystal clear and (most important for me) I do not need to rerun my analyses (just removed several Tb of raw data on the server).
I will not close the issue since I do not know if you need it as reminder so please close it whenever you decide so.

fpusan added a commit that referenced this issue Jun 13, 2022

stop tracking ummaped reads for functions, fixes #442

73eeec0

fpusan closed this as completed Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative values in functional annotations #442

Negative values in functional annotations #442

timyerg commented Mar 10, 2022

fpusan commented Mar 10, 2022

timyerg commented Mar 10, 2022

fpusan commented Mar 10, 2022

timyerg commented Mar 10, 2022 •

edited

Loading

fpusan commented Mar 14, 2022

timyerg commented Mar 14, 2022

fpusan commented Mar 14, 2022

fpusan commented Mar 15, 2022

timyerg commented Mar 15, 2022

Negative values in functional annotations #442

Negative values in functional annotations #442

Comments

timyerg commented Mar 10, 2022

fpusan commented Mar 10, 2022

timyerg commented Mar 10, 2022

fpusan commented Mar 10, 2022

timyerg commented Mar 10, 2022 • edited Loading

fpusan commented Mar 14, 2022

timyerg commented Mar 14, 2022

fpusan commented Mar 14, 2022

fpusan commented Mar 15, 2022

timyerg commented Mar 15, 2022

timyerg commented Mar 10, 2022 •

edited

Loading