Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of "UniqueReads" in genes.out file? #83

Open
DaRinker opened this issue Mar 30, 2023 · 2 comments
Open

Meaning of "UniqueReads" in genes.out file? #83

DaRinker opened this issue Mar 30, 2023 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@DaRinker
Copy link

DaRinker commented Mar 30, 2023

I have been comparing HTSeq (htseq-count) counts to TPMcalcualtor UniqueReads counts.

I find them generally to be in very good agreement. However, there are occasionally some cases where HTSeq and TPMcalculator will be in agreement for the "Reads" column of TPMcalculator output. but will disagree in the last 8 columns (TPMcalculator output will report zeros).

For example:

DTO2_FUN_009269 scaffold_6 3072673 3073284 612 366201 9037.3 612 366201 7579.98 0 0 0 0 0 0 0 0 0 0 0 0

This can happen with genes that receive just a few reads or with genes have 10's of thousands of reads aligning. It can happen with intronless genes or with multi-exon genes.

Since I'm using de novo assembly and annotation, I'm wondering if there could be something about my gff file that is tripping up TPMcalculator?

Here's my gff file for the example case above

$ grep DTO2_FUN_009269 ../../final_assemblies/full_assemblies/DTO2/genome.masked.gff3 
scaffold_6	funannotate	gene	3072674	3073285	.	-	.	ID=DTO2_FUN_009269;
scaffold_6	funannotate	mRNA	3072674	3073285	.	-	.	ID=DTO2_FUN_009269-T1;Parent=DTO2_FUN_009269;product=hypothetical protein;
scaffold_6	funannotate	exon	3072674	3073285	.	-	.	ID=DTO2_FUN_009269-T1.exon1;Parent=DTO2_FUN_009269-T1;
scaffold_6	funannotate	CDS	3072674	3073285	.	-	0	ID=DTO2_FUN_009269-T1.cds;Parent=DTO2_FUN_009269-T1;
@DaRinker
Copy link
Author

I may have figured out what is going on here.

These discrepancies appear to be instances of overlapping genes in the gff (opposite strands) where TPMcalculator can't assign one transcript as the correct strand so there are no unique reads assigned to EITHER transcript.

Not sure if there's a way to fix this within the TPMcalculator? (I don't seen an option to enforce stranded-ness)

@r78v10a07
Copy link
Member

See output description here: https://github.com/ncbi/TPMCalculator/wiki/Output-files

@r78v10a07 r78v10a07 self-assigned this Feb 6, 2024
@r78v10a07 r78v10a07 added the question Further information is requested label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants