-
Notifications
You must be signed in to change notification settings - Fork 103
collapsed.gff file is not consistent with collapsed.rep.fa file #27
Comments
Hi @XiaoyuZhan520 , Can you confirm that in the Another possibility of the difference - especially in the case of the sequence you posted - is difference in the transcript sequence and mapped alignment. The end of the sequence is all "A"s which may be untrimmed polyA sequences. They would hence not be mapped to the genome - and result in a shorter exonic length - then the sequence itself. --Liz |
Hello, Thanks for the quick answer. Yes, indeed, PB.10113.15 was collapsed from multiple transcripts (around 20). And I can find a exactly same one from the non-collapsed transcripts. According to the information above, can I make use of the information from collapsed.gff file and capture transcripts sequence from genome and get the matching sequence? Besides, could you please tell me the difference between collapsed.gff file and collapsed.gff.unfuzzy, as well as collapsed.group.txt and collapsed.group.txt.unfuzzy file. |
Hi @XiaoyuZhan520 ,
Do you mean: if you reconstitute the sequence form the genome according to GFF, will it match exactly (minus the small difference in length) the representative sequence? Yes and No. It will match exactly except the PacBio transcript may contain SNPs not contained in the reference genome. Possibly, there may be a very small number of residual errors. re: what is ".unfuzzy" files --Liz |
Thank you very much. Really appreciate for your help |
Hello Liz, I found it quite confusing to collapse the transcripts and I hope you can provide me any idea. For example, transcripts PB.6923.1, which is 1324 nucleotide long. It was mapped to the genome with 124 nucleotide perfectly mapped, 1200 nucleotide mapped to N (gap). This transcript passed the judgement of -c 0.85 -i 0.85 and collapsed into one final transcript. Do you think it is reasonable to act like this? |
Hi @XiaoyuZhan520 , This is an odd case. Can you please share the following information:
You may share this privately by sending it to etseng@pacb.com. Thanks, |
Closing unless there is further information. |
Hello,
I've used collapse_isoforms_by_sam.py to remove redundant transcripts and get the results.
However, I found that the content in collapsed.gff file is not consistent with collapsed.rep.fa file.
Take 'PB.10113.15' as an example. As you can see in the following, the length of this transcript is 2091 but the sequence in fasta file is 2154-nucleotide long.
And I wonder how can I deal with the difference between gff file and fasta file?
The annotation of gene structure of this transcripts is:
22 PacBio transcript 36253133 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36253133 36253219 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36254937 36254999 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257083 36257136 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257319 36257407 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36261596 36261722 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36265151 36266135 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36266840 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
The text was updated successfully, but these errors were encountered: