Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

collapsed.gff file is not consistent with collapsed.rep.fa file #27

Closed
XiaoyuZhan520 opened this issue Jan 8, 2018 · 7 comments
Closed

Comments

@XiaoyuZhan520
Copy link

Hello,

I've used collapse_isoforms_by_sam.py to remove redundant transcripts and get the results.

However, I found that the content in collapsed.gff file is not consistent with collapsed.rep.fa file.

Take 'PB.10113.15' as an example. As you can see in the following, the length of this transcript is 2091 but the sequence in fasta file is 2154-nucleotide long.

And I wonder how can I deal with the difference between gff file and fasta file?

The annotation of gene structure of this transcripts is:
22 PacBio transcript 36253133 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36253133 36253219 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36254937 36254999 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257083 36257136 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36257319 36257407 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36261596 36261722 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36265151 36266135 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";
22 PacBio exon 36266840 36267525 . + . gene_id "PB.10113"; transcript_id "PB.10113.15";

PB.10113.15
ATTATACAGACGCATAACTGGAGGTGGGATCCACACAGCTCAGAACAGCTGGATCTTGCT
CAGTCTCTGCCAGGGGAAGATTCCTTGGAGGAGGCCCTGCAGCGACATGGAGGGAGCTGC
TTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTTGGTGTGGGAGT
GAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGGACAGATACTGG
AGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGACCCAGAGAGCAG
TATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAGTGAGCACACAGAATCTGCT
ACTCCTGCTGACTGATAATGAGGCCTGGAACGGATTCGTGGCTGCTGCTGAACTGCCCAG
GAATGAGGCAGATGAGCTCCGTAAAGCTCTGGACAACCTTGCAAGACAAATGATCATGAA
AGACAAAAACTGGCACGATAAAGGCCAGCAGTACAGAAACTGGTTTCTGAAAGAGTTTCC
TCGGTTGAAAAGTGAGCTTGAGGATAACATAAGAAGGCTCCGTGCCCTTGCAGATGGGGT
TCAGAAGGTCCACAAAGGCACCACCATCGCCAGTGTGGTGTCTGGCTCTCTCAGCATTTC
CTCTGGCATCCTGACCCTCGTCGGCATGGGTCTGGCACCCTTCACAGAGGGAGGCAGCCT
TGTACTCTTGGAACCTGGGATGGAGTTGGGAATCACAGCAGCTTTGACCGGGATTACCAG
CAGTACCATAGACTACGGAAAGAAGTGGTGGACACAAGCCCAAGCCCACGACCTGGTCAT
CAAAAGCCTTGACAAATTGAAGGAGGTGAAGGAGTTTTTGGGTGAGAACATATCCAACTT
TCTTTCCTTAGCTGGCAATACTTACCAACTCACACGAGGCATTGGGAAGGACATCCGTGC
CCTCAGACGAGCCAGAGCCAATCTTCAGTCAGTACCGCATGCCTCAGCCTCACGCCCCCG
GGTCACTGAGCCAATCTCAGCTGAAAGCGGTGAACAGGTGGAGAGAGTTAATGAACCCAG
CATCCTGGAAATGAGCAGAGGAGTCAAGCTCACGAATGTGGCCCCTGTAAGCTTCTTTCT
TGTGCTGGATGTAGTCTACCTCGTGTACGAATCAAAGCACTTACATGAGGGGGCAAAGTC
AGAGACAGCTGAGGAGCTGAAGAAGGTGGCTCAGGAGCTGGAGGAGAAGCTAAACATTCT
CAACAATAATTATAAGATTCTGCAGGCGGACCAAGAACTGTGACCACAGGGCAGGGCAGC
CACCAGGAGAGATATGCCTGGCAGGGGCCAGGACAAAATGCAAACTTTTTTTTTTTCTGA
GACAGAGTCTTGCACTGTCGCCAAGTGGAGCTTGCAGTGAGCCGAGATATCGCCACTGCA
CTCCAGCCTGGGTGACAGAGCGAGACTCCATCTCAAAAAAAAAAAAAAAAAGAATATATT
GACGGAAGAATAGAGAGGAGGCTTGAAGGAACCAGCAATGAGAAGGCCAGGAAAAGAAAG
AGCTGAAAATGGAGAAAGCCCAAGAGTTAGAACAGTTGGATACAGGAGAAGAAACAGCGG
CTCCACTACAGACCCAGCCCCAGGTTCAATGTCCTCCGAAGAATGAAGTCTTTCCCTGGT
GATGGTCCCCTGCCCTGTCTTTCCAGCATCCACTCTCCCTTGTCCTCCTGGGGGCATATC
TCAGTCAGGCAGCGGCTTCCTGATGATGGTCGTTGGGGTGGTTGTCATGTGATGGGTCCC
CTCCAGGTTACTAAAGGGTGCATGTCCCCTGCTTGAACACTGAAGGGCAGGTGGTGGGCC
ATGGCCATGGTCCCCAGCTGAGGAGCAGGTGTCCCTGAGAACCCAAACTTCCCAGAGAGT
ATGTGAGAACCAACCAATGAAAACAGTCCCATCGCTCTTACCCGGTAAGTAAACAGTCAG
AAAATTAGCATGAAAGCAGTTTAGCATTGGGAGGAAGCTCAGATCTCTAGAGCTGTCTTG
TCGCCGCCCAGGATTGACCTGTGTGTAAGTCCCAATAAACTCACCTACTCATCAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCC

@Magdoll
Copy link
Owner

Magdoll commented Jan 8, 2018

Hi @XiaoyuZhan520 ,

Can you confirm that in the collapsed.group.txt file that the PBID PB.10113.15 has multiple members? If so, then this discrepancy comes from collapsing multiple redundant sequences that have the same exonic structure but may vary slightly in the exact 5'/3' ends. The GFF file will display the longest 5' and 3' sites (which may be attributed by different collapsed members) but for rep.fa it is showing just one of the sequences from that collapsed set of members. Hence the small difference.

Another possibility of the difference - especially in the case of the sequence you posted - is difference in the transcript sequence and mapped alignment. The end of the sequence is all "A"s which may be untrimmed polyA sequences. They would hence not be mapped to the genome - and result in a shorter exonic length - then the sequence itself.

--Liz

@XiaoyuZhan520
Copy link
Author

Hello,

Thanks for the quick answer.

Yes, indeed, PB.10113.15 was collapsed from multiple transcripts (around 20). And I can find a exactly same one from the non-collapsed transcripts.

According to the information above, can I make use of the information from collapsed.gff file and capture transcripts sequence from genome and get the matching sequence?

Besides, could you please tell me the difference between collapsed.gff file and collapsed.gff.unfuzzy, as well as collapsed.group.txt and collapsed.group.txt.unfuzzy file.

@Magdoll
Copy link
Owner

Magdoll commented Jan 9, 2018

Hi @XiaoyuZhan520 ,

can I make use of the information from collapsed.gff file and capture transcripts sequence from genome and get the matching sequence?

Do you mean: if you reconstitute the sequence form the genome according to GFF, will it match exactly (minus the small difference in length) the representative sequence?

Yes and No. It will match exactly except the PacBio transcript may contain SNPs not contained in the reference genome. Possibly, there may be a very small number of residual errors.

re: what is ".unfuzzy" files
The collapse script has a --fuzzy_junction parameter. That is, two transcripts that have the same mapped exonic structures except the junctions are different by a small amount (default: 5 bp) will still be considered identical and collapsed. Most likely, small number of junction diff (5 bp or less) are caused by residual errors not well aligned by GMAP or other aligners.

--Liz

@XiaoyuZhan520
Copy link
Author

Thank you very much. Really appreciate for your help

@XiaoyuZhan520
Copy link
Author

XiaoyuZhan520 commented Jan 15, 2018

Hello Liz,

I found it quite confusing to collapse the transcripts and I hope you can provide me any idea.

For example, transcripts PB.6923.1, which is 1324 nucleotide long. It was mapped to the genome with 124 nucleotide perfectly mapped, 1200 nucleotide mapped to N (gap). This transcript passed the judgement of -c 0.85 -i 0.85 and collapsed into one final transcript.

Do you think it is reasonable to act like this?

@Magdoll
Copy link
Owner

Magdoll commented Jan 23, 2018

Hi @XiaoyuZhan520 ,

This is an odd case. Can you please share the following information:

  1. the collapsed (.rep.fq) fasta/fastq sequence for PB.6923.1
  2. the GFF annotation for it PB.6923.1
  3. the line that starts with PB.6923.1 in *.collapsed.group.txt

You may share this privately by sending it to etseng@pacb.com.

Thanks,
--Liz

@Magdoll
Copy link
Owner

Magdoll commented Jul 24, 2018

Closing unless there is further information.

@Magdoll Magdoll closed this as completed Jul 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants