Problem running with cage_peak #63

PJ-Shaw · 2020-04-05T09:28:13Z

When I ran SQANTI2 using the option of providing a cage_peak file and SQANTI2 aborted without finishing the run. When I ran again leaving out this option, it worked OK suggesting that there's a problem with my cage_peak file. My cage_peak file is made from CAGEr output that looks like this:

(anaCogent3) philip@BT591-001:/mnt/d$ head merged_TSS.bed
Pf3D7_01_v3 5620 5621 cluster1 1000 -
Pf3D7_01_v3 5671 5702 cluster2 1000 -
Pf3D7_01_v3 6351 6362 cluster3 1000 -
Pf3D7_01_v3 6395 6423 cluster4 1000 -
Pf3D7_01_v3 6474 6483 cluster5 1000 -
Pf3D7_01_v3 7770 7789 cluster6 1000 -
Pf3D7_01_v3 35285 35286 cluster7 1000 +
Pf3D7_01_v3 35330 35347 cluster8 1000 +
Pf3D7_01_v3 35404 35412 cluster9 1000 +
Pf3D7_01_v3 35437 35463 cluster10 1000 +

This is what showed up when running SQANTI2:

(anaCogent3) philip@BT591-001:/mnt/d$ python ~/cDNA_Cupcake/sequence/SQANTI2/sqanti_qc2.py /mnt/d/ringschiz_sqanti2.fasta /mnt/d/Pf3D7_RNAsequins_combined.gtf /mnt/d/PlasmoDBv44_Pf3D7_RNAsequins_combined.fasta --cage_peak merged_TSS.bed --polyA_motif_list PAS.txt --polyA_peak sorted_Pf_polyA_sites.bed
R scripting front-end version 3.6.1 (2019-07-05)
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /mnt/d/ringschiz_sqanti2.renamed.fasta
Write arguments to /mnt/d/ringschiz_sqanti2.params.txt...
**** Running SQANTI2...
**** Parsing provided files....
Reading genome fasta /mnt/d/PlasmoDBv44_Pf3D7_RNAsequins_combined.fasta....
Aligning reads with Minimap2...
[M::mm_idx_gen::2.4101.00] collected minimizers
[M::mm_idx_gen::3.1121.61] sorted minimizers
[M::main::3.1131.61] loaded/built the index for 17 target sequence(s)
[M::mm_mapopt_update::3.3031.58] mid_occ = 150
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 17
[M::mm_idx_stat::3.4161.56] distinct minimizers: 7046944 (83.50% are singletons); average occurrences: 1.697; average spacing: 2.835
[M::worker_pipeline::25.7453.54] mapped 42700 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 10 /mnt/d/PlasmoDBv44_Pf3D7_RNAsequins_combined.fasta /mnt/d/ringschiz_sqanti2.renamed.fasta
[M::main] Real time: 25.810 sec; CPU: 91.281 sec; Peak RSS: 0.776 GB
output written to
**** Predicting ORF sequences...
**** Parsing Reference Transcriptome....
**** Parsing Isoforms....
Splice Junction Coverage files not provided.
**** Reading CAGE Peak data.
Traceback (most recent call last):
File "/home/philip/cDNA_Cupcake/sequence/SQANTI2/sqanti_qc2.py", line 2198, in
main()
File "/home/philip/cDNA_Cupcake/sequence/SQANTI2/sqanti_qc2.py", line 2191, in main
run(args)
File "/home/philip/cDNA_Cupcake/sequence/SQANTI2/sqanti_qc2.py", line 1672, in run
isoforms_info = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict)
File "/home/philip/cDNA_Cupcake/sequence/SQANTI2/sqanti_qc2.py", line 1395, in isoformClassification
cage_peak_obj = CAGEPeak(args.cage_peak)
File "/home/philip/cDNA_Cupcake/sequence/SQANTI2/sqanti_qc2.py", line 1896, in init
self.read_bed()
File "/home/philip/cDNA_Cupcake/sequence/SQANTI2/sqanti_qc2.py", line 1905, in read_bed
tss0 = int(raw[6])
IndexError: list index out of range

Magdoll · 2020-04-05T18:47:26Z

Hi @PJ-Shaw ,
SQANTI2 is using the format used by FANTOM5 for their CAGE Peak Bed. See data here. Additional info on the refTSS is in their publication

The required columns are

column 1 - chromosome
column 2 - cage peak signal start
column 3 - cage peak signal end
column 6 - strand
column 7 - cage peak "tss" <--- this is what FANTOM5 considers as the actual TSS

Example below

chr1    629886  629898  hg_3.1  1       +       629892  629893  255,255,0
chr1    629904  629938  hg_223433.1     1       +       629921  629922  255,255,0
chr1    629945  629967  hg_223434.1     1       -       629956  629957  255,255,0
chr1    630059  630067  hg_223435.1     1       +       630063  630064  255,255,0
chr1    630098  630103  hg_4.1  1       +       630100  630101  255,255,0
chr1    630129  630161  hg_5.1  1       +       630145  630146  255,255,0
chr1    630225  630237  hg_236023.1     1       +       630231  630232  255,255,0

PJ-Shaw · 2020-04-06T07:58:33Z

Thank you for spotting the error. I added a 7th column with the "dominant_ctss" nucleotide reported by CAGEr to my bed file and SQANTI2 worked, reporting isoforms "within_cage_peak".

Magdoll closed this as completed Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem running with cage_peak #63

Problem running with cage_peak #63

PJ-Shaw commented Apr 5, 2020

Magdoll commented Apr 5, 2020 •

edited

PJ-Shaw commented Apr 6, 2020

Problem running with cage_peak #63

Problem running with cage_peak #63

Comments

PJ-Shaw commented Apr 5, 2020

Magdoll commented Apr 5, 2020 • edited

PJ-Shaw commented Apr 6, 2020

Magdoll commented Apr 5, 2020 •

edited