-
Notifications
You must be signed in to change notification settings - Fork 9
/
tables.py
748 lines (701 loc) · 34.9 KB
/
tables.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
#!/usr/bin/env python
"""
tables.py
Abhi Nellore / December 25, 2015
Reproduces data used in Mathematica 10 notebook figures.nb for paper Human
splicing diversity across the Sequence Read Archive. Based on Abhi Nellore's
talk at Genome Informatics 2015; see the related repo
https://github.com/nellore/gi2015.
Get and unpack HISAT 2.0.1-beta from
https://ccb.jhu.edu/software/hisat2/index.shtml; we use the tool
extract_splice_sites.py that comes with it to obtain splice sites from
annotation.
File requirements:
1. intropolis.v1.hg19.tsv.gz: database of exon-exon junctions found across
~21,500 SRA samples NOT PROVIDED IN THIS REPO BUT CAN BE REPRODUCED.
See README.md for instructions. The file is also available for download
at http://intropolis.rail.bio .
2. intropolis.idmap.v1.hg19.tsv: maps sample indexes from
intropolis.v1.hg19.tsv.gz to SRA run accession numbers (regex: [SED]RR\d+)
(In this repo.)
3. All GENCODE gene annotations for GRCh37, which may be obtained by executing
the following command.
for i in 3c 3d 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19;
do curl -o gencode.$i.gtf.gz
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$i/
gencode.v$i.annotation.gtf.gz; if [ $? -eq 78 ];
then curl -o gencode.$i.gtf.gz ftp://ftp.sanger.ac.uk/pub/gencode/
Gencode_human/release_$i/gencode.v$i.annotation.GRCh37.gtf.gz;
fi; if [ $? -eq 78 ]; then curl -o gencode.$i.gtf.gz ftp://
ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_$i/
gencode_v$i.annotation.GRCh37.gtf.gz; fi; if [ $? -eq 78 ]; then
curl -o gencode.$i.gtf.gz ftp://ftp.sanger.ac.uk/pub/gencode/
release_$i/gencode.v$i.gtf.gz; fi; done
The GENCODE GTF filenames must have the format gencode.[VERSION].gtf.gz .
The directory containing GENCODE GTFs is specified at the command line.
An archive containing these GENCODE gene annotations may be downloaded at
http://verve.webfactional.com/misc/gencodes.tar.gz .
4. annotated_junctions.tsv.gz, which is in this directory and is
generated by rip_annotated_junctions.py . This file contains a union
of relevant annotated junction tracks from the UCSC Genome Browser.
See rip_annotated_junctions.py for more information.
5. http://www.nature.com/nbt/journal/v32/n9/extref/nbt.2957-S4.zip, which
is Supplementary Data 3 from the paper "A comprehensive assessment of
RNA-seq accuracy, reproducibility and information content by the
Sequencing Quality Control Consortium" by SEQC/MAQC-III Consortium
in Nature Biotech. The junctions on this list are used to make a
Venn Diagram.
6. biosample_tags.tsv, which is in the hg19 subdirectory of this repo and was
generated using hg19/get_biosample_data.sh . It contains metadata from the
NCBI Biosample database, including sample submission dates.
The file http://verve.webfactional.com/misc/all_illumina_sra_for_human.tsv.gz,
which has metadata grabbed from the SRA, is used by other scripts. This file
may be reproduced by following these instructions:
a) Download the sqlite database for SRA from
http://gbnci.abcc.ncifcrf.gov/backup/SRAmetadb.sqlite.gz
b) Unpack the file and make sure SRAmetadb.sqlite is in the same
directory as define_and_get_fields_SRA.R .
C) Use define_and_get_fields_SRA.R to create:
all_illumina_sra_for_human.tsv (which we then compressed with gzip)
manifest_file_illumina_sra_human, which is in the hg19 subdirectory
intropolis.v1.hg19.tsv.gz is specified as argument of --junctions. Annotations
are read from arguments of command-line parameter --annotations that specify
paths to the GTFs above.
Each line of intropolis.v1.hg19.tsv.gz specifies a different junction and has
the following tab-separated fields.
1. chromosome
2. start position (1-based inclusive)
3. end position (1-based inclusive)
4. strand (+ or -)
5. 5' motif (GT, GC, or AT)
6. 3' motif (AG or AC)
7. comma-separated list of indexes of samples in which junction was found
8. comma-separated list of counts of reads overlapping junctions in
corresponding sample from field 7. So if field 7 is 4,5,6 and field 8 is
9,10,11 there are 9 reads overlapping the junction in the sample with
index 4, 10 reads overlapping the junction in the sample with index 5, and
11 reads overlapping the junction in the sample with index 6.
Each line of intropolis.idmap.v1.hg19.tsv specifies a different sample
(specifically, run) on SRA and has the following tab-separated fields.
1. sample index
2. project accession number (regex: [SED]RP\d+)
3. sample accession number (regex: [SED]RS\d+)
4. experiment accession number (regex: [SED]RX\d+)
5. run accession number (regex: [SED]RR\d+)
We used PyPy 2.5.0 with GCC 4.9.2 for our Python implementation and ran:
pypy tables.py
--hisat2-dir /path/to/hisat2-2.0.1-beta
--gencode-dir /path/to/directory/with/gencode/gtf.gzs
--junctions /path/to/intropolis.v1.hg19.tsv.gz
--index-to-sra /path/to/intropolis.idmap.v1.hg19.tsv
--seqc /path/to/nbt.2957-S4.zip
--biosample-metadata /path/to/biosample_tags.tsv
--annotation /path/to/annotated_junctions.tsv.gz
Note that the argument of --hisat2-dir is the directory containing the HISAT 2
binary and extract_splice_sites.py.
The following output was obtained. It is included in this repo because this
script cannot easily be rerun to obtain results; the input file
intropolis.v1.hg19.tsv.gz must be provided, and this requires following the
instructions in README.md for its reproduction. Note that an "overlap" below
is an instance where a junction is overlapped by a read. A read that overlaps
two exon-exon junctions contributes two overlaps (or overlap instances).
[basename].annotation_diffint.tsv
Matrix where each row is a GENCODE version i and each column is a GENCODE
version i. Each element is in the format
(|junctions in i - junctions in j|, |junctions in i and j|,
|junctions in j - junctions in i|)
, where - is a set difference.
[basename].seqc_summary.txt
Junction counts from SEQC protocol and Rail for the 1720 samples studied by
with both. See file for details.
[basename].sample_count_submission_date_overlap_geq_20.tsv
Tab-separated fields
1. count of samples in which a given junction was found
2. count of projects in which a given junction was found
3. earliest known discovery date (in units of days after February 27, 2009)
-- this is the earliest known submission date of a sample associated with a
junction
Above, each junction is covered by at least 20 reads per sample.
[basename].[type].stats.tsv, where [type] is in [project, sample]
Tab-separated fields
1. [type] count
2. Number of junctions found in >= field 1 [type]s
3. Number of annotated junctions found in >= field 1 [type]s
4. Number of exonskips found in >= field 1 [type]s (exon skip: both 5' and 3'
splice sites are annotated, but not in the same exon-exon junction)
5. Number of altstartends found in >= field 1 [type]s (altstartend: either 5'
or 3' splice site is annotated, but not both)
6. Number of novel junctions found in >= field 1 [type]s (novel: both 5' and
3' splice sites are unannotated)
7. Number of GT-AG junctions found in >= field 1 [type]s
8. Number of annotated GT-AG junctions found in >= field 1 [type]s
9. Number of GC-AG junctions found in >= field 1 [type]s
10. Number of annotated GC-AG junctions found in >= field 1 [type]s
11. Number of AT-AC junctions found in >= field 1 [type]s
12. Number of annotated AT-AC junctions found in >= field 1 [type]s
[basename].seqc.stats.tsv
Tab-separated fields
1. SEQC sample count
2. Number of junctions found in >= field 1 SEQC samples
3. Number of junctions found by magic and Rail in >= field 1 SEQC samples
4. Number of junctions found by rmake and Rail in >= field 1 SEQC samples
5. Number of junctions found by subread and Rail in >= field 1 SEQC samples
6. Number of junctions found by Rail and exactly one of
{magic, rmake, subread} in >= field 1 samples
7. Number of junctions found by Rail and exactly two of
{magic, rmake, subread} in >= field 1 samples
8. Number of junctions found by Rail and all of {magic, rmake, subread} in
>= field 1 samples
[basename].stats_by_sample.tsv
Tab-separated fields
1. sample index
2. project accession number
3. sample accession number
4. experiment accession number
5. run accession number
6. junction count
7. annotated junction count
8. count of junctions overlapped by at least 5 reads
9. count of annotated junctions overlapped by at least 5 reads
10. total overlap instances
11. total annotated overlap instances
"""
import sys
import gzip
import zipfile
import re
import os
import subprocess
from contextlib import contextmanager
def is_gzipped(filename):
""" Uses gzip magic number to determine whether a file is compressed.
filename: path to file
Return value: True iff file filename is gzipped.
"""
with open(filename, 'rb') as binary_input_stream:
# Check for magic number
if binary_input_stream.read(2) == '\x1f\x8b':
return True
else:
return False
@contextmanager
def xopen(filename):
""" Opens both gzipped and uncompressed files for contextual reading.
filename: path to file to open
Yield value: a gzip.open or open object
"""
if is_gzipped(filename):
f = gzip.open(filename)
else:
f = open(filename)
try:
yield f
finally:
f.close()
if __name__ == '__main__':
import argparse
# Print file's docstring if -h is invoked
parser = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
# Add command-line arguments
parser.add_argument('--hisat2-dir', type=str, required=True,
help=('path to directory containing contents of HISAT2; we '
'unpacked ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/'
'downloads/hisat2-2.0.1-beta-Linux_x86_64.zip to get this')
)
parser.add_argument('--annotation', type=str, required=True,
help=('path to annotated_junctions.tsv.gz, which is generated '
'by rip_annotated_junctions.py')
)
parser.add_argument('--gencode-dir', type=str, required=True,
help='path to directory containing all GENCODE GTFs for hg19, '
'which includes 3c, 3d, and 4 through 19'
)
parser.add_argument('--junctions', type=str, required=True,
help='junctions file; this should be intropolis.v1.hg19.tsv.gz'
)
parser.add_argument('--index-to-sra', type=str, required=True,
help='index to SRA accession numbers file; this should be '
'intropolis.idmap.v1.hg19.tsv'
)
parser.add_argument('--biosample-metadata', type=str, required=True,
help='path to Biosample metadata file; this should be '
'biosample_tags.tsv'
)
parser.add_argument('--seqc', type=str, required=True,
help='path to SEQC junctions; this should be nbt.2957-S4.zip')
parser.add_argument('--basename', type=str, required=False,
default='hg19',
help='basename for output files'
)
args = parser.parse_args()
from collections import defaultdict
# Load all annotated junctions from annotated_junctions.tsv.gz
annotated_junctions = set()
annotated_5p = set()
annotated_3p = set()
with xopen(args.annotation) as annotations_stream:
for i, line in enumerate(annotations_stream):
chrom, start, end, strand = line.strip().split('\t')
annotated_junctions.add((chrom, int(start), int(end), strand))
if strand == '+':
annotated_5p.add((chrom, int(start), strand))
annotated_3p.add((chrom, int(end), strand))
elif strand == '-':
annotated_5p.add((chrom, int(end), strand))
annotated_3p.add((chrom, int(start), strand))
print >>sys.stderr, 'Read {} annotated junctions.'.format(i+1)
# Map sample indexes to accession number lines
index_to_sra, index_to_srp, srr_to_index = {}, {}, {}
srs_to_srr = defaultdict(list)
# Get sample indexes for all Illumina RNA-seq from SEQC for comparison
seqc_indexes = set()
with xopen(args.index_to_sra) as index_stream:
for line in index_stream:
partitioned = line.partition('\t')
sample_index = int(partitioned[0])
index_to_sra[sample_index] = partitioned[2].strip()
srp, srs, srx, srr = partitioned[2].strip().split('\t')
srs_to_srr[srs].append(srr)
srr_to_index[srr] = sample_index
index_to_srp[sample_index] = srp
if srp == 'SRP025982':
# SEQC hit!
seqc_indexes.add(sample_index)
print >>sys.stderr, 'Done mapping sample indexes to samples.'
from datetime import date
'''For getting junctions by "earliest detection date"; use units of number
of days after earliest date. Map sample indexes to submission dates.'''
all_dates = {}
with xopen(args.biosample_metadata) as biosample_stream:
biosample_stream.readline() # header
for line in biosample_stream:
tokens = line.strip().split('\t')
current_date = date(
*tuple(
[int(el.strip())
for el in tokens[10].split('T')[0].split('-')]
)
)
for srr in srs_to_srr[tokens[9].upper()]:
all_dates[srr_to_index[srr]] = current_date
earliest_date = min(all_dates.values())
for sample_index in all_dates:
all_dates[sample_index] = (
all_dates[sample_index] - earliest_date
).days
date_indexes = set(all_dates.keys())
print >>sys.stderr, 'Done grabbing submission dates from Biosample DB.'
# Grab all GENCODE junctions
gencodes = defaultdict(set)
refs = set(
['chr' + str(i) for i in xrange(1, 23)] + ['chrM', 'chrX', 'chrY']
)
extract_splice_sites_path = os.path.join(args.hisat2_dir,
'extract_splice_sites.py')
from glob import glob
annotations = glob(os.path.join(args.gencode_dir, 'gencode.*.gtf.gz'))
annotations = [(os.path.basename(annotation_path), annotation_path)
for annotation_path in annotations]
for annotation_base, annotation in annotations:
extract_process = subprocess.Popen(' '.join([
sys.executable,
extract_splice_sites_path,
annotation
if not is_gzipped(
annotation
) else ('<(gzip -cd %s)'
% annotation
)
]),
shell=True,
executable='/bin/bash',
stdout=subprocess.PIPE
)
gencode_version = annotation.split('.')[1]
for line in extract_process.stdout:
tokens = line.strip().split('\t')
tokens[1] = int(tokens[1]) + 2
tokens[2] = int(tokens[2])
if tokens[0] in refs:
gencodes[gencode_version].add(tuple(tokens))
extract_process.stdout.close()
exit_code = extract_process.wait()
if exit_code != 0:
raise RuntimeError(
'extract_splice_sites.py had nonzero exit code {}.'.format(
exit_code
)
)
gencode_versions = ['3c', '3d'] + [str(ver) for ver in range(4, 20)]
# Write some differences/intersections
with open(
args.basename + '.annotation_diffint.tsv', 'w'
) as intersect_stream:
print >>intersect_stream, '\t'.join([''] + gencode_versions)
for i in gencode_versions:
intersect_stream.write(i + '\t')
print >>intersect_stream, '\t'.join([
','.join([
str(len(gencodes[i] - gencodes[j])),
str(len(gencodes[i].intersection(gencodes[j]))),
str(len(gencodes[j] - gencodes[i]))])
for j in gencode_versions
])
print >>sys.stderr, ('Found {} annotated junctions across '
'GENCODE versions.').format(
{ version : len(gencodes[version])
for version in gencodes }
)
'''Grab SEQC junctions. Three protocols were used: Subread, r-make, and
NCBI Magic.'''
magic_junctions, rmake_junctions, subread_junctions = set(), set(), set()
seqc_junctions = set()
with zipfile.ZipFile(args.seqc).open('SupplementaryData3.tab') \
as seqc_stream:
seqc_stream.readline() # header
for line in seqc_stream:
tokens = line.strip().split('\t')
tokens[0] = tokens[0].split('.')
junction = (tokens[0][0], int(tokens[0][1]), int(tokens[0][2]))
add_junc = False
if tokens[1] == '1':
subread_junctions.add(junction)
add_junc = True
if tokens[2] == '1':
rmake_junctions.add(junction)
add_junc = True
if tokens[3] == '1':
magic_junctions.add(junction)
add_junc = True
if add_junc:
seqc_junctions.add(junction)
print >>sys.stderr, 'Done reading SEQC junctions.'
# Key: sample index; value: number of junctions found in sample
junction_counts = defaultdict(int)
# For junctions in union of annotations specified at command line
annotated_junction_counts = defaultdict(int)
# Count total overlap instances and annotated overlap instances
'''Same as above, but including only junctions covered by at least 5 reads
in the sample.'''
junction_counts_geq_5 = defaultdict(int)
annotated_junction_counts_geq_5 = defaultdict(int)
overlap_counts = defaultdict(int)
annotated_overlap_counts = defaultdict(int)
# Mapping counts of samples to junction counts
sample_count_to_junction_count = defaultdict(int)
project_count_to_junction_count = defaultdict(int)
sample_count_to_GTAG_junction_count = defaultdict(int)
project_count_to_GTAG_junction_count = defaultdict(int)
sample_count_to_GCAG_junction_count = defaultdict(int)
project_count_to_GCAG_junction_count = defaultdict(int)
sample_count_to_ATAC_junction_count = defaultdict(int)
project_count_to_ATAC_junction_count = defaultdict(int)
sample_count_to_GTAG_ann_count = defaultdict(int)
project_count_to_GTAG_ann_count = defaultdict(int)
sample_count_to_GCAG_ann_count = defaultdict(int)
project_count_to_GCAG_ann_count = defaultdict(int)
sample_count_to_ATAC_ann_count = defaultdict(int)
project_count_to_ATAC_ann_count = defaultdict(int)
# One of 5' or 3' splice site is in annotation, one isn't
sample_count_to_altstartend_junction_count = defaultdict(int)
project_count_to_altstartend_junction_count = defaultdict(int)
# Both 5' and 3' splice sites are in annotation, but junction is not
sample_count_to_exonskip_junction_count = defaultdict(int)
project_count_to_exonskip_junction_count = defaultdict(int)
# Full junction is in annotation
sample_count_to_annotated_junction_count = defaultdict(int)
project_count_to_annotated_junction_count = defaultdict(int)
# Neither 5' nor 3' is in annotation
sample_count_to_novel_junction_count = defaultdict(int)
project_count_to_novel_junction_count = defaultdict(int)
# For comparison wth SEQC
rail_seqc_junctions = set()
seqc_sample_count_to_junction_count = defaultdict(int)
seqc_sample_count_to_magic = defaultdict(int)
seqc_sample_count_to_rmake = defaultdict(int)
seqc_sample_count_to_subread = defaultdict(int)
seqc_sample_count_to_ones = defaultdict(int)
seqc_sample_count_to_twos = defaultdict(int)
seqc_sample_count_to_threes = defaultdict(int)
# For junction-date analyses
date_to_junction_count = defaultdict(int)
date_to_junction_count_overlap_geq_20 = defaultdict(int)
with xopen(args.junctions) as junction_stream, gzip.open(
args.basename
+ '.sample_count_submission_date_overlap_geq_20.tsv.gz', 'w'
) as junction_date_stream:
print >>junction_date_stream, ((
'# reads across samples in which junction '
'was found\t'
'# samples in which junction was found'
'\t# projects in which junction was found'
'\tearliest known discovery date in '
'days after %s; format Y-M-D\t') % (
earliest_date.strftime(
'%Y-%m-%d'
)
)) + '\t'.join(
['present in GENCODE v' + ver
for ver in gencode_versions]
) + '\tearliest GENCODE version'
for line in junction_stream:
tokens = line.strip().split('\t')
junction = (tokens[0], int(tokens[1]), int(tokens[2]), tokens[3])
if tokens[3] == '+':
fivep = junction[:2] + (junction[3],)
threep = (junction[0], junction[2], junction[3])
elif tokens[3] == '-':
threep = junction[:2] + (junction[3],)
fivep = (junction[0], junction[2], junction[3])
else:
raise RuntimeError('Bad strand in line "%s"' % line)
samples = [int(el) for el in tokens[-2].split(',')]
coverages = [int(el) for el in tokens[-1].split(',')]
sample_count = len(samples)
project_count = len(set([index_to_srp[sample]
for sample in samples]))
try:
discovery_date = min(
[all_dates[sample] for sample in samples
if sample in date_indexes]
)
except ValueError:
# No discovery date available!
pass
else:
date_to_junction_count[discovery_date] += 1
cov_sum = sum(coverages)
if cov_sum >= 20:
date_to_junction_count_overlap_geq_20[discovery_date] += 1
gencode_bools_to_print = [
'1' if junction in gencodes[ver]
else '0' for ver in gencode_versions
]
try:
earliest_gencode_version = gencode_versions[
gencode_bools_to_print.index('1')
]
except ValueError:
earliest_gencode_version = 'NA'
print >>junction_date_stream, ('%d\t%d\t%d\t%d\t' % (
cov_sum,
sample_count,
project_count,
discovery_date
)) + '\t'.join(gencode_bools_to_print) + (
'\t' + earliest_gencode_version
)
samples_and_coverages = zip(samples, coverages)
sample_count_to_junction_count[sample_count] += 1
project_count_to_junction_count[project_count] += 1
if tokens[5] == 'AG':
if tokens[4] == 'GT':
sample_count_to_GTAG_junction_count[sample_count] += 1
project_count_to_GTAG_junction_count[project_count] += 1
elif tokens[4] == 'GC':
sample_count_to_GCAG_junction_count[sample_count] += 1
project_count_to_GCAG_junction_count[project_count] += 1
else:
raise RuntimeError('Bad motif in line "%s"' % line)
elif tokens[5] == 'AC':
if tokens[4] == 'AT':
sample_count_to_ATAC_junction_count[sample_count] += 1
project_count_to_ATAC_junction_count[project_count] += 1
else:
raise RuntimeError('Bad motif in line "%s"' % line)
if junction in annotated_junctions:
sample_count_to_annotated_junction_count[sample_count] += 1
project_count_to_annotated_junction_count[project_count] += 1
if tokens[5] == 'AG':
if tokens[4] == 'GT':
sample_count_to_GTAG_ann_count[sample_count] += 1
project_count_to_GTAG_ann_count[project_count] += 1
elif tokens[4] == 'GC':
sample_count_to_GCAG_ann_count[sample_count] += 1
project_count_to_GCAG_ann_count[project_count] += 1
elif tokens[5] == 'AC':
sample_count_to_ATAC_ann_count[sample_count] += 1
project_count_to_ATAC_ann_count[project_count] += 1
for sample, coverage in samples_and_coverages:
annotated_junction_counts[sample] += 1
annotated_overlap_counts[sample] += coverage
if coverage >= 5:
annotated_junction_counts_geq_5[sample] += 1
elif threep in annotated_3p:
if fivep in annotated_5p:
sample_count_to_exonskip_junction_count[sample_count] += 1
project_count_to_exonskip_junction_count[
project_count
] += 1
else:
sample_count_to_altstartend_junction_count[sample_count] \
+= 1
project_count_to_altstartend_junction_count[
project_count
] += 1
elif fivep in annotated_5p:
sample_count_to_altstartend_junction_count[sample_count] += 1
project_count_to_altstartend_junction_count[project_count] += 1
else:
sample_count_to_novel_junction_count[sample_count] += 1
project_count_to_novel_junction_count[project_count] += 1
seqc_intersect = set(samples).intersection(seqc_indexes)
if seqc_intersect:
junction = junction[:-1]
rail_seqc_junctions.add(junction)
seqc_sample_count = len(seqc_intersect)
seqc_sample_count_to_junction_count[seqc_sample_count] += 1
intersect_count = 0
if junction in magic_junctions:
seqc_sample_count_to_magic[seqc_sample_count] += 1
intersect_count += 1
if junction in rmake_junctions:
seqc_sample_count_to_rmake[seqc_sample_count] += 1
intersect_count += 1
if junction in subread_junctions:
seqc_sample_count_to_subread[seqc_sample_count] += 1
intersect_count += 1
if intersect_count == 1:
seqc_sample_count_to_ones[seqc_sample_count] += 1
elif intersect_count == 2:
seqc_sample_count_to_twos[seqc_sample_count] += 1
elif intersect_count == 3:
seqc_sample_count_to_threes[seqc_sample_count] += 1
for sample, coverage in samples_and_coverages:
junction_counts[sample] += 1
overlap_counts[sample] += coverage
if coverage >= 5:
junction_counts_geq_5[sample] += 1
print >>sys.stderr, 'Done reading junction file.'
'''Aggregate junction stats: how many junctions/overlaps of given type
are found in >= K samples/projects/seqc samples?'''
sample_stats_to_aggregate = [sample_count_to_junction_count,
sample_count_to_annotated_junction_count,
sample_count_to_exonskip_junction_count,
sample_count_to_altstartend_junction_count,
sample_count_to_novel_junction_count,
sample_count_to_GTAG_junction_count,
sample_count_to_GTAG_ann_count,
sample_count_to_GCAG_junction_count,
sample_count_to_GCAG_ann_count,
sample_count_to_ATAC_junction_count,
sample_count_to_ATAC_ann_count]
project_stats_to_aggregate = [project_count_to_junction_count,
project_count_to_annotated_junction_count,
project_count_to_exonskip_junction_count,
project_count_to_altstartend_junction_count,
project_count_to_novel_junction_count,
project_count_to_GTAG_junction_count,
project_count_to_GTAG_ann_count,
project_count_to_GCAG_junction_count,
project_count_to_GCAG_ann_count,
project_count_to_ATAC_junction_count,
project_count_to_ATAC_ann_count]
seqc_stats_to_aggregate = [seqc_sample_count_to_junction_count,
seqc_sample_count_to_magic,
seqc_sample_count_to_rmake,
seqc_sample_count_to_subread,
seqc_sample_count_to_ones,
seqc_sample_count_to_twos,
seqc_sample_count_to_threes]
header_prototype = ('min {descriptor}s\t'
'junctions\t'
'annotated\t'
'exonskips\t'
'altstartend\t'
'novel\t'
'GTAG\t'
'annotated GTAG\t'
'GCAG\t'
'annotated GCAG\t'
'ATAC\t'
'annotated ATAC')
seqc_header = ('min seqc samples\t'
'junctions\t'
'magic junctions\t'
'rmake junctions\t'
'subread junctions\t'
'exactly one of {magic, rmake, subread} junctions\t'
'exactly two of {magic, rmake, subread} junctions\t'
'all three of {magic, rmake, subread} junctions')
for stats, header, descriptor in [
(sample_stats_to_aggregate,
header_prototype.format(descriptor='sample'),
'sample'),
(project_stats_to_aggregate,
header_prototype.format(descriptor='project'),
'project'),
(seqc_stats_to_aggregate,
seqc_header,
'seqc_sample')
]:
max_count, min_count = 0, 1000000000 # way larger than max # samples
for stat in stats:
max_count = max(stat.keys() + [max_count])
min_count = min(stat.keys() + [min_count])
stat_count = len(stats)
stat_aggregators = [0 for _ in xrange(stat_count)]
with open(args.basename + '.' + descriptor + '.stats.tsv', 'w') \
as stat_stream:
print >>stat_stream, header
for descriptor_count in xrange(max_count, min_count - 1, -1):
for i in xrange(stat_count):
stat_aggregators[i] += stats[i][descriptor_count]
print >>stat_stream, '\t'.join(
[str(descriptor_count)]
+ [str(el) for el in stat_aggregators]
)
print >>sys.stderr, ('Dumped sample/project-level and SEQC '
'aggregate junction stats.')
# Dump junction information by sample
with open(args.basename + '.stats_by_sample.tsv', 'w') as stat_stream:
print >>stat_stream, ('sample index\tproject\tsample\texperiment\trun'
'\tjunctions\tannotated_junctions'
'\tjunctions_geq_5\tannotated_junctions_geq_5'
'\toverlaps\tannotated_overlaps')
for sample_index in sorted(index_to_sra.keys()):
print >>stat_stream, '\t'.join(
[str(el) for el in
[sample_index, index_to_sra[sample_index],
junction_counts[sample_index],
annotated_junction_counts[sample_index],
junction_counts_geq_5[sample_index],
annotated_junction_counts_geq_5[sample_index],
overlap_counts[sample_index],
annotated_overlap_counts[sample_index]]]
)
print >>sys.stderr, 'Dumped junction info by sample.'
# SEQC summary
with open(args.basename + '.seqc_summary.txt', 'w') as seqc_stream:
in_all = set.intersection(
magic_junctions, rmake_junctions, subread_junctions
)
in_one = set.union(
magic_junctions, rmake_junctions, subread_junctions
)
in_two = set.union(
set.intersection(magic_junctions, rmake_junctions),
set.intersection(magic_junctions, subread_junctions),
set.intersection(rmake_junctions, subread_junctions)
)
print >>seqc_stream, (
'total samples studied by SEQC consortium and Rail: %d'
% len(seqc_indexes)
)
print >>seqc_stream, (
'junctions found by magic, rmake, and subread: %d'
% len(in_all)
)
print >>seqc_stream, (
'junctions found by magic, rmake, or subread: %d'
% len(in_one)
)
print >>seqc_stream, (
'junctions found by at least two of '
'[magic, rmake, subread]: %d'
) % len(in_two)
print >>seqc_stream, (
'junctions found by Rail: %d' % len(rail_seqc_junctions)
)
print >>sys.stderr, 'Dumped SEQC summary.'