Permalink
Find file
e55923d Sep 8, 2016
@nellore @mikelove
63 lines (52 sloc) 4.18 KB

intropolis

intropolis is a list of exon-exon junctions found across 21,504 human RNA-seq samples on the Sequence Read Archive (SRA) from spliced read alignment to hg19 with Rail-RNA. Five files are provided:

A. intropolis.v1.hg19.tsv.gz : a 6.6-GB gzipped TSV (18.3 GB uncompressed) with fields

  1. chromosome
  2. intron start position (1-based; inclusive)
  3. intron end position (1-based; inclusive)
  4. strand (+ or -)
  5. donor dinucleotide (e.g., GT)
  6. acceptor dinucleotide (e.g., AG)
  7. comma-separated list of indexes of samples in which junction was found
  8. comma-separated list of corresponding numbers of reads mapping across junction in samples from field 7

B. intropolis.idmap.v1.hg19.tsv : a small TSV with fields

  1. sample index used in field 7 of intropolis.v1.hg19.tsv.gz
  2. SRA project accession number
  3. SRA sample accession number
  4. SRA experiment accession number
  5. SRA run accession number

C. intropolis.v1.hg19.bed.gz : a gzipped BED-formatted version of intropolis.v1.hg19.tsv.gz with fields

  1. chromosome
  2. intron start position (0-based; inclusive)
  3. intron end position (0-based; exclusive)
  4. name (junction_[line number])
  5. score (always 1000)
  6. strand (+ or -)

D. intropolis.v1.hg19.bb : a bigBed version of intropolis.v1.hg19.bed.gz

E. intropolis.v1.hg19_with_liftover_to_hg38.tsv.gz : a 6.87-GB gzipped TSV (18.3 GB uncompressed) with fields

  1. chromosome
  2. intron start position (1-based; inclusive)
  3. intron end position (1-based; inclusive)
  4. strand (+ or -)
  5. donor dinucleotide (e.g., GT)
  6. acceptor dinucleotide (e.g., AG)
  7. comma-separated list of indexes of samples in which junction was found
  8. comma-separated list of corresponding numbers of reads mapping across junction in samples from field 7
  9. chromosome from liftover to hg38 or NA if unavailable
  10. start position in liftover to hg38 or NA if unavailable
  11. end position in liftover to hg38 or NA if unavailable
  12. strand in liftover to hg38 or NA if unavailable

Liftover of junctions to hg38 in intropolis.v1.hg19_with_liftover_to_hg38.tsv.gz was performed with the UCSC liftOver executable with command-line parameters -ends=2 -minMatch=1.0 and may be reproduced using this script together with intropolis.v1.hg19.tsv.gz.

Metadata on SRA specifying e.g. tissue and cell type is incomplete and does not have a controlled vocabulary. Some is available in this file derived from the fantastic SRAdb R package by Jack Zhu and Sean Davis. Still more metadata taken from Biosample is available in this file. But probably the best effort to infer metadata for SRA RNA-seq (with a controlled vocabulary for tissues!) is SHARQ, by Darya Filippova while in Carl Kingsford's group.

Expect new versions of intropolis spanning more samples as they are added to SRA. If you use intropolis, cite Human splicing diversity across the Sequence Read Archive, by