Skip to content

Latest commit

 

History

History
66 lines (54 loc) · 4.94 KB

README.md

File metadata and controls

66 lines (54 loc) · 4.94 KB

intropolis

PLEASE NOTE: Snaptron by Wilks et al. is a query tool for making sense of splicing across thousands of RNA-seq samples. It subsumes intropolis v1. If you are looking for the raw data behind Snaptron, see http://snaptron.cs.jhu.edu/data/. In particular, the SQLite files comprising an "intropolis v2" compilation spanning ~50,000 RNA-seq samples on SRA are available at http://snaptron.cs.jhu.edu/data/srav2/.

intropolis is a list of exon-exon junctions found across 21,504 human RNA-seq samples on the Sequence Read Archive (SRA) from spliced read alignment to hg19 with Rail-RNA. Five files are provided:

A. intropolis.v1.hg19.tsv.gz : a 6.6-GB gzipped TSV (18.3 GB uncompressed) with fields

  1. chromosome
  2. intron start position (1-based; inclusive)
  3. intron end position (1-based; inclusive)
  4. strand (+ or -)
  5. donor dinucleotide (e.g., GT)
  6. acceptor dinucleotide (e.g., AG)
  7. comma-separated list of indexes of samples in which junction was found
  8. comma-separated list of corresponding numbers of reads mapping across junction in samples from field 7

B. intropolis.idmap.v1.hg19.tsv : a small TSV with fields

  1. sample index used in field 7 of intropolis.v1.hg19.tsv.gz
  2. SRA project accession number
  3. SRA sample accession number
  4. SRA experiment accession number
  5. SRA run accession number

C. intropolis.v1.hg19.bed.gz : a gzipped BED-formatted version of intropolis.v1.hg19.tsv.gz with fields

  1. chromosome
  2. intron start position (0-based; inclusive)
  3. intron end position (0-based; exclusive)
  4. name (junction_[line number])
  5. score (always 1000)
  6. strand (+ or -)

D. intropolis.v1.hg19.bb : a bigBed version of intropolis.v1.hg19.bed.gz

E. intropolis.v1.hg19_with_liftover_to_hg38.tsv.gz : a 6.87-GB gzipped TSV (18.3 GB uncompressed) with fields

  1. chromosome
  2. intron start position (1-based; inclusive)
  3. intron end position (1-based; inclusive)
  4. strand (+ or -)
  5. donor dinucleotide (e.g., GT)
  6. acceptor dinucleotide (e.g., AG)
  7. comma-separated list of indexes of samples in which junction was found
  8. comma-separated list of corresponding numbers of reads mapping across junction in samples from field 7
  9. chromosome from liftover to hg38 or NA if unavailable
  10. start position in liftover to hg38 or NA if unavailable
  11. end position in liftover to hg38 or NA if unavailable
  12. strand in liftover to hg38 or NA if unavailable

(If the links above don't work for you, check out the backup on Figshare.)

Liftover of junctions to hg38 in intropolis.v1.hg19_with_liftover_to_hg38.tsv.gz was performed with the UCSC liftOver executable with command-line parameters -ends=2 -minMatch=1.0 and may be reproduced using this script together with intropolis.v1.hg19.tsv.gz.

Metadata on SRA specifying e.g. tissue and cell type is incomplete and does not have a controlled vocabulary. Some is available in this file derived from the fantastic SRAdb R package by Jack Zhu and Sean Davis. Still more metadata taken from Biosample is available in this file. But probably the best effort to infer metadata for SRA RNA-seq (with a controlled vocabulary for tissues!) is SHARQ, by Darya Filippova while in Carl Kingsford's group.

Expect new versions of intropolis spanning more samples as they are added to SRA. If you use intropolis, cite Human splicing diversity across the Sequence Read Archive, by