#RegionAnnotator ##Source https://github.com/ivankosmos/RegionAnnotator
##Data input needed Daner clumps or similar generated from PGC GWAS Summary Statistics
##What the program does Creates a number of tables with gene entries (rows) and psychiatric related annotations (columns). Includes gencode location, gene-based p-values, SFARI ASD annotation, GENCODE annotation, GWAS catalog annotation, OMIM annotation and manual curation among other things. Genes/regions are annotated based on
- protein_coding_genes: an expanded (10 Mbases) segment overlap condition, and a distance condition (distance<100Kb).
- omim, asd_genes, id_devdelay_genes, mouse_knockout: an expanded (10 Mbases) segment overlap condition, a distance condition (distance<100Kb), and a gene name comparison condition.
- gwas_catalog, psychiatric_cnvs: a segment overlap condition.
##What the data output will look like Multiple .csv/.tsv-files or one MS excel-file or one json-file. Creates a Java H2 database file that can be accessed directly instead of file export.
##Algorithm
- The input data is read from its source format.
- The input data is checked against its pre-configured templates and is configured and completed from the templates. Templates contain information on table columns, column data types and output formatting.
- The input data rows are read into tables. Reference data is read into tables that are named corresponding to their file names or excel-sheets, prefixed by
_
. Gene data is read into a table named GENE_MASTER. User data is read into a table named _USER_INPUT. - Templated columns are indexed for improved database performance.
The program runs operation actions after every user input.
- TwoSegmentOverlapCondition(a0,a1,b0,b1) =
((a0<=b0 AND b0<=a1) OR (a0<=b1 AND b1<=a1) OR (b0<=a0 AND a0<=b1) OR (b0<=a1 AND a1<=b1))
- Computes an enriched version of the user input in the table USER_INPUT.
- location : A coordinate composed of the chromosome (chr) and the basepair coordinates (bp1, bp2) that has been formatted into a string. A comma is used as a 3-character separator in the basepari coordinates.
- UCSC_LINK : A (MS Excel) hyperlink to UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly.
- Computes an enriched version of the GENE_MASTER (g) table in GENE_MASTER_EXPANDED. Expanded basepair coordinates are calculated, one expanding 20 kbases, and one 10 Mbases.
- bp1s20k_gm =
(g.bp1-20000)
- bp2a20k_gm =
(g.bp2+20000)
- bp1s10m_gm =
(g.bp1-10e6)
- bp2a10m_gm =
(g.bp2+10e6)
- Creates a joined table PROTEIN_CODING_GENES_ALL of user input and protein coding genes from _USER_INPUT (c) and GENE_MASTER_EXPANDED (g) fulfilling the condition of
g.ttype='protein_coding' AND c.chr=g.chr AND TwoSegmentOverlapCondition(c.bp1,c.bp2,g.bp1s10m_gm,g.bp2a10m_gm)
, that is: protein coding genes that fulfill the overlap condition, between user input regions and gene coordinates that were expanded 10MBases.
- dist=
CASE WHEN TwoSegmentOverlapCondition(c.bp1,c.bp2,g.bp1,g.bp2) THEN 0 WHEN c.bp1 IS NULL OR c.bp2 IS NULL THEN 9e9 ELSE NUM_MAX_INTEGER(ABS(c.bp1-g.bp2),ABS(c.bp2-g.bp1)) END)
- Creates a view PROTEIN_CODING_GENES from PROTEIN_CODING_GENES_ALL
WHERE dist<100000
- Creates all output datasets:
- GWAS_CATALOG by joining _USER_INPUT (c) and the reference _GWAS_CATALOG (r)
on
c.chr=r.chr AND TwoSegmentOverlapCondition(c.bp1, c.bp2, r.bp1, r.bp2)
- OMIM by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _OMIM (r)
on
g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''
- PSYCHIATRIC_CNVS by joining _USER_INPUT (c) and the reference _PSYCHIATRIC_CNVS (r)
on
c.chr=r.chr AND TwoSegmentOverlapCondition(c.bp1, c.bp2, r.bp1, r.bp2)
- ASD_GENES by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _ASD_GENES (r)
on
g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''
- ID_DEVDELAY_GENES by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _ID_DEVDELAY_GENES (r)
on
g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''
- MOUSE_KNOCKOUT by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _MOUSE_KNOCKOUT (r)
on
g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''
- The chosen tables are outputted to the chosen file(s) and in the chosen format.
- Output is automatically done after user input.