Skip to content

johanzvrskovec/RegionAnnotator

Repository files navigation

#RegionAnnotator ##Source https://github.com/ivankosmos/RegionAnnotator

##Data input needed Daner clumps or similar generated from PGC GWAS Summary Statistics

##What the program does Creates a number of tables with gene entries (rows) and psychiatric related annotations (columns). Includes gencode location, gene-based p-values, SFARI ASD annotation, GENCODE annotation, GWAS catalog annotation, OMIM annotation and manual curation among other things. Genes/regions are annotated based on

  • protein_coding_genes: an expanded (10 Mbases) segment overlap condition, and a distance condition (distance<100Kb).
  • omim, asd_genes, id_devdelay_genes, mouse_knockout: an expanded (10 Mbases) segment overlap condition, a distance condition (distance<100Kb), and a gene name comparison condition.
  • gwas_catalog, psychiatric_cnvs: a segment overlap condition.

##What the data output will look like Multiple .csv/.tsv-files or one MS excel-file or one json-file. Creates a Java H2 database file that can be accessed directly instead of file export.

##Algorithm

Enter reference data, gene data or user data input

  1. The input data is read from its source format.
  2. The input data is checked against its pre-configured templates and is configured and completed from the templates. Templates contain information on table columns, column data types and output formatting.
  3. The input data rows are read into tables. Reference data is read into tables that are named corresponding to their file names or excel-sheets, prefixed by _. Gene data is read into a table named GENE_MASTER. User data is read into a table named _USER_INPUT.
  4. Templated columns are indexed for improved database performance.

Operate

The program runs operation actions after every user input.

  • TwoSegmentOverlapCondition(a0,a1,b0,b1) = ((a0<=b0 AND b0<=a1) OR (a0<=b1 AND b1<=a1) OR (b0<=a0 AND a0<=b1) OR (b0<=a1 AND a1<=b1))
  1. Computes an enriched version of the user input in the table USER_INPUT.
  • location : A coordinate composed of the chromosome (chr) and the basepair coordinates (bp1, bp2) that has been formatted into a string. A comma is used as a 3-character separator in the basepari coordinates.
  • UCSC_LINK : A (MS Excel) hyperlink to UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly.
  1. Computes an enriched version of the GENE_MASTER (g) table in GENE_MASTER_EXPANDED. Expanded basepair coordinates are calculated, one expanding 20 kbases, and one 10 Mbases.
  • bp1s20k_gm = (g.bp1-20000)
  • bp2a20k_gm = (g.bp2+20000)
  • bp1s10m_gm = (g.bp1-10e6)
  • bp2a10m_gm = (g.bp2+10e6)
  1. Creates a joined table PROTEIN_CODING_GENES_ALL of user input and protein coding genes from _USER_INPUT (c) and GENE_MASTER_EXPANDED (g) fulfilling the condition of g.ttype='protein_coding' AND c.chr=g.chr AND TwoSegmentOverlapCondition(c.bp1,c.bp2,g.bp1s10m_gm,g.bp2a10m_gm) , that is: protein coding genes that fulfill the overlap condition, between user input regions and gene coordinates that were expanded 10MBases.
  • dist= CASE WHEN TwoSegmentOverlapCondition(c.bp1,c.bp2,g.bp1,g.bp2) THEN 0 WHEN c.bp1 IS NULL OR c.bp2 IS NULL THEN 9e9 ELSE NUM_MAX_INTEGER(ABS(c.bp1-g.bp2),ABS(c.bp2-g.bp1)) END)
  1. Creates a view PROTEIN_CODING_GENES from PROTEIN_CODING_GENES_ALL WHERE dist<100000
  2. Creates all output datasets:
  • GWAS_CATALOG by joining _USER_INPUT (c) and the reference _GWAS_CATALOG (r) on c.chr=r.chr AND TwoSegmentOverlapCondition(c.bp1, c.bp2, r.bp1, r.bp2)
  • OMIM by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _OMIM (r) on g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''
  • PSYCHIATRIC_CNVS by joining _USER_INPUT (c) and the reference _PSYCHIATRIC_CNVS (r) on c.chr=r.chr AND TwoSegmentOverlapCondition(c.bp1, c.bp2, r.bp1, r.bp2)
  • ASD_GENES by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _ASD_GENES (r) on g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''
  • ID_DEVDELAY_GENES by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _ID_DEVDELAY_GENES (r) on g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''
  • MOUSE_KNOCKOUT by joining GENES_PROTEIN_CODING_NEAR (g) and the reference _MOUSE_KNOCKOUT (r) on g.genename_gm=r.geneName AND g.geneName_gm IS NOT NULL AND g.geneName_gm!='' AND r.geneName IS NOT NULL AND r.geneName!=''

Output

  1. The chosen tables are outputted to the chosen file(s) and in the chosen format.
  2. Output is automatically done after user input.