Gene Combinations in Oligogenic Disease (GCOD) is an efficient, simulation-based method to detect gene sets that carry co-occurring damaging variants in disease probands at a higher rate than expected given parental genotypes. GCOD can be viewed as an extension of the Transmission Disequilibrium Test (TDT) in which the unit being tested is an observed co-occurrence of variants in a group of genes, rather than observed transmissions at a single genomic location or within a single gene. GCOD also differs from the TDT in its use of simulations to assess significance rather than a chi-squared statistic, enabling the detection of rare gene combinations with very few expected observations.
GCOD takes as input a tab-delimited file of gene variants with proband and parental genotypes, and returns a dataframe with candidate oligogenic combinations, each with a list of carrier IDs and the simulation-based significance of the gene set's co-occurrence within the dataset. It performs simulations for all digenic pairs for which at least two probands carry variant combinations not seen in a parent, as well as higher-order gene sets limited to the greatest set of common genes among distinct family combinations ("highest order"). By default GCOD performs these tests on the full list of variants provided, but can also accept a subset of genes to investigate.
The necessary packages can be installed in a conda environment with the following commands:
$ conda env create --prefix gcod --file environment.yml
$ conda activate gcod
GCOD_walkthrough.ipynb demonstrates different use cases for GCOD analysis (pairs only, highest order, and gene subset analyses). This notebook also provides guidance for input file formatting.
GCOD can be launched via the script run_GCOD.py with the input and output file names, followed by the gene set order (either pairs or highest), and optionally the path to a list of genes to restrict the analysis:
python run_GCOD.py <input file path> <output file path> <order mode> <path to gene list file>
The example can be run on the command line as follows:
python run_GCOD.py ./data/input_variants.tsv ./output/example_results.tsv highest