GAiN generates large synthetic cohorts using small training sets of gene expression data from samples of two different phenotypes. It then performs differential expression (DE) analysis on the synthetic cohorts, returning a list of candidate DE genes.
Please make sure you have installed the following tools:
Clone this repository to the desired location:
git clone https://github.com/jin-wash-u/GAiN.git
To test the installation, execute the following command in your terminal to display usage information about GAiN.
$PATH_TO_GAIN/GAiN -h
Running GAiN with the -h option (or --help) will print a desciption of its optional and required input arguments. A description of each follows.
GAiN [-h] [-b C0 C1] [-e EPOCHS] [--minExprMean MINEXPRMEAN]
[--minExprMAD MINEXPRMAD] [--numbOfNetworks NON] [--deseq] [--save]
[--seed SEED] [-q] [--synth] [-o OUTNAME] [-p POPCSV]
input.csv
- -h, --help
Prints the help menu, which contains an example of the function usage and abbreviated explanations of each of the options. - --version
Prints version information - -b C0 C1, --batchsizes C0 C1
Size of synthetic cohort to generate for each condition (default: 500 500) - -e EPOCHS, --epochs EPOCHS
Number of epochs for training model (default: 100) - --minExprMean MINEXPRMEAN Minimum mean expression for genes to be modeled [10] --minExprMAD MINEXPRMAD Minimum mean absolute deviation of expression for genes to be modeled [10]
- --numbOfNetworks NON
Number of networks to use for bagging (default: 5) - --deseq
Use DESeq2 method for DE significance calculations (default: use edgeR) - --save
Save trained models for later use - --seed SEED
Optional seed for random sampling of the training sets - -q, --quiet
Run in quiet mode, limiting program output - --synth
Save synthetic expression tables - -o OUTNAME, --outname OUTNAME
Prefix for output filenames (default: ./GAiN) - -p POPCSV, --popCSV POPCSV Population expression table in CSV format (if not provided, input.csv will be used as population cohort)
- input.csv
Path to the training cohort gene expression table in CSV format. A pair of large synthetic cohorts will be generated based on the samples in this table. The first row must be any label string followed by the sample ids of the training set. The second row must start with a label string (e.g. "Cancer ID") followed by 0 or 1 based on the phenotypic group of the sample for that column. All subsequent rows must start with a gene label, followed by the expression level of that gene in the sample for that column. - population.csv
Path to the population cohort gene expression table in CSV format. Scale will be restored to the synthetic gene expression tables using these samples. Follows the same format as input.csv, but without any phenotype group/Cancer ID row. Note that only genes with entries in both tables will be modeled.
GAiN generates a CSV table, by default GAiN_DE_genes.csv, containing the list of differentially expressed genes between the two synthetic groups that passed all filters, together with the sum of each gene's rank across the NON networks and the number of networks in which it was significantly DE.
An example input expression CSV file is included with GAiN to demonstrate how to run the tool. It can be run as follows:
cd $PATH_TO_DANSR
gunzip example/example_population.csv.gz
GAiN \
-o test \
--seed 42 \
-p example/example_population.csv \
example/example_input.csv
The results of this execution can be compared with the results file example/example_result_DE_genes.csv.
The gene expression values in example_input.csv consist of TMM-normalized counts from 10 primary tumors of luminal B subtype (condition 0) and 10 primary tumors of triple negative subtype (condition 1) sourced from TCGA-BRCA. The population cohort is TMM normalized expression of all primary tumors in TCGA-BRCA.
A docker image for GAiN has been created and tested on Linux and Mac. To run GAiN using this method, you need to have Docker installed on your machine.
- Ubuntu: follow the instructions to get Docker CE for Ubuntu.
- Mac: follow the instructions to install the stable version of Docker CE on Mac.
To obtain the latest docker image, run the following on your command line:
docker pull mjinkm/gain
To test the image, run the following command which shows the usage of this tool:
docker run mjinkm/gain GAiN -h
To run GAiN using docker on the example provided, use the following command:
docker run -v $PATH_TO_OUTPUT:/gain_out mjinkm/gain GAiN \
-o /gain_out/test \
--seed 42 \
-p /opt/GAiN/example/example_population.csv \
/opt/GAiN/example/example_input.csv
where PATH_TO_OUTPUT
is the desired local output directory.