To identify desired gene module in WGCNA, we proposed the gmcNet. gmcNet is a GNN-based clsutering algorithm, which can cluster genes according to the co-expression topology (genes in the same module should be strongly connected) and to the single-level expression (genes in the same module should have similar expression patterns). The key innovation of gmcNet is incorporating the single-expression of genes with co-expression of their neighbor genes.
gmcNet requries four inputs to implement unsupervised clustering. Let, is the number of genes and is the number of expression sample.
- : Single-expression features of genes.
- : Topological overlap matrix, which is created using the topological overlap measure between genes.
- : Topological overlap matrix, which is created only with gene pairs of positive correlation coefficient.
- : Topological overlap matrix, which is created only with gene pairs of neagtive correlation coefficient.
gmcNet includes a co-expression pattern recognizer (CEPR) and module classifier.
CEPR : With massage passing operation, CEPR generates the embedding feature , which accounts for single-epxression and two diffrent co-expressions in dimension.
Module classifier : Given CEPR-embedding feature , the module classifier computes module-assignment probability using a multi-layer perceptron (MLP), where is the number of modules. Finally, th-row of corresponds to module assifnment probability of gene . In other words, gene belongs to module if is the maximum value of the th-row of .
our models were implemented by tensorflow 2.3 in Python 3.8.6
Requirements can be installed through the following command in your shell.
pip install -r [CODE PATH]/requirements.txt
expr : gene expression data. A text file with a header line, and then one line per sample with +1 columns. The first column is gene name and others are expression values. An example file format is in data
folder as sample.txt
.
TOM (optional) : If you already created TOM through the R library WGCNA
, you can use them for gmcNet. The three TOMs (, , ), required to implement gmcNet, must be located in one folder with the name of (whole.txt
, positive.txt
, negative.txt
), repectively. TOM files must include -rows and -columns, and then the th-column of th-row is the topological overlap measure of gene and . You can find an example files in out/TOMs
folder.
Before excute gmcNet, you shuld set the configuration at main.py
.
'betas' : smoothing parameter for (whole, positive, negative) networks
'save_TOM' : save TOM or not in output path
'save_embed' : save embedding features or not in output path
'n_cluster' : number of cluster (k)
'epochs' : trainning epochs
'lr' : trainning learning rate
'mp_layers' : number of message passing layers
'CEPR_features' : CEPR_embedding demesions
'lambda' : balancing hyper-parameter
'Lo_thr' : orthogonal threshold
'tune_epoch' : first tunning epochs, which prevent the empty modules
'tune_lr' : learning rate for first tunning
'device' : used GPU device. if you don't use GPU, then write False
python main.py --expr [expr] --out [out]
- [expr] :
expr
file path. - [out] : Path for saving the results.
python main.py --expr [expr] --TOM [TOM] --out [out]
- [expr] :
expr
file path. - [TOM] : Path for TOM folder including three diffrent TOM files (
whole.txt
,positive.txt
,negative.txt
). - [out] : Path for saving the results.
python main.py --expr data/sample.txt --out out
python main.py --expr data/sample.txt --TOM out/TOMs --out out