Skip to content

R package for context-specific functionality analysis of metabolic gene clusters

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

METACLUSTER - an R package for context-specific expression analysis of metabolic gene clusters

Check out our hands-on tutorial on


METACLUSTER facilitates comprehensive condition and tissue-specific expression analysis of metabolic gene clusters based on a probabilistic framework for characterizing metabolic gene clusters using context-specific gene expression information

Alt text A) The METACLUSTER framework. B) Cluster diagram and transcriptional activity map of the arabidiol/baruol cluster (Yu et al. 2016) (C463 based on the prediction in Schlapfer et al. 2017). Colors indicate the inferred p-value of the cluster to be transcriptionally active per condition and tissue. Gray tiles indicate condition-tissue combinations that are missing in the differential expression dataset. C) Transcriptional activity map of the 317 inferred context-specific gene clusters. Color values denote the number of the transcriptionally active gene clusters per condition-tissue. Black tiles indicate condition-tissue combinations with no inferred transcriptionally active clusters.

Contact for questions.


METACLUSTER is based on R version 3.6.1. The easiest way to install METACLUSTER is through devtools (see OS specific notes on installing devtools at the end)


install_github("", build_vignettes=TRUE,
 repos=c("", ""),


To run the METACLUSTER with the Schlapfer et al. 2017 A.thaliana gene cluster predictions data you can download all neccessary datasets from onedrive: datasets_athaliana. If you are using personal datasets, see the required data format for "custom" datasets in section Notes.

library(METACLUSTER) # load package

setwd("/User/home/METACLUSTER_athaliana_datasets") # set working directory to the dataset files

Load individual datasets based on their filenames:

  • input_format "custom", "PCF2017_enzymes_only" or "PCF2017" (default = "PCF2017_enzymes_only")
  • geneCluster the gene clusters dataset
  • genes a list of genes (corresponds to the rows of the differential expression datasets)
  • sample_ids_differentialExpression a list of unique identifiers referencing individual condition-tissue specific differential expression experiments
  • foldChange_differentialExpression differential expression data (fold changes) as a genes x differential expression experiments
  • pvalue_differentialExpression differential expression data (p-values) as a genes x differential expression experiments
  • experiment_condition_tissue_annotation experiment to treatment and tissue annotation (with corresponding experiment_ids) = load_datasets(input_format = "PCF2017_enzymes_only",
                      filename.geneCluster = "data/ath_geneInCluster_3_aracyc.txt-labeled_NoHypoGenes.txt",
                      filename.genes = "data/genes.txt",
                      filename.sample_ids_differentialExpression = "data/sample_ids_differentialExpression.txt",
                      filename.foldChange_differentialExpression = "data/m.foldChange_differentialExpression.txt",
                      filename.pvalue_differentialExpression =	"data/m.pvalue_differentialExpression.txt",
                      filename.experiment_condition_tissue_annotation ="data/experiment_annotation.txt")

METACLUSTER Parameter sets:

!We set b.load_codifferentialAnalysis_monteCarloSimulation = "yes" for the Schlapfer et al. 2017 A.thaliana gene cluster predictions data, as we have pre-computed and provided all co-differential expression datasets - for other datasets, set to "no"!

  • m.foldChange_differentialExpression differential expression foldchange matrix - rows are genes, cols are experiments
  • m.pvalue_differentialExpression differential expression pvalue matrix - rows are genes, cols are experiments
  • df.experiment_condition_annotation experiment condition annotation
  • df.geneCluster gene cluster dataset
  • tb.condition_treatments table of conditions
  • tb.condition_tissues table of tissues
  • n.cpus number of cores used
  • b.load_codifferentialAnalysis_monteCarloSimulation load codifferential expression data ("yes", "no")
  • pvalue_DifferentialExpression pvalue treshold for differential expession (default = 0.05)
  • probability_codifferentialExpression_MonteCarloSimulation probability threshold codifferential expression (default = 0.05)
  • pvalue_coexpression_distribution pvalue treshold context specific coexpression (default = 0.05)
  • pvalue_geneClusterPrediction pvalue gene cluster inference enzyme presence (default = 0.05)
  • pvalue_geneClusterConsistency pvalue gene cluster enzyme condition consistency (default = 0.05)
  • pvalue_treatment_per_condition pvalue gene pair condition annotation (default = 0.05)
  • pvalue_tissue_per_condition pvalue gene pair tissue annotation (default = 0.05)
  • number_codifferentialExpression_MonteCarloSimulations number of codiffernetial expression background monte carlo simulations (default = 1)
  • number_conditionSpecificCoexpressionBackgroundGenePairs number of context specific coexpression simulation background gene pairs (default = 100)
  • min_number_condition_samples minimum number of condition samples for significance test (default 1)
  • foldername.tmp temp file folder name (default = /tmp)
  • foldername.results results file folder name (default = /results)
df.cluster_annotations = run_METACLUSTER(m.foldChange_differentialExpression =$m.foldChange_differentialExpression,
                                       m.pvalue_differentialExpression =$m.pvalue_differentialExpression,
                                       df.experiment_condition_annotation =$df.experiment_condition_annotation,
                                       df.geneCluster =$df.geneCluster,
                                       tb.condition_treatments =$tb.condition_treatments,
                                       tb.condition_tissues =$tb.condition_tissues,
                                       n.cpus = 3,
                                       b.load_codifferentialAnalysis_monteCarloSimulation = "yes",
                                       pvalue_DifferentialExpression = 0.05,
                                       probability_codifferentialExpression_MonteCarloSimulation = 0.95,
                                       pvalue_coexpression_distribution = 0.05,
                                       pvalue_geneClusterPrediction = 0.05,
                                       pvalue_geneClusterConsistency = 0.05,
                                       pvalue_treatment_per_condition = 0.05,
                                       pvalue_tissue_per_condition = 0.05,
                                       number_codifferentialExpression_MonteCarloSimulations = 1,
                                       number_conditionSpecificCoexpressionBackgroundGenePairs = 100,
                                       min_number_condition_samples = 1,
                                       seed = 1234,
                                       heatmap_width = 10,
                                       heatmap_height = 5,
                                       foldername.results = "results/",
                                       foldername.tmp = "tmp/")

Next evaluate and store the results

                          df.experiment_condition_annotation =$df.experiment_condition_annotation,
                          tb.condition_treatments =$tb.condition_treatments,
                          tb.condition_tissues =$tb.condition_tissues,
                          min_number_of_genes = 3,
                          heatmap_width = 4, heatmap_height = 7, fontsize = 7, fontsize_row = 10, fontsize_col = 10,
                          foldername.results = "results/")


Installation of devtools dependencies under Ubuntu (prior to installing devtools): sudo apt-get install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev

Then install.packages("devtools")

Custom gene cluster data format: "Cluster.ID", "Gene.ID", "Gene.Name", see custom_example_data

as a pre-requisite, our algorithm needs two matrices: m.pvalue_differentialExpression and m.foldChange_differentialExpression.

  • genes a list of genes (corresponds to the rows of the differential expression datasets)
  • sample_ids_differentialExpression a list of unique identifiers referencing individual condition-tissue specific differential expression experiments listed in experiment_condition_tissue_annotation (corrresponding to the columns of the differential expression datasets)

The format of the experimental annotation should be: "series_id" "condition_treatment_1" "condition_treatment_2" "condition_tissue" "unique_ID".


Banf M, Zhao K.M., and Rhee S. METACLUSTER - an R package for context-specific expression analysis of metabolic gene clusters, Bioinformatics, 2019

Genome-wide prediction of metabolic enzymes, pathways, and gene clusters in plants, Schläpfer P, Zhang P, Wang C, Kim T, Banf M, Chae L, Dreher K, Chavali A K, Nilo-Poyanco, Bernhard T, Kahn D, and Rhee S. - Plant physiology, 2017


R package for context-specific functionality analysis of metabolic gene clusters







No releases published


