# Decomposing gene co-expression networks with COBRA
Author: Soel Micheletti<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

## 1. Introduction
COBRA decomposes a gene co-expression network as a linear combination of covariate-specific components. It takes as input a gene co-expression network and a design matrix. Depending on the choice of the covariates in the design matrix, COBRA can be used to tackle different tasks in system biology. In this tutorial we show how it can be applied for batch correction, differential co-expression analysis controlling for variables, and to understand the impact of variables of interest to the observed co-expression. 

![**Figure 1:** COBRA workflow.](./cobra.png)

COBRA is now part of the [netZooR package](https://github.com/netZoo/netZooR). You can install TIGER with other netZoo tools using the command below.

In [None]:
devtools::install_github("netZoo/netZooR", build_vignettes = TRUE)

If you need help or if you have any question about netZoo, feel free to start with [discussions](https://github.com/netZoo/netZooR/discussions). To report a bug, please open a new [issue](https://github.com/netZoo/netZooR/issues). 

For this tutorial, we need to importthe following libraries. 

In [None]:
library(DESeq2)
library(fastDummies)
library(netZooR)
library(recount3)

### 2. Downloading data from recount3

To illustrate how to use COBRA for different tasks, we use recount3 <sup>1</sup> to download thyroid carcinoma (THCA) data from the TCGA project <sup>2</sup>. 

In [None]:
data <- recount3::create_rse_manual(
  project = "THCA",
  project_home = "data_sources/tcga",
  organism = "human",
  annotation = "gencode_v26",
  type = "gene"
)
G <- transform_counts(data, by = "mapped_reads")
G <- G[data@rowRanges@elementMetadata@listData$gene_type == "protein_coding",]
G <- G[-which(rowSums(G) <= 1),] # Filtering: remove genes with no counts
countMat=SummarizedExperiment::assay(DESeqDataSetFromMatrix(G, data.frame(row.names=seq_len(ncol(G))), ~1), 1)
gene_expression <- vst(countMat, blind=FALSE)

In [None]:
metadata_url <- locate_url(
  "THCA",
  "data_sources/tcga")
metadata <- read_metadata(file_retrieve(url = metadata_url))

The gene expression dataset contains the expression of 19711 genes for 572 samples. The metadata contains a collection of additional information for each sample. 

In [13]:
dim(gene_expression)

## 3. Applications of COBRA
COBRA requires two inputs:      
1. a gene expression matrix with rows as genes and column as samples; 
2. a design matrix with rows as samples and covariates as columns.

Depending on the covariates in the design matrix, COBRA can be used for multiple purposes.

### 3.1 Higher order batch correction

A first application is batch correction of the co-expression network. In this case, we correct for the batch variable in our data. 

In [None]:
batch <- metadata$tcga.cgc_case_batch_number

In our dataset, the 572 samples come from 17 distinct batches. 

In [23]:
length(unique(batch))

For batch correction, the design matrix must contain an intercept in the first column, and the batches (encoded usy dummy coding for identifiability) in the remaining columns. 

In [None]:
number_of_samples <- dim(gene_expression)[2]
X <- cbind(rep(1, number_of_samples), as.matrix(dummy_cols(batch)[, -c(1:2)]))

We get a design matrix with 17 covariates (an intercept and 16 for the dummy coding) for the 572 samples in our study. 

In [42]:
dim(X)

We are now ready to fit COBRA

In [None]:
cobra_estimates <- cobra(X, gene_expression)

The batch corrected network consider only the mean effect after removing the contribution of the batch variables. It is computed as follows. 

In [None]:
corrected_network <- cobra_estimates$Q %*% diag(cobra_estimates$psi[1,]) %*% t(cobra_estimates$Q)

### 3.2 Differential co-expression analysis

A second application is differential co-expression analysis between two conditions of interest. Here, we are interested in the differential co-expression between healthy and cancer samples. We extract the sample type for each sample. 

In [None]:
cancer <- metadata$tcga.gdc_cases.samples.sample_type
cancer <- ifelse(cancer == "Solid Tissue Normal", 0, 1)

In this case, the design matrix contains an intercept an a second column with an indicator for cancer/ healthy. The additional columns are for the variables we want to adjust for. Similarly as before, we consider the batch variable. 

In [None]:
number_of_samples <- dim(gene_expression)[2]
X <- cbind(rep(1, number_of_samples), cancer, as.matrix(dummy_cols(batch)[, -c(1:2)]))

We are now ready to fit COBRA and extract the component corresponding to the differential co-expression. Since the indicator variable for cancer is the second column in our design matrix, the COBRA-adjusted differential co-expression network corresponds to the second component of COBRA's decomposition. 

In [None]:
cobra_estimates <- cobra(X, gene_expression)
differential_coexpression <- cobra_estimates$Q %*% diag(cobra_estimates$psi[2,]) %*% t(cobra_estimates$Q)

### 3.3 Identifying the component for a covariate of interest

COBRA is general enough to be applied to any variable. For instance, if we want to study the differences between males and females in cancer, we can use the following design matrix. 

In [None]:
sex <- metadata$tcga.gdc_cases.demographic.gender
sex <- ifelse(sex == "male", 0, 1)
number_of_samples <- dim(gene_expression)[2]
X <- cbind(rep(1, number_of_samples), cancer, sex, sex * cancer)

With this design, the last component of COBRA's decomposition describes the sex differes in cancer between male and females. 

In [None]:
cobra_estimates <- cobra(X, gene_expression)
sex_differences_in_cancer <- cobra_estimates$Q %*% diag(cobra_estimates$psi[4,]) %*% t(cobra_estimates$Q)

## Reference

1- Christopher Wilks, Shijie C. Zheng, Feng Yong Chen, Rone Charles, Brad Solomon, Jonathan P. Ling, Eddie Luidy Imada, David Zhang, Lance Joseph, Jeffrey T. Leek, Andrew E. Jaffe, Abhinav Nellore, Leonardo Collado-Torres, Kasper D. Hansen, Ben Langmead. "recount3: summaries and queries for large-scale RNA-seq expression and splicing". Genome Biol (2021). 

2- Agrawal, Nishant, et al. "Integrated genomic characterization of papillary thyroid carcinoma." Cell 159.3 (2014): 676-690.