# Decomposing gene co-expression networks with COBRA (Python version)
Author: Soel Micheletti<sup>1</sup>

<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA.

## 1. Introduction
COBRA decomposes a gene co-expression network as a linear combination of covariate-specific components. It takes as input a gene co-expression network and a design matrix. Depending on the choice of the covariates in the design matrix, COBRA can be used to tackle different tasks in system biology. In this tutorial we show how it can be applied for batch correction, differential co-expression analysis controlling for variables, and to understand the impact of variables of interest to the observed co-expression. 

![**Figure 1:** COBRA workflow.](./cobra.png)

COBRA is now part of the [netZooPy package](https://github.com/netZoo/netZooPy). Please follow the installation guidelines on the [README](https://github.com/netZoo/netZooPy/blob/master/README.md). If you need help or if you have any question about netZoo, feel free to start with [discussions](https://github.com/netZoo/netZooPy/discussions). To report a bug, please open a new [issue](https://github.com/netZoo/netZooPy/issues). 

To illustrate how to use COBRA for different tasks, we import thyroid carcinoma (THCA) data from the TCGA project <sup>1</sup>. 

In [1]:
cd ..

/home/soel/Desktop/netbooks/netbooks


In [2]:
from netZooPy.cobra import *
import numpy as np
import pandas as pd

In [3]:
gene_expression = pd.read_csv("data/gene_expression_thca.csv", index_col = 0).to_numpy()
metadata = pd.read_csv("data/thca_metadata.csv", index_col = 0)
batch = metadata['batch'].to_numpy()
cancer = metadata['status'].to_numpy()
sex = metadata['sex'].to_numpy()

Here gene_expression is a gene expression matrix for 19711 genes and 572 samples. Batch, cancer, and sex are sample-specific metadata as vectors of length 572.

In [4]:
print("Gene expression shape = " + str(gene_expression.shape))
print("Batch vector length = " + str(len(batch)))
print("Cancer vector length = " + str(len(cancer)))
print("Sex vector length = " + str(len(sex)))

Gene expression shape = (19711, 572)
Batch vector length = 572
Cancer vector length = 572
Sex vector length = 572


## 2. Applications of COBRA
COBRA requires two inputs:      
1. a gene expression matrix with rows as genes and column as samples; 
2. a design matrix with rows as samples and covariates as columns.

Depending on the covariates in the design matrix, COBRA can be used for multiple purposes.

### 2.1 Higher order batch correction

A first application is batch correction of the co-expression network. In this case, we correct for the batch variable in our data. In our dataset, the 572 samples come from 17 distinct batches. 

In [5]:
len(np.unique(batch))

17

For batch correction, the design matrix must contain an intercept in the first column, and the batches (encoded usy dummy coding for identifiability) in the remaining columns. 

In [6]:
number_of_samples = gene_expression.shape[1]
X = pd.get_dummies(batch).iloc[:,1:]
X.insert(0, 'intercept', np.ones(number_of_samples))

We get a design matrix with 17 covariates (an intercept and 16 for the dummy coding) for the 572 samples in our study. 

In [7]:
X.shape

(572, 17)

We are now ready to fit COBRA

In [8]:
psi, Q, d, g = cobra(X, gene_expression)

The batch corrected network consider only the mean effect after removing the contribution of the batch variables. It is computed as follows. 

In [9]:
corrected_network = Q.dot(np.diag(psi[0,])).dot(Q.T)

### 3.2 Differential co-expression analysis
A second application is differential co-expression analysis between two conditions of interest. Here, we are interested in the differential co-expression between healthy and cancer samples. We extract the sample type for each sample. 

In [10]:
cancer = [1 if cancer[i] == "Solid Tissue Normal" else 0 for i in range(len(cancer))]

In this case, the design matrix contains an intercept an a second column with an indicator for cancer/ healthy. The additional columns are for the variables we want to adjust for. Similarly as before, we consider the batch variable. 

In [11]:
number_of_samples = gene_expression.shape[1]
X = pd.get_dummies(batch).iloc[:,1:]
X.insert(0, 'intercept', np.ones(number_of_samples))
X.insert(1, 'cancer', cancer)

We are now ready to fit COBRA and extract the component corresponding to the differential co-expression. Since the indicator variable for cancer is the second column in our design matrix, the COBRA-adjusted differential co-expression network corresponds to the second component of COBRA's decomposition. 

In [12]:
psi, Q, d, g = cobra(X, gene_expression)
differential_coexpression = Q.dot(np.diag(psi[1,])).dot(Q.T)

### 3.3 Identifying the component for a covariate of interest

COBRA is general enough to be applied to any variable. For instance, if we want to study the differences between males and females in cancer, we can use the following design matrix. 

In [13]:
sex = [0 if sex[i] == 'male' else 1 for i in range(len(sex))]
number_of_samples = gene_expression.shape[1]
X = pd.DataFrame({'intercept': np.ones(number_of_samples), 'cancer' : cancer, 'sex' : sex, 'interaction' : [a*b for a,b in zip(sex, cancer)]})

With this design, the last component of COBRA's decomposition describes the sex differes in cancer between male and females. 

In [14]:
psi, Q, d, g = cobra(X, gene_expression)
sex_differences_in_cancer = Q.dot(np.diag(psi[3,])).dot(Q.T)

## Reference

1- Agrawal, Nishant, et al. "Integrated genomic characterization of papillary thyroid carcinoma." Cell 159.3 (2014): 676-690.