In this notebook, a network can be created with gene pairs ranked by either PCC or OLS P-values, pre-processed with different combinations of techniques.

To start, choose desired essentiality scoring method and load the respective data file. Data files can be downloaded from: https://doi.org/10.6084/m9.figshare.21379761

The dataframe can then be processed with choices of variance and covariance normalization techniques. 

To measure similarity of gene vectors, Pearson's correlation or Least squares methods are provided. 

Note: The combination Boyle et al. variance norm + Cholesky covariance norm + OLS constitutes the method used in Wainberg et al. 2021 paper. The olfactory genes file ("olfactory_genes.txt") necessary for the Boyle method can be downloaded at: https://github.com/kundajelab/coessentiality


Import packages and load functions:

In [None]:
import pandas as pd
import numpy as np
from scipy.special import stdtr
from sklearn.decomposition import PCA

In [None]:
# Variance normalization using quantile normalization
def qnorm_dataframe( data ):
    """
    quantile normalize a dataframe with numeric values only!
    Normalizes to rank mean
    Does not deal with ties
    """
    rank_mean = data.stack().groupby(data.rank(method='first').stack().astype(int)).mean()
    qnormed_data    = data.rank(method='min').stack().astype(int).map(rank_mean).unstack()
    return qnormed_data

# Covariance normalization using PCA whitening transformation
def PCA_whitening(X):
    centered_X = X - np.mean(X, axis = 0)
    cov = np.cov(centered_X.T)
    # Calculate the eigenvalues and eigenvectors of the covariance matrix
    eigVals, eigVecs = np.linalg.eig(cov)
    # Apply the eigenvectors to X
    transf_x = centered_X @ eigVecs
    whitened_x = transf_x / np.sqrt(eigVals + 1e-5)
    return whitened_x

# Covariance normalization using Cholesky decomposition, as described in Wainberg et al. 2021 paper
def cholesky_whitening(df):
    cholsigmainv = np.linalg.cholesky(np.linalg.pinv(np.cov(df.T)))
    warped_screens = df.values @ cholsigmainv
    df_chol = pd.DataFrame(warped_screens,index=df.index.values,columns=df.columns.values)
    return df_chol

# Create list of pairs and their edge value
def ut_as_list( dframe, diag=1, cols=['Row','Column','Value'] ):
  """
  for a symmetric dataframe, where cols=rows, get the upper triangle as a list of row/column pairs
  diag = 1 (default): ignore diagonal
  diag = 0: include diagonal
  """
  #if (dframe.index.name == dframe.columns.name):
  dframe.index.name = cols[0]
  dframe.columns.name = cols[1]
  #             dframe.index.name = dframe.index.name + '.1'
  #             dframe.index.name = dframe.index.name + '.2'
  d = dframe.where( np.triu( np.ones( dframe.shape ), k=diag).astype(bool))
  d = d.stack().reset_index()
  d.columns=cols
  return d

# Check if a matrix is positive definite (required for Cholesky covariance normalization) 
def is_pos_def(x):
    return np.all(np.linalg.eigvals(x) > 0)

Data files can be downloaded from:  https://doi.org/10.6084/m9.figshare.21379761

Choose one gene-by-cell lines matrix of essentiality scores and import data

*The data matrices have been pre-processed to include only the genes and cell lines that are common between them.

In [None]:
### Z-scores matrix 
# data=pd.read_csv('Zscores_dataframe.csv',header=0,index_col=0)

### Bayes Factors matrix
# data=pd.read_csv('BF_dataframe.csv',header=0,index_col=0)

### Ceres scores matrix
data= pd.read_csv('Ceres_dataframe.csv',header=0,index_col=0)

### Chronos scores matrix
# data= pd.read_csv('Chronos_dataframe.csv',header=0,index_col=0)

Check for any NAN values and that the data matrix is positive definite

In [None]:
print(np.any(np.isnan(data)))
print(is_pos_def(np.cov(data.T)))

The following performs variance normalization via Quantile normalization using the function given above.

In [None]:
# Quantile Normalization 
data = qnorm_dataframe( data )

The following performs variance normalization via Boyle et al. PCA transformation using olfactory genes, implemented as in the Wainberg et al. 2021 methods. Download 'olfactory_genes.txt' file at: https://github.com/kundajelab/coessentiality


In [None]:
# Boyle et al. PCA Variance Normalization

olfactory_genes = pd.read_csv('olfactory_genes.txt', header=None, squeeze=True)
olfactory_data = data.reindex(olfactory_genes).dropna()

transformation = PCA(n_components=4)
transformation.fit(olfactory_data)

top_PC_effects = transformation.inverse_transform(transformation.transform(data))

data -= top_PC_effects
data = data.iloc[:, :-4]

The following performs covariance normalization via PCA Whitening transformation using the function given above

In [None]:
# PCA Whitening Covariance Transformation
data=PCA_whitening(data)

The following performs covariance normalization via Cholesky transformation, implemented as in the Wainberg et al 2021 methods.

In [None]:
# Cholesky Covariance Transformation
data=cholesky_whitening(data)

The following performs Ordinary (or Generalized if Cholesky covariance normalization is applied) Least Squares to measure similarity of gene vectors, and ranks the resulting gene pairs by log P-values

The Generalized least squares method is implemented as in the Wainberg et al 2021 paper. 


In [None]:
# Least Squares

screen=data.values

### If CHOLESKY covariance normalization is applied, the intercept should be calculated with the following:
# This then becomes the Generalized Least Squares described in Wainberg et al. 2021 methods.

# cholsigmainv = np.linalg.cholesky(np.linalg.pinv(np.cov(data.T)))
# intercept = cholsigmainv.sum(axis=0)


### Otherwise, use:
intercept=np.ones(screen.shape[1],dtype=int)

GLS_coef = np.empty((len(screen), len(screen)))
GLS_se = np.empty((len(screen), len(screen)))
ys = np.array(screen.T)

for gene_index in range(len(screen)):
        
    X = np.stack((intercept,screen[gene_index]), axis=1)
        
    coef, residues = np.linalg.lstsq(X, ys, rcond=None)[:2]
        
    df = screen.shape[1] - 2
        
    GLS_coef[gene_index] = coef[1]
        
    GLS_se[gene_index] = \
        np.sqrt(np.linalg.pinv(X.T @ X)[1, 1] * residues / df)

df = screen.shape[1] - 2

GLS_p = 2 * stdtr(df, -np.abs(GLS_coef / GLS_se))

np.fill_diagonal(GLS_p, 1)

GLS_logp=np.negative(np.log10(GLS_p))

GLS_logp_df = pd.DataFrame(GLS_logp , index=data.index.values, columns=data.index.values)

# Create and rank OLS Pairs

OLS_pairs = ut_as_list(GLS_logp_df, diag=1, cols=['Gene1','Gene2','OLS logP'] ).sort_values('OLS logP', ascending=False) 


The following calculates Pearson's correlation coefficients to measure similarity of gene vectors, and ranks the resulting gene pairs by PCC values

In [None]:
# PCC 

#  Correlation matrix
Corr_df = pd.DataFrame( np.corrcoef(data.values) , index=data.index.values, columns=data.index.values)

# Create and rank PCC Pairs
PCC_pairs = ut_as_list(Corr_df,cols=['Gene1','Gene2','PCC']).sort_values(by='PCC',key=abs, ascending=False)

Save the resulting network

In [None]:
# Save Pairs

# PCC_pairs.to_csv('',index=False)

# OLS_pairs.to_csv('',index=False)

