# Estimating single-sample co-expression networks for yeast genetic screens using BONOBO
Enakshi Saha <sup>1</sup> and Viola Fanfani  <sup>1</sup>

<sup>1</sup> Harvard T.H. Chan School of Public Health, Boston, MA, USA.

# Introduction
BONOBO (Bayesian Optimized Networks Obtained By assimilating Omics data) [1] is an empirical Bayesian model that derives
individual sample-specific co-expression networks, facilitating the discovery of differentially co-regulated gene pairs
between different conditions and/or phenotypes. BONOBO derives positive semidefinite co-expression networks from input
data alone, without using any external reference datasets.

Below is the general illustration on how BONOBO works:

![Graphical-Abstract](https://netzoo.s3.us-east-2.amazonaws.com/netbooks/bonobo/bonobo-graphical.png)


BONOBO requires a gene expression matrix as input, from which we would like to extract sample-specific correlation
networks. Then, for each of the samples, BONOBO infers the network by using both the Pearson's correlation matrix
computed on $N-1$ samples and the sample-specific squared-deviation about the mean. BONOBO outputs $N$ co-expression
networks, one for each sample, and the associated p-values for each of the gene-gene estimated edges.


In the rest of this notebook we will show an example on how to compute BONOBO networks using the
[netZooPy](https://github.com/netZoo/netZooPy) package [3].

We recommend to install the netZooPy package through conda (`conda install -c conda-forge -c anaconda -c netzoo
netzoopy`) and to double check that the `pytables` installation is working properly. 
 

# Importing packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import glob as glob
import os
from netZooPy.bonobo.bonobo import Bonobo # To import Bonobo

Next, we define data path on Netbooks' server.

In [None]:
ppath = '/opt/data/netZooPy/bonobo/'

# 1. Compute BONOBO networks

To compute BONOBO networks we need a tab separated expression file, with samples on the columns and genes on the rows.

In this case we have generated the file from Jackson and colleagues' [2]
raw data ([GEO:SE125162](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125162))  following these steps:

Creating pseudobulk from raw counts: 
- group by Genotype and condition and averaging counts by gene. (132 genotype x condition, 6529 non-zero genes)
- Remove genes there are always zero (GSE125162_all_pseudobulk_counts.txt)
- df_nonzero_log = np.log(1+df_nonzero) (GSE125162_all_pseudobulk_logcounts.txt)

The data is composed of 132 samples and 6520 genes. There are no samples that have all genes non-expressed.

## 1.1. Check the data and generate output folder

First we need to specify which data we need and where the results are going to be saved. 

In [None]:
yeast_fn = ppath+'pseudobulk_nonzero_logcounts.tsv'

The, we specify the output folder.

In [None]:
output_folder = '../results/bonobo_netbook/'

If the folder doesn't exist, we create it.

In [None]:
if os.path.exists(output_folder) == False:
    os.makedirs(output_folder)
    print('Created output folder:%s' %output_folder)
else:
    print('Output folder exists:%s' %output_folder)

In [None]:
all_samples = pd.read_csv(yeast_fn, nrows = 3, sep = '\t', index_col=0).columns.tolist()

With the following command you can check which samples have already been computed and which samples are left to be computed, which is not our case here.

In [None]:
left_uncomputed=0

In [None]:
if left_uncomputed==1:
    done_bonobo = [i.split('/')[-1][7:-3] for i in  glob.glob(output_folder + 'bonobo/bonobo*.h5')]
    left = list(set(all_samples) - set(done_bonobo))

In [None]:
print(all_samples)

You can see that the samples are named with the <genotype>\_<medium> convention. 

For simplicity, we select a limited amount of samples, such that we don't have to wait too much time to compute all
networks.

In [None]:
all_samples

In [None]:
my_samples = [i for i in all_samples  if (not (i.startswith('dal80') or i.startswith('dal81') or i.startswith('dal82') or i.startswith('gat1')) and (i.endswith('YPDRapa') or i.endswith('CStarve') or i.endswith('MinimalGlucose') )) ]

In [None]:
len(my_samples)

We have selected 24 samples, arbitrarily, coming from 3 different conditions: YPD Rapa, CStarve and Minimal Glucose. We will use these samples to compute the Bonobo network.

## 1.2. Computing BONOBO networks

BONOBO can be simply computed by first instantiating the Bonobo class, and then calling the `run_bonobo` method.

Below is the example on how these networks have been generated. We first initialize the Bonobo object.

In [None]:
bonobo_obj_sparse = Bonobo(yeast_fn)

In this case we are using all samples If you want to run bonobo with a subset of the data, you can specify the sample names and pass them to the `run_bonobo method`: `sample_names=['WT(ho)_AmmoniumSulfate','WT(ho)_CStarve','WT(ho)_Glutamine']`

Now, we run the actual bonobo computation, with the following parameters.
We cannot keep all the data in memory in this case, these networks are too large: `keep_in_memory=False`
We will save the p-values for the edges in the output_folder: `sparsify=True, save_pvals=True`
The output format is hd5, which is the most efficient way to store the data on disk: `output_fmt='.h5'`

In [None]:
bonobo_obj_sparse.run_bonobo(keep_in_memory=False, output_fmt='.h5', sparsify=True, output_folder=output_folder+'bonobo/', save_pvals=True, sample_names=my_samples)

It took 2 minutes to generate the networks and you can see which files have been generated 

In [None]:
nets_fn = glob.glob(output_folder+'bonobo/bonobo*.h5')
pvals_prefix = (output_folder+'bonobo/pvals_')

len(nets_fn), nets_fn[:5]

## 1.3. Inspecting BONOBO networks

Here is what BONOBO networks look like, and how to use the p-values to threshold them.

In [None]:
test_bonobo = pd.read_hdf(nets_fn[0])
test_bonobo.head()

Rows have the same name as columns in the bonobo network.
We can actually check that they are symmetric (use np close to avoid rounding issues with floating numbers)

In [None]:
assert(np.isclose(test_bonobo.values, test_bonobo.values.T).all())
test_bonobo.index = test_bonobo.columns
test_bonobo.iloc[:5,:5]

We now check BONOBO's network edge distribution.

In [None]:
sns.histplot(test_bonobo.values[np.tril(np.ones(test_bonobo.shape))==1].flatten(), bins=100)

We do the same for the p-values. We first check if dimensions are correct.

In [None]:
test_pvals = pd.read_hdf(pvals_prefix + nets_fn[0].split('/')[-1][7:])
test_pvals.head()

In [None]:
assert(np.isclose(test_pvals.values, test_pvals.values.T).all())
test_pvals.index = test_pvals.columns
test_pvals.iloc[:5,:5]

Then, we check the p-values' distribution.

In [None]:
sns.histplot(test_pvals.values[np.tril(np.ones(test_pvals.shape))==1].flatten(), bins=100)

In [None]:
test_bonobo[test_pvals<0.05]

We now filter BONOBO networks for only those with significant edges.

In [None]:
sns.histplot((test_bonobo[test_pvals<0.05].values[np.tril(np.ones(test_pvals.shape))==1]).flatten(), bins = 100)

# 2. Get the BONOBO networks for the analysis

Here we define two functions to read BONOBO networks sequentially and generate a manageable dataframe we can use to analyze the data. BONOBO networks are dense symmetric correlation matrices of size $NxN$. Using all the edges for downstream analysis is unfeasible or computationally expensive, hence we focus on three strategies (sparse, random, gene)to reduce the amount of data that is used. Also, Empirical data suggests biological networks are sparse. Therefore, BONOBO allows to threshold the edges using p-values (sparse networks), alternatively one could randomly select K edges from all networks (random). Finally, we can select all edges involving a specific node (gene), for instance all edges that involve GCN4 subnetwork.

In [None]:
def get_bonobo_dataframe(nets_fn, pvals_prefix = None, strategy = 'sparse', pth = 0.05, nk=1000, gene = 'YEL009C'):

    df_bonobos = pd.DataFrame()

    for iii,bbb in enumerate(nets_fn):
        if iii>-1:
            k = bbb.split('/')[-1][:-3][7:]
            temp = pd.read_hdf(bbb)
                
            temp.index = temp.columns
            # Upper triangular matrix
            temp = temp.where(np.triu(np.ones(temp.shape), k = 1).astype(bool))
            # Put in long format
            temp = temp.stack().reset_index() 
            # Rename columns
            temp.columns = ['gene1','gene2',k]
            
            if strategy == 'sparse':
                
                print(pvals_prefix + k + '.h5')
                assert os.path.isfile(pvals_prefix + k + '.h5'), 'pvals file not found'
                pvals_fn = pvals_prefix + k + '.h5'
                pvals = pd.read_hdf(pvals_fn)
                pvals = pvals.where(np.triu(np.ones(pvals.shape), k = 1).astype(bool))
                pvals = pvals.stack().reset_index() 
                
                ps = pvals.iloc[:,2].values
                
                temp = temp[ps<pth]

                if iii== 0:
                    df_bonobos = temp
                else:
                    df_bonobos = pd.merge(df_bonobos, temp, how = 'outer', on = ['gene1','gene2'])
                    
            elif strategy == 'random':
                if iii==0:
                    print(k)
                    print(nk)
                    index_random = np.random.choice(np.arange(len(temp.index)), nk, replace = False)
                    df_bonobos = temp.iloc[index_random,:]
                    
                else:
                    #df_bonobos[k] = temp.iloc[index_random,:][k]
                    df_bonobos = pd.concat([df_bonobos, temp.iloc[index_random,:][k]], axis = 1)
                    
            elif strategy == 'gene':
                # Get only one gene
                if iii == 0:
                    df_bonobos = temp[(temp['gene1'] == gene) | (temp['gene2'] == gene)]
                else:
                    df_bonobos = pd.concat([df_bonobos, temp[(temp['gene1'] == gene) | (temp['gene2'] == gene)][k]], axis = 1)
                    
            else:
                if iii == 0:
                    df_bonobos = temp
                else:
                    #df_bonobos[k] = temp[k]
                    df_bonobos = pd.concat([df_bonobos, temp[k]], axis = 1)
            

    return(df_bonobos)

The second function fetches edges conected to a given gene from the BONOBO network.

In [None]:
def get_gene_bonobo_dataframe(nets_fn, gene = 'YEL009C'):

    df_bonobos = pd.DataFrame()

    for iii,bbb in enumerate(nets_fn):
        if iii>-1:
            k = bbb.split('/')[-1][:-3][7:]
            temp = pd.read_hdf(bbb)
                
            temp.index = temp.columns
            
            temp = temp.loc[:,[gene]]
        
            temp['gene1'] = temp.index
            temp['gene2'] = gene
            temp[k] = temp.loc[:,[gene]]
            temp = temp.loc[:,['gene1','gene2',k]]
            
            if iii == 0:
                df_bonobos = temp
            else:
                df_bonobos = pd.concat([df_bonobos, temp.loc[:,k]], axis = 1)

    return(df_bonobos)

## 2.1. Sparse startegy

The sparse stratgy allows to sparsfiy a fully-conected network using p-value thresholds on network edges. Here you should change the paths depending on where you want to save the data.

In [None]:
nets_fn = glob.glob(output_folder+'bonobo/bonobo*.h5')
pvals_prefix = (output_folder+'bonobo/pvals_')

We use the sparse stargetgy based on p-values to obtain networks.

In [None]:
bonobo_sparse = get_bonobo_dataframe(nets_fn, pvals_prefix = pvals_prefix, strategy = 'sparse', pth=0.01)

We can also save the network (not needeed when running on the server).

In [None]:
save_network=0

In [None]:
if save_network == 1:
    bonobo_sparse.to_hdf(output_folder+'bonobo_sparse_001.h5', key='bonobo_sparse', mode='w')

In [None]:
bonobo_sparse

### 2.1.1. Correlation between BONOBO samples



First, we need to generate an auxiliary table to keep track of the genotype and medium name.

In [None]:
bonobo_names_df = pd.DataFrame()
bonobo_names_df.index = bonobo_sparse.columns[2:]
bonobo_names_df['genotype'] = [i.split('_')[0] for i in bonobo_names_df.index]
bonobo_names_df['medium'] = [i.split('_')[1] for i in bonobo_names_df.index]
bonobo_names_df

We compute correlation between network edges.

In [None]:
corrs = bonobo_sparse.iloc[:,2:].corr()

In [None]:
details = bonobo_names_df

sns.set_context('talk')

colors = details.medium.unique()
cmap = matplotlib.cm.get_cmap('Set1', len(colors))
color_list = [matplotlib.colors.rgb2hex(cmap(i)[:3]) for i in range(cmap.N)]
lut = dict(zip(colors, color_list))
row_colors = details.medium.map(lut)

colors2 = details.genotype.unique()
cmap2 = matplotlib.cm.get_cmap('plasma', len(colors2))
color_list2 = [matplotlib.colors.rgb2hex(cmap2(i)[:3]) for i in range(cmap2.N)]
lut2 = dict(zip(colors2, color_list2))
row_colors2 = details.genotype.map(lut2)

# add name to cbar
f1 = sns.clustermap(corrs, row_colors=[row_colors, row_colors2], cmap = 'jet', 
                    figsize=(20,20), cbar_pos=(0.02, 0.80, 0.05, 0.18), cbar_kws= {'label':'Pearson Corr'}, xticklabels=1, yticklabels=1)

f1.ax_col_dendrogram.set_visible(False)
from matplotlib.patches import Patch

handles = [Patch(facecolor=lut[name]) for name in lut]
l1 = plt.legend(handles, lut, title='Media',
           bbox_to_anchor=(.5, .985), bbox_transform=plt.gcf().transFigure, loc='upper right')

# change the location 
handles = [Patch(facecolor=lut2[name]) for name in lut2]
l2 = plt.legend(handles, lut2, title='Genotype',
           bbox_to_anchor=(.7, .985), bbox_transform=plt.gcf().transFigure, loc='upper right')

# Move the legends
plt.gca().add_artist(l1)
plt.gca().add_artist(l2)

plt.tight_layout()
#f1.savefig(results_folder+'/bonobo_correlation_clustermap.pdf', bbox_extra_artists=(l1,l2), #bbox_inches='tight')

## 2.3. Random strategy

Here, the random strategy picks k random edges from the networks to reduce the model. we first read the BONOBOs and keep 1000 random edges as representative of the networks.

In [None]:
df_random_1k = get_bonobo_dataframe(nets_fn, strategy = 'random', nk = 1000)

In [None]:
df_random_1k

We compute correlation between network edges.

In [None]:
corrs_random = df_random_1k.iloc[:,2:].corr()

In [None]:
details = bonobo_names_df

sns.set_context('talk')

colors = details.medium.unique()
cmap = matplotlib.cm.get_cmap('Set1', len(colors))
color_list = [matplotlib.colors.rgb2hex(cmap(i)[:3]) for i in range(cmap.N)]
lut = dict(zip(colors, color_list))
row_colors = details.medium.map(lut)

colors2 = details.genotype.unique()
cmap2 = matplotlib.cm.get_cmap('plasma', len(colors2))
color_list2 = [matplotlib.colors.rgb2hex(cmap2(i)[:3]) for i in range(cmap2.N)]
lut2 = dict(zip(colors2, color_list2))
row_colors2 = details.genotype.map(lut2)

# add name to cbar
f1 = sns.clustermap(corrs_random, row_colors=[row_colors, row_colors2], cmap = 'jet', 
                    figsize=(20,20), cbar_pos=(0.02, 0.80, 0.05, 0.18), cbar_kws= {'label':'Pearson Corr'}, xticklabels=1, yticklabels=1)

f1.ax_col_dendrogram.set_visible(False)
from matplotlib.patches import Patch

handles = [Patch(facecolor=lut[name]) for name in lut]
l1 = plt.legend(handles, lut, title='Media',
           bbox_to_anchor=(.5, .985), bbox_transform=plt.gcf().transFigure, loc='upper right')

# change the location 
handles = [Patch(facecolor=lut2[name]) for name in lut2]
l2 = plt.legend(handles, lut2, title='Genotype',
           bbox_to_anchor=(.7, .985), bbox_transform=plt.gcf().transFigure, loc='upper right')

# Move the legends
plt.gca().add_artist(l1)
plt.gca().add_artist(l2)

plt.tight_layout()
#f1.savefig(results_folder+'/bonobo_correlation_clustermap.pdf', bbox_extra_artists=(l1,l2), #bbox_inches='tight')

If you compare these results with those obtained by sparsifying the network it is easy to see that networks tend to be
more similar to each other when using random edges. This makes sense, as the sparsificaytion process will likely pick up
sample-specific trends rather than general, non-significant, edges

## 2.4. Gene startegy

Here, we will get all perturbed edges for the gene GCN4.
Now, for each BONOBO, we select the GCN4 subnetwork (the edges connected to GCN4), to show that the networks are able to detect the effect of
the the KO perturbation. 

We expect the edges connected to GCN4, in the samples where the KO was on GCN4, to exhibit different patterns of
connectivity, as the cells have probably to rewire some of the processes that include GCN4. 

For that we'll use one of the previously defined functions and we ask it to retrieve the edges for 'YEL009C' which is a
synonym for GCN4.


In [None]:
bonobo_gcn4 = get_gene_bonobo_dataframe(nets_fn, gene = 'YEL009C')

Here is the network for the gene YEL009C

In [None]:
bonobo_gcn4.head()

Let's reindex the dataframe, such that we don't need gene1 and gene2 columns.

In [None]:
bonobo_gcn4.index = bonobo_gcn4.gene1 + '-' + bonobo_gcn4.gene2
bonobo_gcn4 = bonobo_gcn4.iloc[:,2:]

We now sort the columns, such that similar phenotypes are close to each other.

In [None]:
bonobo_gcn4 = bonobo_gcn4.sort_index(axis=1)
bonobo_gcn4

Then, we compute variance and absolute average per edge, such that we can select the most variable/strongest edges.

In [None]:
variance_peredge = bonobo_gcn4.var(axis = 1)
mean_peredge = np.abs(bonobo_gcn4).mean(axis = 1)

And we plot these edge values on heatmap.

In [None]:
f, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(bonobo_gcn4[variance_peredge>np.percentile(variance_peredge, 90)], cmap = 'jet', ax = ax, yticklabels=False)

# Get the current xticklabels
xticklabels = ax.get_xticklabels()

# Set the first three xticklabels to bold
for label in xticklabels:
    if label.get_text().startswith('gcn4'):
        label.set_weight('bold')

# Apply the modified labels back to the heatmap
ax.set_xticklabels(xticklabels)


ax.set_ylabel('Edges connected \nto GCN4')
ax.set_xlabel('Samples')
ax.set_title('Variable edges \n(absolute variance above 90th percentile)')

Now, we select the strongest edges as opposed to those that have highest variability.

In [None]:
f, ax = plt.subplots(figsize=(10, 5))
sns.heatmap(bonobo_gcn4[mean_peredge>np.percentile(mean_peredge, 90)], cmap = 'jet', ax = ax, yticklabels=False, center = 0)


# Get the current xticklabels
xticklabels = ax.get_xticklabels()

# Set the first three xticklabels to bold
for label in xticklabels:
    if label.get_text().startswith('gcn4'):
        label.set_weight('bold')

# Apply the modified labels back to the heatmap
ax.set_xticklabels(xticklabels)

ax.set_ylabel('Edges connected \nto GCN4')
ax.set_xlabel('Samples')
ax.set_title('Strongest edges \n(absolute mean above 90th percentile)')

In both cases, you can see that the gcn4 KO samples (x-axis labels in bold) have different co-expression values compared to the rest of the networks. 

# References

1- Saha, Enakshi, et al. "Bayesian Optimized sample-specific Networks Obtained By Omics data (BONOBO)." bioRxiv (2023).

2- Jackson, Christopher A., et al. "Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments." elife 9 (2020): e51254.

3- Ben Guebila, Marouen, et al. "The Network Zoo: a multilingual package for the inference and analysis of gene regulatory networks." Genome Biology 24.1 (2023): 45.