# Network fusion

Sergiu Netotea, PhD, NBIS, Chalmers

## Network models are a very complex representation of data


- power law sophistication: for every n vertices there are up to n(n-1) possible edges
- data in biology is typically suffering from the curse of dimensionality
- differences in scale, collection bias and noise in each data set
- complementary nature of the information provided by different types of data
- the potential for feature cleaning and data transformation using a network structure are great
- one important tradeoff: losing a weak signal due to a poorly made network model
- one important benefit: recovering a weak signal due to the cummulative power of weak links


### Typical dataset:
- features: samples, patients, genes, metabolites, clinical descriptors etc
- instances: any repeated measurements of the above!

> Via network modelling, all instances are turned into feature relationships (data structure becomes uni-dimensional)

### Common network (graph) representation:
- Nodes (vertices): features
- Links (edges): numerical relationship between nodes (directional, boolean, discrete, continuous)

> Via **network fusion**, the feature relationships (links) are described based on multiple datasets.


![NF basics](img/nf_basics.png "NF basics")

## How can networks be fused?

*Simple rule: average edge values by summing up adjacency matrices!*
    
Pitfalls:
- Most omics networks are sparse: some feature relationships eh .. are prevalent only on certain 'omics levels!
- Guilt of association is not captured: features similar to pan-omics features should not the treated equally.

*Complex rule: take in consideration the infomation difussivity in each network when fusing them!*

- in graph computing this is called "message passing".
- pitfalls: model parametrization


# (Perceived) advantages of network fusion

Very useful when: 
- data is incomplete
- dataset quality is not uniform
- data from multiple sources is used
- networks can be built in complex ways, taking into account additional, annotated information.
- easy to transfer knowledge from alternative models

It is a generative model!
- TODO: generate data from similarity matrix
    - https://en.wikipedia.org/wiki/Multidimensional_scaling
- But the generated networks can be used for graph clustering, regression, classification.

# Similarity networks

- At every 'omics level, there are multiple ways to compute a network.
- Similarity networks are only investigating the similarity of one feature to another, with regards to a distance measure.
- Every similarity distance is different. See https://en.wikipedia.org/wiki/Metric_space

![similarity](img/similarity.png)

# SNF: Similarity Network Fusion

Wang, Bo et al. “Similarity network fusion for aggregating data types on a genomic scale.” Nature methods vol. 11,3 (2014): 333-7.

https://doi.org/10.1038/nmeth.2810


- mRNA expression, DNA methylation and microRNA (miRNA) expression data for five cancer data sets.
- clustering cancer subtypes and predicting survival
- uses networks of samples (patient networks) as a basis for integration. 
- at the time patient-similarity networks have not been used specifically for integrating biological data


## Method - Similarity distance:

- contininuous data: euclidean distance, weighted using an exponential kernel function
    - TODO: describe the regularization step
- discrete data: chi squared distance
$\sum_{i=1}^{n} \cfrac{(x_i - y_i)^2} {(x_i + y_i)}$
- boolean data: agreement based distances (Jaccard, Hamming)

## Method - Inductive network fusion (1)

- In order to fuse the supplied m simmilarity (affinity) matrices, each must be normalized. A traditional normalization on an affinity matrix would suffer from numerical instabilities due to the self-similarity along the diagonal; thus, a modified normalization is used:

$$
      \mathbf{P}(i,j) =
         \left\{\begin{array}{rr}
           \frac{\mathbf{W}_(i,j)}
                 {2 \sum_{k\neq i}^{} \mathbf{W}_(i,k)} ,& j \neq i \\
                                                       1/2 ,& j = i
         \end{array}\right.
$$

- Under the assumption that local similarities are more important than distant ones, a more sparse weight matrix is calculated based on a KNN framework:

$$
      \mathbf{S}(i,j) =
         \left\{\begin{array}{rr}
           \frac{\mathbf{W}_(i,j)}
                 {\sum_{k\in N_{i}}^{}\mathbf{W}_(i,k)} ,& j \in N_{i} \\
                                                         0 ,& \text{otherwise}
         \end{array}\right.
$$



## Method - Inductive network fusion (2)

- The two weight matrices P and S thus provide information about a given patient's similarity to all other patients and the K most similar patients, respectively.
- These m matrices are then iteratively fused. At each iteration, the matrices are made more similar to each other via:

$$
\mathbf{P}^{(v)} = \mathbf{S}^{(v)}
                          \times
                          \frac{\sum_{k\neq v}^{}\mathbf{P}^{(k)}}{m-1}
                          \times
                          (\mathbf{S}^{(v)})^{T},
                          v = 1, 2, ..., m
$$

Obs:
- After each iteration, the resultant P matrices are normalized via the equation above.
- Fusion stops after `t` iterations, or when the matrices $$\mathbf{P}^{(v)}, v = 1, 2, ..., m$$ converge.
- The output fused matrix is full rank and can be subjected to clustering and classification.

## TODO:
- Network Clustering
- Prediction



# Consensus clustering


- Monti, S., Tamayo, P., Mesirov, J. et al. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003). https://doi.org/10.1023/A:1023949509487
- Implementation: http://www.bioconductor.org/packages/release/bioc/html/CancerSubtypes.html

SNF drawbacks:
- the unstable nature of kernel-based clustering 
- sensitive to small changes in molecular measurements
- need for hyper-parametrization

Details:
- The data are first partitioned with different values of k (number of clusters).
- For each value of k, we construct the pair-wise connectivity matrix.
- To identify the number of clusters we add noise to the data and then build the pair-wise connectivity for the perturbed data.
- We calculate the discrepancy in pair-wise connectivity between before and after data perturbation. We choose opt_k as the optimal number of clusters for which the pair-wise connectivity is the most stable.




## Another SNF-CC example:

- Nguyen, Tin et al. “A novel approach for data integration and disease subtyping.” Genome research vol. 27,12 (2017): 2025-2039.
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741060/


**Stage 1**

- construct the combined similarity matrix between patients using the connectivity information from individual data types. 
- partition the patients using the integrated similarity matrix.

**Stage 2**
- further split each discovered group of patients into subgroups if possible.



## Weighted similarity network fusion

Links:
- Xu T, Le TD, Liu L, Wang R, Sun B, et al. (2016) Identifying Cancer Subtypes from miRNA-TF-mRNA Regulatory Networks and Expression Data. PLOS ONE 11(4): e0152792. https://doi.org/10.1371/journal.pone.0152792
- Implementation:
    - https://rdrr.io/bioc/CancerSubtypes/
    - https://rdrr.io/cran/IntClust/



## WSNF - Method (1)

1. Compute PageRank based ranking of features. Important features are ranked based on the in-number of TFs.
    - Build the regulatory network where the nodes represent the features, i.e. the microRNAs (miRNAs), transcription factors (TFs) and messenger RNAs (mRNAs) and the edges indicate the interactions between the features.
    - The interactions are retrieved from various interatomic databases.
![wsnf_data](img/wsnf_data.png)


## WSNF - Method (2)

2. Integrate feature ranking and feature variation. Gene expression based.
    - Use the network information and the expression data of the miRNAs, TFs and mRNAs to calculate the weight of the features, representing the level of importance of the features.
3. Weighted SNF
    - The feature weight is then integrated into a network fusion approach to cluster the samples (patients) and thus to identify cancer subtypes. 
    - $D(s_i, s_j) = \sqrt{\sum_{m=1}^p{W(f_m)*(f_m^{s_i}-f_m^{s_j})^2}}$


## WSNF - Method (3)

![wsnf](img/wsnf.png)

## ANF - Affinity network fusion

- Tianle Ma, Aidong Zhang, Affinity Network Fusion and Semi-supervised Learning for Cancer Patient Clustering, 2019, https://doi.org/10.1016/j.ymeth.2018.05.020
- Software: https://www.bioconductor.org/packages/release/bioc/html/ANF.html
    - https://github.com/BeautyOfWeb/ANF
- Data: gene expression, miRNA expression and DNA methylation


<img src="img/anf_alg.jpg" width="600"/>

- Overall, the methodology is similar to SNF. Weak edges derived from a similarity distance are pruned using a kNN based kernel. The same is acomplished using the message passing kernel in SNF.
- The matrix fusion process is random-walk based, and it is a bit more clear, although it parametrizes feature importance with a set of omics-specific weights:
    - $ W = \sum_{v=1}^n{w_v * W^{(v)}}$, $\sum_{v=1}^n{w_v}=1$
    - W regarded as a "state transition matrix", and multiplying it r times is a random walk in its generated graph. $W = W^r$
        - this step is similar to the inductive message passing in SNF
        - r parameter must be small (2, 3), or W becomes a rank one matrix
        - a more complex random walk was also explored, that uses some of the knn pruned weaker edges.

> **Obs.** While the SNF fitting process might look long, the patient matrix is usually a small network (most clinical datasets are in the order of tens - hundreds)!

## ANF - Semi-supervised learning on patient affinity networks

- a good example for the potential of NF to combine supervised and unsupervised learning through good representation learning.
- 97% accuracy on test set with training less than 1% of data for classifying patients into correct disease types
- ANF obtains effective kNN-based nonlinear transformations that reduce noise in multi-omic data


<img src="img/anf_nn.jpg" width="300"/>

## Subsequent studies:

- Guo, Yang et al. “Improvement of cancer subtype prediction by incorporating transcriptome expression data and heterogeneous biological networks.” BMC medical genomics vol. 11,Suppl 6 119. 31 Dec. 2018, doi:10.1186/s12920-018-0435-x
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311915/

Main ideas:
- ANF did not consider the feature importance and the feature relationships in data integration (different regulatory mechanisms may exist in different cancer subtypes).
- Method: CSPRV (Cancer Subtype Prediction using RV2)
> Given the expression data of genome elements, we first extract multiple expression features for each regulatory element based on the heterogeneous biological networks. Based on the extracted feature matrices of samples, we use a matrix correlation method, RV2, to predict the similarities between samples in each expression data-view, and then fuse the similarity information in samples from all considering data-views according to different integration weights. Finally, we cluster patient samples into different cancer subtypes based on the predicted integrative similarity network between samples.


## SKF - Similarity Kernel Fusion

- https://en.wikipedia.org/wiki/Kernel_method
- multiple papers:
    - Jiang, Limin et al. “Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data.” Frontiers in genetics vol. 10 20. 5 Feb. 2019, doi:10.3389/fgene.2019.00020
    - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6370730/
    - https://pubmed.ncbi.nlm.nih.gov/30619454/
    - https://pubmed.ncbi.nlm.nih.gov/31514111/

## The Kernel trick

- (wikipedia) "kernel functions, enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space."


$P_l^{t+1}=\alpha(S_l \times \frac{\sum_{r \neq l}P_r^t}{n} \times S_l^T) + (1-\alpha) \frac{\sum_{r \neq l}P_r^t}{n}$

<img src="img/skf.png" width="400"/>

# Papers:
- SNF similarity network fusion: https://www.nature.com/articles/nmeth.2810
    - https://media.nature.com/original/nature-assets/nmeth/journal/v11/n3/extref/nmeth.2810-S1.pdf
- SNF.CC: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741060/
- WSNF: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152792
- SKF: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6370730/#:~:text=(2016)%20proposed%20Weighted%20Similarity%20Network,cancer%20types%20to%20demonstrate%20performance.
- Speicher and Pfeifer (2015) https://pubmed.ncbi.nlm.nih.gov/26072491/ pointed out that iCluster has high computational complexity and proposed a dimensionality reduction method to integrate multiple similarity kernels. This method is evaluated by using five cancer types. 

- ANF affinity network fusion: https://arxiv.org/abs/1805.09673 
- INF integrative network fusion: https://www.frontiersin.org/articles/10.3389/fonc.2020.01065/full
- fusion methods review (2020): https://www.sciencedirect.com/science/article/pii/S2001037019304155
- icluster: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0035236
    - https://pubmed.ncbi.nlm.nih.gov/19759197/
- IntClust https://rdrr.io/cran/IntClust/
- check jNMF: file:///home/sergiun/Downloads/Wang-Gu2016_Article_IntegrativeClusteringMethodsOf.pdf


# Software:
- ANF: https://www.bioconductor.org/packages/release/bioc/html/ANF.html
    - https://github.com/BeautyOfWeb/ANF
- R: CancerSubtypes https://bioconductor.org/packages/devel/bioc/vignettes/CancerSubtypes/inst/doc/CancerSubtypes-vignette.html
- SNF Python: https://github.com/rmarkello/snfpy
- SNF IntClust: https://rdrr.io/cran/IntClust/man/SNF.html

# Course:
- https://github.com/NBISweden/workshop_omics_integration/
- https://www.sciencedirect.com/science/article/abs/pii/S1746809419301326
- iomicspass: https://www.nature.com/articles/s41540-019-0099-y
