Skip to content

The BioNE is a new pipeline that integrates embedding results from different graph embedding methods, providing a more comprehensive knowledge of the network and therefore better performance on prediction tasks.

License

Notifications You must be signed in to change notification settings

pooryaparvizi/BioNE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioNE: Integration of network embeddings for supervised learning

Overview

A network embedding approach reduces the complexity of analyzing large biological networks by converting the high-dimensional networks to low-dimensional vector representations. These lower-dimensional representations can then be used in machine learning prediction tasks such as link/association prediction. Several network embedding methods have been proposed with different approaches to obtain network features. We believe, rather than developing the new network embedding method, integrating them could offer complementary information about the network, and consequently better performance in prediction tasks. BioNE is a pipeline that applies a range of network embedding methods following the network preparation step and integrates the vector representations obtained by these methods using three different techniques. In this framework we focus on link prediction task.

The BioNE pipeline is divided into three steps;

  1. Network Preparation
    1.1. Convert Adjacency Matrix to Edge List
    1.2. Heterogeneous Network Preparation
  2. Network Embedding
  3. Predictions Using the Integration of Embeddings

In order to install packages and create the necessary virtual environment, check section Virtual Environment and Installing Packages.
This pipeline will be tested on Drug-Target Interaction (DTI) data as a link prediction task. You can find the scripts of this test in the Example section.

 

Virtual Environment and Installing Packages

All of the analyses are written and tested on virtual environment using python 3.7. The detailed software versions are listed below:

  • Python 3.7
  • virtualenv 20.4.0
  • ubuntu 20.04
  • nvidia-driver 460
  • cuda 10.0
  • cuDNN 7.4.2

To create the virtual environment:

cd BioNE-main
virtualenv --python=/usr/bin/python3.7 BioNEvenv

To activate and install required packages:

source BioNEvenv/bin/activate
pip install -r requirements.txt

 

Input files formats

The input file format for Network Embedding is a space-delimited edge list file. If the edge list file is ready in this format, users can start from the Network Embedding step. If the networks are in adjacency matrix format, this pipeline provides the command line to convert adjacency matrices to edge lists in section 1.1. Convert Adjacency Matrix to Edge List. Adjacency matrices should contain column names and row names, and the format should be space-delimited. Click here to see sample adjacency matrix.

 

1. Network Preparation

This part consists of two sections. Users can convert adjacency matrices to edge list files in section 1.1. Convert Adjacency Matrix to Edge List. On the other hand, when required, users can combine two edge list files to form a heterogeneous network using command lines provided in section 1.2. Heterogeneous Network Preparation.

 

1.1. Convert Adjacency Matrix to Edge List

In order to conduct network embedding, adjacency matrices should be converted to an edge list file format.

python3 scripts/mat2edgelist.py --input input.txt --directed --keepzero --attribute --output output.txt
Arguments:

      input   The filepath of the adjacency matrix
      Input adjacency matrix file should be space-delimited file and contains row and column index labels.
      Click here to see a sample file.

      directed   Treat the graph as directed
      When directed, row indexes are source nodes and column indexes are target nodes.

      keepzero   Adding negative associations (0s) to the output

      attribute   Including the edge attributes to the output file
      If edge attributes are not going to be used as weights in network embedding, removing this line is recommended to save memory.

      output   The filepath for the output edge list file
      The file will be saved as a space-delimited file. Click here to see a sample edge list file.

 

1.2. Heterogeneous Network Preparation

When required, users can combine two edge lists (e.g. drug-drug and drug-disease networks) to construct a heterogeneous network. The command line below can be used to combine edge lists. This can be used multiple times to combine more than two edge list files.

python3 scripts/merge_edgelist.py --input1 input1.txt --input2 input2.txt --rmduplicate --output output.txt
Arguments:

      input1   The filepath of first edge list file
      This file should be a space-delimited edge list file. Click here to see a sample input file.

      input2   The filepath of second edge list file
      This file should be a space-delimited edge list file. Click here to see a sample input file.

      rmduplicate   Removes duplicated edges

      output   The filepath for the output combined edge list file
      The file will be saved as a space-delimited file. Click here to see a sample output file.

 

2. Network Embedding

Network embedding methods convert high-dimensional data to low-dimensional vector representations. In this project users are able to conduct the following embedding methods:
LINE, GraRep, SDNE, LLE, HOPE, LaplacianEigenmaps (Lap), node2vec, DeepWalk and GF.

python3 scripts/embedding.py --method lle --input input.txt --directed --weighted --representation_size 128 --output output.txt
Arguments:

      method:   Network embedding method
      Choices are:
        line (parameters: epochs, order, negative_ratio)
        grarep (parameters: kstep)
        sdne (parameters: alpha, beta, nu1, nu2, bs, lr, epochs, encoder-list)
        lle
        hope
        lap
        node2vec (parameters: walk_length, number_walks, workers, p, q, window_size)
        deepwalk (parameters: walk_length, number_walks, workers, window_size)
        gf (parameters: epochs, lr, weight-decay)

        Note: input, directed, weighted, random_state and representation_size are shared among all methods.

      input:   The filepath of the edge list file
      This file should be an space-delimited edge list. Click here to see a sample input file.

      directed:   Treats the network as directed
      There is no need to use this if you already specified this in section 1.1.

      weighted:   Treat the network as weighted
      To use this, edge attributes should be included in the edge list file. Check attribute argument in section 1.1.

      random_state:   Fixing the randomization
      The default value is 1.

      epochs:   The number of times that the learning algorithm will work through the entire training data set
      This parameter is used in line, sdne and gf. The default value is 5.

      representation_size:   Dimensionality of the output data
      The default value is 128.

      order:   Choose the order of line
      1 means first order, 2 means second order, 3 means first order + second order. The default value is 2.

      negative_ratio:   Negative sampling ratio
      This parameter is used in line. The default is 5.

      kstep:   Use k-step transition probability matrix
      This parameter is used in grarep. The default value is 2.

      encoder-list:   a list of neuron numbers in each encoder layer within sdne
      The last number is the dimension of the output embeddings. The default is [1000,128].

      alpha:   alpha is a hyperparameter in sdne
      The default value is 1e-6.

      beta:   beta is a hyperparameter in sdne
      The default value is 1e-5.

      nu1:   nu1 is a hyperparameter in sdne
      The default value is 1e-5.

      nu2:   nu2 is a hyperparameter in sdne
      The default value is 1e-4.

      bs:   batch size in sdne
      Number of training samples utilized in one iteration. The default is 200.

      lr:   learning rate in sdne
      The learning rate controls how quickly the model adapts to the problem. The default is 0.001.

      walk-length:   Length of the random walk started at each node
      This parameter is used in node2vec and deepwalk. The default value is 20.

      number-walks:   Number of random walks to start at each node
      This parameter is used in node2vec and deepwalk. The default value is 80.

      workers:   Number of parallel processes
      This parameter is used in node2vec and deepwalk. The default value is 8.

      p:   Return hyperparameter in node2vec
      The default value is 1.

      q:   Inout hyperparameter in node2vec
      The default value is 1.

      window-size:   Window size of skipgram model in node2vec and deepwalk
      The default value is 10.

      weight-decay:   Weight for L2 loss on embedding matrix in gf
      The default value is 5e-4.

      output:   The filepath for the embedding results
      The file saves as a space-delimited file. Click here to see a sample output file.

 

3. Predictions using the integration of embeddings

For this section, we developed three different integration methods (late fusion, early fusion and mixed fusion) to integrate embedding results from the different methods. This ensures a comprehensive representation of networks and therefore better prediction performance.

python3 scripts/integration.py --fusion late --annotation annotation.txt --entity1-embeddings '["hope_x.txt","lap_x.txt"]' --entity2-embeddings '["hope_y.txt","lap_y.txt"]' --cv-type stratified --cv 10 --imbalance ADASYN --model '["RF"]' --output ./output
Arguments:

      fusion   The integration type
      Choices are:
        early: Merging all embedding results before passing to the prediction model
        late (default): Including each embedding result in the prediction model and then summing up the achieved prediction probabilities.
        mix: Merging all embedding results, and then summing up the prediction probabilities achieved from different prediction models.

      annotation   The filepath of the annotation file
      This file should contain two columns. The first and second columns in the annotation file harbour the information of entity1 and entity2 respectively. Click here to see a sample annotation file.

      entity1-embeddings   filepaths of the embeddings containing the entities of the first column (entity1) in the annotation file
      The file paths should be given in this format: '["deepwalk_drug.txt", "gf_drug.txt"]'.
      When late fusion is applied, the entity1-embeddings and entity2-embeddings should have the same length with the same order of embedding methods.

      entity2-embeddings   filepaths of the embeddings containing the entities of the second column (entity2) in the annotation file
      The file paths should be given in this format: '["deepwalk_protein.txt", "gf_protein.txt"]'.
      When late fusion is applied, the entity1-embeddings and entity2-embeddings should have same length with the same order of embedding methods.

      cv-type   Cross-validation method
      Choices are 'kfold', 'stratified' and 'split' (default). 'split' divides the data according to the test-size size.

      cv   Number of folds
      This argument is used when the cv-type is either 'kfold' or 'stratified'.
      Default value is 5.

      cv-shuffle   Whether to shuffle each class samples before splitting into batches
      This argument is used when the cv-type is either 'kfold' or 'stratified'.

      test-size   Percentage of the data to be test-set
      The value of this argument must be between 0 and 1. This can be used when cv-type is 'split'.
      Default value is 0.2.

      imbalance   Deals with imbalanced classes
      Choices are: 'equalize' which equalizes the number of majority class to minority class.
      'SMOTE' is the oversampling method.
      'None' (default) does not deal with imbalanced classes.

      fselection   feature selection
      Choices are: 'fvalue', 'qvalue', 'MI' or None.
      ANOVA analyses the differences among the means between classes. The output is either in 'fvalue' or pvalue.
      ktop argument helps to select features with K highest 'fvalues'.
      The 'qvalue' is the Bonferroni correction of p-values with values lower than 0.1.
      The MI is based on mutual information. Here ktop helps to collect features with K highest MI value.

      ktop   Select K highest value features
      Select features according to the k highest scores if feature selection is either fvalue or MI.
      Default value is 10.

      model   Machine Learning models
      Choices are 'SVM' (default), 'RF','NB' and 'XGBoost'.
      The models should be given in this format: '["SVM"]'
      In the case where mixed fusion is applied, the models should be given in this format: '["SVM","RF", "NB", "XGBoost"]'

      random_state   Fixing the randomization
      Default value is None.

      kernel   Specifies the kernel type to be used in the algorithm
      This can be used when classification is 'SVM'.
      Default is 'linear'.

      C   Regularization parameter
      The default value is 1. This can be used when model is SVM.

      ntree   The number of trees in the random forest
      Default value is 100.

      criterion   The function to measure the quality of a split in random forest
      Choices are 'gini' (default) and 'entropy'

      njob   The number of parallel jobs to run in random forest.

      output   The filepath for the predictions and evaluation results
      Only provide directory and file prefix. e.g. ./Desktop/DTI_prediction
      Click here to see a sample prediction output and here for ROC and PR curves.
      In ROC and PR, the label of the positive class is fixed to 1.

 

Example

Here you can find the example of the Drug-Target interaction link prediction task.

# 1) Network Preparation

# Convert drug-drug and drug-disease adjacency matrices to the edge list
python3 scripts/mat2edgelist.py --input ./data/mat_drug_drug.txt --output ./output/edgelist/edgelist_drug_drug.txt
python3 scripts/mat2edgelist.py --input ./data/mat_drug_disease.txt --output ./output/edgelist/edgelist_drug_disease.txt
# Drugs heterogeneous network preparation
python3 scripts/merge_edgelist.py --input1 ./output/edgelist/edgelist_drug_drug.txt --input2 ./output/edgelist/edgelist_drug_disease.txt --rmduplicate --output ./output/edgelist/edgelist_hetero_drugs.txt


# Convert protein-protein adjacency matrix to the edge list
python3 scripts/mat2edgelist.py --input ./data/mat_protein_protein.txt --output ./output/edgelist/edgelist_protein_protein.txt


# 2) Embedding
# Embedding on drugs heterogeneous network. The hope and lap embeddings conducted
python3 scripts/embedding.py --method hope --input ./output/edgelist/edgelist_hetero_drugs.txt --representation_size 20 --output ./output/embedding/hope_20_hetero_drugs.txt
python3 scripts/embedding.py --method lap  --input ./output/edgelist/edgelist_hetero_drugs.txt --representation_size 20 --output ./output/embedding/lap_20_hetero_drugs.txt

# Embedding on protein-protein edgelist. The hope and lap embeddings conducted
python3 scripts/embedding.py --method hope --input ./output/edgelist/edgelist_protein_protein.txt --representation_size 20 --output ./output/embedding/hope_20_protein.txt
python3 scripts/embedding.py --method lap  --input ./output/edgelist/edgelist_protein_protein.txt --representation_size 20 --output ./output/embedding/lap_20_protein.txt


# 3) Predictions using the integration of embeddings
# Create annotation file
python3 scripts/mat2edgelist.py --input ./data/mat_drug_protein_remove_homo.txt--directed --keepzero --attribute --output ./output/edgelist/edgelist_drug_protein.txt
# late fusion
python3 scripts/integration.py --fusion late --annotation ./output/edgelist/edgelist_drug_protein.txt --entity1-embeddings '["./output/embedding/hope_20_hetero_drugs.txt","./output/embedding/lap_20_hetero_drugs.txt"]' --entity2-embeddings '["./output/embedding/hope_20_protein.txt","./output/embedding/lap_20_protein.txt"]' --cv-type kfold --cv 10 --imbalance equalize --model '["SVM"]' --random_state 11 --output ./output/prediction/DTI_prediction

 

Citation

Please consider citing the following publication if you found BioNE beneficial in your research:

@article{BioNE,
author = {Parvizi, Poorya and Azuaje, Francisco and Theodoratou, Evropi and Luz, Saturnino},
doi = {10.1101/2022.04.26.489560},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
title = {{BioNE: Integration of network embeddings for supervised learning}},
url = {https://www.biorxiv.org/content/early/2022/04/27/2022.04.26.489560}, year = {2022}
}

BioNE is also archived at Zenodo (https://doi.org/10.5281/zenodo.5500712)

Contact

If you have any questions, please submit an issue on GitHub or send an email to poorya.parvizi@ed.ac.uk.

License

Licensed under GPLv3 license

About

The BioNE is a new pipeline that integrates embedding results from different graph embedding methods, providing a more comprehensive knowledge of the network and therefore better performance on prediction tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages