Skip to content

matteo-bastico/DrOGA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

DrOGA is a benchmark to predict the driver-status of somatic non-synonymous DNA mutations.
Explore the docs »

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Usage
  3. Roadmap
  4. Contributing
  5. License
  6. Contact
  7. Acknowledgments

About The Project

Citation

Our paper is available in

  @article{
    
  }

(back to top)

Built With

Our released implementation is tested on:

  • Ubuntu 20.04
  • Python 3.9.7

(back to top)

Usage

Dataset

Our documented dataset for training and testing can be downloaded . After downloading and extracting the files into the Data folder, you will get a data structure as follows:

  Data
  ├── train.csv	# Dataset for training
  ├── test.csv	# Dataset for testing
  └── SupplemetaryMaterial1.xlsx # Complete documentation of the dataset

Training

Training traditional ML models

Traditional ML models are trained with the data obtained in previous step and contained in "Data" folder. The approach followed uses Random Search for hyperparameter optimization, considering the following models and parameters:

Model Parameter Set of values
Logistic Regression penalty 'l1', 'l2', 'elasticnet', 'none'
C 100, 10, 1.0, 0.1, 0.01
tol 1e-3, 1e-4, 1e-5
SVM C 0.01, 0.1, 1, 10, 100
kernel 'poly', 'rbf', 'sigmoid', 'linear'
gamma 10, 1, 0.1, 0.01, 0.001, 0.0001, 'scale', 'auto'
tol 1e-3, 1e-4, 1e-2
Decision Tree max_depth 2-20
criterion 'gini', 'entropy'
Random Forest max_depth 5, 10, 'None'
criterion 'gini', 'entropy'
max_features 'auto', 'log2', 'None'
n_estimators 100, 200, 300, 400,
500, 600, 1000, 2000,
3000, 4000, 5000, 6000
bootstrap True, False
XGBoost n_estimators 100, 500, 1000, 2000
learning rate 0.001, 0.01, 0.05, 0.1, 0.3, 0.5
max_depth 5-20, 'None'
booster 'gbtree', 'gblinear', 'dart'
reg_alpha 1, 0.1, 0.01, 0.001, 0
reg_lambda 1, 0.1, 0.01, 0.001, 0

For training the models, these parameters can be modified directly from the source code, and later run by:

python train_traditional_ml.py

Training DL models

Deep Learning models are also trained with the data obtained in previous step and contained in "Data" folder. The approach followed uses Ray for hyperparameter optimization, considering the following models and parameters:

Model Parameter Set of values Parameter Set of values
Deep Multi-Layer
Perceptron
number of layers 1, 3, 5, 7 lr 1e-8-1e-6
starting exponent 4, 5, 6, 7 weight decay 0, 0.1
alpha 1.00-3.00 batch size 32, 64, 128, 256
gamma 1.0-4.0 warm up steps 0-100
Convolutional
Neural Network
number of filters 1 8, 16 alpha 1.00-3.00
number of filters 2 32, 64 gamma 1.0-4.0
number of neurons 1 512, 256 lr 1e-8-1e-6
number of neurons 2 256, 128, 64 weight decay 0, 0.1
number of neurons 3 64, 32 batch size 32, 64, 128, 256
number of filters skip 32, 64 warm up steps 0-100
number of neurons skip 4, 8, 16

For training the models, first you need to select which model you want to train from the 3 available:

  • MLP: 'mlp'
  • CNN: 'cnn'
  • CNN with skip connections: 'cnn-skip'

these parameters can be modified directly from the source code. An example of use for training CNN model: weights

python train_dl.py --model cnn

Testing

Testing pre-trained algorithms is also available to check results provided in our publication. In order to test all traditional ML and DL models, download weights here and add them to the folder as follows:

  weights
  ├── CNN	          # Weights and parameters of CNN
  │    └── ...	  
  ├── CNN_SKIP	          # Weights and parameters of CNN with skip-connections
  │    └── ...	  
  ├── MLP	          # Weights and parameters of MLP
  │    └── ...	  
  ├── DecisionTree.h5     # Weights of Decision Tree
  ├── Logistic.h5	  # Weights of Logistic Classification
  ├── RF.h5	          # Weights of Random Forest
  ├── SVM.h5	          # Weights of Support Vector Machine
  └── XGB.h5              # Weights of XGB

These models can be tested together obtaining metrics regarding accuracy, precision, recall and F1 over the test slit of our dataset. It is recommended to use GPU to accelerate testing process, but CPU is set automatically if there is not CUDA device found.

python test.py

(back to top)

Roadmap

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/my_feature)
  3. Commit your Changes (git commit -m 'Add my_feature')
  4. Push to the Branch (git push origin feature/my_feature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Matteo Bastico - @matteobastico - matteo.bastico@gmail.com

Project Link: https://github.com/matteo-bastico/DrOGA

(back to top)

Acknowledgments

This work was supported by the H2020 European Project: GenoMed4ALL https://genomed4all.eu/ web Grant no. 101017549. The authors are with the Escuela Técnica Superior de Ingenieros de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain (e-mail: mab@gatv.ssr.upm.es, afg@gatv.ssr.upm.es, abh@gatv.ssr.upm.es, sum@gatv.ssr.upm.es).

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages