Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



25 Commits

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn


DrOGA is a benchmark to predict the driver-status of somatic non-synonymous DNA mutations.
Explore the docs »

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Usage
  3. Roadmap
  4. Contributing
  5. License
  6. Contact
  7. Acknowledgments

About The Project


Our paper is available in


(back to top)

Built With

Our released implementation is tested on:

  • Ubuntu 20.04
  • Python 3.9.7

(back to top)



Our documented dataset for training and testing can be downloaded . After downloading and extracting the files into the Data folder, you will get a data structure as follows:

  ├── train.csv	# Dataset for training
  ├── test.csv	# Dataset for testing
  └── SupplemetaryMaterial1.xlsx # Complete documentation of the dataset


Training traditional ML models

Traditional ML models are trained with the data obtained in previous step and contained in "Data" folder. The approach followed uses Random Search for hyperparameter optimization, considering the following models and parameters:

Model Parameter Set of values
Logistic Regression penalty 'l1', 'l2', 'elasticnet', 'none'
C 100, 10, 1.0, 0.1, 0.01
tol 1e-3, 1e-4, 1e-5
SVM C 0.01, 0.1, 1, 10, 100
kernel 'poly', 'rbf', 'sigmoid', 'linear'
gamma 10, 1, 0.1, 0.01, 0.001, 0.0001, 'scale', 'auto'
tol 1e-3, 1e-4, 1e-2
Decision Tree max_depth 2-20
criterion 'gini', 'entropy'
Random Forest max_depth 5, 10, 'None'
criterion 'gini', 'entropy'
max_features 'auto', 'log2', 'None'
n_estimators 100, 200, 300, 400,
500, 600, 1000, 2000,
3000, 4000, 5000, 6000
bootstrap True, False
XGBoost n_estimators 100, 500, 1000, 2000
learning rate 0.001, 0.01, 0.05, 0.1, 0.3, 0.5
max_depth 5-20, 'None'
booster 'gbtree', 'gblinear', 'dart'
reg_alpha 1, 0.1, 0.01, 0.001, 0
reg_lambda 1, 0.1, 0.01, 0.001, 0

For training the models, these parameters can be modified directly from the source code, and later run by:


Training DL models

Deep Learning models are also trained with the data obtained in previous step and contained in "Data" folder. The approach followed uses Ray for hyperparameter optimization, considering the following models and parameters:

Model Parameter Set of values Parameter Set of values
Deep Multi-Layer
number of layers 1, 3, 5, 7 lr 1e-8-1e-6
starting exponent 4, 5, 6, 7 weight decay 0, 0.1
alpha 1.00-3.00 batch size 32, 64, 128, 256
gamma 1.0-4.0 warm up steps 0-100
Neural Network
number of filters 1 8, 16 alpha 1.00-3.00
number of filters 2 32, 64 gamma 1.0-4.0
number of neurons 1 512, 256 lr 1e-8-1e-6
number of neurons 2 256, 128, 64 weight decay 0, 0.1
number of neurons 3 64, 32 batch size 32, 64, 128, 256
number of filters skip 32, 64 warm up steps 0-100
number of neurons skip 4, 8, 16

For training the models, first you need to select which model you want to train from the 3 available:

  • MLP: 'mlp'
  • CNN: 'cnn'
  • CNN with skip connections: 'cnn-skip'

these parameters can be modified directly from the source code. An example of use for training CNN model: weights

python --model cnn


Testing pre-trained algorithms is also available to check results provided in our publication. In order to test all traditional ML and DL models, download weights here and add them to the folder as follows:

  ├── CNN	          # Weights and parameters of CNN
  │    └── ...	  
  ├── CNN_SKIP	          # Weights and parameters of CNN with skip-connections
  │    └── ...	  
  ├── MLP	          # Weights and parameters of MLP
  │    └── ...	  
  ├── DecisionTree.h5     # Weights of Decision Tree
  ├── Logistic.h5	  # Weights of Logistic Classification
  ├── RF.h5	          # Weights of Random Forest
  ├── SVM.h5	          # Weights of Support Vector Machine
  └── XGB.h5              # Weights of XGB

These models can be tested together obtaining metrics regarding accuracy, precision, recall and F1 over the test slit of our dataset. It is recommended to use GPU to accelerate testing process, but CPU is set automatically if there is not CUDA device found.


(back to top)


See the open issues for a full list of proposed features (and known issues).

(back to top)


If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/my_feature)
  3. Commit your Changes (git commit -m 'Add my_feature')
  4. Push to the Branch (git push origin feature/my_feature)
  5. Open a Pull Request

(back to top)


Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)


Matteo Bastico - @matteobastico -

Project Link:

(back to top)


This work was supported by the H2020 European Project: GenoMed4ALL web Grant no. 101017549. The authors are with the Escuela Técnica Superior de Ingenieros de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain (e-mail:,,,

(back to top)


No description, website, or topics provided.







No releases published


No packages published
