Skip to content

Predicting privacy risk of functional genomics data

Notifications You must be signed in to change notification settings

gursoygamze/FANCY

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FANCY

Predicting privacy risk of functional genomics data. Pre-print: https://www.biorxiv.org/content/10.1101/775338v1

NOTE 1: The results in the paper are obtained using the MATLAB version of the code.

NOTE 2: If you want to skip the training step and use the already trained model, for FANCY (both in matlab (.mat) and pyhton (.sav) format, go to http://homes.gersteinlab.org/people/gg487/FANCY/. For FANCY_low, go to python or matlab folder.

  • In the python folder, you can find three python scripts:

    The dependencies are pyGPs, which can be easily downloaded following: https://www.cse.wustl.edu/~m.neumann/pyGPs_doc/Install.html

    (1) training_testing_modeling.py takes the feature files in the data folder and divides it into half for training and test. It will then train a Gaussian Process Learning regression model on the trainign data. It will then predict the outcome of the test features and report the R^2 score for the test data. Finally, it will export the model and save it as "FANCY.sav" The kernel parameters are obtained using the optimization in MATLAB, as optimizing it in Python took a long time.

    (2) training_testing_modeling_low.py does the same operations as (1), but it only uses the data if the number of leaking variants are smaller or equal to 1000. It exports the model as "FANCY_low.sav"

    (3) FANCY.py is the prediction model. It has several user arguments. Below is an example and explanation of the arguments.

    python FANCY.py -regressor FANCY -depth 4.04985000000000 -cov 72274436 -assay ATAC-Seq -std 300.96 -skew 103.01 -kurt 12491.18

    (a)-regressor is used to chose the model. FANCY for the full model, FANCY_low for low number of variants.
    
    (b)-depth is the mean depth of the sequencing reads.
    
    (c)-cov is the breadth of the coverage.
    
    (d)-assay is the type of the experiment. Options are: RNA-Seq, ATAC-Seq and CHIP-Seq.
    
    (e)-std is the standard deviation of the depth distribution.
    
    (f)-skew is the skewness of the depth distribution.
    
    (g)-kurt is the kurtosis of the depth distribution.
    

    The correct answer for these inputs are 59,278 leaking SNVs and you should obtain a result similar to

      Predicted number of leaking variants are  54678.078125 
      
      Between  10388  and  12575  of them are rare
      
      Risk is  red 
    
  • In the matlab folder, you can find the following scripts:

    (1) trainRegressionModel.m takes the feature files in the data folder and divides it into half for training and test. It will then train a Gaussian Process Learning regression model on the training data. It will then predict the outcome of the test features and report the R^2 score for the test data. Finally, you can export the model and save it as "FANCY.mat" The kernel parameters are obtained using the Statistics and Machine Learning toolbox in MATLAB. You can also train the regressor by using only the data points that are smaller or equalt to 1000 and save the resulting model as "FANCY_low.mat".

    (2) fancy.m is the prediction model function. Below is an example on how to run with an explanation

    >> load('FANCY.mat')

    >> X=[300.96 103.01 12491.18 4.04985000000000 72274436 0 1 0];

    >> [ypred rare1 rare2 alert]= fancy(X)

    In the input X, the first column is the standard deviation, second column is the skewness, third column is the kurtosis, fourth column is the mean depth, sixth column is the coverage and sevent, eight and nineth columns are the one-hot encoding of the assay type. RNA-Seq = [0 0 1], ATAC-Seq=[0 1 0] and CHIP-Seq=[0 0 1]

    The correct answer for these inputs are 59,278 leaking SNVs and you should obtain a result similar to

    ypred = 5.2579e+04
    rare1 = 9.9900e+03
    rare2 = 1.2093e+04
    alert = 'red'
    

    (3) fancy_low.m is the prediction model function for the number of SNVs smaller or equal to 1000. Below is an example on how to run:

    >> load('FANCY_low.mat')

    >> X=[0.0582118199131000,9.30898835798000,170.476230689000,1.28473000000000,1879796,0,1,0];

    >> [ypred rare1 rare2 alert]= fancy_low(X)

    In the input X, the first column is the standard deviation, second column is the skewness, third column is the kurtosis, fourth column is the mean depth, sixth column is the coverage and sevent, eight and nineth columns are the one-hot encoding of the assay type. RNA-Seq = [0 0 1], ATAC-Seq=[0 1 0] and CHIP-Seq=[0 0 1]

    The correct answer for these inputs are 134 leaking SNVs and you should obtain a result similar to

    ypred = 138.5298
    rare1 = 26.3204
    rare2 = 31.8618
    alert = 'yellow'
    

    (4) trainingANDtest_for_classifier.m uses the sequencing statistics (mean depth, coverage, standard deviation, skewness and kurtosis) as inputs and predicts the assay type using a Random Forest Classifier. It first takes the data and divides it into half for test and training. When you run the code, you should obtain something similar to below:

     validationAccuracy = 0.9682
     testAccuracy = 0.9679
    

    validationAccuracy is the accuracy on the training data after 5-fold cross-validation and testAccuracy is the accuracy on the test data.

About

Predicting privacy risk of functional genomics data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 58.2%
  • MATLAB 20.7%
  • Python 20.6%
  • Shell 0.5%