diff --git a/joss.04545/10.21105.joss.04545.crossref.xml b/joss.04545/10.21105.joss.04545.crossref.xml new file mode 100644 index 0000000000..b055184ee3 --- /dev/null +++ b/joss.04545/10.21105.joss.04545.crossref.xml @@ -0,0 +1,212 @@ + + + + 20230213T184138-bb481ed4a6bc51273cbd2b106583b073715fcdec + 20230213184138 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org/ + + + + + 02 + 2023 + + + 8 + + 82 + + + + dsBinVal: Conducting distributed ROC analysis using +DataSHIELD + + + + Daniel + Schalk + https://orcid.org/0000-0003-0950-1947 + + + Verena Sophia + Hoffmann + + + Bernd + Bischl + + + Ulrich + Mansmann + + + + 02 + 13 + 2023 + + + 4545 + + + 10.21105/joss.04545 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.7634619 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/4545 + + + + 10.21105/joss.04545 + https://joss.theoj.org/papers/10.21105/joss.04545 + + + https://joss.theoj.org/papers/10.21105/joss.04545.pdf + + + + + + R: A language and environment for statistical +computing + R Core Team + 2021 + R Core Team. (2021). R: A language +and environment for statistical computing. R Foundation for Statistical +Computing. https://www.R-project.org/ + + + DataSHIELD: Taking the analysis to the data, +not the data to the analysis + Gaye + International journal of +epidemiology + 6 + 43 + 10.1093/ije/dyu188 + 2014 + Gaye, A., Marcon, Y., Isaeva, J., +LaFlamme, P., Turner, A., Jones, E. M., Minion, J., Boyd, A. W., Newby, +C. J., Nuotio, M.-L., & others. (2014). DataSHIELD: Taking the +analysis to the data, not the data to the analysis. International +Journal of Epidemiology, 43(6), 1929–1944. +https://doi.org/10.1093/ije/dyu188 + + + Data integration for future medicine +(DIFUTURE) + Prasser + Methods of Information in +Medicine + S01 + 57 + 10.3414/ME17-02-0022 + 2018 + Prasser, F., Kohlbacher, O., +Mansmann, U., Bauer, B., & Kuhn, K. A. (2018). Data integration for +future medicine (DIFUTURE). Methods of Information in Medicine, 57(S01), +e57–e65. https://doi.org/10.3414/ME17-02-0022 + + + Distributed non-disclosive validation of +predictive models by a modified ROC-GLM + Schalk + 10.48550/ARXIV.2203.10828 + 2022 + Schalk, D., Hoffmann, V. S., Bischl, +B., & Mansmann, U. (2022). Distributed non-disclosive validation of +predictive models by a modified ROC-GLM. arXiv. +https://doi.org/10.48550/ARXIV.2203.10828 + + + An interpretation for the ROC curve and +inference using GLM procedures + Pepe + Biometrics + 2 + 56 + 10.1111/j.0006-341x.2000.00352.x + 2000 + Pepe, M. S. (2000). An interpretation +for the ROC curve and inference using GLM procedures. Biometrics, 56(2), +352–359. +https://doi.org/10.1111/j.0006-341x.2000.00352.x + + + Calibrating noise to sensitivity in private +data analysis + Dwork + Theory of cryptography +conference + 10.1007/11681878_14 + 2006 + Dwork, C., McSherry, F., Nissim, K., +& Smith, A. (2006). Calibrating noise to sensitivity in private data +analysis. Theory of Cryptography Conference, 265–284. +https://doi.org/10.1007/11681878_14 + + + Comparing the areas under two or more +correlated receiver operating characteristic curves: A nonparametric +approach + DeLong + Biometrics + 10.2307/2531595 + 1988 + DeLong, E. R., DeLong, D. M., & +Clarke-Pearson, D. L. (1988). Comparing the areas under two or more +correlated receiver operating characteristic curves: A nonparametric +approach. Biometrics, 837–845. +https://doi.org/10.2307/2531595 + + + pROC: An open-source package for R and S+ to +analyze and compare ROC curves + Robin + BMC Bioinformatics + 12 + 10.1186/1471-2105-12-77 + 2011 + Robin, X., Turck, N., Hainard, A., +Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: +An open-source package for R and S+ to analyze and compare ROC curves. +BMC Bioinformatics, 12, 77. +https://doi.org/10.1186/1471-2105-12-77 + + + + + + diff --git a/joss.04545/10.21105.joss.04545.jats b/joss.04545/10.21105.joss.04545.jats new file mode 100644 index 0000000000..5aa6370ac6 --- /dev/null +++ b/joss.04545/10.21105.joss.04545.jats @@ -0,0 +1,402 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +4545 +10.21105/joss.04545 + +dsBinVal: Conducting distributed ROC analysis using +DataSHIELD + + + +https://orcid.org/0000-0003-0950-1947 + +Schalk +Daniel + + + + + + + +Hoffmann +Verena Sophia + + + + + + +Bischl +Bernd + + + + + + +Mansmann +Ulrich + + + + + + +Department of Statistics, LMU Munich, Munich, +Germany + + + + +Institute for Medical Information Processing, Biometry and +Epidemiology, LMU Munich, Munich, Germany + + + + +DIFUTURE (DataIntegration for Future Medicine, +www.difuture.de), LMU Munich, Munich, Germany + + + + +Munich Center for Machine Learning, Munich, +Germany + + + + +23 +3 +2022 + +8 +82 +4545 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +DataSHIELD +distributed computing +distributed analysis +privacy-preserving +diagnostic tests +prognostic model +model validation +ROC-GLM +discrimination +calibration +Brier score + + + + + + Summary +

Our R + (R Core + Team, 2021) package dsBinVal implements + the methodology explained by Schalk et al. + (2022). + It extends the ROC-GLM + (Pepe, + 2000) to distributed data by using techniques of differential + privacy + (Dwork + et al., 2006) and the idea of sharing highly aggregated values + only. The package also exports functionality to calculate distributed + calibration curves and assess the calibration. Using the package + allows us to evaluate a prognostic model based on a binary outcome + using the DataSHIELD + (Gaye + et al., 2014) framework. Therefore, the main functionality + makes it able to 1) compute the receiver operating characteristic + (ROC) curve using the ROC-GLM from which 2) the area under the curve + (AUC) and confidence intervals (CI) are derived to conduct hypothesis + testing according to DeLong et al. + (1988). + Furthermore, 3) the calibration can be assessed distributively via + calibration curves and the Brier score. Visualizing the approximated + ROC curve, the AUC with confidence intervals, and the calibration + curves using + ggplot2 + is also supported. Examples can be found in the + README + file of the repository.

+
+ + Statement of need +

Privacy protection of patient data plays a major role for a variety + of tasks in medical research. Uncontrolled release of health + information may cause personal disadvantages for individuals, and the + individual patient needs to be protected against personal details + becoming visible to people not authorized to know them.

+

In statistics or machine learning, one of these tasks is to gain + insights by building statistical or prognostic models. Prognoses on + the development of severe health conditions and covariates encoding + critical health information, such as genetic susceptibility, need to + be handled with care. Furthermore, using confidential data comes with + administrative burdens and mostly requires a consent around data + usage. Additionally, the data can be distributed over multiple sites + (e.g. hospitals) which makes their access even more challenging. + Modern approaches in distributed analysis allow work on distributed + confidential data by providing frameworks that allow retrieval of + information without sharing of sensitive information. Since no + sensitive information is shared through the use of privacy-preserving + and distributed algorithms, their use helps to meet administrative, + ethical, and legal requirements in medical research as users do not + have access to personal data.

+

One of these frameworks for privacy protected analysis is + DataSHIELD + (Gaye + et al., 2014). It allows the analysis of data in a + non-disclosive setting. The framework already provides techniques for + descriptive statistics, basic summary statistics, and basic + statistical modeling. Within a multiple sclerosis use case to enhance + patient medication in the DIFUTURE consortium of the German Medical + Informatics Initiative + (Prasser + et al., 2018), a prognostic model was developed on individual + patient data. One goal of the multiple sclerosis use case is to + validate that prognostic model using ROC and calibration analysis on + patient data distributed across five hospitals using DataSHIELD.

+

In this package we close the gap between distributed model building + and the validation of binary outcomes also on the distributed data. + Therefore, our package seamlessly integrates into the DataSHIELD + framework, which does not yet provide distributed ROC analysis and + calibration assessment.

+
+ + Functionality +

The integration of the dsBinVal package into + the DataSHIELD framework extends its functionality and allows users to + assess the discrimination and calibration of a binary classification + model without harming the privacy of individuals. Based on + privacy-preserving distributed algorithms + (Schalk + et al., 2022), the assessment of the discrimination is done by + the dsROCGLM() function that calculates a ROC + curve based on the ROC-GLM as well as an AUC with CI. The calibration + is estimated distributively using the functions + dsBrierScore() and + dsCalibrationCurve(). Additional helper + functions, dsConfusion() or + dsL2Sens(), can be used to calculate several + measures, e.g. sensitivity, specificity, accuracy, or the F1 score, + from the confusion matrix or the L2-sensitivity. Note that measures + from the confusion matrix may be disclosive for specific thresholds + and are therefore checked and protected by DataSHIELDs privacy + mechanisms. During the call to dsROCGLM(), + parts of the data set are communicated twice, first, to calculate the + ROC-GLM based on prediction scores, and second, to calculate the CI of + the AUC. In both steps, the information is protected by differential + privacy to prevent individuals from re-identification. The amount of + noise generated for differential privacy is carefully chosen based on + a simulation study that takes the variation of the predicted values + into account. We refer to the + README + file of the repository for a demonstration and usage of the + functionality.

+

Technical details: To ensure the functioning of our + package on DataSHIELD, it is constantly unit tested on an active + DataSHIELD + test + instance. The reference, username, and password are + available at the + OPAL + documentation in the “Types” section. Parts of the tests + also cover checks against privacy breaches by attempting to call + functions with data sets that do not pass the safety mechanisms of + DataSHIELD. Hence, individual functions attempt to prevent accidental + disclosures when data is not sufficient to ensure privacy.

+

State of the field: To the best of our knowledge, + there is no distributed ROC-GLM implementation available in + R. Current state-of-the-art techniques require + sharing of sensitive information from the sites and using existing + implementation such as pROC + (Robin + et al., 2011) for the ROC curve or standard software for the + GLM to calculate the ROC-GLM (as stated by Pepe + (2000)).

+
+ + Acknowledgements +

This work was supported by the German Federal Ministry of Education + and Research (BMBF) under Grant No. 01IS18036A and Federal Ministry + for Research and Technology (BMFT) under Grant No. 01ZZ1804C + (DIFUTURE, MII). The authors of this work take full responsibilities + for its content.

+
+ + + + + + + R Core Team + + R: A language and environment for statistical computing + R Foundation for Statistical Computing + Vienna, Austria + 2021 + https://www.R-project.org/ + + + + + + GayeAmadou + MarconYannick + IsaevaJulia + LaFlammePhilippe + TurnerAndrew + JonesElinor M + MinionJoel + BoydAndrew W + NewbyChristopher J + NuotioMarja-Liisa + others + + DataSHIELD: Taking the analysis to the data, not the data to the analysis + International journal of epidemiology + Oxford University Press + 2014 + 43 + 6 + 10.1093/ije/dyu188 + 1929 + 1944 + + + + + + PrasserFabian + KohlbacherOliver + MansmannUlrich + BauerBernhard + KuhnKlaus A + + Data integration for future medicine (DIFUTURE) + Methods of Information in Medicine + 2018 + 57 + S01 + 10.3414/ME17-02-0022 + e57 + e65 + + + + + + SchalkDaniel + HoffmannVerena S. + BischlBernd + MansmannUlrich + + Distributed non-disclosive validation of predictive models by a modified ROC-GLM + arXiv + 2022 + https://arxiv.org/abs/2203.10828 + 10.48550/ARXIV.2203.10828 + + + + + + PepeMargaret Sullivan + + An interpretation for the ROC curve and inference using GLM procedures + Biometrics + Wiley Online Library + 2000 + 56 + 2 + 10.1111/j.0006-341x.2000.00352.x + 352 + 359 + + + + + + DworkCynthia + McSherryFrank + NissimKobbi + SmithAdam + + Calibrating noise to sensitivity in private data analysis + Theory of cryptography conference + Springer + 2006 + 10.1007/11681878_14 + 265 + 284 + + + + + + DeLongElizabeth R + DeLongDavid M + Clarke-PearsonDaniel L + + Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach + Biometrics + JSTOR + 1988 + 10.2307/2531595 + 837 + 845 + + + + + + RobinXavier + TurckNatacha + HainardAlexandre + TibertiNatalia + LisacekFrédérique + SanchezJean-Charles + MüllerMarkus + + pROC: An open-source package for R and S+ to analyze and compare ROC curves + BMC Bioinformatics + 2011 + 12 + 10.1186/1471-2105-12-77 + 77 + + + + + +
diff --git a/joss.04545/10.21105.joss.04545.pdf b/joss.04545/10.21105.joss.04545.pdf new file mode 100644 index 0000000000..f83074b823 Binary files /dev/null and b/joss.04545/10.21105.joss.04545.pdf differ