#  Reproducibility Project for CS598 DL4H in Spring 2023


Our code for our reproducibility study is heavily based on the original code of the paper Makarious, M.B., Leonard, H.L., Vitale, D. et al. Multi-modality machine learning predicting Parkinson’s disease. npj Parkinsons Dis. 8, 35 (2022). https://doi.org/10.1038/s41531-022-00288-w

The original code can be accessed at https://github.com/GenoML/GenoML_multimodal_PD/

The original code and our code implement the GenoML Python package (https://genoml.com/)

## Initial Set Up

In [None]:
#pip install genoml2

# requirements.txt of genoml2 contains the following required packages : 
#pip install joblib
#pip install matplotlib
#pip install numpy
#pip install tables
#pip install pandas
#pip install pandas_plink
#pip install requests
#pip install scikit-learn
#pip install scipy
#pip install seaborn
#pip install statsmodels
#pip install xgboost
#pip install umap-learn
#pip install tensorflow

In [1]:
import os
import sys
import argparse 
import math
import time
import joblib
import subprocess
import numpy as np
import pandas as pd
import tables
import statsmodels.api as sm
from scipy import stats

## Loading in Data

As we do not yet have access to tier 2 data, this section will need to be adjusted once we have the proper csv files

## Implementation

### Data Munging

For munging with GenoML, we need to specify certain arguments. In our case, we will classify two classes, thus have supervised learning. The data will be discrete. With the --prefix argument we specify where we would like our output to be stored (which in our case is the outputs folder) and the --pheno argument specifies where the phenotype file is stored (this file has the ID stored in the first column, and the label in the second column (0 for controls and 1 for cases)). The --p flag adds a p-value cut off, which we adjust when rerunning the code. Note, when using different p-values, do not forget to change the prefix argument to a different output file name, in order to not overwrite the output file of the prior p-values. The --feature_selection flag uses the extraTrees to find the features that most contribute to the model.

In [None]:
genoml discrete supervised munge \
--p 0.01 \
--prefix outputs/PPMI_Only_genetics_with_PRS \
--geno data/discrete/training \
--pheno data/discrete/training_pheno.csv \ 
--gwas data/discrete/data_GWAS.csv \
--addit PRS.csv \
--impute mean \ 
--feature_selection 500 \
--adjust_data yes \
--adjust_normalize yes \
--umap_reduce no

### Model Training

For training the model, comparable arguments need to be set in GenoML. The train command looks for the .dataForML file that was created in the munge step.

In [None]:
genoml discrete supervised train \
--prefix outputs/PPMI_Only_genetics_with_PRS


### Model Tuning


In [None]:
genoml discrete supervised tune \
--prefix outputs/PPMI_Only_genetics_with_PRS
