#### Main methodology results

*Disclaimer*: please note that many parts of this code require the preprocessed data from ADNI (both genetic and diagnostic related) as input. This data has not been uploaded to the repository for privacy reasons.

In [1]:

import pandas as pd
import datetime, pickle
import create_datasets.create_nx_datasets
from ml_models.machine_learning_models import create_class_LOAD, baseline_model

**1. Obtain genes of interest**

Using DisGeNET to get Gene-Disease-Associations (GDAs) to Alzheimer's Disease (AD gene set) and other neurodegenerative diseases (ND). This is already obtained from [first part of the results](1_main_methodology.ipynb).

**2. Obtain biological networks**

Using genes of interst obtained from DisGeNET, obtain PPI between them from STRING. This is already obtained from [first part of the results](1_main_methodology.ipynb).

**3. Data preprocessing**

Please refer to `data_preprocessing` subdirectory for this part.
1. [make_BED_files.R](data_preprocessing/make_BED_files.R) creates BED files with the genomic coordinates of the genes of interest. This is already obtained from [first part of the results](1_main_methodology.ipynb).
2. [extract_and_annotate_missense_LOAD.sh](data_preprocessing/extract_and_annotate_missense_LOAD.sh) is the script for obtaining missense variants from the VCF files.

**4. Create graph datasets**

Create graph datasets (one graph representing each patient) for different targets with ADNI dataset.

In [3]:
dataset = 'LOAD'
target  = 'LOAD'
diseases = ['AD', 'ND']
network = 'original'

for disease in diseases:

    indir = 'data'
    outdir = f'data/graph_datasets/{target}'
    print('Input directory:', indir)
    print('Output directory:', outdir)
    print()

    start_time = datetime.datetime.now()
    print()

    result_nodes = create_datasets.create_nx_datasets.main(indir, dataset, target, disease, network, 'missense', None)
    print('Coding: number of missense variants per node')

    outfile = f'{outdir}/{disease}_PPI_missense.pkl'
    print('Resulting dataset saved at:', outfile)
    print()

    with open(outfile, 'wb') as f:
        pickle.dump(result_nodes, f)

    result_nodes_time = datetime.datetime.now()
    print('Processing time:', result_nodes_time - start_time)
    print('\n\n')

Input directory: data
Output directory: data/graph_datasets/LOAD


data/ND_STRING_PPI_edgelist.txt
Network used: ND original
# nodes = 139
# edges = 263

Dataset used: LOAD
    CHROM        POS          ID REF ALT Allele           Consequence  \
0       1   11854476   rs1801131   A   C      C      missense_variant   
1       1   20977000   rs1043424   A   C      C      missense_variant   
2       1  209782343  rs11119314   T   C      C      missense_variant   
3       2  210558162    rs741006   G   A      A      missense_variant   
4       2  234113301      rs9247   C   T      T      missense_variant   
5       3   12393125   rs1801282   C   G      G      missense_variant   
6       3  133475722   rs1799852   G   A      A      missense_variant   
7       3  133494354   rs1049296   C   T      T      missense_variant   
8       4   23815662   rs8192678   C   T      T      missense_variant   
9       4   23815681  rs17574213   C   T      T      missense_variant   
10      5  179264731    

NameError: name 'exit' is not defined

**5. Graph classification with GNNs**

We then evaluated and tested different GNNs in the framework called [GraphGym](https://github.com/snap-stanford/GraphGym) (You *et al.*, 2020).

Configuration and grid files employed are in the subdirectory [graphgym_files](graphgym_files).

Summarized results obtained by GraphGym in LOAD dataset are [here](results/GNNs_LOAD/2022_02_LOAD.csv)

**6. Baseline model**

Using only APOE gene as input.

In [3]:
target = 'LOAD'

infile = f'data/table_datasets/AD_PPI_missense_LOAD_labeled.csv'
data = pd.read_csv(infile, index_col = 0)
data_wclass = create_class_LOAD(data)

x = data_wclass.drop(columns=['y'])
x = x['APOE']

y = data_wclass['y']
x.index = x.index.str.upper()

baseline_model(x, y)

# f = open(f'data/splits/split_{target}.pkl', 'rb')
# split_load = pickle.load(f)
# f.close()

# auc_load = baseline_model(split_load, x, y)
# print('Baseline model LOAD, AUC ROC:', auc_load)

Class distribution:
1    1014
0     585
Name: y, dtype: int64
1
Acc. 0.70625
Pre. 0.8674698795180723
Rec. 0.6666666666666666
F1. 0.7539267015706806
AUC. 0.7539267015706806

2
Acc. 0.6625
Pre. 0.7857142857142857
Rec. 0.6470588235294118
F1. 0.7096774193548386
AUC. 0.7096774193548386

3
Acc. 0.68125
Pre. 0.7931034482758621
Rec. 0.6764705882352942
F1. 0.7301587301587301
AUC. 0.7301587301587301

4
Acc. 0.64375
Pre. 0.7848101265822784
Rec. 0.6078431372549019
F1. 0.6850828729281768
AUC. 0.6850828729281768

5
Acc. 0.65
Pre. 0.7319587628865979
Rec. 0.7029702970297029
F1. 0.7171717171717172
AUC. 0.7171717171717172

6
Acc. 0.65
Pre. 0.7047619047619048
Rec. 0.7474747474747475
F1. 0.7254901960784313
AUC. 0.7254901960784313

7
Acc. 0.59375
Pre. 0.75
Rec. 0.5887850467289719
F1. 0.6596858638743455
AUC. 0.6596858638743455

8
Acc. 0.69375
Pre. 0.7311827956989247
Rec. 0.7391304347826086
F1. 0.7351351351351352
AUC. 0.7351351351351352

9
Acc. 0.7125
Pre. 0.7549019607843137
Rec. 0.7857142857142857
F1. 0.770