#### Main methodology results

*Disclaimer*: please note that many parts of this code require the preprocessed data from ADNI (both genetic and diagnostic related) as input. This data has not been uploaded to the repository for privacy reasons.

In [1]:

import pandas as pd
import datetime, pickle
from create_datasets import create_nx_datasets, create_splits
from ml_models.machine_learning_models import create_class_LOAD, baseline_model

**1. Obtain genes of interest**

Using DisGeNET to get Gene-Disease-Associations (GDAs) to Alzheimer's Disease (AD gene set) and other neurodegenerative diseases (ND). This is already obtained from [first part of the results](1_main_methodology.ipynb).

**2. Obtain biological networks**

Using genes of interst obtained from DisGeNET, obtain PPI between them from STRING. This is already obtained from [first part of the results](1_main_methodology.ipynb).

**3. Data preprocessing**

Please refer to `data_preprocessing` subdirectory for this part.
1. [make_BED_files.R](data_preprocessing/make_BED_files.R) creates BED files with the genomic coordinates of the genes of interest. This is already obtained from [first part of the results](1_main_methodology.ipynb).
2. [extract_and_annotate_missense_LOAD.sh](data_preprocessing/extract_and_annotate_missense_LOAD.sh) is the script for obtaining missense variants from the VCF files.

**4. Create graph datasets**

Create graph datasets (one graph representing each patient) for different targets with ADNI dataset.

In [2]:
networks = ['string', 'biogrid', 'huri', 'snap_brain', 'giant_brain']

for network in networks:

    outdir = f'data/graph_datasets/LOAD'

    start_time = datetime.datetime.now()
    print()

    result_nodes = create_nx_datasets.main('data', 'LOAD', 'LOAD', 'AD', network, 'missense', None)
    print('Coding: number of missense variants per node')

    outfile = f'{outdir}/AD_PPI_{network}_missense.pkl'
    print('Resulting dataset saved at:', outfile)
    print()

    with open(outfile, 'wb') as f:
        pickle.dump(result_nodes, f)

    result_nodes_time = datetime.datetime.now()
    print('Processing time:', result_nodes_time - start_time)
    print('\n\n')


data/AD_STRING_PPI_edgelist.txt
Network used: AD string
# nodes = 59
# edges = 115

Dataset used: LOAD
(11, 1678)
(11, 1608)
missense
(59, 1599)
Creating samples graphs...
Class: LOAD. Found 1014 positive subjects out of 1599
Sample graph used: # nodes = 52 # edges = 111
Density = 0.083710407239819 Diameter = 6
Coding: number of missense variants per node
Resulting dataset saved at: data/graph_datasets/LOAD/AD_PPI_string_missense.pkl

Processing time: 0:01:30.601637




Network used: AD biogrid
# nodes = 46
# edges = 62

Dataset used: LOAD
(11, 1678)
(11, 1608)
missense
(46, 1599)
Creating samples graphs...
Class: LOAD. Found 1014 positive subjects out of 1599
Sample graph used: # nodes = 38 # edges = 57
Density = 0.08108108108108109 Diameter = 5
Coding: number of missense variants per node
Resulting dataset saved at: data/graph_datasets/LOAD/AD_PPI_biogrid_missense.pkl

Processing time: 0:01:07.566558




Network used: AD huri
# nodes = 18
# edges = 16

Dataset used: LOAD
(11, 1678)


In [4]:
# Create split for using it in GraphGym and non-GNN models
# Splits are obtained through 10-Fold Stratied Cross Validation

create_splits.create_folds_stratified_cv('LOAD', 10)

LOAD
Fold -  1   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  2   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  3   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  4   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  5   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  6   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  7   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  8   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  9   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  10   |   train -  [527 913]   |   test -  [ 58 101]



**5. Graph classification with GNNs**

We then evaluated and tested different GNNs in the framework called [GraphGym](https://github.com/snap-stanford/GraphGym) (You *et al.*, 2020).

Configuration and grid files employed are in the subdirectory [graphgym_files](graphgym_files).

Summarized results obtained by GraphGym in LOAD dataset are [here](results/GNNs_LOAD/2022_02_LOAD.csv)

**6. Baseline model**

Using only APOE gene as input.

In [3]:
target = 'LOAD'

infile = f'data/table_datasets/AD_PPI_missense_LOAD_labeled.csv'
data = pd.read_csv(infile, index_col = 0)
data_wclass = create_class_LOAD(data)

x = data_wclass.drop(columns=['y'])
x = x['APOE']

y = data_wclass['y']
x.index = x.index.str.upper()

baseline_model(x, y)

# f = open(f'data/splits/split_{target}.pkl', 'rb')
# split_load = pickle.load(f)
# f.close()

# auc_load = baseline_model(split_load, x, y)
# print('Baseline model LOAD, AUC ROC:', auc_load)

Class distribution:
1    1014
0     585
Name: y, dtype: int64
1
Acc. 0.70625
Pre. 0.8674698795180723
Rec. 0.6666666666666666
F1. 0.7539267015706806
AUC. 0.7539267015706806

2
Acc. 0.6625
Pre. 0.7857142857142857
Rec. 0.6470588235294118
F1. 0.7096774193548386
AUC. 0.7096774193548386

3
Acc. 0.68125
Pre. 0.7931034482758621
Rec. 0.6764705882352942
F1. 0.7301587301587301
AUC. 0.7301587301587301

4
Acc. 0.64375
Pre. 0.7848101265822784
Rec. 0.6078431372549019
F1. 0.6850828729281768
AUC. 0.6850828729281768

5
Acc. 0.65
Pre. 0.7319587628865979
Rec. 0.7029702970297029
F1. 0.7171717171717172
AUC. 0.7171717171717172

6
Acc. 0.65
Pre. 0.7047619047619048
Rec. 0.7474747474747475
F1. 0.7254901960784313
AUC. 0.7254901960784313

7
Acc. 0.59375
Pre. 0.75
Rec. 0.5887850467289719
F1. 0.6596858638743455
AUC. 0.6596858638743455

8
Acc. 0.69375
Pre. 0.7311827956989247
Rec. 0.7391304347826086
F1. 0.7351351351351352
AUC. 0.7351351351351352

9
Acc. 0.7125
Pre. 0.7549019607843137
Rec. 0.7857142857142857
F1. 0.770