## GNNs working on a different cohort 

In this notebook, we obtain the results for GNNs, baseline, and non-GNN models using a completely different cohort of Late Onset Alzheimer's Disease subjects and healthy controls.

The graph datasets were built using the network that obtained best performance in ADNI dataset (both with PET and PET&DX labellings), which is AD PPT-Ohmnet. 

*Disclaimer*: please note that many parts of this code require the preprocessed data from LOAD (both genetic and diagnostic related) as input. This data has not been uploaded to the repository for privacy reasons.

In [2]:
import pandas as pd
import datetime, pickle
from create_datasets import create_nx_datasets, create_splits
from ml_models.machine_learning_models import create_class_LOAD, baseline_model

**1. Obtain genes of interest**

Using DisGeNET to get Gene-Disease-Associations (GDAs) to Alzheimer's Disease (AD gene set) and other neurodegenerative diseases (ND). This is already obtained from [first part of the results](1_main_methodology.ipynb).

**2. Obtain biological networks**

Using genes of interst obtained from DisGeNET, obtain PPI between them from STRING. This is already obtained from [first part of the results](1_main_methodology.ipynb).

**3. Data preprocessing**

Please refer to `data_preprocessing` subdirectory for this part.
1. [make_BED_files.R](data_preprocessing/make_BED_files.R) creates BED files with the genomic coordinates of the genes of interest. This is already obtained from [first part of the results](1_main_methodology.ipynb).
2. [extract_and_annotate_missense_LOAD.sh](data_preprocessing/extract_and_annotate_missense_LOAD.sh) is the script for obtaining missense variants from the VCF files.

**4. Create graph datasets**

Create graph datasets (one graph representing each patient) for different targets with LOAD dataset. As previously stated, we only obtain graph datasets using AD PPT-Ohmnet network (named as `snap_brain` in the following code).

In [3]:
networks = ['snap_brain']

for network in networks:

    outdir = f'data/graph_datasets/LOAD'

    start_time = datetime.datetime.now()
    print()

    result_nodes = create_nx_datasets.main('data', 'LOAD', 'LOAD', 'AD', network, 'missense', None)
    print('Coding: number of missense variants per node')

    outfile = f'{outdir}/AD_PPI_{network}_missense.pkl'
    print('Resulting dataset saved at:', outfile)
    print()

    with open(outfile, 'wb') as f:
        pickle.dump(result_nodes, f)

    result_nodes_time = datetime.datetime.now()
    print('Processing time:', result_nodes_time - start_time)
    print('\n\n')


Network used: AD snap_brain
# nodes = 29
# edges = 52

Dataset used: LOAD
(11, 1678)
(11, 1608)
missense
(29, 1599)
Creating samples graphs...
Class: LOAD. Found 1014 positive subjects out of 1599
Sample graph used: # nodes = 29 # edges = 52
Density = 0.12807881773399016 Diameter = 6
Coding: number of missense variants per node
Resulting dataset saved at: data/graph_datasets/LOAD/AD_PPI_snap_brain_missense.pkl

Processing time: 0:00:55.406044





**5. Graph classification with GNNs**

We then evaluated and tested different GNN configurations in the framework called [GraphGym](https://github.com/snap-stanford/GraphGym) (You *et al.*, 2020).

Configuration and grid files employed are in the subdirectory [graphgym_files](graphgym_files).

Summarized results obtained by GraphGym in LOAD dataset are in **COMPLETE**

We run GraphGym 10 times, with the 10 different splits generated by 10-Fold Stratified Cross-Validation (see below).

In [3]:
# Create split for using it in GraphGym and non-GNN models
# Splits are obtained through 10-Fold Stratied Cross Validation

create_splits.create_folds_stratified_cv('LOAD', 10)

LOAD
Fold -  1   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  2   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  3   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  4   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  5   |   train -  [526 913]   |   test -  [ 59 101]
Fold -  6   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  7   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  8   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  9   |   train -  [527 912]   |   test -  [ 58 102]
Fold -  10   |   train -  [527 913]   |   test -  [ 58 101]



**6. Analyze results**

We obtained best GNN configurations in each split. Next, we selected the common configuration in all splits and obtained average and standard deviation over the 10 splits of several performance metrics (Accuracy, Precision, Recall, F1, AUC).

Please refer to the following notebook **COMPLETE!** to see how we compared GNNs performance against other non-GNN models and the corresponding baseline model.