# Grapher

## Data description

### [FinnGen](https://finngen.gitbook.io/documentation/)

Any large Canadian GWAS-related clinical trials?

- The FinnGen research project is an expedition to the frontier of genomics and medicine, with significant discoveries potentially arising from any one of Finland’s 500,000 biomedical pioneers.
- The project brings together a nation-wide network of Finnish biobanks, with every Finn able to participate in the study by giving biobank consent.
- As of the last update, there were 589,000 samples available, with a goal to reach 520,000 by 2023. The latest data freeze included combined genotype and health registry data from 473,681 individuals.
- The study utilizes samples collected by a nationwide network of Finnish biobanks and combines genome information with digital health care data from national health registries【8†source】.
- There's a need for samples from all over Finland as solutions in the field of personalized healthcare can be found only by looking at large populations. Every Finn can be a part of the FinnGen study by giving a biobank consent.
- The genome data produced during the project is owned by the Finnish biobanks and remains available for research purposes. The medical breakthroughs that arise from the project are expected to benefit health care systems and patients globally.
- The FinnGen research project is collaborative, involving all the same actors as drug development, with the aim to speed up the emergence of new innovations.
- The project's data freeze 9 results and summary statistics are now available, consisting of over 377,200 individuals, almost 20.2 M variants, and 2,272 disease endpoints. Results can be browsed online using the FinnGen web browser, and the summary statistics downloaded.
- The University of Helsinki is the organization responsible for the study, and the nationwide network of Finnish biobanks is participating in the study, thus covering the whole of Finland. The Helsinki Biobank coordinates the sample collection.
- For more information, the project can be contacted at finngen-info@helsinki.fi.

### Dataset

Here's the summary documentation for the DataFrame in bullet format:

- `#chrom`: This column represents the chromosome number where the genetic variant is located.

- `pos`: This is the position of the genetic variant on the chromosome.

- `ref`: This column represents the reference allele (or variant) at the genomic position.

- `alt`: This is the alternate allele observed at this position.

- `rsids`: This stands for reference SNP cluster ID. It's a unique identifier for each variant used in the dbSNP database.

- `nearest_genes`: This column represents the gene which is nearest to the variant.

- `pval`: This represents the p-value, which is a statistical measure for the strength of evidence against the null hypothesis.

- `mlogp`: This represents the minus log of the p-value, commonly used in genomic studies.

- `beta`: The beta coefficient represents the effect size of the variant.

- `sebeta`: This is the standard error of the beta coefficient.

- `af_alt`: This is the allele frequency of the alternate variant in the general population.

- `af_alt_cases`: This is the allele frequency of the alternate variant in the cases group.

- `af_alt_controls`: This is the allele frequency of the alternate variant in the control group.

- `causal`: This binary column indicates whether the variant is determined to be causal (1) or not (0).

- `trait`: This column represents the trait associated with the variant. In this dataset, it is the response to the drug paracetamol and NSAIDs.

## Load libraries

In [1]:
import os
import random


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, average_precision_score
from sklearn.preprocessing import LabelEncoder, StandardScaler

import torch
import torch.nn.functional as F
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv
from torch_geometric.utils import to_undirected, negative_sampling

import networkx as nx
from ogb.io import DatasetSaver
from ogb.linkproppred import LinkPropPredDataset

from scipy.spatial import cKDTree

## Versions

In [2]:
print(f"PyTorch version: {torch.__version__}")
print(f"PyTorch Geometric version: {torch_geometric.__version__}")

PyTorch version: 2.0.0+cu118
PyTorch Geometric version: 2.3.1


In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda")          # Current CUDA device
    print(f"Using {torch.cuda.get_device_name()} ({device})")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
else:
    print("CUDA is not available on this device.")

Using NVIDIA GeForce RTX 3060 Ti (cuda)
CUDA version: 11.8
Number of CUDA devices: 1


## Load data and create new rows for each gene

In [4]:
dtypes={
    '#chrom': 'string',
    'pos': 'int64',
    'ref': 'string',
    'alt': 'string',
    'rsids': 'string',
    'nearest_genes': 'string',
    'pval': 'float64',
    'mlogp': 'float64',
    'beta': 'float64',
    'sebeta': 'float64',
    'af_alt': 'float64',
    'af_alt_cases': 'float64',
    'af_alt_controls': 'float64',
    'causal':'int64',
    'LD': 'int64',
    'lead': 'string',
    'trait': 'string'
}
data = pd.read_csv('~/Desktop/gwas-graph/FinnGen/data/gwas-causal.csv', dtype=dtypes)

In [5]:
data['nearest_genes'] = data['nearest_genes'].astype(str)

# Assuming your DataFrame is called data and the relevant column is 'nearest_genes'
# First, let's split the gene names in the 'nearest_genes' column
split_genes = data['nearest_genes'].str.split(',')

# Flatten the list of split gene names
flat_genes = [item for sublist in split_genes for item in sublist]

# Then, we create a new DataFrame by repeating rows and substituting the gene names
data_new = (data.loc[data.index.repeat(split_genes.str.len())]
            .assign(nearest_genes=flat_genes))

# Reset index to have a standard index
data = data_new.reset_index(drop=True)

In [6]:
data

Unnamed: 0,#chrom,pos,ref,alt,rsids,nearest_genes,pval,mlogp,beta,sebeta,af_alt,af_alt_cases,af_alt_controls,causal,LD,lead,trait
0,1,13668,G,A,rs2691328,OR4F5,0.944365,0.024860,-0.005926,0.084918,0.005842,0.005729,0.005863,0,0,,T2D
1,1,14773,C,T,rs878915777,OR4F5,0.844305,0.073501,0.010088,0.051369,0.013495,0.013547,0.013485,0,0,,T2D
2,1,15585,G,A,rs533630043,OR4F5,0.841908,0.074735,0.031464,0.157751,0.001113,0.001125,0.001110,0,0,,T2D
3,1,16549,T,C,rs1262014613,OR4F5,0.343308,0.464316,0.241377,0.254711,0.000561,0.000620,0.000550,0,0,,T2D
4,1,16567,G,C,rs1194064194,OR4F5,0.129883,0.886447,0.130736,0.086319,0.004170,0.004250,0.004154,0,0,,T2D
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20565622,23,155697920,G,A,,,0.027115,1.566790,-0.013475,0.006097,0.290961,0.286054,0.291879,0,0,,T2D
20565623,23,155698443,C,A,,,0.178417,0.748564,-0.069907,0.051951,0.003259,0.003022,0.003304,0,0,,T2D
20565624,23,155698490,C,T,,,0.279640,0.553400,-0.020245,0.018725,0.024406,0.024312,0.024423,0,0,,T2D
20565625,23,155699751,C,T,,,0.078864,1.103120,-0.011284,0.006421,0.244829,0.241257,0.245498,0,0,,T2D


In [7]:
# Drop rows with NaN values in 'rsids' column
data.dropna(subset=['rsids'], inplace=True)

# Keep only rows with unique values in 'rsids' column
data = data[data.duplicated(subset='rsids', keep=False) == False]

# Adjust the index if necessary
data.reset_index(drop=True, inplace=True)

In [8]:
len(data)

18295392

## Spec

**Task Overview**
- The objective is to design and implement a link prediction deep neural network model for analyzing relationships between SNP nodes and Phenotype nodes.

**Nodes and Their Features**
- There are two types of nodes: SNP Nodes and Phenotype Nodes.
- *Phenotype Nodes*: Each Phenotype Node represents a particular trait. This information comes from the `trait` column in the data.
- *SNP Nodes*: Each SNP Node is characterized by various features, including `rsids`, `nearest_genes`, `#chrom`, `pos`, `ref`, `alt`, `beta`, `sebeta`, `af_alt`, and `af_alt_cases` columns.

**Edges, Their Features, and Labels**
- Edges represent relationships between nodes. There are two types of edges: SNP-Phenotype and SNP-SNP.
- *SNP-Phenotype Edges*:
  - These edges are undirected, linking SNP Nodes and Phenotype Nodes.
  - The label for each edge is determined by the `causal` column in the data:
    - A label of +1 is assigned when `data['causal']` is 1, indicating a causal relationship.
    - A label of -1 is assigned when `data['causal']` is 0, indicating the absence of a causal relationship.
- *SNP-SNP Edges*:
  - These edges are undirected, linking an SNP Node (as identified by the `rsids` column) to another SNP Node (as identified by the `lead` column) in the same data row.
  - The label for each edge is determined by the `LD` column in the data:
    - A label of +1 is assigned when `data['LD']` is 1, signifying that the two SNPs are in linkage disequilibrium.
    - A label of -1 is assigned when `data['LD']` is 0, indicating that the two SNPs are not in linkage disequilibrium.

## Create graph

In [79]:
%%time

import random

# Create mappings for phenotypes and SNPs to integer indices
phenotypes = data['trait'].unique()
snps = data['rsids'].unique()
phenotype_to_idx = {phenotype: idx for idx, phenotype in enumerate(phenotypes)}
snp_to_idx = {snp: idx + len(phenotypes) for idx, snp in enumerate(snps)}

# Create node feature vectors for phenotypes and SNPs
phenotype_features = data.loc[data['trait'].isin(phenotypes)][['trait']].drop_duplicates().sort_values(by='trait').reset_index(drop=True)
snp_features = data.loc[data['rsids'].isin(snps)][['rsids', 'nearest_genes', '#chrom', 'pos', 'ref', 'alt', 'beta', 'sebeta', 'af_alt', 'af_alt_cases']].drop_duplicates().sort_values(by='rsids').reset_index(drop=True)

# Create node type labels
node_types = torch.tensor([0] * len(phenotypes) + [1] * len(snps), dtype=torch.long)

# Create positive edges between SNPs and Phenotypes (Causal Edges)
positive_edges_snp_phenotype = data.loc[data['causal'] == 1, ['rsids', 'trait']].drop_duplicates()
positive_edges_snp_phenotype['snp_idx'] = positive_edges_snp_phenotype['rsids'].map(snp_to_idx)
positive_edges_snp_phenotype['phenotype_idx'] = positive_edges_snp_phenotype['trait'].map(phenotype_to_idx)
positive_edges_snp_phenotype = positive_edges_snp_phenotype[['snp_idx', 'phenotype_idx']].values
positive_edges_snp_phenotype = torch.tensor(positive_edges_snp_phenotype, dtype=torch.long).t().contiguous()

# Create negative edges between SNPs and Phenotypes (Non-Causal Edges)
negative_edges_snp_phenotype = data.loc[data['causal'] == 0, ['rsids', 'trait']].drop_duplicates()
negative_edges_snp_phenotype['snp_idx'] = negative_edges_snp_phenotype['rsids'].map(snp_to_idx)
negative_edges_snp_phenotype['phenotype_idx'] = negative_edges_snp_phenotype['trait'].map(phenotype_to_idx)
negative_edges_snp_phenotype = negative_edges_snp_phenotype[['snp_idx', 'phenotype_idx']].values
negative_edges_snp_phenotype = torch.tensor(negative_edges_snp_phenotype, dtype=torch.long).t().contiguous()

# Create positive SNP-SNP edges (SNPs in LD)
positive_edges_snp_snp = data.loc[data['LD'] == 1, ['rsids', 'lead']].drop_duplicates()
positive_edges_snp_snp['snp_idx'] = positive_edges_snp_snp['rsids'].map(snp_to_idx)
positive_edges_snp_snp['lead_idx'] = positive_edges_snp_snp['lead'].map(snp_to_idx)
positive_edges_snp_snp = positive_edges_snp_snp[['snp_idx', 'lead_idx']].values
positive_edges_snp_snp = torch.tensor(positive_edges_snp_snp, dtype=torch.long).t().contiguous()

# Create negative SNP-SNP edges (SNPs not in LD)
negative_edges_snp_snp = data.loc[data['LD'] == 0, ['rsids', 'lead']].drop_duplicates()
negative_edges_snp_snp['snp_idx'] = negative_edges_snp_snp['rsids'].map(snp_to_idx)
negative_edges_snp_snp['lead_idx'] = negative_edges_snp_snp['lead'].map(snp_to_idx)
negative_edges_snp_snp = negative_edges_snp_snp[['snp_idx', 'lead_idx']].values
negative_edges_snp_snp = torch.tensor(negative_edges_snp_snp, dtype=torch.long).t().contiguous()

# Combine positive and negative edges
#edges = torch.cat([positive_edges_snp_phenotype, negative_edges_snp_phenotype, positive_edges_snp_snp, negative_edges_snp_snp], dim=1)

# Create edge attributes
#edge_attr = torch.tensor([1] * positive_edges_snp_phenotype.size(1) + [-1] * negative_edges_snp_phenotype.size(1) + [1] * positive_edges_snp_snp.size(1) + [-1] * negative_edges_snp_snp.size(1), dtype=torch.float)

# Combine the feature vectors
combined_features = pd.concat([phenotype_features, snp_features], ignore_index=True).drop(['trait', 'rsids'], axis=1)

# Fill NaNs with appropriate replacements
nan_replacements = {'nearest_genes': 'N/A', '#chrom': 'N/A', 'pos': 0, 'ref': 'N/A', 'alt': 'N/A', 'beta': 0, 'sebeta': 0, 'af_alt': 0, 'af_alt_cases': 0}
for col, replacement in nan_replacements.items():
    if col in combined_features:
        if combined_features[col].dtype.name == 'category' and replacement not in combined_features[col].cat.categories:
            combined_features[col] = combined_features[col].cat.add_categories([replacement])
        combined_features[col].fillna(replacement, inplace=True)

# Label encoding for categorical columns
le = LabelEncoder()
categorical_columns = ['ref', 'alt', 'nearest_genes', '#chrom']
for column in categorical_columns:
    combined_features[column] = le.fit_transform(combined_features[column].astype(str))

# Check if problem solved
try:
    features = torch.tensor(combined_features.values, dtype=torch.float)
    print("Successfully converted to tensor!")
except TypeError as e:
    print("Still a problem: ", e)


# Standardize numerical features
scaler = StandardScaler()
numerical_columns = ['pos', 'beta', 'sebeta', 'af_alt', 'af_alt_cases']
for col in numerical_columns:
    combined_features[col] = scaler.fit_transform(combined_features[[col]])

features = torch.tensor(combined_features.values, dtype=torch.float)

#### Combine positive edges only
edges = torch.cat([positive_edges_snp_phenotype, positive_edges_snp_snp], dim=1)
edge_attr = torch.tensor([0] * positive_edges_snp_phenotype.size(1) + [0] * positive_edges_snp_snp.size(1), dtype=torch.float)

# Create the PyTorch Geometric graph
graph = Data(x=features, edge_index=edges, edge_attr=edge_attr)
graph.node_types = node_types
graph.y = torch.ones(positive_edges_snp_phenotype.size(1) + positive_edges_snp_snp.size(1), dtype=torch.long)


print(f"Number of nodes: {graph.num_nodes}")
print(f"Number of positive SNP-Phenotype edges: {positive_edges_snp_phenotype.size(1)}")
print(f"Number of negative SNP-Phenotype edges: {negative_edges_snp_phenotype.size(1)}")
print(f"Number of positive SNP-SNP edges: {positive_edges_snp_snp.size(1)}")
print(f"Number of negative SNP-SNP edges: {negative_edges_snp_snp.size(1)}")
print(f"Number of edges: {graph.num_edges}")
print(f"Node feature dimension: {graph.num_node_features}")
print(f"Node types: {graph.node_types}")


Successfully converted to tensor!
Number of nodes: 18295393
Number of positive SNP-Phenotype edges: 37
Number of negative SNP-Phenotype edges: 18295355
Number of positive SNP-SNP edges: 3787
Number of negative SNP-SNP edges: 18291605
Number of edges: 3824
Node feature dimension: 9
Node types: tensor([0, 1, 1,  ..., 1, 1, 1])
CPU times: total: 1min 38s
Wall time: 2min 53s


### Data Preprocessing Functions

### Node Data Preprocessing 

### Edge Data Processing

### Edge Attributes and Node Feature Preprocessing

### Construct Graph Data Object

## Graph stats

In [28]:
from torch_geometric.utils import degree

def print_graph_stats(graph, positive_edges_snp_phenotype, negative_edges_snp_phenotype, positive_edges_snp_snp, negative_edges_snp_snp):
    node_types = np.unique(graph.node_types.numpy(), return_counts=True)

    print(f"Number of nodes: {graph.num_nodes}")
    for node_type, count in zip(*node_types):
        print(f"Number of {node_type} nodes: {count}")
    print(f"Number of positive edges between SNPs and phenotypes: {positive_edges_snp_phenotype.shape[1]}")
    print(f"Number of negative edges between SNPs and phenotypes: {negative_edges_snp_phenotype.shape[1]}")
    print(f"Number of positive edges between SNPs: {positive_edges_snp_snp.shape[1]}")
    print(f"Number of negative edges between SNPs: {negative_edges_snp_snp.shape[1]}")
    print(f"Number of edges: {graph.num_edges}")
    print(f"Node feature dimension: {graph.num_node_features}")

    # Compute and print degree-related stats for each node type
    for node_type in node_types[0]:
        node_indices = np.where(graph.node_types.numpy() == node_type)[0]
        degrees = degree(graph.edge_index[0].long(), num_nodes=graph.num_nodes)[node_indices]
        average_degree = degrees.float().mean().item()
        median_degree = np.median(degrees.numpy())
        std_degree = degrees.float().std().item()

        print(f"\n{node_type} node stats:")
        print(f"Average degree: {average_degree:.2f}")
        print(f"Median degree: {median_degree:.2f}")
        print(f"Standard deviation of degree: {std_degree:.2f}")

    # Density is the ratio of actual edges to the maximum number of possible edges
    num_possible_edges = graph.num_nodes * (graph.num_nodes - 1) / 2
    density = graph.num_edges / num_possible_edges
    print(f"Density: {density:.10f}")

    # Check for NaN values in features
    nan_in_features = torch.isnan(graph.x).any().item()
    print(f"Are there any NaN values in features? {nan_in_features}")

# Print
print("Graph stats:")
print_graph_stats(graph, positive_edges_snp_phenotype, negative_edges_snp_phenotype, positive_edges_snp_snp, negative_edges_snp_snp)

Graph stats:
Number of nodes: 18295393
Number of 0 nodes: 1
Number of 1 nodes: 18295392
Number of positive edges between SNPs and phenotypes: 37
Number of negative edges between SNPs and phenotypes: 18295355
Number of positive edges between SNPs: 3787
Number of negative edges between SNPs: 18291605
Number of edges: 3824
Node feature dimension: 9

0 node stats:
Average degree: 0.00
Median degree: 0.00
Standard deviation of degree: nan

1 node stats:
Average degree: 0.00
Median degree: 0.00
Standard deviation of degree: 0.01
Density: 0.0000000000
Are there any NaN values in features? False


## Data splitting

In [81]:
graph

Data(x=[18295393, 9], edge_index=[2, 3824], edge_attr=[3824], y=[3824], node_types=[18295393])

In [80]:
from torch_geometric.transforms import RandomLinkSplit

transform = RandomLinkSplit(num_val=0.1, num_test=0.1, key="y", is_undirected=True)

graph_train, graph_val, graph_test = transform(graph)

print(graph_train)
print(graph_val)
print(graph_test)

ValueError: Insufficient number of edges for training

## Models

### GCN

In [26]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.data import Data

# Define the GCN model
class GCN(nn.Module):
    def __init__(self, num_features, hidden_size):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(num_features, hidden_size)
        self.conv2 = GCNConv(hidden_size, 1)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        return x

# Define the training loop
def train(model, data, optimizer, criterion, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        optimizer.zero_grad()
        out = model(data.x, data.edge_index)
        loss = criterion(out[data.pos_edge_index], out[data.neg_edge_index])
        loss.backward()
        optimizer.step()
        if epoch % 10 == 0:
            print(f'Epoch {epoch}/{num_epochs}, Loss: {loss.item()}')

# Evaluate the model
def evaluate(model, data):
    model.eval()
    with torch.no_grad():
        out = model(data.x, data.edge_index)
        y_pred = torch.sigmoid(out)
        y_pred[y_pred >= 0.5] = 1
        y_pred[y_pred < 0.5] = -1
        correct = (y_pred == data.edge_attr).sum().item()
        total = data.edge_attr.size(0)
        accuracy = correct / total
        print(f'Accuracy: {accuracy}')

# Initialize and train the GCN model
num_features = graph_train.x.size(1)
hidden_size = 64
learning_rate = 0.01
num_epochs = 100

model = GCN(num_features, hidden_size)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.BCEWithLogitsLoss()

train(model, graph_train, optimizer, criterion, num_epochs)

# Evaluate the model on the validation and test sets
evaluate(model, graph_val)
evaluate(model, graph_test)


RuntimeError: index -9223372036854775808 is out of bounds for dimension 0 with size 18295393