## README 
Website: https://www.eqtlgen.org/phase1.html

Paper: https://www.nature.com/articles/s41588-021-00913-zhttps://www.nature.com/articles/s41588-021-00913-z

This Jupyter notebook README covers cis-eQTL and trans-eQTL results from the eQTLGen project. The dataset includes significant files for both cis-eQTL and trans-eQTL analyses. 

Files contain various columns with information on P-value, SNP rs ID, SNP chromosome, SNP position, assessed and not assessed alleles, Z-score, ENSG and HGNC names of eQTL genes, gene chromosome, gene position, number of cohorts, number of samples, false discovery rate, and Bonferroni-corrected P-value.

The cis-eQTL analysis includes 19,250 genes expressed in blood, with SNP-gene combinations within 1Mb from the gene center and tested in at least 2 cohorts. The trans-eQTL analysis tests 19,960 genes expressed in blood and 10,317 trait-associated SNPs based on GWAS Catalog, Immunobase, and Astle et al. study. Trans-eQTL combinations have a distance of >5Mb and were tested in at least 2 cohorts.

The FDR calculation uses a pruned set of SNPs for trans-eQTL mapping and permutation-based FDR calculation. Crossmapping filters are applied to identify and remove potential artifacts in trans-eQTL results, recalculating the FDR afterward. Note that the full results file has not been filtered for cross-mapping effects, which may lead to artifacts in the data.

The code below demonstrates the process of creating a graph-based representation of the combined cis and trans-eQTL data using PyTorch Geometric. This process can be broken down into several steps:

1. Combine cis and trans dataframes: The code begins by concatenating the cis and trans dataframes into a single dataframe named 'data', which contains information on both cis-eQTL and trans-eQTL results. This combined dataset simplifies the process of working with the data and ensures that all relevant information is contained within a single data structure.

2. Create mappings for genes and SNPs: To represent the genes and SNPs as nodes in the graph, integer indices are assigned to each unique gene and SNP. This is done using dictionaries called 'gene_to_idx' and 'snp_to_idx', which map the gene and SNP identifiers to their corresponding integer indices.

3. Generate node type labels: Node type labels are created using PyTorch tensors, distinguishing between gene nodes (assigned a label of 0) and SNP nodes (assigned a label of 1). This differentiation is useful for various graph-based analyses and machine learning tasks that require knowledge of node types.

4. Create edges based on gene and SNP indices: Edges in the graph represent the relationships between genes and SNPs. These edges are created by iterating over the 'data' dataframe and extracting the corresponding gene and SNP indices from the previously created mappings. The edges are then represented as a PyTorch tensor with a long data type.

5. Convert edges to undirected: Since the relationships between genes and SNPs are undirected, the edges in the graph should also be undirected. This is achieved using the 'to_undirected()' function from PyTorch Geometric, which ensures that the graph correctly represents the underlying biology.

6. Create a PyTorch Geometric graph: Finally, the graph is created using the PyTorch Geometric 'Data' class. The node types and edge indices are used as inputs to instantiate the graph object, which can then be utilized for further analysis and visualization.

The resulting 'graph' object is a PyTorch Geometric representation of the combined cis and trans-eQTL data. The prediction task is to predict new association edges given the training edges, with the task type being link prediction. Below are a few important graph statistics:

- Number of nodes: 3681495
- Number of SNP nodes: 3664025
- Number of Gene nodes: 17470
- Number of edges: 10567450
- Number of connected components: 424
- Average degree: 5.74
- Median degree: 2.0
- Standard deviation of degree: 69.81
- Density: 0.0000015594
- Assortativity: -0.2267915607

The prediction task is to predict new association edges given the training edges, with the task type being link prediction. The data splitting is random while maintaining an equal proportion of cis- and trans- associations.

- The model is predicting whether an edge exists between nodes: -1 for no edge and +1 for an edge. It knows whether nodes are genes (0) or snps (1). It has no notion of cis or trans apart from differences in node features vectors:

- Gene Node Features: 'GenePos': float, 'GeneChr': str, GeneStart': float, 'GeneEnd': float

- SNP Node Features: 'SNPChr': str, 'SNPPos': float, 'AssessedAllele': str, 'OtherAllele': str

- Input: feature vectors for and edges between all training nodes

- Output: edges between all nodes in test/validation set

## Data Setup 

### Libraries

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, average_precision_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler


import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.init as init

import torch_sparse

import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv
from torch_geometric.utils import to_undirected, negative_sampling, subgraph
from torch_geometric.transforms import RandomLinkSplit

import networkx as nx
from ogb.io import DatasetSaver
from ogb.linkproppred import LinkPropPredDataset

In [2]:
print(f"PyTorch version: {torch.__version__}")
print(f"PyTorch Geometric version: {torch_geometric.__version__}")

PyTorch version: 2.0.0+cu118
PyTorch Geometric version: 2.3.1


In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda")          # Current CUDA device
    print(f"Using {torch.cuda.get_device_name()} ({device})")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
else:
    print("CUDA is not available on this device.")

Using NVIDIA GeForce RTX 3060 Ti (cuda)
CUDA version: 11.8
Number of CUDA devices: 1


In [4]:
# Load data
data_types = {'Pvalue': float, 'SNP': str, 'SNPChr': str, 'SNPPos': float,
              'AssessedAllele': str, 'OtherAllele': str, 'Zscore': float,
              'Gene': str, 'GeneSymbol': str, 'GeneChr': str, 'GenePos': float,
              'NrCohorts': int, 'NrSamples': int, 'FDR': float,
              'BonferroniP': float, 'GeneStart': float, 'GeneEnd': float, 'Sig':int}

cis = pd.read_csv("cis-genes.csv", dtype=data_types)
trans = pd.read_csv("trans-genes.csv", dtype=data_types)

### Graph

In [5]:
data = pd.concat([cis, trans], ignore_index=True)

In [6]:
# Create mappings for genes and SNPs to integer indices
genes = data['Gene'].unique()
snps = data['SNP'].unique()
gene_to_idx = {gene: idx for idx, gene in enumerate(genes)}
snp_to_idx = {snp: idx + len(genes) for idx, snp in enumerate(snps)}

# Create node feature vectors
gene_features = data.loc[data['Gene'].isin(genes)][['Gene', 'GeneChr', 'GenePos', 'GeneStart', 'GeneEnd']].drop_duplicates().sort_values(by='Gene').reset_index(drop=True)
snp_features = data.loc[data['SNP'].isin(snps)][['SNP', 'SNPChr', 'SNPPos', 'AssessedAllele', 'OtherAllele']].drop_duplicates().sort_values(by='SNP').reset_index(drop=True)


# Create node type labels
node_types = torch.tensor([0] * len(genes) + [1] * len(snps), dtype=torch.long)

# Filter the data to create positive and negative edges
sig = data[data['Sig'] == 1]
unsig = data[data['Sig'] != 1]

# Create positive and negative edges
positive_edges = sig.apply(lambda row: (gene_to_idx[row['Gene']], snp_to_idx[row['SNP']]), axis=1)
negative_edges = unsig.apply(lambda row: (gene_to_idx[row['Gene']], snp_to_idx[row['SNP']]), axis=1)

positive_edges = torch.tensor(list(positive_edges), dtype=torch.long).t().contiguous()
negative_edges = torch.tensor(list(negative_edges), dtype=torch.long).t().contiguous()

# Combine positive and negative edges
edges = torch.cat([positive_edges, negative_edges], dim=1)

# Create edge labels
edge_labels = torch.tensor([1] * positive_edges.size(1) + [0] * negative_edges.size(1), dtype=torch.long)

# Convert edges to undirected
edges = to_undirected(edges)

# Combine the feature vectors
combined_features = pd.concat([gene_features, snp_features], ignore_index=True).drop(['Gene', 'SNP'], axis=1)

# Replace NaN and empty strings with "N/A"
combined_features.fillna({'GeneChr': 'N/A', 'GenePos': 0, 'GeneStart': 0, 'GeneEnd': 0,
                          'SNPChr': 'N/A', 'SNPPos': 0, 'AssessedAllele': 'N/A', 'OtherAllele': 'N/A'}, inplace=True)
combined_features.replace({'GeneChr': {'': 'N/A'}, 'SNPChr': {'': 'N/A'},
                           'AssessedAllele': {'': 'N/A'}, 'OtherAllele': {'': 'N/A'}}, inplace=True)


# Label encoding for categorical columns
categorical_columns = ['GeneChr', 'SNPChr', 'AssessedAllele', 'OtherAllele']
for column in categorical_columns:
    le = LabelEncoder()
    combined_features[column] = le.fit_transform(combined_features[column])

# Standardize numerical features
numerical_columns = ['GenePos', 'GeneStart', 'GeneEnd', 'SNPPos']
scaler = StandardScaler()
combined_features[numerical_columns] = scaler.fit_transform(combined_features[numerical_columns])

# Create the PyTorch tensor
features = torch.tensor(combined_features.values, dtype=torch.float)

# Create the PyTorch Geometric graph
graph = Data(x=features, edge_index=edges, edge_attr=edge_labels)
graph.node_types = node_types
#graph.x = F.normalize(graph.x, p=2, dim=-1)
print(f"Number of nodes: {graph.num_nodes}")
print(f"Number of edges: {graph.num_edges}")
print(f"Node feature dimension: {graph.num_node_features}")
print(f"Node types: {graph.node_types}")

edge_attr_sum = torch.sum(graph.edge_attr)
print(f"Sum of edge_attr: {edge_attr_sum}")

Number of nodes: 6756193
Number of edges: 114904434
Node feature dimension: 8
Node types: tensor([0, 0, 0,  ..., 1, 1, 1])
Sum of edge_attr: 2154707


### Check

In [7]:
print("Sum of 'Sig' values:", len(data[data['Sig'] == 1]))
print("Count of 'Sig' values not equal to 1:", len(data[data['Sig'] != 1]))

Sum of 'Sig' values: 2154707
Count of 'Sig' values not equal to 1: 55297510


In [8]:
# Check for NaN values in features
nan_in_features = torch.isnan(graph.x).any().item()
print(f"Are there any NaN values in features? {nan_in_features}")

Are there any NaN values in features? False


### Stats

In [9]:
def print_graph_stats(graph, genes, snps):
    G = nx.Graph()
    for edge in graph.edge_index.t().numpy():
        G.add_edge(edge[0], edge[1])

    num_nodes = G.number_of_nodes()
    num_genes = len(genes)
    num_snps = len(snps)
    num_edges = G.number_of_edges()
    num_connected_components = nx.number_connected_components(G)
    average_degree = np.mean([degree for _, degree in G.degree()])
    median_degree = np.median([degree for _, degree in G.degree()])
    std_degree = np.std([degree for _, degree in G.degree()])
    density = nx.density(G)
    assortativity = nx.degree_assortativity_coefficient(G)

    print(f"Number of nodes: {num_nodes}")
    print("Number of SNP nodes:", num_snps)
    print("Number of Gene nodes:", num_genes)
    print(f"Number of edges: {num_edges}")
    print(f"Number of connected components: {num_connected_components}")
    print(f"Average degree: {average_degree:.2f}")
    print(f"Median degree: {median_degree}")
    print(f"Standard deviation of degree: {std_degree:.2f}")
    print(f"Density: {density:.10f}")
    print(f"Assortativity: {assortativity:.10f}")

# For the graph
print("Graph stats:")
print_graph_stats(graph, genes, snps)
print("\n")

Graph stats:
Number of nodes: 6755968
Number of SNP nodes: 6742475
Number of Gene nodes: 13493
Number of edges: 57452217
Number of connected components: 1
Average degree: 17.01
Median degree: 2.0
Standard deviation of degree: 231.95
Density: 0.0000025175
Assortativity: -0.5145749464




### Data split

In [10]:
# data = pd.read_csv("sig-combined-with-genes.csv", dtype=data_types)
# graph = Data(x=features, edge_index=edges)

transform = RandomLinkSplit(num_val=0.495, num_test=0.495, is_undirected=True)
graph_train, graph_val, graph_test = transform(graph)

print(graph_train)
print(graph_val)
print(graph_test)

Data(x=[6756193, 8], edge_index=[2, 1149046], edge_attr=[57452217], node_types=[6755968], edge_label=[1149046], edge_label_index=[2, 1149046])
Data(x=[6756193, 8], edge_index=[2, 1149046], edge_attr=[57452217], node_types=[6755968], edge_label=[56877694], edge_label_index=[2, 56877694])
Data(x=[6756193, 8], edge_index=[2, 58026740], edge_attr=[57452217], node_types=[6755968], edge_label=[56877694], edge_label_index=[2, 56877694])


### Model

In [11]:
# Task: Link prediction: does an edge exist between two nodes?
# Node Types: 0 = Gene, 1 = SNP
# Node Feature Vector: 8-dimensional

torch.cuda.empty_cache()

# Define the GCN model
class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(8, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

# Train and evaluate the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GCN(hidden_channels=2).to(device)

graph_train = graph_train.to(device)
graph_val = graph_val.to(device)
graph_test = graph_test.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

# Train function
from torch_geometric.utils import negative_sampling
def train():
    model.train()
    optimizer.zero_grad()
    z = model(graph_train.x.float(), graph_train.edge_index)

    # Only consider positive edges for the positive score calculation
    pos_edge_index = graph_train.edge_index
    pos = (z[pos_edge_index[0]] * z[pos_edge_index[1]]).sum(dim=-1)

    # Use negative_sampling to generate negative edges
    neg_edge_index = negative_sampling(edge_index=pos_edge_index, num_nodes=z.size(0), num_neg_samples=pos_edge_index.size(1))
    neg = (z[neg_edge_index[0]] * z[neg_edge_index[1]]).sum(dim=-1)

    logits = torch.cat([pos, neg], dim=0)
    targets = torch.tensor([1] * pos.size(0) + [0] * neg.size(0), dtype=torch.float32).to(device)

    loss = F.binary_cross_entropy_with_logits(logits, targets)
    loss.backward()
    optimizer.step()
    return loss.item()



# Evaluation function
def evaluate(edge_index, graph):
    model.eval()
    with torch.no_grad():
        z = model(graph.x.float(), graph.edge_index)
        pos = torch.sigmoid((z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)).view(-1)
        neg_edge_index = negative_sampling(edge_index, num_nodes=graph.num_nodes, num_neg_samples=edge_index.size(1))
        neg = torch.sigmoid((z[neg_edge_index[0]] * z[neg_edge_index[1]]).sum(dim=-1)).view(-1)

        preds = np.concatenate([pos.cpu().numpy(), neg.cpu().numpy()])
        true_labels = np.concatenate([np.ones_like(pos.cpu().numpy()), np.zeros_like(neg.cpu().numpy())])

        roc_auc = roc_auc_score(true_labels, preds)
        mrr = compute_mrr(preds, true_labels)
        hits_at_5 = compute_hits_at_k(preds, true_labels, k=5)

        return roc_auc, mrr, hits_at_5

def compute_mrr(preds, true_labels):
    # Find the predicted scores for positive examples
    pos_preds = preds[:len(true_labels)]
    # Rank the positive examples by predicted score in descending order
    sorted_idx = np.argsort(pos_preds)[::-1]
    # Find the rank of the first true positive
    for i, idx in enumerate(sorted_idx):
        if true_labels[idx] == 1:
            return 1.0 / (i + 1)
    return 0.0

def compute_hits_at_k(preds, true_labels, k=5):
    # Find the predicted scores for positive examples
    pos_preds = preds[:len(true_labels)]
    # Rank the positive examples by predicted score in descending order
    sorted_idx = np.argsort(pos_preds)[::-1]
    # Check if the first k predictions contain at least one true positive
    hits = 0
    for idx in sorted_idx[:k]:
        if true_labels[idx] == 1:
            hits = 1
            break
    return hits

for epoch in range(50):
    loss = train()
    val_roc_auc, val_mrr, val_hits_at_5 = evaluate(graph_val.edge_index, graph_val)
    print(f"Epoch: {epoch + 1}, Loss: {loss:.4f}, Val ROC-AUC: {val_roc_auc:.10f}, Val MRR: {val_mrr:.10f}, Val Hits@5: {val_hits_at_5}")


Epoch: 1, Loss: 27.8539, Val ROC-AUC: 0.5858050532, Val MRR: 0.5000000000, Val Hits@5: 1
Epoch: 2, Loss: 24.7141, Val ROC-AUC: 0.6218579157, Val MRR: 1.0000000000, Val Hits@5: 1
Epoch: 3, Loss: 21.7310, Val ROC-AUC: 0.6596783845, Val MRR: 1.0000000000, Val Hits@5: 1
Epoch: 4, Loss: 18.9926, Val ROC-AUC: 0.6976678727, Val MRR: 0.5000000000, Val Hits@5: 1
Epoch: 5, Loss: 16.4977, Val ROC-AUC: 0.7321675456, Val MRR: 1.0000000000, Val Hits@5: 1
Epoch: 6, Loss: 14.3812, Val ROC-AUC: 0.7624780581, Val MRR: 0.5000000000, Val Hits@5: 1
Epoch: 7, Loss: 12.4413, Val ROC-AUC: 0.7866461134, Val MRR: 1.0000000000, Val Hits@5: 1
Epoch: 8, Loss: 10.8368, Val ROC-AUC: 0.8049677219, Val MRR: 0.5000000000, Val Hits@5: 1
Epoch: 9, Loss: 9.3769, Val ROC-AUC: 0.8175267307, Val MRR: 1.0000000000, Val Hits@5: 1
Epoch: 10, Loss: 8.1590, Val ROC-AUC: 0.8279033779, Val MRR: 0.5000000000, Val Hits@5: 1
Epoch: 11, Loss: 7.1001, Val ROC-AUC: 0.8364182953, Val MRR: 0.5000000000, Val Hits@5: 1
Epoch: 12, Loss: 6.250

In [12]:
val_roc_auc, val_mrr, val_hits5 = evaluate(graph_val.edge_index, graph_val)
test_roc_auc, test_mrr, test_hits5 = evaluate(graph_test.edge_index, graph_test)

print(f"Validation ROC-AUC: {val_roc_auc:.10f}")
print(f"Validation MRR: {val_mrr:.10f}")
print(f"Validation Hits@5: {val_hits5:.10f}")

print(f"Test ROC-AUC: {test_roc_auc:.10f}")
print(f"Test MRR: {test_mrr:.10f}")
print(f"Test Hits@5: {test_hits5:.10f}")

OutOfMemoryError: CUDA out of memory. Tried to allocate 496.00 MiB (GPU 0; 8.00 GiB total capacity; 6.84 GiB already allocated; 0 bytes free; 7.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF