## README 
Website: https://www.eqtlgen.org/phase1.html

Paper: https://www.nature.com/articles/s41588-021-00913-zhttps://www.nature.com/articles/s41588-021-00913-z

This Jupyter notebook README covers cis-eQTL and trans-eQTL results from the eQTLGen project. The dataset includes significant files for both cis-eQTL and trans-eQTL analyses. 

Files contain various columns with information on P-value, SNP rs ID, SNP chromosome, SNP position, assessed and not assessed alleles, Z-score, ENSG and HGNC names of eQTL genes, gene chromosome, gene position, number of cohorts, number of samples, false discovery rate, and Bonferroni-corrected P-value.

The cis-eQTL analysis includes 19,250 genes expressed in blood, with SNP-gene combinations within 1Mb from the gene center and tested in at least 2 cohorts. The trans-eQTL analysis tests 19,960 genes expressed in blood and 10,317 trait-associated SNPs based on GWAS Catalog, Immunobase, and Astle et al. study. Trans-eQTL combinations have a distance of >5Mb and were tested in at least 2 cohorts.

The FDR calculation uses a pruned set of SNPs for trans-eQTL mapping and permutation-based FDR calculation. Crossmapping filters are applied to identify and remove potential artifacts in trans-eQTL results, recalculating the FDR afterward. Note that the full results file has not been filtered for cross-mapping effects, which may lead to artifacts in the data.

The code below demonstrates the process of creating a graph-based representation of the combined cis and trans-eQTL data using PyTorch Geometric. This process can be broken down into several steps:

1. Combine cis and trans dataframes: The code begins by concatenating the cis and trans dataframes into a single dataframe named 'data', which contains information on both cis-eQTL and trans-eQTL results. This combined dataset simplifies the process of working with the data and ensures that all relevant information is contained within a single data structure.

2. Create mappings for genes and SNPs: To represent the genes and SNPs as nodes in the graph, integer indices are assigned to each unique gene and SNP. This is done using dictionaries called 'gene_to_idx' and 'snp_to_idx', which map the gene and SNP identifiers to their corresponding integer indices.

3. Generate node type labels: Node type labels are created using PyTorch tensors, distinguishing between gene nodes (assigned a label of 0) and SNP nodes (assigned a label of 1). This differentiation is useful for various graph-based analyses and machine learning tasks that require knowledge of node types.

4. Create edges based on gene and SNP indices: Edges in the graph represent the relationships between genes and SNPs. These edges are created by iterating over the 'data' dataframe and extracting the corresponding gene and SNP indices from the previously created mappings. The edges are then represented as a PyTorch tensor with a long data type.

5. Convert edges to undirected: Since the relationships between genes and SNPs are undirected, the edges in the graph should also be undirected. This is achieved using the 'to_undirected()' function from PyTorch Geometric, which ensures that the graph correctly represents the underlying biology.

6. Create a PyTorch Geometric graph: Finally, the graph is created using the PyTorch Geometric 'Data' class. The node types and edge indices are used as inputs to instantiate the graph object, which can then be utilized for further analysis and visualization.

The resulting 'graph' object is a PyTorch Geometric representation of the combined cis and trans-eQTL data. The prediction task is to predict new association edges given the training edges, with the task type being link prediction. Below are a few important graph statistics:

- Number of nodes: 3681495
- Number of SNP nodes: 3664025
- Number of Gene nodes: 17470
- Number of edges: 10567450
- Number of connected components: 424
- Average degree: 5.74
- Median degree: 2.0
- Standard deviation of degree: 69.81
- Density: 0.0000015594
- Assortativity: -0.2267915607

The prediction task is to predict new association edges given the training edges, with the task type being link prediction. The data splitting is random while maintaining an equal proportion of cis- and trans- associations.

## Data Setup 

### Libraries

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, average_precision_score

import torch
import torch.nn.functional as F
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv
from torch_geometric.utils import to_undirected, negative_sampling
import torch_geometric.utils as pyg_utils

import networkx as nx
from ogb.io import DatasetSaver
from ogb.linkproppred import LinkPropPredDataset

In [2]:
print(f"PyTorch version: {torch.__version__}")
print(f"PyTorch Geometric version: {torch_geometric.__version__}")

PyTorch version: 2.0.0+cu118
PyTorch Geometric version: 2.3.1


### Load data

In [3]:
# Read files
cis = pd.read_csv("sig-cis.csv")
trans = pd.read_csv("sig-trans.csv")

In [4]:
# Print cis
cis.head()

Unnamed: 0,Pvalue,SNP,SNPChr,SNPPos,AssessedAllele,OtherAllele,Zscore,Gene,GeneSymbol,GeneChr,GenePos,NrCohorts,NrSamples,FDR,BonferroniP
0,3.2717e-310,rs12230244,12,10117369,T,A,200.7534,ENSG00000172322,CLEC12A,12,10126104,34,30596,0.0,4.1662e-302
1,3.2717e-310,rs12229020,12,10117683,G,C,200.6568,ENSG00000172322,CLEC12A,12,10126104,34,30596,0.0,4.1662e-302
2,3.2717e-310,rs61913527,12,10116198,T,C,200.2654,ENSG00000172322,CLEC12A,12,10126104,34,30598,0.0,4.1662e-302
3,3.2717e-310,rs2594103,12,10115428,T,C,200.042,ENSG00000172322,CLEC12A,12,10126104,34,30598,0.0,4.1662e-302
4,3.2717e-310,rs12231833,12,10118428,A,G,199.9508,ENSG00000172322,CLEC12A,12,10126104,34,30592,0.0,4.1662e-302


In [5]:
# Print trans
trans.columns

Index(['Pvalue', 'SNP', 'SNPChr', 'SNPPos', 'AssessedAllele', 'OtherAllele',
       'Zscore', 'Gene', 'GeneSymbol', 'GeneChr', 'GenePos', 'NrCohorts',
       'NrSamples', 'FDR', 'BonferroniP'],
      dtype='object')

### Create graphs

In [6]:
# For the cis dataframe
cis_genes = cis['Gene'].unique()
cis_snps = cis['SNP'].unique()
cis_gene_to_idx = {gene: idx for idx, gene in enumerate(cis_genes)}
cis_snp_to_idx = {snp: idx + len(cis_genes) for idx, snp in enumerate(cis_snps)}

cis_node_types = torch.tensor([0] * len(cis_genes) + [1] * len(cis_snps), dtype=torch.long)

cis_edges = cis.apply(lambda row: (cis_gene_to_idx[row['Gene']], cis_snp_to_idx[row['SNP']]), axis=1)
cis_edges = torch.tensor(list(cis_edges), dtype=torch.long).t().contiguous()

cis_edges = to_undirected(cis_edges)

cis_graph = Data(x=cis_node_types.view(-1, 1), edge_index=cis_edges)



# For the trans dataframe
trans_genes = trans['Gene'].unique()
trans_snps = trans['SNP'].unique()
trans_gene_to_idx = {gene: idx for idx, gene in enumerate(trans_genes)}
trans_snp_to_idx = {snp: idx + len(trans_genes) for idx, snp in enumerate(trans_snps)}

trans_node_types = torch.tensor([0] * len(trans_genes) + [1] * len(trans_snps), dtype=torch.long)

trans_edges = trans.apply(lambda row: (trans_gene_to_idx[row['Gene']], trans_snp_to_idx[row['SNP']]), axis=1)
trans_edges = torch.tensor(list(trans_edges), dtype=torch.long).t().contiguous()

trans_edges = to_undirected(trans_edges)

trans_graph = Data(x=trans_node_types.view(-1, 1), edge_index=trans_edges)



# Combine the cis and trans dataframes
data = pd.concat([cis, trans], ignore_index=True)

# Function to filter the lowest 20% of associations
def filter_lowest_20_percent(df):
    threshold = np.percentile(df['BonferroniP'], 20)
    return df[df['BonferroniP'] <= threshold]

# Filter cis and trans dataframes
cis_filtered = filter_lowest_20_percent(cis)
trans_filtered = filter_lowest_20_percent(trans)

# Remove filtered data from original dataframes
cis_remaining = cis[~cis.index.isin(cis_filtered.index)]
trans_remaining = trans[~trans.index.isin(trans_filtered.index)]

# Test that all values in 'BonferroniP' column of cis_filtered are lower than cis_remaining
assert all(cis_filtered['BonferroniP'] <= cis_remaining['BonferroniP'].min()), "Values in cis_filtered are not all lower than cis_remaining"

# Test that all values in 'BonferroniP' column of trans_filtered are lower than trans_remaining
assert all(trans_filtered['BonferroniP'] <= trans_remaining['BonferroniP'].min()), "Values in trans_filtered are not all lower than trans_remaining"

# Create mappings for genes and SNPs to integer indices
genes = data['Gene'].unique()
snps = data['SNP'].unique()
gene_to_idx = {gene: idx for idx, gene in enumerate(genes)}
snp_to_idx = {snp: idx + len(genes) for idx, snp in enumerate(snps)}

# Create node type labels
node_types = torch.tensor([0] * len(genes) + [1] * len(snps), dtype=torch.long)

# Create edges
edges = data.apply(lambda row: (gene_to_idx[row['Gene']], snp_to_idx[row['SNP']]), axis=1)
edges = torch.tensor(list(edges), dtype=torch.long).t().contiguous()

# Convert edges to undirected
edges = to_undirected(edges)

# Create the PyTorch Geometric graph
graph = Data(x=node_types.view(-1, 1), edge_index=edges)

# Print the number of nodes and edges in the graph
print("Graph created with", len(genes) + len(snps), "nodes and", edges.size(1), "edges.")

def compute_descriptive_stats(cis_filtered, cis_remaining, trans_filtered, trans_remaining):
    # Compute mean, median, and standard deviation for cis_filtered and cis_remaining
    mean_cis_filtered = cis_filtered['BonferroniP'].mean()
    median_cis_filtered = cis_filtered['BonferroniP'].median()
    std_cis_filtered = cis_filtered['BonferroniP'].std()
    
    mean_cis_remaining = cis_remaining['BonferroniP'].mean()
    median_cis_remaining = cis_remaining['BonferroniP'].median()
    std_cis_remaining = cis_remaining['BonferroniP'].std()
    
    # Compute mean, median, and standard deviation for trans_filtered and trans_remaining
    mean_trans_filtered = trans_filtered['BonferroniP'].mean()
    median_trans_filtered = trans_filtered['BonferroniP'].median()
    std_trans_filtered = trans_filtered['BonferroniP'].std()
    
    mean_trans_remaining = trans_remaining['BonferroniP'].mean()
    median_trans_remaining = trans_remaining['BonferroniP'].median()
    std_trans_remaining = trans_remaining['BonferroniP'].std()
    
    # Print the results
    print("Cis Filtered: mean=%.2e, median=%.2e, std=%.2e" % (mean_cis_filtered, median_cis_filtered, std_cis_filtered))
    print("Cis Remaining: mean=%.2e, median=%.2e, std=%.2e" % (mean_cis_remaining, median_cis_remaining, std_cis_remaining))
    print("Trans Filtered: mean=%.2e, median=%.2e, std=%.2e" % (mean_trans_filtered, median_trans_filtered, std_trans_filtered))
    print("Trans Remaining: mean=%.2e, median=%.2e, std=%.2e" % (mean_trans_remaining, median_trans_remaining, std_trans_remaining))



compute_descriptive_stats(cis_filtered, cis_remaining, trans_filtered, trans_remaining)

Graph created with 3681495 nodes and 21134900 edges.
Cis Filtered: mean=4.67e-40, median=4.74e-93, std=3.75e-39
Cis Remaining: mean=4.08e-01, median=9.70e-03, std=4.77e-01
Trans Filtered: mean=1.44e-05, median=3.07e-10, std=4.02e-05
Trans Remaining: mean=8.16e-01, median=1.00e+00, std=3.63e-01


### Graph stats

In [7]:
def print_graph_stats(graph, genes, snps):
    G = nx.Graph()
    for edge in graph.edge_index.t().numpy():
        G.add_edge(edge[0], edge[1])

    num_nodes = G.number_of_nodes()
    num_genes = len(genes)
    num_snps = len(snps)
    num_edges = G.number_of_edges()
    num_connected_components = nx.number_connected_components(G)
    average_degree = np.mean([degree for _, degree in G.degree()])
    median_degree = np.median([degree for _, degree in G.degree()])
    std_degree = np.std([degree for _, degree in G.degree()])
    density = nx.density(G)
    assortativity = nx.degree_assortativity_coefficient(G)

    print(f"Number of nodes: {num_nodes}")
    print("Number of SNP nodes:", num_snps)
    print("Number of Gene nodes:", num_genes)
    print(f"Number of edges: {num_edges}")
    print(f"Number of connected components: {num_connected_components}")
    print(f"Average degree: {average_degree:.2f}")
    print(f"Median degree: {median_degree}")
    print(f"Standard deviation of degree: {std_degree:.2f}")
    print(f"Density: {density:.10f}")
    print(f"Assortativity: {assortativity:.10f}")


# For the cis-graph
print("Cis-graph stats:")
print_graph_stats(cis_graph, cis_genes, cis_snps)
print("\n")

# For the trans-graph
print("Trans-graph stats:")
print_graph_stats(trans_graph, trans_genes, trans_snps)
print("\n")

# For the combined graph
print("Combined graph stats:")
print_graph_stats(graph, genes, snps)

Cis-graph stats:
Number of nodes: 3680443
Number of SNP nodes: 3663456
Number of Gene nodes: 16987
Number of edges: 10507664
Number of connected components: 1226
Average degree: 5.71
Median degree: 2.0
Standard deviation of degree: 69.64
Density: 0.0000015514
Assortativity: -0.2263568608


Trans-graph stats:
Number of nodes: 10151
Number of SNP nodes: 3853
Number of Gene nodes: 6298
Number of edges: 59786
Number of connected components: 500
Average degree: 11.78
Median degree: 3.0
Standard deviation of degree: 29.91
Density: 0.0011605253
Assortativity: -0.2050825845


Combined graph stats:
Number of nodes: 3681495
Number of SNP nodes: 3664025
Number of Gene nodes: 17470
Number of edges: 10567450
Number of connected components: 424
Average degree: 5.74
Median degree: 2.0
Standard deviation of degree: 69.81
Density: 0.0000015594
Assortativity: -0.2267915607


In [13]:
from collections import defaultdict

data = pd.concat([cis, trans], ignore_index=True)

def dfs(node, graph, visited):
    visited[node] = True
    size = 1
    for neighbor in graph[node]:
        if not visited[neighbor]:
            size += dfs(neighbor, graph, visited)
    return size

def largest_subgraph(df):
    graph = defaultdict(list)
    for _, row in df.iterrows():
        graph[row['source']].append(row['target'])
        graph[row['target']].append(row['source'])

    visited = {node: False for node in graph}
    max_size = 0

    for node in graph:
        if not visited[node]:
            max_size = max(max_size, dfs(node, graph, visited))

    return max_size

subgraph = largest_subgraph(data)

KeyError: 'source'

In [11]:
type(data)

pandas.core.frame.DataFrame