# Grapher

## Data Description

### UKBB_94traits_release1.{tsv|bed}.gz

This file contains genetic variant data used in a study investigating 94 complex diseases and traits from the UK Biobank. Each row represents a variant with columns detailing characteristics such as its genomic location, allele details, association statistics, and more. It also includes indicators for linkage disequilibrium with variants failing Hardy Weinberg equilibrium or with common structural variants. This file is particularly valuable for those interested in the genetic association results and the fine-mapping of these traits and diseases.

Columns:

- Chromosome: hg19 autosomes only
- Start: 0-indexed hg19 start position
- End: 0-indexed hg19 end position
- Variant: unique variant identifier (chr:pos:ref:alt)
- rsid: rsid identifier
- Allele1: reference allele in hg19
- Allele2: alternative allele in hg19
- Minor allele: minor allele in cohort
- Cohort: GWAS cohort
- Model_marginal: type of regression model used
- Method: fine-mapping method used
- Trait: abbreviation for phenotype in genetic association tests
- Region: fine-mapping region in hg19
- MAF: minor allele frequency in cohort
- Beta_marginal: marginal association effect size (effect allele: alternative)
- SE_marginal: standard error on marginal association effect size
- Chisq_marginal: test statistic for marginal association
- PIP: posterior probability of association from fine-mapping
- CS_ID: ID of 95% credible set (-1 if variant not in 95% CS)
- Beta_posterior: posterior expectation of true effect size (effect allele: alternative)
- SD_posterior: posterior standard deviation of true effect size
- LD_HWE: indicator for LD (R^2 > 0.6) with a variant that failed HWE (p < 10^-12) in UK10K LD
- LD_SV: indicator for LD (R^2 > 0.8) with a common structural variant in gnomAD European samples

### UKBB_94traits_release1_regions.bed.gz

This file also pertains to the same study but instead focuses on genomic regions used for fine-mapping. Each row represents a genomic region with columns providing details about the cohort, trait, and whether the fine-mapping methods (FINEMAP, SuSiE) successfully completed. It also includes the variant identifier for variants located in these regions. This file is useful for exploring the specific regions of the genome under investigation in the study and the outcomes of the fine-mapping process.

Columns:

- Chromosome: hg19 autosomes only
- Start: 0-indexed hg19 start position
- End: 0-indexed hg19 end position
- Cohort: GWAS cohort
- Trait: abbreviation for phenotype in genetic association tests
- Region: fine-mapping region in hg19
- Variant: unique variant identifier (chr:pos:ref:alt)
- Success_FINEMAP: indicator for successful FINEMAP completion
- Success_SuSiE: indicator for successful SuSiE completion

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, average_precision_score
from sklearn.preprocessing import LabelEncoder, StandardScaler

import torch
import torch.nn.functional as F
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv
from torch_geometric.utils import to_undirected, negative_sampling

import networkx as nx
from ogb.io import DatasetSaver
from ogb.linkproppred import LinkPropPredDataset

In [2]:
print(f"PyTorch version: {torch.__version__}")
print(f"PyTorch Geometric version: {torch_geometric.__version__}")

PyTorch version: 2.0.0+cu118
PyTorch Geometric version: 2.3.1


In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda")          # Current CUDA device
    print(f"Using {torch.cuda.get_device_name()} ({device})")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
else:
    print("CUDA is not available on this device.")

Using NVIDIA GeForce RTX 3060 Ti (cuda)
CUDA version: 11.8
Number of CUDA devices: 1


## Load data

In [4]:
data = pd.read_csv('~/Desktop/geometric-omics/UKBB-fine-mapping/data/UKBB_94traits_release1.csv')
hg19_gene_positions = pd.read_csv("~/Desktop/geometric-omics/UKBB-fine-mapping/data/hg19-gene-positions.csv") 

In [5]:
data['position'] = data['variant'].str.split(':').str[1]

## Get Genes from hg19_gene_positions

In [23]:
%%time

# First, ensure that both dataframes are sorted by chromosome and start position for efficient interval checking
data = data.sort_values(by=['chromosome', 'start'])
hg19_gene_positions = hg19_gene_positions.sort_values(by=['chrom', 'txStart'])

# Convert 'chrom' and 'chromosome' to string type to ensure successful merge
data['chromosome'] = data['chromosome'].astype(str)
hg19_gene_positions['chrom'] = hg19_gene_positions['chrom'].astype(str)

# Define the leniency
start_25th_percentile = data['start'].quantile(0.25)
start_75th_percentile = data['start'].quantile(0.75)
end_25th_percentile = data['end'].quantile(0.25)
end_75th_percentile = data['end'].quantile(0.75)

leniency_start = 0.05 * (start_75th_percentile - start_25th_percentile)
leniency_end = 0.05 * (end_75th_percentile - end_25th_percentile)

# Set the leniency parameter based on the calculated values
leniency = max(leniency_start, leniency_end)

# Initialize an empty list to hold the 'geneSymbol'
gene_symbols = []

# Iterate over each row in 'data'
for idx, row in data.iterrows():
    # Find matching rows in 'hg19_gene_positions'
    mask = (
        (hg19_gene_positions['chrom'] == row['chromosome'])
        & (hg19_gene_positions['txStart'] <= row['end'] + leniency)
        & (hg19_gene_positions['txEnd'] >= row['start'] - leniency)
    )

    matches = hg19_gene_positions[mask]['geneSymbol']
    # Add the 'geneSymbol' of the matching row in 'hg19_gene_positions' to the list
    if not matches.empty:
        gene_symbols.append(matches.values[0])
    else:
        gene_symbols.append(None)

# Add the 'geneSymbol' column to 'data'
data['geneSymbol'] = gene_symbols

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "C:\Users\falty\AppData\Local\Programs\Python\Python311\Lib\site-packages\IPython\core\magics\execution.py", line 1325, in time
    exec(code, glob, local_ns)
  File "<timed exec>", line 33, in <module>
  File "C:\Users\falty\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 3752, in __getitem__
    return self._getitem_bool_array(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\falty\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 3810, in _getitem_bool_array
    indexer = key.nonzero()[0]
              ^^^^^^^^^^^^^
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\falty\AppData\Local\Programs\Python\Python311\Lib\site-packages\IPython\core\interactiveshell.py", line 2105, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
          ^^^^^^^^^^^^^^^^^^

In [24]:
null_values = data['geneSymbol'].isnull().sum()
print("Number of null values in data['geneSymbol']: ", null_values)
print(len(data['geneSymbol']))
print(2856378/5377879)

Number of null values in data['geneSymbol']:  2856378
5377879
0.5311346722378841


In [22]:
# Select the columns of interest
data_cols = ['start', 'end']
hg19_gene_positions_cols = ['txStart', 'txEnd']

# Get descriptive statistics
stats1 = data[data_cols].describe()
stats2 = hg19_gene_positions[hg19_gene_positions_cols].describe()

# Print the statistics
print(stats1)
print()
print(stats2)

              start           end
count  5.377879e+06  5.377879e+06
mean   7.832148e+07  7.832148e+07
std    5.521563e+07  5.521563e+07
min    1.131900e+04  1.132000e+04
25%    3.709027e+07  3.709027e+07
50%    6.590721e+07  6.590722e+07
75%    1.140176e+08  1.140176e+08
max    2.492164e+08  2.492164e+08

            txStart         txEnd
count  2.901400e+04  2.901400e+04
mean   7.299197e+07  7.302937e+07
std    5.536003e+07  5.536462e+07
min    0.000000e+00  3.680000e+02
25%    3.018590e+07  3.020131e+07
50%    5.779024e+07  5.783122e+07
75%    1.079881e+08  1.080451e+08
max    2.492032e+08  2.492131e+08


## Proposed graph features

Phenotype nodes features:
- trait
- region
- chromosome
- start
- end


SNP node features:

- rsid
- chromosome
- position
- allele1
- allele2

Edge features:
- directed
- weighted (use `pip` column)

## Create graph

In [8]:
# Create mappings for phenotypes and SNPs to integer indices
phenotypes = data['trait'].unique()
snps = data['rsid'].unique()
phenotype_to_idx = {phenotype: idx for idx, phenotype in enumerate(phenotypes)}
snp_to_idx = {snp: idx + len(phenotypes) for idx, snp in enumerate(snps)}

# Create node feature vectors for phenotypes and SNPs
phenotype_features = data.loc[data['trait'].isin(phenotypes)][['trait', 'region', 'chromosome', 'start', 'end']].drop_duplicates().sort_values(by='trait').reset_index(drop=True)
snp_features = data.loc[data['rsid'].isin(snps)][['rsid', 'chromosome', 'position', 'allele1', 'allele2']].drop_duplicates().sort_values(by='rsid').reset_index(drop=True)

# Create node type labels
node_types = torch.tensor([0] * len(phenotypes) + [1] * len(snps), dtype=torch.long)

# Create edges and edge attributes
edges = data.apply(lambda row: (snp_to_idx[row['rsid']], phenotype_to_idx[row['trait']]), axis=1)
edges = torch.tensor(list(edges), dtype=torch.long).t().contiguous()

edge_attr = data[['pip']].values.astype(np.float32)  # get pip values as edge attributes
edge_attr = torch.tensor(edge_attr, dtype=torch.float)

# Combine the feature vectors
combined_features = pd.concat([phenotype_features, snp_features], ignore_index=True).drop(['trait', 'rsid'], axis=1)

# Replace NaN and empty strings with "N/A" or 0 as appropriate
combined_features.fillna({'chromosome': 'N/A', 'start': 0, 'end': 0, 'position': 0, 'allele1': 'N/A', 'allele2': 'N/A'}, inplace=True)
combined_features.replace({'chromosome': {'': 'N/A'}, 'allele1': {'': 'N/A'}, 'allele2': {'': 'N/A'}}, inplace=True)

# Label encoding for categorical columns
le = LabelEncoder()
combined_features = combined_features.apply(lambda col: le.fit_transform(col.astype(str)) if col.dtype == 'object' else col)

# Standardize numerical features
numerical_columns = ['start', 'end', 'position']
scaler = StandardScaler()
combined_features[numerical_columns] = scaler.fit_transform(combined_features[numerical_columns])

# Create the PyTorch tensor
features = torch.tensor(combined_features.values, dtype=torch.float)

# Create the PyTorch Geometric graph
graph = Data(x=features, edge_index=edges, edge_attr=edge_attr)
graph.node_types = node_types

print(f"Number of nodes: {graph.num_nodes}")
print(f"Number of edges: {graph.num_edges}")
print(f"Node feature dimension: {graph.num_node_features}")
print(f"Node types: {graph.node_types}")

Number of nodes: 5864975
Number of edges: 5377879
Node feature dimension: 7
Node types: tensor([0, 0, 0,  ..., 1, 1, 1])


In [9]:
# Check for NaN values in features
nan_in_features = torch.isnan(graph.x).any().item()
print(f"Are there any NaN values in features? {nan_in_features}")

Are there any NaN values in features? False


In [10]:
def print_graph_stats(graph, phenotypes, snps):
    G = nx.Graph()
    edge_weights = graph.edge_attr.view(-1)  # ensure that edge_attr is a 1D tensor
    for edge, weight in zip(graph.edge_index.t().numpy(), edge_weights):
        G.add_edge(edge[0], edge[1], weight=weight.item())  # use the pip value as the edge weight

    num_nodes = G.number_of_nodes()
    num_phenotypes = len(phenotypes)
    num_snps = len(snps)
    num_edges = G.number_of_edges()
    num_connected_components = nx.number_connected_components(G)
    average_degree = np.mean([degree for _, degree in G.degree()])
    median_degree = np.median([degree for _, degree in G.degree()])
    std_degree = np.std([degree for _, degree in G.degree()])
    density = nx.density(G)
    assortativity = nx.degree_assortativity_coefficient(G)
    edge_weights = [data["weight"] for _, _, data in G.edges(data=True)]
    average_weight = np.mean(edge_weights)
    median_weight = np.median(edge_weights)
    std_weight = np.std(edge_weights)

    print(f"Number of nodes: {num_nodes}")
    print("Number of SNP nodes:", num_snps)
    print("Number of Phenotype nodes:", num_phenotypes)
    print(f"Number of edges: {num_edges}")
    print(f"Number of connected components: {num_connected_components}")
    print(f"Average degree: {average_degree:.2f}")
    print(f"Median degree: {median_degree}")
    print(f"Standard deviation of degree: {std_degree:.2f}")
    print(f"Density: {density:.10f}")
    print(f"Assortativity: {assortativity:.10f}")
    print(f"Average edge weight: {average_weight:.2f}")
    print(f"Median edge weight: {median_weight}")
    print(f"Standard deviation of edge weight: {std_weight:.2f}")

# Print
print("Graph stats:")
print_graph_stats(graph, phenotypes, snps)
print("\n")

Graph stats:
Number of nodes: 2049535
Number of SNP nodes: 2049441
Number of Phenotype nodes: 94
Number of edges: 3814894
Number of connected components: 1
Average degree: 3.72
Median degree: 1.0
Standard deviation of degree: 393.56
Density: 0.0000018164
Assortativity: -0.0102581519
Average edge weight: 0.01
Median edge weight: 0.0026488471776247025
Standard deviation of edge weight: 0.05




In [11]:
# Save the PyTorch Geometric graph
torch.save(graph, 'graph.pth')