# Grapher

## Data description

### UKBB_94traits_release1.{tsv|bed}.gz

This file contains genetic variant data used in a study investigating 94 complex diseases and traits from the UK Biobank. Each row represents a variant with columns detailing characteristics such as its genomic location, allele details, association statistics, and more. It also includes indicators for linkage disequilibrium with variants failing Hardy Weinberg equilibrium or with common structural variants. This file is particularly valuable for those interested in the genetic association results and the fine-mapping of these traits and diseases.

Columns:

1. Chromosome: hg19 autosomes only
2. Start: 0-indexed hg19 start position
3. End: 0-indexed hg19 end position
4. Variant: unique variant identifier (chr:pos:ref:alt)
5. rsid: rsid identifier
6. Allele1: reference allele in hg19
7. Allele2: alternative allele in hg19
8. Minor allele: minor allele in cohort
9. Cohort: GWAS cohort
10. Model_marginal: type of regression model used
11. Method: fine-mapping method used
12. Trait: abbreviation for phenotype in genetic association tests
13. Region: fine-mapping region in hg19
14. MAF: minor allele frequency in cohort
15. Beta_marginal: marginal association effect size (effect allele: alternative)
16. SE_marginal: standard error on marginal association effect size
17. Chisq_marginal: test statistic for marginal association
18. PIP: posterior probability of association from fine-mapping
19. CS_ID: ID of 95% credible set (-1 if variant not in 95% CS)
20. Beta_posterior: posterior expectation of true effect size (effect allele: alternative)
21. SD_posterior: posterior standard deviation of true effect size
22. LD_HWE: indicator for LD (R^2 > 0.6) with a variant that failed HWE (p < 10^-12) in UK10K LD
23. LD_SV: indicator for LD (R^2 > 0.8) with a common structural variant in gnomAD European samples

## Load libraries

In [1]:
import os
import pandoc

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, average_precision_score
from sklearn.preprocessing import LabelEncoder, StandardScaler

import torch
import torch.nn.functional as F
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv
from torch_geometric.utils import to_undirected, negative_sampling

import networkx as nx
from ogb.io import DatasetSaver
from ogb.linkproppred import LinkPropPredDataset

from scipy.spatial import cKDTree

## Perform checks

In [2]:
print(f"PyTorch version: {torch.__version__}")
print(f"PyTorch Geometric version: {torch_geometric.__version__}")

PyTorch version: 2.0.0+cu118
PyTorch Geometric version: 2.3.1


In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda")          # Current CUDA device
    print(f"Using {torch.cuda.get_device_name()} ({device})")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
else:
    print("CUDA is not available on this device.")

Using NVIDIA GeForce RTX 3060 Ti (cuda)
CUDA version: 11.8
Number of CUDA devices: 1


## Load data

In [4]:
data = pd.read_csv('~/Desktop/geometric-omics/UKBB-fine-mapping/data/UKBB_94traits_release1.csv')
hg19_gene_positions = pd.read_csv("~/Desktop/geometric-omics/UKBB-fine-mapping/data/hg19-gene-positions.csv") 

In [5]:
data['position'] = data['variant'].str.split(':').str[1]

## Get geneSymbol

In [6]:
%%time

# Sort the dataframes and convert columns to string type
data = data.sort_values(by=['chromosome', 'start'])
data['chromosome'] = data['chromosome'].astype(str)

hg19_gene_positions = hg19_gene_positions.sort_values(by=['chrom', 'txStart'])
hg19_gene_positions['chrom'] = hg19_gene_positions['chrom'].astype(str)

# Define leniency
leniency = 100000

# Convert the 'chrom' column to category type for efficient memory usage
data['chromosome'] = data['chromosome'].astype('category')
hg19_gene_positions['chrom'] = hg19_gene_positions['chrom'].astype('category')

# Create an empty dictionary to store geneSymbols
gene_symbols_dict = {}

# Iterate over unique chromosome
for chromosome in data['chromosome'].cat.categories:
    # Subset data for current chromosome
    data_chromosome = data[data['chromosome'] == chromosome]
    hg19_gene_positions_chromosome = hg19_gene_positions[hg19_gene_positions['chrom'] == chromosome]

    # Build KDTree for efficient nearest-neighbor search
    tree = cKDTree(np.expand_dims(hg19_gene_positions_chromosome['txStart'].values, axis=1))

    # Query the KDTree for nearest neighbors within the leniency
    distances, indices = tree.query(np.expand_dims(data_chromosome['start'].values, axis=1), distance_upper_bound=leniency)

    # Create a list of gene symbols
    gene_symbols = []
    for idx, distance in zip(indices, distances):
        if distance == np.inf:
            gene_symbols.append('N/A')  # or any other default value you want
        else:
            gene_symbols.append(hg19_gene_positions_chromosome.iloc[idx]['geneSymbol'])

    # Assign geneSymbols to data dictionary
    for idx, gene_symbol in zip(data_chromosome.index, gene_symbols):
        gene_symbols_dict[idx] = gene_symbol

# Convert the dictionary to a Series and assign it to a new column in 'data'
data['geneSymbol'] = pd.Series(gene_symbols_dict)

CPU times: total: 1min 51s
Wall time: 3min 20s


In [7]:
null_values = data['geneSymbol'].isnull().sum()
print("Number of null values in data['geneSymbol']: ", null_values)

unique_elements1 = data['geneSymbol'].nunique()
print("Number of unique elements in data['geneSymbol']: ", unique_elements1)

unique_elements2 = hg19_gene_positions['geneSymbol'].nunique()
print("Number of unique elements in hg19_gene_positions['geneSymbol']: ", unique_elements2)

Number of null values in data['geneSymbol']:  0
Number of unique elements in data['geneSymbol']:  24464
Number of unique elements in hg19_gene_positions['geneSymbol']:  29014


## Proposed graph features

All columns from `data` dataframe:

Phenotype nodes features:
- `trait` column

Gene nodes features:
- `geneSymbol` column
- `chromosome` column
- `start` column
- `end` column

SNP node features:
- `rsid` column
- `chromosome` column
- `position` column
- `allele1` column
- `allele2` column

Edge features:
- undirected
- weighted (use `pip` column)

## Create graph

In [8]:
# Create mappings for phenotypes, genes and SNPs to integer indices
phenotypes = data['trait'].unique()
genes = data['geneSymbol'].unique()
snps = data['rsid'].unique()
phenotype_to_idx = {phenotype: idx for idx, phenotype in enumerate(phenotypes)}
gene_to_idx = {gene: idx + len(phenotypes) for idx, gene in enumerate(genes)}
snp_to_idx = {snp: idx + len(phenotypes) + len(genes) for idx, snp in enumerate(snps)}

# Create node feature vectors for phenotypes, genes and SNPs
phenotype_features = data.loc[data['trait'].isin(phenotypes)][['trait']].drop_duplicates().sort_values(by='trait').reset_index(drop=True)
gene_features = data.loc[data['geneSymbol'].isin(genes)][['geneSymbol', 'chromosome', 'start', 'end']].drop_duplicates().sort_values(by='geneSymbol').reset_index(drop=True)
snp_features = data.loc[data['rsid'].isin(snps)][['rsid', 'chromosome', 'position', 'allele1', 'allele2']].drop_duplicates().sort_values(by='rsid').reset_index(drop=True)

# Create node type labels
node_types = torch.tensor([0] * len(phenotypes) + [1] * len(genes) + [2] * len(snps), dtype=torch.long)

# Create edges and edge attributes
edges = data.apply(lambda row: (snp_to_idx[row['rsid']], phenotype_to_idx[row['trait']]), axis=1)
edges = torch.tensor(list(edges), dtype=torch.long).t().contiguous()

edge_attr = data[['pip']].values.astype(np.float32)  # get pip values as edge attributes
edge_attr = torch.tensor(edge_attr, dtype=torch.float)

# Combine the feature vectors
combined_features = pd.concat([phenotype_features, gene_features, snp_features], ignore_index=True).drop(['trait', 'geneSymbol', 'rsid'], axis=1)

# Now you can fill NaNs with 'N/A'
nan_replacements = {'chromosome': 'N/A', 'start': 0, 'end': 0, 'position': 0, 'allele1': 'N/A', 'allele2': 'N/A'}
for col, replacement in nan_replacements.items():
    if col in combined_features:
        if combined_features[col].dtype.name == 'category' and replacement not in combined_features[col].cat.categories:
            combined_features[col] = combined_features[col].cat.add_categories([replacement])
        combined_features[col].fillna(replacement, inplace=True)

# Label encoding for categorical columns
le = LabelEncoder()
combined_features = combined_features.apply(lambda col: le.fit_transform(col.astype(str)) if col.dtype == 'object' else col)

# Standardize numerical features
scaler = StandardScaler()
numerical_columns = ['start', 'end', 'position']
categorical_columns = ['chromosome', 'allele1', 'allele2']
for col in categorical_columns:
    combined_features[col] = combined_features[col].astype('category').cat.codes

features = torch.tensor(combined_features.values, dtype=torch.float)


# Create the PyTorch Geometric graph
graph = Data(x=features, edge_index=edges, edge_attr=edge_attr)
graph.node_types = node_types

print(f"Number of nodes: {graph.num_nodes}")
print(f"Number of edges: {graph.num_edges}")
print(f"Node feature dimension: {graph.num_node_features}")
print(f"Node types: {graph.node_types}")

Number of nodes: 4099617
Number of edges: 5377879
Node feature dimension: 6
Node types: tensor([0, 0, 0,  ..., 2, 2, 2])


## Graph stats

In [9]:
# Check for NaN values in features
nan_in_features = torch.isnan(graph.x).any().item()
print(f"Are there any NaN values in features? {nan_in_features}")

Are there any NaN values in features? False


In [10]:
def print_graph_stats(graph, phenotypes, snps):
    G = nx.Graph()
    edge_weights = graph.edge_attr.view(-1)  # ensure that edge_attr is a 1D tensor
    for edge, weight in zip(graph.edge_index.t().numpy(), edge_weights):
        G.add_edge(edge[0], edge[1], weight=weight.item())  # use the pip value as the edge weight

    num_nodes = G.number_of_nodes()
    num_phenotypes = len(phenotypes)
    num_genes = len(genes)
    num_snps = len(snps)
    num_edges = G.number_of_edges()
    num_connected_components = nx.number_connected_components(G)
    average_degree = np.mean([degree for _, degree in G.degree()])
    median_degree = np.median([degree for _, degree in G.degree()])
    std_degree = np.std([degree for _, degree in G.degree()])
    density = nx.density(G)
    assortativity = nx.degree_assortativity_coefficient(G)
    edge_weights = [data["weight"] for _, _, data in G.edges(data=True)]
    average_weight = np.mean(edge_weights)
    median_weight = np.median(edge_weights)
    std_weight = np.std(edge_weights)

    print(f"Number of nodes: {num_nodes}")
    print("Number of SNP nodes:", num_snps)
    print("Number of Gene nodes:", num_genes)
    print("Number of Phenotype nodes:", num_phenotypes)
    print(f"Number of edges: {num_edges}")
    print(f"Number of connected components: {num_connected_components}")
    print(f"Average degree: {average_degree:.2f}")
    print(f"Median degree: {median_degree}")
    print(f"Standard deviation of degree: {std_degree:.2f}")
    print(f"Density: {density:.10f}")
    print(f"Assortativity: {assortativity:.10f}")
    print(f"Average edge weight: {average_weight:.2f}")
    print(f"Median edge weight: {median_weight}")
    print(f"Standard deviation of edge weight: {std_weight:.2f}")

# Print
print("Graph stats:")
print_graph_stats(graph, phenotypes, snps)
print("\n")

Graph stats:
Number of nodes: 2049535
Number of SNP nodes: 2049441
Number of Gene nodes: 24464
Number of Phenotype nodes: 94
Number of edges: 3814894
Number of connected components: 1
Average degree: 3.72
Median degree: 1.0
Standard deviation of degree: 393.56
Density: 0.0000018164
Assortativity: -0.0102581519
Average edge weight: 0.01
Median edge weight: 0.0026488471776247025
Standard deviation of edge weight: 0.05




## Save graph

In [12]:
# Save the PyTorch Geometric graph
torch.save(graph, 'graph.pth')