# Data Spec

`data` Pandas DataFrame:

- `id`: This column represents the id of the variant in the following format: #chrom:pos:ref:alt (string).
- `#chrom`: This column represents the chromosome number where the genetic variant is located (int).
- `pos`: This is the position of the genetic variant on the chromosome (int: 1-200,000).
- `ref`: This column represents the reference allele (or variant) at the genomic position.
- `alt`: This is the alternate allele observed at this position.
- `rsids`: This stands for reference SNP cluster ID. It's a unique identifier for each variant used in the dbSNP database.
- `nearest_genes`: This column represents the gene which is nearest to the variant (string).
- `pval`: This represents the p-value, which is a statistical measure for the strength of evidence against the null hypothesis.
- `mlogp`: This represents the minus log of the p-value, commonly used in genomic studies.
- `beta`: The beta coefficient represents the effect size of the variant.
- `sebeta`: This is the standard error of the beta coefficient.
- `af_alt`: This is the allele frequency of the alternate variant in the general population (float: 0-1.
- `af_alt_cases`: This is the allele frequency of the alternate variant in the cases group (float: 0-1).
- `af_alt_controls`: This is the allele frequency of the alternate variant in the control group (float: 0-1).
- `finemapped`: This column represents whether the variant is included in the post-finemapped dataset (1) or not (0) (int).
- `trait`: This column represents the trait associated with the variant. In this dataset, it is the response to the drug paracetamol and NSAIDs.


### Nodes and Their Features

There is one type of node: SNP nodes.

- **SNP Nodes**: Each SNP Node is characterized by various features, including `id`, `nearest_genes`, `#chrom`, `pos`, `ref`, `alt`, `mlogp`, `beta`, `sebeta`,  `af_alt`, `af_alt_cases`, and `af_alt_controls` columns.

### Edges, Their Features, and Labels

Edges represent relationships between SNP nodes in the graph:

- For each pair of SNPs (row1 and row2) that exist on the same chromosome (`#chrom`), an edge is created if the absolute difference between their positions (`pos`) is less than or equal to 1,000,000 and greater than 1 (no loops). Create edges between all pairs of SNPs within the 1,000,000 base distance threshold. The edge weight is determined by the following formula:
     
```
weights = 1 * e^(-ln(2) / 100_000 * pos_diff_abs)
```

# Data Wrangling 

## Load libraries

In [1]:
import sys
import os
import random
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, roc_auc_score, average_precision_score
from sklearn.preprocessing import RobustScaler, LabelEncoder, StandardScaler, OrdinalEncoder, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import torch
import torch.nn.functional as F
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, GATConv
from torch_geometric.utils import to_undirected, negative_sampling
import networkx as nx
from scipy.spatial import cKDTree
from scipy.special import expit
from typing import List, Dict
import time
import cProfile
import pstats
import io
import category_encoders as ce
import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
import copy
from torch_geometric.transforms import RandomNodeSplit
from collections import Counter
from category_encoders import BinaryEncoder
import cProfile
import pstats
import io



# Print versions of imported libraries
print(f"Python version: {sys.version}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {matplotlib.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
print(f"Torch version: {torch.__version__}")
print(f"Torch Geometric version: {torch_geometric.__version__}")
print(f"NetworkX version: {nx.__version__}")

if torch.cuda.is_available():
    device = torch.device("cuda")          # Current CUDA device
    print(f"Using {torch.cuda.get_device_name()} ({device})")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
else:
    print("CUDA is not available on this device.")

Python version: 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]
NumPy version: 1.24.1
Pandas version: 1.5.3
Matplotlib version: 3.7.1
Scikit-learn version: 1.3.0
Torch version: 2.0.1+cu117
Torch Geometric version: 2.3.1
NetworkX version: 3.0
Using NVIDIA RTX A6000 (cuda)
CUDA version: 11.7
Number of CUDA devices: 2


## Load data

In [2]:
dtypes = {
    'id': 'string',
    '#chrom': 'int64',
    'pos': 'int64',
    'ref': 'string',
    'alt': 'string',
    'rsids': 'string',
    'nearest_genes': 'string',
    'pval': 'float64',
    'mlogp': 'float64',
    'beta': 'float64',
    'sebeta': 'float64',
    'af_alt': 'float64',
    'af_alt_cases': 'float64',
    'af_alt_controls': 'float64',
    'finemapped': 'int64',
    'causal': 'int64',
    'trait': 'string'
}

data = pd.read_csv('gwas-fine-causal.csv', dtype=dtypes)

# Assert column names
expected_columns = ['#chrom', 'pos', 'ref', 'alt', 'rsids', 'nearest_genes', 'pval', 'mlogp', 'beta',
                    'sebeta', 'af_alt', 'af_alt_cases', 'af_alt_controls', 'finemapped',
                    'id', 'causal', 'trait']
assert set(data.columns) == set(expected_columns), "Unexpected columns in the data DataFrame."

# Assert data types
expected_dtypes = {
    'id': 'string',
    '#chrom': 'int64',
    'pos': 'int64',
    'ref': 'string',
    'alt': 'string',
    'rsids': 'string',
    'nearest_genes': 'string',
    'pval': 'float64',
    'mlogp': 'float64',
    'beta': 'float64',
    'sebeta': 'float64',
    'af_alt': 'float64',
    'af_alt_cases': 'float64',
    'af_alt_controls': 'float64',
    'finemapped': 'int64',
    'causal': 'int64',
    'trait': 'string'
}

for col, expected_dtype in expected_dtypes.items():
    assert data[col].dtype == expected_dtype, f"Unexpected data type for column {col}."

## Data manipulation

In [4]:
data['nearest_genes'] = data['nearest_genes'].astype(str)

# Assert column 'nearest_genes' is a string
assert data['nearest_genes'].dtype == 'object', "Column 'nearest_genes' is not of string type."

# Get the length of the data before transformation
original_length = len(data)

# Extract the first gene name from the 'nearest_genes' column
data['nearest_genes'] = data['nearest_genes'].str.split(',').str[0]

# Reset index to have a standard index
data = data.reset_index(drop=True)

# Assert the length of the data remains the same
assert len(data) == original_length, "Length of the data has changed after transformation."

# Functions

## gwas_to_graph

In [5]:
import math
import networkx as nx
from scipy.spatial import distance
import numba
import cupy as cp
from numba import cuda

### GPU-enhanced

In [None]:
%%time

def create_graph(data):
    data.sort_values(['#chrom', 'pos'], inplace=True)
    
    G = nx.Graph()
    
    nodes = data.set_index('id')[['nearest_genes', 'mlogp', 'beta', 'sebeta', 'af_alt', 'af_alt_cases', 'af_alt_controls']].to_dict('index')
    G.add_nodes_from(nodes.items())

    def calculate_weights(pos_diffs):
        mask = (pos_diffs > 1) & (pos_diffs <= 300_000)
        indices = cp.argwhere(mask)  # indices where condition holds
        unique_indices = indices[indices[:, 0] < indices[:, 1]]  # indices where first index < second index
        if unique_indices.size > 0:  # check if there are unique_indices
            unique_pos_diffs = pos_diffs[unique_indices[:, 0], unique_indices[:, 1]]
            return unique_indices, cp.exp(-cp.log(2) / 100_000 * unique_pos_diffs)
        else:
            return cp.array([]), cp.array([])

    # Divide the data into 2 halves for multi-GPU computation
    halves = [data[data['#chrom'] <= data['#chrom'].median()], data[data['#chrom'] > data['#chrom'].median()]]
    
    for device, data_half in enumerate(halves):
        with cp.cuda.Device(device):  # Specify the device
            for chrom, group in data_half.groupby('#chrom'):
                ids = group['id'].values
                pos = cp.asarray(group['pos'].values, dtype=cp.float32)  # Use float32 for reduced memory
                
                chunk_size = 20_000
                overlap = 10_000 
                num_chunks = math.ceil(len(pos) / chunk_size)
                
                for chunk in range(num_chunks):
                    start_idx = max(0, chunk * chunk_size - overlap)
                    end_idx = min((chunk + 1) * chunk_size + overlap, len(pos))
                    
                    chunk_pos = pos[start_idx:end_idx]
                    chunk_pos_diffs = cp.empty((len(chunk_pos), len(chunk_pos)), dtype=cp.float32)
                    chunk_pos_diffs -= chunk_pos[:, None]  # Compute difference in-place
                    chunk_pos_diffs = cp.abs(chunk_pos_diffs)

                    unique_indices, unique_weights = calculate_weights(chunk_pos_diffs)
                    unique_weights = cp.asnumpy(unique_weights)
                    unique_indices = cp.asnumpy(unique_indices)  # conversion to NumPy array

                    if unique_indices.size > 0:
                        edges = [(ids[unique_indices[i, 0]], ids[unique_indices[i, 1]], unique_weights[i]) for i in range(unique_indices.shape[0])]
                        G.add_weighted_edges_from(edges)

                del ids, pos
                cp.cuda.Stream.null.synchronize()
                cp._default_memory_pool.free_all_blocks()
                group = None

    return G

nx_graph = create_graph()


### GPU-enhanced with Specific Chrom

In [6]:
%%time

def create_graph(data, target_chromosome):
    # Filter the data for the target chromosome
    data = data[data['#chrom'] == target_chromosome]
    
    data.sort_values(['#chrom', 'pos'], inplace=True)
    
    G = nx.Graph()
    
    nodes = data.set_index('id')[['nearest_genes', 'mlogp', 'beta', 'sebeta', 'af_alt', 'af_alt_cases', 'af_alt_controls']].to_dict('index')
    G.add_nodes_from(nodes.items())

    def calculate_weights(pos_diffs):
        mask = (pos_diffs > 1) & (pos_diffs <= 1_000_000)
        indices = cp.argwhere(mask)  # indices where condition holds
        unique_indices = indices[indices[:, 0] < indices[:, 1]]  # indices where first index < second index
        if unique_indices.size > 0:  # check if there are unique_indices
            unique_pos_diffs = pos_diffs[unique_indices[:, 0], unique_indices[:, 1]]
            return unique_indices, cp.exp(-cp.log(2) / 100_000 * unique_pos_diffs)
        else:
            return cp.array([]), cp.array([])

    ids = data['id'].values
    pos = cp.asarray(data['pos'].values, dtype=cp.float32)  # Use float32 for reduced memory
                
    chunk_size = 30_000
    overlap = 15_000 
    num_chunks = math.ceil(len(pos) / chunk_size)
                
    for chunk in range(num_chunks):
        start_idx = max(0, chunk * chunk_size - overlap)
        end_idx = min((chunk + 1) * chunk_size + overlap, len(pos))
                    
        chunk_pos = pos[start_idx:end_idx]
        chunk_pos_diffs = cp.empty((len(chunk_pos), len(chunk_pos)), dtype=cp.float32)
        chunk_pos_diffs -= chunk_pos[:, None]  # Compute difference in-place
        chunk_pos_diffs = cp.abs(chunk_pos_diffs)

        unique_indices, unique_weights = calculate_weights(chunk_pos_diffs)
        unique_weights = cp.asnumpy(unique_weights)
        unique_indices = cp.asnumpy(unique_indices)  # conversion to NumPy array

        if unique_indices.size > 0:
            edges = [(ids[unique_indices[i, 0]], ids[unique_indices[i, 1]], unique_weights[i]) for i in range(unique_indices.shape[0])]
            G.add_weighted_edges_from(edges)

    del ids, pos
    cp.cuda.Stream.null.synchronize()
    cp._default_memory_pool.free_all_blocks()

    return G

# call the function with the specific chromosome as argument
nx_graph = create_graph(data, 2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


CPU times: total: 12min 25s
Wall time: 12min 37s


### CPU-only

In [None]:
%%time

def create_graph(data):
    # Sort the data
    data.sort_values(['#chrom', 'pos'], inplace=True)
    
    # Create the graph
    G = nx.Graph()

    # Add nodes to the graph in bulk using a dict comprehension
    G.add_nodes_from({
        row['id']: row.to_dict() 
        for _, row in data.iterrows()
    })

    # Define a function to calculate weights for given indices
    def calculate_weights(pos_diffs):
        mask = (pos_diffs > 1) & (pos_diffs <= 300_000)
        indices = np.argwhere(mask)
        unique_indices = indices[indices[:, 0] < indices[:, 1]]
        unique_pos_diffs = pos_diffs[unique_indices[:, 0], unique_indices[:, 1]]
        return unique_indices, 1 * np.exp(-np.log(2) / 100_000 * unique_pos_diffs)

    # Iterate over each chromosome group
    for chrom, group in data.groupby('#chrom'):
        ids = group['id'].values
        pos = group['pos'].values
        
        # Apply batch operation
        chunk_size = 40_000
        overlap = 2_500  # Define overlap size
        num_chunks = math.ceil(len(pos) / chunk_size)
        
        for chunk in range(num_chunks):
            start_idx = max(0, chunk * chunk_size - overlap)
            end_idx = min((chunk + 1) * chunk_size + overlap, len(pos))
            
            # Calculate pairwise absolute differences in position within each chunk
            chunk_pos = pos[start_idx:end_idx]
            chunk_pos_diffs = distance.squareform(distance.pdist(chunk_pos[:, None], 'cityblock'))

            # Calculate weights
            unique_indices, unique_weights = calculate_weights(chunk_pos_diffs)

            # Add the unique edges to the graph with weights
            for (i, j), weight in zip(unique_indices, unique_weights):
                node1 = ids[start_idx + i]
                node2 = ids[start_idx + j]
                if not G.has_edge(node1, node2):
                    G.add_edge(node1, node2, weight=weight)

        del ids, pos  # Delete to free memory
        group = None  # Free memory

    return G


nx_graph = create_graph(data)


## get_graph_stats

In [None]:
from torch_geometric.utils import degree
from tabulate import tabulate

def print_graph_stats(graph, features_list):
    print(f"Number of nodes: {graph.x.size(0)}")
    print(f"Number of edges: {graph.edge_index.size(1)}")
    print(f"Node feature dimension: {graph.num_node_features}")

    # Compute and print degree-related stats
    degrees = degree(graph.edge_index[0].long(), num_nodes=graph.x.size(0))
    average_degree = degrees.float().mean().item()
    median_degree = np.median(degrees.numpy())
    std_degree = degrees.float().std().item()

    print(f"Average Degree: {average_degree}")
    print(f"Median Degree: {median_degree}")
    print(f"Standard Deviation of Degree: {std_degree}")

    # Compute and print edge-related stats
    average_edge_weight = graph.edge_attr.float().mean().item()
    median_edge_weight = np.median(graph.edge_attr.numpy())
    std_edge_weight = graph.edge_attr.float().std().item()
    min_edge_weight = graph.edge_attr.float().min().item()
    max_edge_weight = graph.edge_attr.float().max().item()

    print(f"Average Edge Weight: {average_edge_weight}")
    print(f"Median Edge Weight: {median_edge_weight}")
    print(f"Standard Deviation of Edge Weight: {std_edge_weight}")
    print(f"Min Edge Weight: {min_edge_weight}")
    print(f"Max Edge Weight: {max_edge_weight}")

    # Density is the ratio of actual edges to the maximum number of possible edges
    num_possible_edges = graph.x.size(0) * (graph.x.size(0) - 1) / 2
    density = graph.edge_index.size(1) / num_possible_edges

    print(f"Density: {density:.10f}")

    # Check for NaN values in features
    nan_mask = torch.isnan(graph.x)
    nan_features = []
    for feature_idx, feature_name in enumerate(features_list):
        if nan_mask[:, feature_idx].any():
            nan_features.append(feature_name)

    print("Features with NaN values:")
    print(nan_features)

    # Compute and print descriptive stats for node feature vectors
    feature_stats = []
    node_features = graph.x
    for i, feature_name in enumerate(features_list):
        feature = node_features[:, i]
        mean = feature.float().mean().item()
        median = np.median(feature.numpy())
        std = feature.float().std().item()
        minimum = feature.float().min().item()
        maximum = feature.float().max().item()

        feature_stats.append([feature_name, mean, median, std, minimum, maximum])

    headers = ["Feature", "Mean", "Median", "Standard Deviation", "Minimum", "Maximum"]
    print("\nNode Feature Vector Descriptive Statistics:")
    print(tabulate(feature_stats, headers=headers, tablefmt="grid"))

# Print graph stats
print("Graph stats:")
print_graph_stats(graph, snp_features.columns)

In [7]:
%%time

# Print basic graph statistics
num_nodes = nx_graph.number_of_nodes()
num_edges = nx_graph.number_of_edges()

print("Number of nodes:", num_nodes)
print("Number of edges:", num_edges)

# Print number of connected components
num_components = nx.number_connected_components(nx_graph)
print("Number of connected components:", num_components)

from collections import Counter
import numpy as np

# Get degree distribution
degree_sequence = [d for n, d in nx_graph.degree()]
degree_sequence = np.array(degree_sequence)

# Print degree distribution statistics
print("Degree distribution statistics:")
print("Minimum degree:", np.min(degree_sequence))
print("Maximum degree:", np.max(degree_sequence))
print("Average degree:", np.mean(degree_sequence))

# Calculate edge weight statistics
weights = np.array([float(data.get('weight', np.nan)) for _, _, data in nx_graph.edges(data=True)])

# Print edge weight statistics
print("Edge weight statistics:")
print("Minimum weight:", np.nanmin(weights))
print("Maximum weight:", np.nanmax(weights))
print("Average weight:", np.nanmean(weights))

# Find the largest connected component
largest_component = max(nx.connected_components(nx_graph), key=len)

# Print the size of the largest connected component
print("Size of the largest connected component (nodes):", len(largest_component))
print("Size of the largest connected component (edges):", num_edges - (num_nodes - len(largest_component)))

Number of nodes: 1668096
Number of edges: 323140485
Number of connected components: 1623097
Degree distribution statistics:
Minimum degree: 0
Maximum degree: 44999
Average degree: 387.4363166148711
Edge weight statistics:
Minimum weight: 0.0009775241733868045
Maximum weight: 0.9291413530987507
Average weight: 0.13540381658974202
Size of the largest connected component (nodes): 45000
Size of the largest connected component (edges): 321517389
CPU times: total: 1min 56s
Wall time: 1min 59s


In [None]:
%%time

nk_graph = nk.nxadapter.nx2nk(nx_graph)

## save_graph

In [8]:
import math
import networkit as nk
from scipy.spatial import distance
import numba
import cupy as cp
from numba import cuda

In [None]:
%%time

# Assuming nk_graph has been converted from a networkx graph
nk.writeGraph(nk_graph, 'nk_graph.networkit', nk.Format.NetworkitBinary)


In [None]:
%%time
import pickle
e
with open('nx_graph.pkl', 'wb') as f:
    pickle.dump(nx_graph, f)


## load_graph

In [8]:
import math
import networkit as nk
from scipy.spatial import distance
import numba
import cupy as cp
from numba import cuda

In [None]:
%%time

import pickle

with open('nx_graph.pkl', 'rb') as f:
    nx_graph = pickle.load(f)

In [None]:
%%time

# Assuming nk_graph has been written to a GraphML file
nk_graph = nk.readGraph('nk_graph.networkit', nk.Format.NetworkitBinary)

## cluster_graph 

In [10]:
%%time

# Choose and initialize algorithm
plmCommunities = nk.community.detectCommunities(nk_graph, algo=nk.community.PLM(nk_graph, True))

print("{0} elements assigned to {1} subsets".format(plmCommunities.numberOfElements(), plmCommunities.numberOfSubsets()))
print("the biggest subset has size {0}".format(max(plmCommunities.subsetSizes())))

max_size = max(plmCommunities.subsetSizes())
max_index = plmCommunities.subsetSizes().index(max_size)

print(max_size)
print(max_index)

  warn("networkit.Timer is deprecated, will be removed in future updates.")


Communities detected in 21.88334 [s]
solution properties:
-------------------  --------------
# communities            1.6231e+06
min community size       1
max community size   45000
avg. community size      1.02772
imbalance            22500
edge cut                 0
edge cut (portion)       0
modularity               0
-------------------  --------------
1668096 elements assigned to 1623097 subsets
the biggest subset has size 45000
45000
0
CPU times: total: 1min 27s
Wall time: 25.5 s


In [11]:
%%time

# Choose and initialize algorithm
plpCommunities = nk.community.detectCommunities(nk_graph, algo=nk.community.PLP(nk_graph, True))

print("{0} elements assigned to {1} subsets".format(plpCommunities.numberOfElements(), plpCommunities.numberOfSubsets()))
print("the biggest subset has size {0}".format(max(plpCommunities.subsetSizes())))

max_size = max(plpCommunities.subsetSizes())
max_index = plpCommunities.subsetSizes().index(max_size)

print(max_size)
print(max_index)

Communities detected in 30.22702 [s]
solution properties:
-------------------  --------------
# communities            1.6231e+06
min community size       1
max community size   45000
avg. community size      1.02772
imbalance            22500
edge cut                 0
edge cut (portion)       0
modularity               0
-------------------  --------------
1668096 elements assigned to 1623097 subsets
the biggest subset has size 45000
45000
0
CPU times: total: 52.2 s
Wall time: 33.8 s
