# Graph of SNPs in the genome
The idea here is to build a similarity graph between SNPs. In the raw data, there are many SNPs that are identical, only their names change (see data exploration notebook). Eliminating the duplicates, or grouping them together as a single feature, reduces the analysis time and complexity. Going one step further in that direction, it may be interesting to group together the SNPs that are highly similar as well, while still keeping them distinguishable. This is the idea developed here with a graph approach.

In this notebook, SNPs are nodes of the graph. They are connected to their k nearest neighbors. The connections are weighted by the similarity of the SNPs according to a chosen distance. To each SNP is associated a vector encoding its variations over the BXD mouse dataset. Two SNPs are similar if their vectors are close in term of Euclidean distance.

In [None]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
import os

In [None]:
import networkx as nx
import sklearn.metrics
import sklearn.neighbors
import matplotlib.pyplot as plt

# Importing the data

In [None]:
# Config for accessing the data on the s3 storage
storage_options = {'anon':True, 'client_kwargs':{'endpoint_url':'https://os.unil.cloud.switch.ch'}}
s3_path = 's3://lts2-graphnex/BXDmice/'

In [None]:
# Load the data
genotype_path = os.path.join(s3_path, 'geno_reduced.csv.gz')
#genotype_path = os.path.join(s3_path, 'genotype_BXD.csv.gz')
genotype = pd.read_csv(genotype_path, storage_options=storage_options)
print('File {} Opened.'.format(genotype_path))

## Computing the distances

In [None]:
# Extract the data as a numpy array
geno_values = genotype.loc[:,'B6D2F1':].values

In [None]:
# Default distance is Euclidean
num_neighbors = 4
geno_knn = sklearn.neighbors.kneighbors_graph(geno_values, num_neighbors, mode='distance')
# Optionally, one can use the following function to compute all the distances:
#geno_distances = sklearn.metrics.pairwise_distances(geno_values)

In [None]:
# Distribution of weights
plt.hist(geno_knn.data, bins=20)
plt.title('Distribution of distances')
plt.xlabel('Distance')
plt.ylabel('Nb of edges')
plt.show()

In [None]:
# Distance to weight
# Modify the non-zero values to turn them into weights instead of distances
def distance2weight(d):
    sigma = 1
    return np.exp(- sigma * d)
    
M = geno_knn.copy()
M.data = distance2weight(geno_knn.data)

print('A distance of 1 becomes a weight of {}.'.format(str(distance2weight(1))))

In [None]:
# Distribution of weights
plt.hist(M.data, bins=20)
plt.title('Distribution of weights')
plt.xlabel('Weight value')
plt.ylabel('Nb of edges')
plt.show()

## Building the graph

In [None]:
G = nx.from_scipy_sparse_matrix(M)

In [None]:
# Adding info on the nodes of the graph
genoinfo_dic = genotype[['SNP','Chr','Pos']].to_dict()
nx.set_node_attributes(G,genoinfo_dic['SNP'],name='SNP') # SNP id
nx.set_node_attributes(G,genoinfo_dic['Chr'],name='Chr') # Chromosome
nx.set_node_attributes(G,genoinfo_dic['Pos'],name='Pos') # position inside the chromosome

In [None]:
# Saving the graph as a gexf file readable with Gephi.
nx.write_gexf(G,'SNPgraph.gexf')

Graph plotted using Gephi, colored by chromosome. SNPs share their similarity mostly with their spatial neighbors inside chromosomes. Most of the subgraph have a unique color (SNPs from the same chromosome).

![SNP graph](SNPgraphChr.png)

## Applications of the graph
There are different possible applications of this graph

* SNP reduction: reduce the number of SNPs to analyse by grouping similar ones together or keeping only some representatives
    * subsample the SNPs regularly over the graph
    * coarsen the graph in order to merge the most similar SNPs together
* SNP similarity encoding:
    * associate a new feature vector to SNPs accounting for their similarities (node2vec on this graph)
    * use the graph as an input of a GNN

A clustering can be done using the ["Graclus" clustering approach](https://www.cs.utexas.edu/users/inderjit/public_papers/multilevel_pami.pdf) implemented [here](https://github.com/rusty1s/pytorch_cluster).