# Machine learning for genetic data

## Introduction

The goal of this practical session is to manipulate high-dimensional, low sample-size data that is typical of many genetic applications.

Here we will work with GWAS data from _Arabidopsis thaliana_, which is a plant model organism. The genotypes are hence described by **Single Nucleotide Polymorphisms, or SNPs**. Our goal will be to use this data to identify regions of the genome that can be linked with various growth and flowering traits (**phenotypes**).

## Data description

* `data/athaliana_small.X.txt` is the design matrix. As many rows as samples, as many columns as SNPs
* the SNPs are given (in order) in `data/athaliana_small.snps.txt`. 
* the samples are given (in order) in `data/athaliana.samples.txt`.

* the phenotypes are given in `data/phenotypes.pheno`. The first two columns give the sample's ID, and all following columns give a phenotype. The header gives the list of all phenotypes. In this session we will use "2W" and "4W", which give the number of days by which the plant grows to be 5 centimeters tall, after either two weeks ("2W") or four weeks ("4W") of vernalization (i.e. the seeds are kept at cold temperatures, similar to winter). Not all phenotypes are available for all samples.

* `data/athaliana.snps_by_gene.txt` contains, for each _A. thaliana_ SNP, the list of genes it is in or near to. (This can be several genes, as it is customary to use a rather large window to compute this, so as to capture potential cis-regulatory effects.)

* the feature network is in `data/athaliana_small.W.txt`. It has been saved as 3 arrays, corresponding to the row, col, and data attributes of a [scipy.sparse coo_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html).

## Loading the data

In [None]:
%pylab inline

#### Read the list of SNP names

In [None]:
with open('data/athaliana_small.snps.txt') as f:
    snp_names = f.readline().split()
    f.close()
print(len(snp_names), snp_names[:10])

#### Read the list of sample names

In [None]:
samples = list(np.loadtxt('data/athaliana.samples.txt', # file names
                         dtype=int)) # values are integers
print(len(samples), samples[:10])

#### Load the design matrix (n samples x p SNPs)

In [None]:
X = np.loadtxt('data/athaliana_small.X.txt',  # file names
               dtype='int') # values are integers

In [None]:
n, p = X.shape

In [None]:
len(np.where(X[0, :]==1)[0])

#### Load the 2W phenotype data

The first phenotype we will work with is called "2W". It describes the number of days required for the bolt height to reach 5 cm, at a temperature of 23°C under 16 hours of daylight per 24 hours, for seeds that have been vernalized for 2 weeks at 5°C (with 8 hours of daylight per 24 hours).

In [None]:
import pandas as pd

In [None]:
# TODO
# read phenotypes from phenotypes.pheno
# only keep samples that have a 2W phenotype. 
#
# df = ...
# df_2W = ...
# samples_with_phenotype = ...

In [None]:
# Restrict X to the samples with a 2W phenotype, in correct order
X_2W = X[samples_with_phenotype, :]
y_2W = np.array(df_2W)[samples_with_phenotype]
n, p = X_2W.shape
print(n, p)

## Split the data in a train and test set

We will set aside a test set, containing 20% of our samples, on which to evaluate the quality of our predictive models.

In [None]:
from sklearn import model_selection

In [None]:
## TODO
## X_2W_tr, X_2W_te, y_2W_tr, y_2W_te = ...
print(X_2W_tr.shape, X_2W_te.shape)

## Visualize the phenotype

In [None]:
h = plt.hist(y_2W_tr, bins=30)

## T-test

Let us start by running a statistical test for association of each SNP feature with the phenotype.

In [None]:
## TODO: make univariate T-tests of all SNP, and make a Manhattan plot.

__What do you observe? Are any SNPs significantly associated with the phenotype? What genes are they in/near?__

## Linear regression 

In [None]:
## TODO: train an OLS model, visualize the plot, and measure the predictive accuracy on the test set

__What do you observe? How can you interpret these results? Do any of the SNPs strike you as having a strong influence on the phenotype?__

## Lasso

In [None]:
## TODO: same question with LASSO

__How can you interpret these results? How many SNPs contribute to explaining the phenotype?__

### Stability

__How stable is the set of selected SNPs, between the different rounds of cross-validation with optimal parameters?__

You can use [sklearn.metrics.jaccard_similarity_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html), or implement Kuncheva's consistency index.

__Note:__ One could also contemplate using the Jaccard similarity (or another measure of consistency/stability/robustness) as a criterion to select the best hyperparameters. Pay attention, however, to the fact that hyperparameters selecting no features at all or all the features will have very good consistency.

The jaccard index is high for extreme situation (when none or all features is selected). 

The Kuncheva's ConcordenceIndex adresses this issue. [link to paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.6458&rep=rep1&type=pdf])

## Elastic net

One solution to make the lasso more stable is to use a combination of the l1 and l2 regularizations.

We are now minimizing the loss + a linear combination of an l1-norm and an l2-norm over the regression weights. This imposes sparsity, but encourages correlated features to be selected together, where the lasso would tend to pick only one (at random) of a group of correlated features.

The elastic net is implemented in scikit-learn's [linear_model.ElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet).

In [None]:
## TODO: train an elastic net model, check selected SNP, and predictive accuracy on the test set


__How can you interpret these results? How many SNPs contribute to explaining the phenotype?__

__How stable is the set of selected SNPs, between the different rounds of cross-validation with optimal parameters?__

## Stability selection with the Lasso

__Use a randomized procedure to stabilize the lasso__

[sklearn.linear_model.RandomizedLasso.html](http://scikit-learn.org/0.18/modules/generated/sklearn.linear_model.RandomizedLasso.html#sklearn.linear_model.RandomizedLasso) + [User Guide](http://scikit-learn.org/0.18/auto_examples/linear_model/plot_sparse_recovery.html)

In [1]:
## TODO: run stability selection to select SNP; for different number of SNP, fit a simple OLS model on the selected SNPs and check performance on the test set

**Note**: It is usually more relevant to evaluate the range of relevant selection_threshold to be tested depending on the number of selected features per threshold.  We can change with `randlasso.selection_threshold = new_value`.

## Network-constrained lasso

Let us try the ncLasso on the real data

### Load the network

In [None]:
from scipy import sparse

In [None]:
w_saved = np.loadtxt('data/athaliana_small.W.txt')

In [None]:
w_saved.shape

In [None]:
# 1291643 is the number of connection between genes for athaliana's gene network
# w_saved[0,:] correspond to the list of row indices
# w_saved[1,:] correspond to the list of column indices
# w_saved[1,:] if full of 1
W = sparse.coo_matrix((w_saved[2, :], (np.array(w_saved[0, :], dtype=int), 
                                       np.array(w_saved[1, :], dtype=int))), 
                      shape=(p, p))

### Build the incidence matrix

In [None]:
# Compute node degrees 
degrees = np.zeros((p, ))
for vertex in W.row:
    degrees[vertex] += 2 ## Question: why +2 and not +1?

In [None]:
# build the incidence matrix linking each vertex to its connected edges
tim = sparse.lil_matrix((W.row.shape[0], p))
for ix, edge in enumerate(W.data):
    tim[ix, W.row[ix]] = np.sqrt(edge / degrees[W.row[ix]])
    tim[ix, W.col[ix]] = - np.sqrt(edge / degrees[W.col[ix]])

In [None]:
tim.shape

Now we can run the ncLasso model we created during Part 1 of the tutorial

__Use the network-constrained Lasso on the data. What do you observe?__

In [None]:
nclasso = ncLasso(transposed_incidence=tim, lambda1=0.001, lambda2=0.001)
nclasso.fit(X_2W_tr, y_2W_tr)

In [None]:
y_2W_nclasso_pred = nclasso.predict(X_2W_te)

print("Percentage of variance explained (using %d SNPs): %.2f" % \
     (np.nonzero(nclasso.coef_)[0].shape[0], 
      metrics.explained_variance_score(y_2W_te, y_2W_nclasso_pred)))

__Print the selected genes within the gene network__

In [None]:
nclasso_selected_genes = np.where(nclasso.coef_!=0)[0].tolist()

In [None]:
import networkx as nx

selected_genes = nclasso_selected_genes # try with randlasso_selected_genes

nb_selected_genes = len(selected_genes)

adjacency_matrix = np.zeros((nb_selected_genes, nb_selected_genes))
for i_gene, gene in enumerate(selected_genes):
    ind_of_interest = np.where(w_saved[0,:]==gene)[0]
    for ind in ind_of_interest:
        if w_saved[1,ind] in selected_genes:
            j_gene = selected_genes.index(w_saved[1,ind])
            adjacency_matrix[i_gene, j_gene] = 1
            adjacency_matrix[j_gene, i_gene] = 1

G1=nx.from_numpy_matrix(adjacency_matrix)
graph_pos=nx.spring_layout(G1,k=0.50,iterations=50)
nx.draw_networkx(G1,graph_pos)

**Note:** it would be also interesting to change node radius depending on its associated weight, to associate a color to each node depending on they are associated to some biological pathways etc

## Multi-task feature selection

1) Repeat the previous analysis for the 4W phenotype. It is very similar to the 2W phenotype, except that the seeds have been vernelized for 4 weeks. 

2) It is not unreasonable to expect the genomic regions driving both those phenotypes to be (almost) the same. Use the multi-task version of the Lasso, ENet, or ncLasso algorithms to analyzed both phenotypes simultaneously.

Use [sklearn.linear_model.MultiTaskLasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskLasso.html#sklearn.linear_model.MultiTaskLasso) + [User Guide](http://scikit-learn.org/stable/auto_examples/linear_model/plot_multi_task_lasso_support.html)

In [None]:
## TODO: good luck..!