Overview
This repo contains python code to reproduce experiments from:
NetNorM: capturing cancer-relevant information in somatic exome mutation data with gene networks for cancer stratification and prognosis. 2016. Marine Le Morvan, Andrei Zinovyev, Jean-Philippe Vert. hal-01341856
The code is split into the following pieces:
- NetNorM.py :
- compute the NetNorM representation of a mutation dataset.
- survpred.py :
- reproduce the survival prediction results presented in the paper.
- stratification_NMF.py, stratification_ccNMF.py and stratification_KMeans.py:
- reproduce the patient stratification results presented in the paper.
- comput_functions.py:
- python module necassary for the functions introduced above.
Packages required
The source code was written with python 2.7 and uses the following pakages:
- csv (>=1)
- numpy (>=1.8)
- pandas (>=0.18)
- scipy (>=0.17)
- sklearn (>=0.17)
To install these packages, you can use pip (for example: pip install numpy==1.8). If you don't have pip installed, you can follow these instructions.
# Load packages
import os
import sys
import csv
import numpy as np
import pandas as pd
import scipy.sparse as sp
from scipy.stats import rankdata
from sklearn.base import BaseEstimator
from sklearn import svm
import pickle
# Load additional functions
from comput_functions import *
Datasets
Mutation datasets
NetNorM was tested on exome somatic mutation datasets from 8 cancer types, downloaded from The Cancer Genome Atlas (TCGA) data portal. The data was preprocessed to obtain raw binary patients x genes matrices where 1s indicate non-silent point mutations or indels in a patient-gene pair and 0s indicate the absence of such mutations. The preprocessed raw binary mutation matrices have the following properties:
Cancer type | Patients | Genes | Deaths | Download date |
---|---|---|---|---|
LUAD (Lung adenocarcinoma) | 430 | 20 596 | 110 | 6/22/2015 |
SKCM (Skin cutaneous melanoma) | 307 | 17 461 | 129 | 11/18/2015 |
GBM (Glioblastoma multiform) | 265 | 14 748 | 195 | 11/18/2015 |
BRCA (Breast invasive carcinoma) | 945 | 16 806 | 97 | 11/25/2015 |
KIRC (Kidney renal clear cell carcinoma) | 411 | 10 608 | 136 | 11/25/2015 |
HNSC (Head and Neck squamous cell carcinoma) | 388 | 17 022 | 140 | 11/25/2015 |
LUSC (Lung squamous cell carcinoma) | 169 | 13 589 | 70 | 11/25/2015 |
OV (Ovarian serous cystadenocarcinoma) | 363 | 10 192 | 172 | 11/24/2014 |
Gene network
Pathway Commons (version 6) was used as gene network and downloaded in SIF format (Pathway Commons.6.All.EXTENDED_BINARY_SIF.tsv) from http://www.pathwaycommons.org/archives/PC2/v6/.
The preprocessed raw binary mutation matrices as well as the adjacency matrix corresponding Pathway Commons are available in csv format in data_NetNorM.zip. Note that the last two columns of the mutation matrices correspond to survival information (Days_survival and Vital_Status).
# This script loads the LUAD mutation dataset as well as the Pathway Commons adjacency matrix.
import zipfile
zip = zipfile.ZipFile('data_NetNorM.zip')
zip.extractall()
Computing the NetNorM representation of a mutation dataset
This can be done using the NetNorM.py function.
from NetNorM import NetNorM
print NetNorM.__doc__
Computes the NetNorM representation of a binary mutation matrix.
Parameters
----------
X_raw : patient x genes pandas dataframe.
Initial binary patients x genes mutation matrix, with where 1s
indicate non-silent point mutations or indels in a patient-gene pair
and 0s indicate the absence of such mutations.
A : CSR sparse matrix
Adjacency matrix describing the gene network used.
A_colnames: array_like
Column names (or equivalently row names) of the adjacency matrix A.
k : int
Consensus number of mutations
Returns
-------
X_NetNorM: patient x genes pandas dataframe.
the NetNorM representation of X_raw.
# This script computes the NetNorM representation of the LUAD mutation dataset
# Read the csv file corresponding to the LUAD dataset
X_lab = pd.read_csv('./data_NetNorM/TCGA/data_preprocessed/LUAD/X_labwithclin_ns1_NANA_NA_NA.csv',
sep=',', index_col=0, header=0)
# Separates mutation profiles from the clinical data
# Most of the columns of X_lab are genes, except the last ones which contain clinical data.
# The first column which contains clinical data is 'Days_survival'.
tmp = X_lab.loc[:, 'Days_survival':]
X = X_lab.drop(list(tmp.columns), axis=1)
y = tmp.loc[:, ['Days_survival', 'Vital_Status']]
clin = tmp.drop(['Days_survival', 'Vital_Status'], axis=1)
# Read the csv file corresponding to the Pathway Commons adjacency matrix.
# We directly store the adjacency matrix in a sparse format (csr) to avoid
# wasting too much memory space. However our csv_to_csr function is quite slow.
d = csv_to_csr('./data_NetNorM/Networks/data_preprocessed/A_PC', sep=';')
# Get the columns of the mutation matrix to be the same as those of the adjacency matrix
X_raw = get_mut_cols_as_adj_cols(X, d['colnames'])
# Compute the NetNorM representation of the mutation matrix.
# Choose a parameter k
k = 295
X_NetNorM = NetNorM(X_raw, d['A'], d['colnames'], k)
print type(X_NetNorM)
print X_NetNorM.shape
<class 'pandas.core.frame.DataFrame'>
(430, 16674)
Reproducing survival prediction results
This can be done using the survpred.py function.
from survpred import survpred
print survpred.__doc__
Preprocess the mutation data according to the method specified, and split
the dataset into train/test tests according to a 5-fold
cross-validation scheme. Then for each split into train/test sets, a
L1-penalized survival SVM model is learned on the train set and tested on
the test set. Hyperparameters are learned thanks to an inner
cross-validation on the train set, and then used to train the model on the
whole train set.
Parameters
----------
M_file: str
Path to the raw binary mutation matrix file.
It must be a csv file with patients as rows and genes as columns.The
first row must be gene names. The first column must be patient IDs.
The last columns must be clinical features, and the 2 first clinical
features must be named 'Days_survival', 'Vital_Status'.
'Days_survival' (int) indicate how many days a patient has survived
since diagnosis.
'Vital_Status' (string) take values 'deceased' and 'living', according
to the censoring information of the patient.
M_sep: str
Delimiter to read the csv file located at M_file.
A_file: str
Path to the adjacency matrix
It must be a csv file with the first row and the first column being
gene names.
A_sep: str
Delimiter to read the csv file located at A_file.
method_rep: str
The method used to preprocess the mutation matrix before normalisation
The options are 'nb_mut', raw', 'rawrestricted', 'smoothing',
'smoothing_1iter_DAD', 'neighbours', 'neighbours_DAD'
'nb_mut':
the mutation matrix is restricted to one feature indicating the
total number of mutations for each patient.
'raw':
the mutation matrix is kept as a raw binary matrix
'rawrestricted': the mutation matrix is kept as a raw binary matrix but
restricted to the genes that are listed in the gene network.
'smoothing':
the mutation matrix is smoothed over a network.
'smoothing_1iter_DAD':
the mutation matrix is smoothed over a network, but the smoothing
process is restricted to happen between a gene and its direct
neighbours only.
'neighbours':
for each patient independently, mutated genes are set to
values between 1 and 2 according their respective number of
neighbours; non mutated genes neighbouring a mutated gene are set
to values between 0 and 1 according to the number of mutated genes
they actually neighbour.
'neighbours_DAD':
for each patient independently, genes are set to a
score equal to their number of mutated neighbours normalised by
their degree and by the degree of their neighbours. Then mutated
genes' scores are increased by a constant C to ensure that their
score is always higher than those of non mutated genes.
method_norm: str
The method used to normalize mutation profiles. The options are
'none', 'qn', 'stair'
'none':
no normalization is performed.
'qn':
quantile normalization (performed after each splitting in
train/test).
'stair':
every mutation profile is binarised with the same number of 0s
and 1s.
rs_folds: int
Random state used to generate the 5 folds for cross-validation.
randomize: boolean
Specify wether the adjacency matrix should be randomized (True) or
not (False).
rs_rand: int
Random state used to randomize this adjacency matrix.
k_start: int
smallest value tested for the parameter k (consensus number of
mutations when 'method_norm' is set to 'stair') during the grid
search.
k_step: int
step controlling the values of k that will be tested between
k_start and k_end.
k_end: int
largest value tested for the parameter k (consensus number of
mutations when 'method_norm' is set to 'stair') during the grid
search.
alpha_start: int
smallest value tested for the parameter alpha (smoothing parameter
used when 'method_rep' is set to 'smoothing' or 'smoothing_1iter_DAD')
during the grid search.
alpha_step: int
step controlling the values of alpha that will be tested between
alpha_start and alpha_end.
alpha_end: int
largest value tested for the parameter alpha (smoothing parameter
used when 'method_rep' is set to 'smoothing' or 'smoothing_1iter_DAD')
during the grid search.
data_type: str
Indicate whether a survival prediction model should be learned using
mutation data only, clinical data only, or both (in this case the 2
are still learned separately)
'clin':
the survival prediction model is learned with clinical data
'mut':
the survival prediction model is learned with mutation data
'both':
survival prediction models are learned with both types of data
(separately though)
n_jobs: int
the number of cores to use.
IMPORTANT:
Gene names in M_file and A_file must be written in the same format
(for ex HUGO) so that it is possible to match genes between the 2 files.
Returns
-------
l: list
If 'data_type' is set to 'clin' or 'mut', the list contains 12
elements. Each of these elements is itself a list or an array of
length 5 corresponding to the 5 splits in train/test of the
cross-validation. These containers of length 5 record:
#0: Concordance index (CI) obtained on the train set.
#1: Concordance index (CI) obtained on the test set.
#2: the variables selected in the model (genes or clinical features)
#3: the parameters ((k and C) or (alpha and C) according to the
method used) that was learned with the inner cross-validation.
#4: the coefficients learned for each variable.
#5: Concordance index (CI) obtained on the train set for patients with
more than the mean number of mutations.
#6: Concordance index (CI) obtained on the test set for patients with
more than the mean number of mutations.
#7: Concordance index (CI) obtained on the train set for patients with
less than the mean number of mutations.
#8: Concordance index (CI) obtained on the test set for patients with
less than the mean number of mutations.
#9: Mean concordance index (CI) obtained across in the inner
cross-validation folds with the set of parameters that yields the
best such mean.
#10: survival predictions for patients in the train set
#11: survival predictions for patients in the test set
Note that the arguments method_rep and method_norm control together how the mutation matrix is preprocessed.
method_rep | method_norm | corresponding preprocessing |
---|---|---|
raw | none | none |
neighbours | stair | NetNorM |
smoothing | qn | NSQN |
smoothing | none | NS |
smoothing_1iter_DAD | qn | SimpNet |
The seed rs_folds which determines how the dataset is split into train and test sets was set to 4, 5, 6 and 7 in the paper to obtain 20 cross-validation folds (4 times 5-fold cross-validation).
# This script performs survival prediction using the NetNorM
# representation of the LUAD dataset. Survival prediction is
# performed in a cross-validation setting (5-fold cross-validation).
# Define input arguments
cancer = 'LUAD'
net = 'PC'
M_file = ('./data_NetNorM/TCGA/data_preprocessed/LUAD' +
'/X_labwithclin_ns1_NANA_NA_NA.csv')
M_sep = ','
A_file = './data_NetNorM/Networks/data_preprocessed/A_PC'
A_sep = ';'
method_rep = 'neighbours'
method_norm = 'stair'
rs_folds = '4'
randomize = False
rs_rand = 'NA'
k_start = '50'
k_step = '10'
k_end = '375'
alpha_start = 'NA'
alpha_step = 'NA'
alpha_end = 'NA'
data_type = 'mut'
n_jobs = 1
# Perform survival prediction
res_pred = survpred(M_file, M_sep, A_file, A_sep, method_rep, method_norm, rs_folds,
randomize, rs_rand, k_start, k_step, k_end, alpha_start,
alpha_step, alpha_end, data_type, n_jobs)
# Save the result as a python object in a folder called LUAD_survpred
if not os.path.isdir(cancer + '_stratification_NMF'):
os.makedirs(cancer + '_survpred')
pickle.dump(res_pred, open(cancer + '_survpred/survpred_' + cancer + '_' +
str(data_type) + '_' + net + '_' + method_rep +
'_' + method_norm + '_rsfolds' + str(rs_folds) + '_' +
str(randomize) + '_rsrand' + str(rs_rand) +
'_' + str(k_start) + '_' +
str(k_step) + '_' + str(k_end) + '_' + str(alpha_start) + '_' +
str(alpha_step) + '_' + str(alpha_end), 'wb'))
Reproducing patient stratification results
This can be done with the stratification_NMF.py function (or alternatively with stratification_ccNMF or stratification_KMeans.py).
from stratification_NMF import stratification_NMF
print stratification_NMF.__doc__
Preprocess the mutation data according to the method specified, and then
performs Non-Negative Matrix Factorisation to split patients into
subtypes.
Parameters
----------
M_file: str
Path to the raw binary mutation matrix file.
It must be a csv file with patients as rows and genes as columns.The
first row must be gene names. The first column must be patient IDs.
The last columns must be clinical features, and the 2 first clinical
features must be named 'Days_survival', 'Vital_Status'.
'Days_survival' (int) indicate how many days a patient has survived
since diagnosis.
'Vital_Status' (string) take values 'deceased' and 'living', according
to the censoring information of the patient.
M_sep: str
Delimiter to read the csv file located at M_file.
A_file: str
Path to the adjacency matrix
It must be a csv file with the first row and the first column being
gene names.
A_sep: str
Delimiter to read the csv file located at A_file.
method_rep: str
The method used to preprocess the mutation matrix before normalisation
The options are 'raw', 'rawrestricted', 'smoothing',
'smoothing_1iter_DAD', 'neighbours', 'neighbours_DAD'
'raw':
the mutation matrix is kept as a raw binary matrix
'rawrestricted': the mutation matrix is kept as a raw binary matrix but
restricted to the genes that are listed in the gene network.
'smoothing':
the mutation matrix is smoothed over a network.
'smoothing_1iter_DAD':
the mutation matrix is smoothed over a network, but the smoothing
process is restricted to happen between a gene and its direct
neighbours only.
'neighbours':
for each patient independently, mutated genes are set to
values between 1 and 2 according their respective number of
neighbours; non mutated genes neighbouring a mutated gene are set
to values between 0 and 1 according to the number of mutated genes
they actually neighbour.
'neighbours_DAD':
for each patient independently, genes are set to a
score equal to their number of mutated neighbours normalised by
their degree and by the degree of their neighbours. Then mutated
genes' scores are increased by a constant C to ensure that their
score is always higher than those of non mutated genes.
method_norm: str
The method used to normalize mutation profiles. The options are
'none', 'qn', 'stair'
'none':
no normalization is performed.
'qn':
quantile normalization.
'stair':
every mutation profile is binarised with the same number of 0s
and 1s.
randomize: boolean
Specify wether the adjacency matrix should be randomized (True) or
not (False).
rs_rand: int
Random state used to randomize this adjacency matrix.
k: int
The value for the parameter k (consensus number of mutations when
'method_norm' is set to 'stair')
alpha: int
The value for the parameter alpha (smoothing parameter
used when 'method_rep' is set to 'smoothing' or 'smoothing_1iter_DAD')
N: int
The number of patient subtypes (parameter for the NMF)
IMPORTANT:
Gene names in M_file and A_file must be written in the same format
(for ex HUGO) so that it is possible to match genes between the 2 files.
Returns
-------
ass: pandas series
The patients' cluster assignment.
# This script performs patient stratification using the NetNorM
# representation of the LUAD dataset.
# Define input arguments
cancer = 'LUAD'
net = 'PC'
M_file = ('./data_NetNorM/TCGA/data_preprocessed/LUAD' +
'/X_labwithclin_ns1_NANA_NA_NA.csv')
M_sep = ','
A_file = './data_NetNorM/Networks/data_preprocessed/A_PC'
A_sep = ';'
method_rep = 'neighbours'
method_norm = 'stair'
k = 295
alpha = 'NA'
randomize = False
rs_rand = 'NA'
N = 5
# Perform patient stratification
res_strat = stratification_NMF(M_file, M_sep, A_file, A_sep,
method_rep, method_norm, k, alpha,
randomize, rs_rand, N)
# Save the result as a python object in a folder called LUAD_stratification.
if not os.path.isdir(cancer + '_stratification_NMF'):
os.makedirs(cancer + '_stratification_NMF')
pickle.dump(res_strat, open(cancer + '_stratification_NMF/stratNMF_' +
cancer + '_' + net + '_' + method_rep + '_' +
method_norm + '_k' + str(k) + '_alpha' + str(alpha) +
'_randomize' + str(randomize) + '_rs_rand' +
str(rs_rand) + '_N' + str(N), 'wb'))