Below we demonstrate how to refit a signature catalog to an input matrix. You can use the code below to predict signature exposures without de novo discovery or you can use it after the de novo discovery and matching to the catalog.  

In [94]:
import musical
import pickle
import numpy as np
import pandas as pd

from musical.utils import get_sig_indices_associated
from musical.refit import refit_matrix


Load input X matrix, matrix containing spectrum of mutations per sample. In this example we are loading a data frame of dimensions nsample x 96 dimensional because we will analyze the SBS spectrum.
The dataset is a simulated cohort with 8 selected signatures (SBS1, 2, 3, 5, 8, 13, 17b, 18) generated based on their estimated occurence in PCAWG breast cancer data some other minor signatures are removed in the simulations for this exercise. The truth info is stored in samples below which we will be comparing the refitting results to.

For refitting your own dataset you will only need the X and not the H_truth, W_truth and signatures_truth.

In [131]:
X = pd.read_csv('data/simulated/X_simul_8sig.csv')
H_truth = pd.read_csv('data/simulated/H_s_simul_8sig.csv')
W_truth = pd.read_csv('data/simulated/W_s_simul_8sig.csv')
signatures_truth = np.array(W_truth.columns)
X = np.array(X)
H_truth = np.array(H_truth)
W_truth = np.array(W_truth)

Load the COSMIC-MuSiCal SBS catalog. Default settings load 'COSMIC-MuSiCal_v3p2_SBS_WGS', specify name = <CATALOG_NAME> if you want to use a different catalog.

In [95]:
catalog = musical.load_catalog()  

For indels, replace the  line above with

In [None]:
# catalog = musical.load_catalog('MuSiCal_Indel_v4_WGS')

It is advised to reduce false positive rate to restrict your signatures to those found in the specific tumor type:

In [46]:
catalog.restrict_catalog(tumor_type = 'Breast.AdenoCA')

Remove mismatch repair deficiency (MMRD) or POLE-exo mutation related signatures if you know your samples do not have these phenotypes. You can determine whether or not your tumor is MMRD using other methods or simply doing a refitting with all the signatures and if the exposure of MMRD/POLE-exo signatures are small remove those the signatures and repeat the refitting

In [47]:
catalog.restrict_catalog(tumor_type = 'Breast.AdenoCA', is_MMRD = False, is_PPD = False)

You can see the available tumor types by doing:

In [97]:
catalog.show_tumor_type_options()

AttributeError: 'Catalog' object has no attribute 'show_tumor_type_options'

To obtain the signature matrix, signature names from the catalog and to see which signatures are selected from the catalog do:

In [49]:
W_catalog = np.array(catalog.W)
signatures = np.array(catalog.signatures)
print(signatures) 

['SBS1' 'SBS2' 'SBS3' 'SBS5' 'SBS8' 'SBS13' 'SBS17a' 'SBS17b' 'SBS18'
 'SBS34' 'SBS41' 'SBS85' 'SBS100']


Let's first try simple NNLS

In [74]:
import scipy as sp
H_s = []
for x in X.T:
    h, _ = sp.optimize.nnls(W_catalog, x)
    H_s.append(h)
H_s = np.transpose(np.array(H_s))    

Refit using the default 'likelihood_bidirectional' method to introduce sparsity but set thresh1 to be a small value allowing more signatures to be fitted 

In [197]:
H_s, reco_error = refit_matrix(X, W = W_catalog, thresh1 = 0.001) #, indices_associated_sigs = indices_associated_sigs) #, signature\

In [77]:
signatures[np.where(np.sum(H_s, axis = 1) > 0)[0]]

array(['SBS1', 'SBS2', 'SBS3', 'SBS5', 'SBS8', 'SBS13', 'SBS17b', 'SBS18',
       'SBS34', 'SBS41', 'SBS85'], dtype='<U6')

try a larger thresh1 value introducing more strict requirement

In [119]:
H_s, reco_error = refit_matrix(X, W = W_catalog, thresh1 = 0.01) #, indices_associated_sigs = indices_associated_sigs) #, signature\

In [198]:
signatures_assigned = signatures[np.where(np.sum(H_s, axis = 1) > 0)[0]]

The signatures agree with the input signatures with 0.01 value and you can calculate whether the correct samples are assigned non-negative exposures by comparing to the input H_matrix

In [204]:
H_s = H_s[np.where(np.sum(H_s, axis = 1) > 0)[0],:]
if(signatures_truth.size == signatures_assigned.size):
    if(np.sum(signatures_truth != signatures_assigned) == 0):
        per_sample_agreement = (H_s > 0) == (H_truth > 0)
    else:
        print('signatures are not identical')
else:
    print('signatures are of different length')

signatures are of different length


If the H_s has different dimensions than H_truth because of false positive signatures 
first find the indices of the true positive signatures and then compare the elements

In [205]:
inds = [index for index,item in enumerate(signatures_assigned) if item in signatures_truth]
inds_truth = [index for index,item in enumerate(signatures_truth) if item in signatures_assigned[inds]]

inds_unmatched = [index for index,item in enumerate(signatures_assigned) if item not in signatures_truth]
inds_truth_unmatched = [index for index,item in enumerate(signatures_truth) if item not in signatures_assigned]

per_sample_agreement = (H_s[inds,:] > 0) == (H_truth[inds_truth,:] > 0)

In [206]:
accuracy = np.sum(per_sample_agreement)/np.size(H_truth)

In [207]:
sensitivity = np.sum((H_truth[inds_truth] > 0) * per_sample_agreement)/np.sum(H_truth > 0) 
sensitivity = sensitivity - np.sum(H_truth[inds_truth_unmatched,:] > 0)/np.sum(H_truth > 0)

In [208]:
fpr = 1 - np.sum((H_truth == 0) * per_sample_agreement)/np.sum(H_truth == 0)
fpr = fpr + np.sum(H_s[inds_unmatched] > 0)/np.sum(H_truth == 0)

In [209]:
print(accuracy)

0.98989898989899


In [210]:
print(sensitivity)

0.9938335046248715


In [211]:
print(fpr)

0.021276595744680868
