<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [2]:
# Import some necessary modules
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import pandas as pd
import time
import scipy as sp
import pickle
%load_ext autoreload
%autoreload 2

In [3]:
# Import MuSiCal
import musical

# Overview 

In this notebook, we demonstrate how to refit an input matrix $X$ against a signature catalog. Refitting can be performed as a standalone task for predicting signature exposures, or as a downstream step after *de novo* signature discovery and matching the *de novo* signatures to the catalog. 

# Input data

The input data for refitting is the mutation count matrix $X$ and the signature catalog $W$. 

## The mutation count matrix

Here we use a simulated dataset based on PCAWG skin melanomas to demonstrate how to perform refitting. The dataset contains 15 SBS signatures. 

Below, `X` is the simulated mutation count matrix. `W_true` is the true signatures present in the dataset (i.e., the 15 SBS signatures). `H_true` is the true exposure matrix from which `X` is simulated. 

In reality, only `X` is needed, since `W_true` and `H_true` are unknown. We read the truth information here so that we can evaluate the refitting results.

In [4]:
X = pd.read_csv('./data/simulated_example.Skin.Melanoma.X.csv', index_col=0)
W_true = pd.read_csv('./data/simulated_example.Skin.Melanoma.W_true.csv', index_col=0)
H_true = pd.read_csv('./data/simulated_example.Skin.Melanoma.H_true.csv', index_col=0)

## The signature catalog 

MuSiCal provides several signature catalogs, listed below. 

In [5]:
musical.catalog.CATALOG_NAMES

['COSMIC_v2_SBS_WGS',
 'COSMIC_v3_SBS_WGS',
 'COSMIC_v3_SBS_WES',
 'COSMIC_v3p1_SBS_WGS',
 'COSMIC_v3p2_SBS_WGS',
 'COSMIC-MuSiCal_v3p2_SBS_WGS',
 'COSMIC_v3p1_Indel',
 'MuSiCal_v4_Indel_WGS']

Let's first load the default SBS catalog, which is `COSMIC-MuSiCal_v3p2_SBS_WGS`. This catalog includes 77 COSMIC v3.2 SBS signatures, 6 SBS signatures additionally discovered by MuSiCal from PCAWG samples, and a revised spectrum of SBS40 based on MuSiCal. Below, `catalog` is a catalog class object. Signatures in the catalog can be accessed through `catalog.W`. We see that there are in total 84 signatures. 

In [6]:
catalog = musical.load_catalog()
print(catalog.W.shape[1])

84


Other catalogs can be loaded if a name is specified. For example, the following line loads the preferred indel signature catalog. 
```
catalog = musical.load_catalog('MuSiCal_v4_Indel_WGS')
```

Directly refitting our dataset `X` against all 96 signatures in the catalog will introduce many false positives, leading to over-assignment. It is thus better to restrict our catalog to only those signatures found in the specific tumor type. 

You can select your own preferred set of signatures. But MuSiCal provides such information based on our PCAWG reanalysis. 

Below, we restrict our catalog to Skin.Melanoma. Now, only 15 signatures remain in the catalog.  

In [7]:
catalog.restrict_catalog(tumor_type='Skin.Melanoma')
print(catalog.W.shape[1])

15


A list of available tumor types are shown below. 

In [8]:
print(catalog.show_tumor_type_options().tolist())

['Biliary.AdenoCA', 'Bladder.TCC', 'Bone.Benign', 'Bone.Epith', 'Bone.Osteosarc', 'Breast.AdenoCA', 'Breast.DCIS', 'Breast.LobularCA', 'CNS.GBM', 'CNS.Medullo', 'CNS.Oligo', 'CNS.PiloAstro', 'Cervix.AdenoCA', 'Cervix.SCC', 'ColoRect.AdenoCA', 'Eso.AdenoCA', 'Head.SCC', 'Kidney.ChRCC', 'Kidney.RCC', 'Liver.HCC', 'Lung.AdenoCA', 'Lung.SCC', 'Lymph.BNHL', 'Lymph.CLL', 'Myeloid.AML', 'Myeloid.MDS', 'Myeloid.MPN', 'Ovary.AdenoCA', 'Panc.AdenoCA', 'Panc.Endocrine', 'Prost.AdenoCA', 'Skin.Melanoma', 'SoftTissue.Leiomyo', 'SoftTissue.Liposarc', 'Stomach.AdenoCA', 'Thy.AdenoCA', 'Uterus.AdenoCA']


We can further restrict our catalog by removing signatures associated with mismatch repair deficiency (MMRD) or polymerase proofreading deficiency (PPD) (e.g., samples with POLE-exo mutations), since we know that this simulated dataset does not contain MMRD or PPD samples. 

If you are not sure whether your dataset contains MMRD/PPD samples, you can first perform a refitting including the MMRD/PPD signatures, and then use the `musical.preprocessing` module to determine if there is a cluster of MMRD/PPD samples within your dataset. If so, you can separate these samples and perform refitting again for the two clusters of samples separately. Of course other methods can be used to determine MMRD/PPD samples, e.g., by looking for hypermutations, inspecting POLE-exo mutations, detecting microsatellite instabilities, etc. 

In this case, no additional signatures are removed, since none of the 15 skin melanoma-specific signatures are associated with MMRD or PPD. 

In [9]:
catalog.restrict_catalog(tumor_type='Skin.Melanoma', is_MMRD=False, is_PPD=False)
print(catalog.W.shape[1])

15


We can finally obtain signatures in the catalog. 

In [10]:
W = catalog.W

# Refitting 

Refitting can be performed with `musical.refit.refit()`.

## Naive NNLS 

Let's first try naive NNLS. This can be achieved by setting `method` to `thresh_naive` and `thresh` to `0`. The method `thresh_naive` simply performs NNLS, and then set signatures with relative exposures smaller than `thresh` to have zero exposures.  

In [11]:
H, model = musical.refit.refit(X, W, method='thresh_naive', thresh=0)

The resulting exposure matrix is as follows:

In [12]:
H.head()

Unnamed: 0,SP124323,SP124281,SP124389,SP124362,SP124394,SP124380,SP124399,SP124311,SP124434,SP124428,...,SP124271,SP124336,SP124441,SP124291,SP82471,SP124353,SP113197,SP83027,SP124351,SP124458
SBS1,290.93254,168.392622,3.455685,121.673856,33.879735,0.0,0.0,7.439992,0.0,0.0,...,694.963837,334.625196,618.845212,261.553617,479.625523,334.849471,487.220375,702.512154,0.0,0.0
SBS2,167.628152,252.00889,0.0,219.398711,0.0,0.0,761.898641,148.445604,0.0,0.0,...,1044.499723,86.746569,512.642909,480.69693,202.639766,227.684818,379.576132,0.0,42.793347,188.123167
SBS3,74.115014,0.0,43.006129,0.0,0.0,350.782361,0.0,0.0,59.690625,460.064275,...,8375.335994,2384.68529,3840.222019,1980.315298,2698.550827,1238.678942,1953.732826,1622.879194,0.0,0.0
SBS5,1918.174315,1582.209531,2783.517374,89.314829,0.0,6677.434453,12255.033423,12576.803327,8681.500148,0.0,...,4113.171976,1675.629974,2960.886074,1654.295535,1054.027011,828.215157,1388.112314,1498.240185,930.267297,26.914515
SBS7a,0.0,47.831538,52176.996552,169888.854658,83299.190298,113109.454007,121593.886709,55904.867638,58921.252123,241293.000767,...,3182.675326,523.090975,0.0,294.383994,0.0,41.823159,0.0,915.465141,2835.824831,106921.666111


We can compare the obtained exposure matrix with the true one to evaluate the refitting result with naive NNLS. To do that, let's first reindex `H_true` so that it has the same signatures as in `H`. 

In [13]:
H_true_reindexed = H_true.reindex(H.index).fillna(0)

Then, we calculate some statistics by comparing zero and nonzero entries in `H_true_reindexed` and those in `H`

In [14]:
TP = np.logical_and(H_true_reindexed > 0, H > 0).sum().sum()
FP = np.logical_and(H_true_reindexed == 0, H > 0).sum().sum()
TN = np.logical_and(H_true_reindexed == 0, H == 0).sum().sum()
FN = np.logical_and(H_true_reindexed > 0, H == 0).sum().sum()
P = TP + FN
N = TN + FP
print(TP, FP, TN, FN, P, N)

424 519 662 0 424 1181


In [15]:
print('Sensitivity = %.3g' % (TP/P))
print('False positive rate = %.3g' % (FP/N))

Sensitivity = 1
False positive rate = 0.439


We see that naive NNLS leads to a high false positive rate, i.e., over-assignment. 

## Likelihood-based sparse NNLS 

MuSiCal implements a likelihood-based sparse NNLS for refitting. It can be achieved by setting `method` to `likelihood_bidirectional` in `musical.refit.refit()`. The small nonnegative likelihood threshold `thresh` controls the sparsity level. When `thresh` is 0, the result is almost equivalent to naive NNLS. Stronger sparsity will be induced when `thresh` is larger. 

In the full pipeline including *de novo* signature discovery followed by matching and refitting, this likelihood threshold will be automatically chosen by the *in silico* validation module. 

Here, let's use a reasonable threshold 0.001. 

In [16]:
H, model = musical.refit.refit(X, W, method='likelihood_bidirectional', thresh=0.001)

In [17]:
H.head()

Unnamed: 0,SP124323,SP124281,SP124389,SP124362,SP124394,SP124380,SP124399,SP124311,SP124434,SP124428,...,SP124271,SP124336,SP124441,SP124291,SP82471,SP124353,SP113197,SP83027,SP124351,SP124458
SBS1,288.393161,169.567787,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,693.524088,332.917468,616.28558,259.169218,480.145895,333.934288,487.482498,703.455047,0.0,0.0
SBS2,167.557017,273.695417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,987.785312,0.0,508.738968,482.74652,203.400891,247.159271,379.853157,0.0,0.0,0.0
SBS3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,8517.941793,2414.405614,3971.791764,2044.878553,2730.740793,1244.40453,1966.84371,1678.780082,0.0,0.0
SBS5,2008.808159,1652.380668,3082.909458,0.0,0.0,7176.32363,12292.269666,12662.358966,8833.091168,0.0,...,4315.16913,1745.713967,3078.01556,1778.661931,1041.596285,878.093767,1379.926582,1480.546286,968.715073,0.0
SBS7a,0.0,0.0,52166.001371,170225.691472,83299.155332,113082.240584,122366.120452,56131.255876,58914.001615,241281.212297,...,3337.867553,656.226823,0.0,281.663489,0.0,0.0,0.0,930.789434,2900.935916,107212.905792


Again, we can compare to `H_true` to evaluate the result.

In [18]:
H_true_reindexed = H_true.reindex(H.index).fillna(0)

In [19]:
TP = np.logical_and(H_true_reindexed > 0, H > 0).sum().sum()
FP = np.logical_and(H_true_reindexed == 0, H > 0).sum().sum()
TN = np.logical_and(H_true_reindexed == 0, H == 0).sum().sum()
FN = np.logical_and(H_true_reindexed > 0, H == 0).sum().sum()
P = TP + FN
N = TN + FP
print(TP, FP, TN, FN, P, N)

423 3 1178 1 424 1181


In [20]:
print('Sensitivity = %.3g' % (TP/P))
print('False positive rate = %.3g' % (FP/N))

Sensitivity = 0.998
False positive rate = 0.00254


We see that the false positive rate is dramatically reduced, while the sensitivity is still reasonably high. 

# Comments 

1. Matching *de novo* signatures to the catalog can be performed with `musical.refit.match()` in a similar way as described above, except that in matching, `X` will be the matrix of signatures to be matched. 

2. The `model` variable above is a `SparseNNLS` object. It provides many other attributes that are convenient. For example, `model.X_reconstructed` is the reconstructed mutation count matrix. `model.cos_similarities` is the cosine similarities between original data and the reconstructed spectra.  

3. Associated signatures (e.g., APOBEC signatures SBS2 and SBS13) can be forced to co-occur using the option `connected_sigs=True` (by default it is set to `False`). 