Whether or How to determine the number of signatures (K) automatically? #8

WuyangFF95 · 2019-09-12T11:45:32Z

run_stm.R requires me to provide the number of signatures (K) in a mutation count input file.

But in many cases, it would not be possible to know K in advance. So is there a function to determine K automatically? If not, is there a plotting script to give me a PDF plot for aiding the selection of K? Thanks!

wir963 · 2019-09-12T15:37:02Z

@WuyangFF95

In the paper and in our experiments, we plot heldout log-likelihood to determine K. I'll add that script to the demo directory but feel free to poke around in the TCGA-BRCA experiment to see.

…natures to demo for #8

WuyangFF95 · 2019-09-13T02:32:00Z

Thank you @wir963 !

I went to the demo folder, and ran help for stm_heldout_likelihood.R.

(tcsm) [wuyang@monster demo]$ ../src/stm_heldout_likelihood.R -h
usage: ../src/stm_heldout_likelihood.R [-h] [--seed SEED]
[--covariates COVARIATES]
[--trainf TRAINF] [--testf TESTF]
[--heldout HELDOUT]
trainmc testmc k

positional arguments:
trainmc mutation count input file for training
testmc mutation count input file for test
k number of signatures to use

optional arguments:
-h, --help show this help message and exit
--seed SEED random seed
--covariates COVARIATES
covariates (separated by +)
--trainf TRAINF feature input file for training
--testf TESTF feature input file for test
--heldout HELDOUT feature input file for test

But I'm still confused about a couple of fields:

How to split the original mutation count file into training and test set? Let's say I have 500 tumors in a mutation count file, should I put 450 tumors as training set and 50 tumors as test set, and do this for 10 times?
What is feature input file ? What is the difference between --testf TESTF and --heldout HELDOUT ?
Can you provide me an example command to run ../src/stm_heldout_likelihood.R? Thanks!

wir963 · 2019-09-13T16:18:29Z

@WuyangFF95,

The feature input file is required when you use covariates (it isn't necessary otherwise). I copy-pasted the help message for --heldout from --testf and forgot to fix it but it should be obvious now.

If you look at the demo/Snakefile file, you'll see a workflow for running TCSM (both for plotting the heldout likelihood and estimating signatures and exposures), which will include an example command for ../src/stm_heldout_likelihood.R

wir963 added a commit that referenced this issue Sep 12, 2019

add example of plotting heldout likelihood to determine number of sig…

ab8f7ba

…natures to demo for #8

wir963 added a commit that referenced this issue Sep 13, 2019

fix help message for heldout per #8

e51ca9a

wir963 added a commit that referenced this issue Sep 13, 2019

improve help message to use covariates instead of features per #8

3331ba4

wir963 closed this as completed Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whether or How to determine the number of signatures (K) automatically? #8

Whether or How to determine the number of signatures (K) automatically? #8

WuyangFF95 commented Sep 12, 2019

wir963 commented Sep 12, 2019

WuyangFF95 commented Sep 13, 2019

wir963 commented Sep 13, 2019

Whether or How to determine the number of signatures (K) automatically? #8

Whether or How to determine the number of signatures (K) automatically? #8

Comments

WuyangFF95 commented Sep 12, 2019

wir963 commented Sep 12, 2019

WuyangFF95 commented Sep 13, 2019

wir963 commented Sep 13, 2019