Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whether or How to determine the number of signatures (K) automatically? #8

Closed
WuyangFF95 opened this issue Sep 12, 2019 · 3 comments
Closed

Comments

@WuyangFF95
Copy link

Dear @wir963 ,

run_stm.R requires me to provide the number of signatures (K) in a mutation count input file.

But in many cases, it would not be possible to know K in advance. So is there a function to determine K automatically? If not, is there a plotting script to give me a PDF plot for aiding the selection of K? Thanks!

@wir963
Copy link
Collaborator

wir963 commented Sep 12, 2019

@WuyangFF95

In the paper and in our experiments, we plot heldout log-likelihood to determine K. I'll add that script to the demo directory but feel free to poke around in the TCGA-BRCA experiment to see.

wir963 added a commit that referenced this issue Sep 12, 2019
@WuyangFF95
Copy link
Author

Thank you @wir963 !

I went to the demo folder, and ran help for stm_heldout_likelihood.R.

(tcsm) [wuyang@monster demo]$ ../src/stm_heldout_likelihood.R -h
usage: ../src/stm_heldout_likelihood.R [-h] [--seed SEED]
[--covariates COVARIATES]
[--trainf TRAINF] [--testf TESTF]
[--heldout HELDOUT]
trainmc testmc k

positional arguments:
trainmc mutation count input file for training
testmc mutation count input file for test
k number of signatures to use

optional arguments:
-h, --help show this help message and exit
--seed SEED random seed
--covariates COVARIATES
covariates (separated by +)
--trainf TRAINF feature input file for training
--testf TESTF feature input file for test
--heldout HELDOUT feature input file for test

But I'm still confused about a couple of fields:

  1. How to split the original mutation count file into training and test set? Let's say I have 500 tumors in a mutation count file, should I put 450 tumors as training set and 50 tumors as test set, and do this for 10 times?
  2. What is feature input file ? What is the difference between --testf TESTF and --heldout HELDOUT ?
  3. Can you provide me an example command to run ../src/stm_heldout_likelihood.R? Thanks!

@wir963
Copy link
Collaborator

wir963 commented Sep 13, 2019

@WuyangFF95,

The feature input file is required when you use covariates (it isn't necessary otherwise). I copy-pasted the help message for --heldout from --testf and forgot to fix it but it should be obvious now.

If you look at the demo/Snakefile file, you'll see a workflow for running TCSM (both for plotting the heldout likelihood and estimating signatures and exposures), which will include an example command for ../src/stm_heldout_likelihood.R

@wir963 wir963 closed this as completed Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants