Application for inferring subclonal composition and evolution from whole-genome sequencing data.
Python TeX JavaScript C++ HTML CSS
Latest commit 851c070 Feb 11, 2017 @jwintersinger jwintersinger Don't change working directory to original dir when resuming run.
If we change the working directory to the one stored in the state file,
we'll continue to update the trees.zip at that location. This is
undesireable, as it means we can't copy the run directory to another
location and run from there (e.g., when testing a code change, or moving
to another filesystem location). Now, just trust that the user changed
to the desired working directory before invoking evolve.py.
Permalink
Failed to load latest commit information.
misc List reassigned SSMs by ID rather than name. Sep 22, 2015
parser Clarify random seed behaviour. Feb 8, 2017
pwgsresults Fix failing to serialize NumPy array to JSON. Feb 2, 2017
witness Make CCF calculations for polyclonal sum the CPs of all clonal nodes. Oct 11, 2016
.gitignore ignore standard virtual environment folders Nov 27, 2015
LICENSE License under GPLv3. Jan 25, 2015
README.md Clarify random seed behaviour. Feb 8, 2017
alleles.py Initial import. Jan 22, 2015
cc.py Initial import. Jan 22, 2015
cnv_data.txt Add physical_cnvs to example data. Jun 18, 2016
data.py Integrate over possibility of SSM coming before or after SSM. May 30, 2016
evolve.py Don't change working directory to original dir when resuming run. Feb 11, 2017
mh.cpp Merge branch 'master' into cnvint Jun 18, 2016
mh.hpp Integrate over possibility of SSM coming before or after SSM. May 30, 2016
munge_results.py Remove polyclonal trees and superclonal clusters in post-processing. Feb 1, 2017
node.py Initial import. Jan 22, 2015
params.py Integrate over possibility of SSM coming before or after SSM. May 30, 2016
posterior_trees.py Merge VarDict support. Jul 8, 2015
printo.py Don't automatically generate LaTeX/PDF top_trees. Oct 13, 2015
redo_ids.py Initial import. Jan 22, 2015
requirements.txt Update requirements.txt May 18, 2016
ssm_data.txt Initial import. Jan 22, 2015
standalone.cfg Initial import. Jan 22, 2015
standalone.cls Initial import. Jan 22, 2015
standalone.sty Initial import. Jan 22, 2015
tssb.py Add further tweak to reduce polyclonality from Shankar. Jan 24, 2017
util.cpp Initial import. Jan 22, 2015
util.hpp Initial import. Jan 22, 2015
util.py Remove unnecessary dependency causing import error in some cases. Mar 12, 2015
util2.py Remove backups on run success. Jan 21, 2017
write_results.py Remove polyclonal trees and superclonal clusters in post-processing. Feb 1, 2017

README.md

PhyloWGS

This Python/C++ code is the accompanying software for the paper PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors, with authors Amit G. Deshwar, Shankar Vembu, Christina K. Yung, Gun Ho Jang, Lincoln Stein, and Quaid Morris.

Input files

The input to evolve.py is two tab-delimited text files -- one for SSM data and one for CNV data. Please see the files ssm_data.txt and cnv_data.txt included with PhyloWGS for examples.

To see how to generate ssm_data.txt and cnv_data.txt from a VCF file and Battenberg CNV file, please see the included parser.

ssm_data.txt:

  • id: identifier for each SSM. Identifiers must start at s0 and increment, so the first data row will have s0, the second row s1, and so forth.
  • gene: any string identifying the variant -- this need not be a gene name. <chr>_<pos> (e.g., 2_234577) works well.
  • a: number of reference-allele reads at the variant locus.
  • d: total number of reads at the variant locus.
  • mu_r: fraction of expected reference allele sampling from the reference population. E.g., if the tumor has an A->T somatic mutation at the locus, the genotype of the reference population should be AA. Thus, mu_r should be 1 - (sequencing error rate). Given the 0.001 error rate in Illumina sequencing, setting this column to 0.999 works well.
  • mu_v: fraction of expected reference allele sampling from variant population. Suppose an A->T somatic mutation occurred at the locus. mu_v always uses normal ploidy (i.e., the copy number in non-CNV regions). As humans are diploid, copy number will thus always be 2. So, the variant population genotype should be AT, meaning we will observe the reference allele with frequency 0.5 - (sequencing error rate). Given the 0.001 error rate in Illumina sequencing, setting this column to 0.499 works well.

cnv_data.txt: Note that if you are running without any CNVs, this file should be empty. You can create the empty file via the command touch cnv_data.txt.

  • cnv: identifier for each CNV. Identifiers must start at c0 and increment, so the first data row will have c0, the second row c1, and so forth.
  • a: number of reference reads covering the CNV.
  • d: total number of reads covering the CNV. This will be affected by factors such as total copy number at the locus, sequencing depth, and the size of the chromosomal region spanned by the CNV.
  • ssms: SSMs that overlap with this CNV. Each entry is a comma-separated triplet consisting of SSM ID, maternal copy number, and paternal copy number. These triplets are separated by semicolons.

When running evolve.py, the random seed used for the run will be written to random_seed.txt in the current directory. To choose this seed, you may give the --random-seed <integer> option to evolve.py. If no random seed is specified, but the random_seed.txt file already exists in the current working directory, the seed stored in that file will be used. This behaviour lets you deterministically repeat runs by copying the random_seed.txt files from a previous batch.

Running PhyloWGS

  1. Install dependencies.

  2. Compile the C++ file.

    g++ -o mh.o -O3 mh.cpp  util.cpp `gsl-config --cflags --libs`
    
  3. Run PhyloWGS. Minimum invocation on sample data set:

    python2 evolve.py ssm_data.txt cnv_data.txt
    

    All options:

      usage: evolve.py [-h] [-b WRITE_BACKUPS_EVERY] [-S WRITE_STATE_EVERY]
                       [-k TOP_K_TREES] [-f CLONAL_FREQS] [-B BURNIN_SAMPLES]
                       [-s MCMC_SAMPLES] [-i MH_ITERATIONS] [-r RANDOM_SEED]
                       [-t TMP_DIR] [-p PARAMS_FILE]
                       ssm_file cnv_file
    
      Run PhyloWGS to infer subclonal composition from SSMs and CNVs
    
      positional arguments:
        ssm_file              File listing SSMs (simple somatic mutations, i.e.,
                              single nucleotide variants. For proper format, see
                              README.md.
        cnv_file              File listing CNVs (copy number variations). For proper
                              format, see README.md.
    
      optional arguments:
        -h, --help            show this help message and exit
        -b WRITE_BACKUPS_EVERY, --write-backups-every WRITE_BACKUPS_EVERY
                              Number of iterations to go between writing backups of
                              program state (default: 100)
        -S WRITE_STATE_EVERY, --write-state-every WRITE_STATE_EVERY
                              Number of iterations between writing program state to
                              disk. Higher values reduce IO burden at the cost of
                              losing progress made if program is interrupted.
                              (default: 10)
        -k TOP_K_TREES, --top-k-trees TOP_K_TREES
                              Output file to save top-k trees in text format
                              (default: top_k_trees)
        -f CLONAL_FREQS, --clonal-freqs CLONAL_FREQS
                              Output file to save clonal frequencies (default:
                              clonalFrequencies)
        -B BURNIN_SAMPLES, --burnin-samples BURNIN_SAMPLES
                              Number of burnin samples (default: 1000)
        -s MCMC_SAMPLES, --mcmc-samples MCMC_SAMPLES
                              Number of MCMC samples (default: 2500)
        -i MH_ITERATIONS, --mh-iterations MH_ITERATIONS
                              Number of Metropolis-Hastings iterations (default:
                              5000)
        -r RANDOM_SEED, --random-seed RANDOM_SEED
                              Random seed for initializing MCMC sampler (default:
                              None)
        -t TMP_DIR, --tmp-dir TMP_DIR
                              Path to directory for temporary files (default: None)
        -p PARAMS_FILE, --params-file PARAMS_FILE
                              JSON file listing run parameters, generated by the
                              parser (default: None)
    
  4. Generate JSON results.

    mkdir test_results
    cd test_results
    # To work with viewer in Step 5, the naming conventions used here must be
    # followed.
    # "example_data" is simply the name by which you want your results to be identified.
    python2 /path/to/phylowgs/write_results.py example_data ../trees.zip example_data.summ.json.gz example_data.muts.json.gz example_data.mutass.zip
    cd ..
    

    All options:

      usage: write_results.py [-h] [--include-ssm-names] [--min-ssms MIN_SSMS]
                              dataset_name tree_file tree_summary_output
                              mutlist_output mutass_output
    
      Write JSON files describing trees
    
      positional arguments:
        dataset_name         Name identifying dataset
        tree_file            File containing sampled trees
        tree_summary_output  Output file for JSON-formatted tree summaries
        mutlist_output       Output file for JSON-formatted list of mutations
        mutass_output        Output file for JSON-formatted list of SSMs and CNVs
                             assigned to each subclone
    
      optional arguments:
        -h, --help           show this help message and exit
        --include-ssm-names  Include SSM names in output (which may be sensitive
                             data) (default: False)
        --min-ssms MIN_SSMS  Minimum number or percent of SSMs to retain a subclone
                             (default: 0.01)
    
  5. View results.

    mv test_results /path/to/phylowgs/witness/data
    cd /path/to/phylowgs/witness
    gunzip data/*/*.gz
    python2 index_data.py
    python2 -m SimpleHTTPServer
    # Open http://127.0.0.1:8000 in your web browser. Note that, by
    # default, the server listens for connections from any host.
    

Resuming a previous PhyloWGS run

If PhyloWGS is interrupted for whatever reason, you can resume your existing run by simply running evolve.py from the same directory as the previous run, without any command-line params:

# Start initial run.
python2 evolve.py ssm_data.txt cnv_data.txt

# Hit CTRL+C to send SIGINT, halting run partway through.

# Resume run:
python2 evolve.py

License

Copyright (C) 2015 Quaid Morris

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.