Scripts supporting identification of genomic features affecting survival time in cancer.
The easiest way to get started is to use
requirements.txt to set up a conda environment with the relevant packages.
$ conda create --name <env> --file requirements.txt
The code in this repo was built with python 2.7, pandas 0.18, numpy 1.10, and scipy 1.0.0 (as is captured in
Use run.py to download the relevant data from the public GCS bucket and perform univariate analyses for copy number and mutation.
run.py takes two optional arguments -- the folder to store data and analysis, (default
.) and the number of parallel workers to use for the analysis (default none, and a sequential only analysis.). With
-p 4, run.py takes ~12 hours on a 2017 MacBook Pro.
$ python run.py -p 4 -o $ouput_directory
- cbioportal - scripts used to analyze cbioportal data
- cnv-and-mutations - scripts for analyzing with cnas and and mutations together
- common-case-zscores - allows getting zscores for every row in a "common" file. common files have genes in rows, patients in columns.
- common - the common set of utilities and tools used in analysis.
analysis.pyhas the meat of cox analysis
mutation_base.pyhas the repeatable processing required to turn raw mutation data into usable dataframes.
- copy-number-analysis - given a copy number file, data about gene/location, and TCGA clinical data, calculate zscores for copy number genes
process_copy_numbers_to_genes.pyhas the repeatable processing to turn copy number raw data into usable dataframes.
- data-munging - contains scripts for miscellaneous small processing tasks: one-off zscores, density plot generation, etc
- fdr - scripts for performing false discovery correction
- geo - scripts for analyzing zscores for GEO files
- make-pancan - given a filetype/platform, take all the per-cancer-type zscore files and produce a file with genes in rows, and cancer types in columns
- mutation-analysis - given a tcga clinical file, and mutation data from the same set of patients, calculate zscores and kaplan meier curves for genes mutated in a sufficient number of patients
- pan-platform - scripts for creating panplatform TCGA files