## Code for Noah's Hypermutation Project<br><br/>
Run the code in this notebook to generate the files used for analysis and the figures themselves.  <br><br/>
Code runs by referencing paths in the file: fileDirectory.txt
The first section of this notebook describes how to create files for analysis.  The second part creates the figures.  The final files are already included so you can skip ahead if you would like

Environment:<br>
Python 2.7.15 |Anaconda custom (64-bit)| (default, Dec 14 2018, 13:10:39)<br>
R version 3.5.1 (2018-07-02) -- "Feather Spray"<br>

**Key packages:**<br>
pandas 0.24.2<br>
jupyter 1.0.0<br>
jupyter-client 5.2.4<br>
jupyter-console 5.2.0<br>
jupyter-core 4.4.0<br>

In [2]:
import sys
import os

#TODO make 
sys.path.append('/Users/friedman/Desktop/hypermutationProjectFinal/scripts/utilityScripts')
import configuration_util
filePathDict = configuration_util.get_all_files_path_dict()

<hr style="border:2px solid gray"> </hr>

## Files not created by me
### Files
* **CANCER_TYPE_INFO** information about TMB, cancer type and msi status, generated by Chai
* **TCGA_CANCER_TYPE_INFO** information about TCGA cancer types, from cluster
* **TCGA_MSI_SCORES** msi scores for TCGA from the cluster
* **GENE_LENGTH_INFO** information about coding sequence lengths for genes, from cluster, originall ensemble?
* **DEP_MAP_DATA** depmap scores for all genes.  Taken from https://ndownloader.figshare.com/files/22629068 -->achilles_gene_effect.csv.  CRISPR data
* **PHASING_DATA** generated by Alex G, using his phasing pipeline on IMPACT data
* **HOTSPOT_DATA** summary of all known hotspots from https://cancerhotspots.org
* **MICROSATELLITE_INFORMATION** summary of all microsatellite sites in IMPACT panel.  Generated by Craig
* **NUCLEOSOME_DYAD_POSITIONS** summary of nucleosome dyad positions, taken from Alex G, before that from Nuria's paper

### Mafs
* **IMPACT_BASE_MAF** impact filtered mutations from portal, annotated from Novemeber 2019
* **IMPACT_BASE_MAF_WITH_SYNONYMOUS** impact unfiltered mutations from portal
* **ALL_EXOME_MAF** maf combining all Exome recapture mutations and all TCGA mutations
* ****


<hr style="border:2px solid gray"> </hr>

# Generation of analysis files

## Files generated on the cluster

### Signatures

Uses the script mutationSigUtils in mode 'runMutSigs.'  This calls the mutation signatures script found here: https://github.com/mskcc/mutation-signatures.  This adds trinucleotide information (if needed) and then runs the mutation signature algorithm. 

To produce **IMPACT_SIGNATURE_DECOMPOSITIONS** run: <br>
<code>python friedman/myUtils/mutationSigUtils.py --inputMaf path/to/impact/maf --outputDir path/to/output/dir --outputFilename exomeRecaptureSignatures.tsv --mode runMutSigs
</code>

To produce **TCGA_SIGNATURE_DECOMPOSITIONS** run: <br>
<code>python friedman/myUtils/mutationSigUtils.py --inputMaf  --outputDir path/to/output/dir --outputFilename exomeRecaptureSignatures.tsv --mode runMutSigs
</code>

To produce **EXOME_RECAPTURE_SIGNATURE_DECOMPOSITIONS** run: <br>
<code>python friedman/myUtils/mutationSigUtils.py --inputMaf /juno/work/ccs/gongy/megatron_Jan5th/Result/cohort_level/mut_somatic.maf --outputDir path/to/output/dir --outputFilename exomeRecaptureSignatures.tsv --mode runMutSigs
</code>
<br>
To produce **CLONAL_VS_SUBCLONAL_MUTATIONAL_SIGNATURES** run: <br>
<code>python friedman/myUtils/mutationSigUtils.py --inputMaf  --outputDir path/to/output/dir --outputFilename exomeRecaptureSignatures.tsv --mode runMutSigs
</code>

### All possible mutations in IMPACT panel and consequences

In [None]:
#On the cluster run:
#this command generates a vcf with all positions
cmdAllVcfPos = '/opt/common/CentOS_7-dev/bin/bcftools mpileup --skip-indels --fasta-ref /ifs/depot/pi/resources/genomes/GRCh37/fasta/b37.fasta --regions-file /ifs/depot/pi/resources/roslin_resources/targets/IMPACT468/b37/IMPACT468_b37_targets.bed --output impact468_snps.vcf /ifs/res/pi/Proj_09221.6f74bdee-146b-11e9-a68d-645106ef9e4c/bam/s_C_P3U7DR_N001_d.Group3.rg.md.abra.printreads.bam'

#this command adds in every possible alt allele
cmdAddAltAlleles = 'python add_all_variants_to_vcf.py ~/impact468_snps.vcf ~/friedman/myAdjustedDataFiles/impact468_all_possible_snps.vcf'

#use commands of the form below to cut vcf up into chromosome sized chunks
'bcftools view /home/friedman/friedman/myAdjustedDataFiles/impact468_all_possible_snps_sorted.vcf.gz --regions 4 -o /juno/work/taylorlab/friedman/myAdjustedDataFiles/allPossibleMutationsInIMPACT/vcfs/chr4AllMuts.vcf'
#convert it to a maf: (note writes a .sh file)

#runs vcf2Maf on all possible Snps maf
'python /juno/work/taylorlab/friedman/myUtils/run_vcf2Maf_on_all_possibleSnpsMaf.py'

#Annotate mafs with the maf to maf script
'python /juno/work/taylorlab/friedman/myUtils/run_maf2Maf_on_all_possibleSnpsMaf.py'

#Results in one maf per chromosome of all possible mutations found in the following directory:
#They are included in my project in the file: files/expectedMutationInfo/
'/juno/work/taylorlab/friedman/myAdjustedDataFiles/allPossibleMutationsInIMPACT/annotatedMafs'

### Pentanucleotide annotated mafs

<hr style="border:2px solid gray"> </hr>

# Files generated locally

### Classification of hypermutated tumors


In [None]:
cmd = ' '.join(['python', filePathDict['SCRIPT_DEFINE_HYPERMUTATION_THRESHOLDS'])
os.popen(cmd)

### Calculation of expected rate of hotspot, oncogenic and truncating mutation

In [None]:
#calculate signature propensity to cause drivers
cmdGet

ALL_POSSIBLE_MUTATION_SUMMARY

**Estimating the expected number of consequential mutations: Nucleotide context method**<br/>
Enumerate all possible mutation mafs

### Identifying tumor subclones and copy number
<br>
On our compute cluster we ran the FACETs algorithm with the following parameters:?
On the cluster we then summarize these results through a shared codebase: 
<code>python myUtils/create_cncf_or_rdata_file_list.py</code><br><br>
then run <code>python prepare_maf_anno_commands.py</code><br><br>
then run <code>/juno/work/taylorlab/friedman/myUtils/runMafAnno.sh</code> which calls myUtils/runAnnotateMaf.R<br><br> 
then combine results with 
<code>python /juno/work/taylorlab/friedman/myUtils/concat_maf_util.py /juno/work/taylorlab/friedman/myAdjustedDataFiles/mafAnnoRuns/annotatedMafs /juno/work/taylorlab/friedman/myAdjustedDataFiles/mafAnnoRuns/combined_cncf_annotated.maf</code>

In [None]:
#adjust clonality calls for hypermutated tumors with flat genomes:
#this command will provide adjusted clonality calls for the cases with flat genome
#this command takes ~1hr to run
cmd = ' '.join(['python', filePathDict['SCRIPT_ADJUST_CLONALITY_CALLS'],
                filePathDict['files/mafs/combined_cncf_annotated.maf'], filePathDict['IMPACT_MAF_WITH_ADJUSTED_CLONALITY_ANNOTATION'], 'IMPACT'])
os.popen(cmd)

### DNDS ANALYSIS

In [None]:
#todo make sure the input and output commands are correct here
#make sure this script properly adjusts the file to run dnds
cmd = ' '.join(['RScript', filePathDict['SCRIPT_RUN_DNDS_CV'], filePathDict['IMPACT_BASE_MAF_WITH_SYNONYMOUS']])

#results are written to:
''
#TODO THE DNDS ANALYSIS script now has an annoying software bug


**Mutation Atribution**<br><br/>
Attributes mutations to signatures as done in Alex's paper.  We use different thresholds for attribution to account for the different characteristics of hypermutated cases.  Note this is currently not used but included here now

In [4]:
cmd = ' '.join(['python', filePathDict['SCRIPT_ATTRIBUTE_MUTATIONS_TO_SIGNATURES'],
                filePathDict['SIGNATURE_SPECTRUM'], '.', '1', '10',
                filePathDict['IMPACT_SIGNATURE_DECOMPOSITIONS'], 'attributedMutations.tsv',
                'agingIsAlwaysPresent', 'doSmokingCorrection'])

print cmd

python /Users/friedman/Desktop/hypermutationProjectFinal/scripts/utilityScripts/attribute_mutations_to_signatures.py /Users/friedman/Desktop/hypermutationProjectFinal/files/infoFiles/Stratton_signatures30.txt . 1 10 /Users/friedman/Desktop/hypermutationProjectFinal/files/infoFiles/impactSignatureCalls_Nov20_2019.tsv attributedMutations.tsv agingIsAlwaysPresent doSmokingCorrection


### Phasing
results are generated using methods described in Alex Gorelick's paper: https://www.nature.com/articles/s41586-020-2315-8


**Constructing Phylogenetic Trees**

# Generate Figures

**FIGURE 1**<br>
Run all cells in <code>scripts/figure1/make_figure1.ipynb</code><br>
Then run <code>scripts/figure1/plotFigure1.R</code>

**FIGURE 1S**<br>
Run all cells in <code>scripts/figure1/make_figure1_supplementary_figures.ipynb</code><br>
Then run <code>scripts/figure1/plotSupplementaryFiguresFig1.R</code>

**FIGURE 2**<br>
Run all cells in <code>scripts/figure2/make_figure2.ipynb</code><br>
Then run <code>scripts/figure2/plotFigure2.R</code>

**FIGURE 2S**<br>
Run all cells in <code>scripts/figure1/make_figure2_supplementary_figures.ipynb</code><br>
Then run <code>scripts/figure1/plotSupplementaryFiguresFig2.R</code>

**FIGURE 3**<br>
Run all cells in <code>scripts/figure1/make_figure1.ipynb</code><br>
Then run <code>scripts/figure1/plotFigure1.R</code>

**FIGURE 3S**<br>
Run all cells in <code>scripts/figure1/make_figure3_supplementary_figures.ipynb</code><br>
Then run <code>scripts/figure1/plotSupplementaryFiguresFig3.R</code>
