# Steps to reproduce Hartwig dataset analysis

1. Preprocessing of pre-biopsy treatment

For each sample adds additional columns with annotation of treatment of interest (platinum, alkylating, immunotherapy). Annotate each sample as 0 or 1 depending on the presence of treatment of interest in the "mechanism column".

> Script: ```./lesion_segregation/metastatic_tumors/scripts/aggregate_prebiopsy_treatment.R```

> Output file:  ```./lesion_segregation/metastatic_tumors/results/pre_biopsy_drugs.agg.txt ```


2. Calculate aggregated statistics on number of bi- and multi-allelic sites in each tumor.
Only SNPs were taken into this analysis.

This step requires vcf files with somatic mutations from Hartwig database. 

> Script:  ```./lesion_segregation/metastatic_tumors/scripts/multiallelic_sites_counts.R ```

> Results:  ```./lesion_segregation/metastatic_tumors/results/ALL_Hartwig_bi_multi.txt ```
            ```./lesion_segregation/metastatic_tumors/results/ALL_Hartwig_bi_multi.by_chr.txt ```


3. Estimate enrichement in the number of observed multi-allelic sites and correlate it with treatment type.

> Script: ```./lesion_segregation/metastatic_tumors/scripts/multi_enrichment_vs_treatment.R ```

> Results: ```./lesion_segregation/metastatic_tumors/results/all_enriched.chemo_alkyl_immuno.txt ```,
```./lesion_segregation/metastatic_tumors/results/all_enriched.chemo_alkyl_immuno.txt ```

> Plots: ```./lesion_segregation/metastatic_tumors/plots/Hartwig_multiallelic_enriched_vs_treatment.jpeg```


4. Select multi-allelic sites, check if there is germline SNP closeby and annotate with context.

This step requires vcf files with germline mutations per sample from Hartwig database.

> Script: ``` ./lesion_segregation/metastatic_tumors/scripts/check_germline_close2multisite.R ```
           
> Results: ```./lesion_segregation/metastatic_tumors/results/multi_sites_by_sample_enriched.chemo_alkyl_immuno ```
            ```./lesion_segregation/metastatic_tumors/results/multi_sites_by_sample_NONenriched.chemo_alkyl_immuno ```

5. Select bi-allelic sites 

> Script: ```./lesion_segregation/data_analysis/metastatic_tumors/scripts/select_biallelic_sites.py ```
           
> Results: ```./lesion_segregation/data_analysis/metastatic_tumors/results/bi_sites_by_sample_NONenriched.chemo_alkyl_immuno```
```./lesion_segregation/data_analysis/metastatic_tumors/results/bi_sites_by_sample_enriched.chemo_alkyl_immuno```

6. Calculate 3nt specta of bi- and multi-allelic sites separately for enriched and non-enriched in multiallelic sites samples. 
Scripts should be run separately for platinum treatment and alkylating treatment. This can be defined by variable "subset" in the script.

> Script: ```./lesion_segregation/data_analysis/metastatic_tumors/scripts/3nt_spectrum_of_bi_sites.R```

```./lesion_segregation/data_analysis/metastatic_tumors/scripts/3nt_spectrum_of_bi_sites.R```
           
> Results: 
```./lesion_segregation/data_analysis/metastatic_tumors/results/bi_spectrum_by_sample_enriched.chemo_alkyl_immuno```
```./lesion_segregation/data_analysis/metastatic_tumors/results/bi_spectrum_by_sample_NONenriched.chemo_alkyl_immuno```
```./lesion_segregation/data_analysis/metastatic_tumors/results/multi_spectrum```

> Plots: 
```./lesion_segregation/data_analysis/metastatic_tumors/plots/biallelic_spectrum_enriched_vs_NONenriched_Alkylating.jpeg```
```./lesion_segregation/data_analysis/metastatic_tumors/plots/biallelic_spectrum_enriched_vs_NONenriched_Platinum.jpeg```
```./lesion_segregation/data_analysis/metastatic_tumors/plots/multiallelic_spectrum_enriched_vs_NONenriched_Alkylating.jpeg```
```./lesion_segregation/data_analysis/metastatic_tumors/plots/multiallelic_spectrum_enriched_vs_NONenriched_Platinum.jpeg```


7. Compute triallelic stectra expected from independence (based on bi-allelic frequencies) and expected from signatures of known treatments. 
For signatures script should be run separately for platinum and alkylating treatments. This could be defined in the variable "treatment" within the script.
> Scripts: 
```./lesion_segregation/data_analysis/metastatic_tumors/scripts/compute_expected_triallelic_spectrum.py```
```./lesion_segregation/data_analysis/metastatic_tumors/scripts/compute_triallelic_signatures.py```

>Output:
```./lesion_segregation/data_analysis/metastatic_tumors/results/expected_triallelic_spectrum/```


8. Compare observed tri-alleic spectrum with expected from independence or expected from signatures. 

> Scrips:
```./lesion_segregation/data_analysis/metastatic_tumors/scripts/compare_observed_vs_expected_triallelic.Alkylating.R```
```./lesion_segregation/data_analysis/metastatic_tumors/scripts/compare_observed_vs_expected_triallelic.Platinum.R```

> Output:
```./lesion_segregation/data_analysis/metastatic_tumors/plots/compare_multiallelic_expected_vs_observed_*```

9. Phase multiallelic sites using nearby SNPs. 
For this step mini-bam files including regions around multi-allelic sites are required.

> Script: ```./lesion_segregation/data_analysis/metastatic_tumors/scripts/phase_multisites.R ```

> Output:

10. Estimate LAD for samples with >= 5 multiallelic variants.

> Scripts: ```./lesion_segregation/data_analysis/metastatic_tumors/scripts/permutations_Hartwig.R```

> Output: ```./lesion_segregation/data_analysis/metastatic_tumors/plots/LAD/```