# Steps to reproduce the analysis of mice data

### LAD inference

1. Create input files for HMMs

Uses as input processed data from Aitken et al.Nature 2020 in the format of ".nodMat" containing list of mutations per sample.
Keeps only SNVs. Adds to each SNV the classification based on the presence of several alternative variants for HMMmulti: "M" - more than one alternative variant is present with the support of at least 3 reads each, "B" - others. And adds a column with the type of substitution based on the refernce nucleotide ("A>N", "T>N", "G>N" or "C>N") for HMMas. Produces one file per sample with ".hmm" extention.

Example input and output files are provided in the corresponding folders.

> Script: ```LAD/scripts/HMM_preprocessing.pl```

> Output: ```LAD/results/HMM/input_HMM/*.hmm```

> Description of output files:  ```LAD/descriptions/README_inputHMMfile_description```

2. Run two separate HMMs - one for multiallelic sites and second for the WC asymmetry

> Script:  ```LAD/scripts/run_2HMMS.R <sampleID>.hmm```

* Better to run in parallel for different samples as it could take around 10mins per sample. 

> Output:  ```LAD/results/HMM/output_HMM/ ```

> Description of output:  ```LAD/descriptions/README_outputHMM_description ```

3. Aggregate HMM results

> Script: ```LAD/scripts/aggregate_hmm_summary.pl ```

> Results: ```LAD/results/HMM/HMM_summary.tsv ```

> Description of the output: ```LAD/descriptions/README_outputHMM_description ```

4. Fit vaf distribution as a mixture of 2 normal distributions (clonal and subclonal mutations) and run HMM for WC asymmetry with 3 states separately for clonal and subclonal mutations. 

> Script: ```LAD/HMM_ploidy.R  <sampleID.hmm>```

* It could be run in parallel for different samples 

    An example of the script to run it in parallel with SLURM is provided here (the input path has to be updated for your dir):

    ```LAD/HMM_ploidy.sh```

> Results: ```LAD/results/HMM/HMM_ploidy/ ```

> Description of the output: ```LAD/descriptions/README_HMM_ploidy_description ```

5. Aggregate results from previous step (HMM_ploidy) and classify tumors to symmetric/diploid/tetraploid.

> Script: ```LAD/scripts/aggregate_hmm_ploidy_summary.pl ```
           
> Results: ```LAD/results/HMM/Summary_ploidy.tsv```

> Description of the output: ```LAD/descriptions/README_HMM_ploidy_description ```

6. LAD inference
Uses HMM_summary.txt and Ploidy_summary.txt as input to infer LAD for each sample. Outputs several tables with the LAD annotation and illustrative plots.

> Script: ```LAD/scripts/infer_LAD_from_HMM_summary.R```
           
> Results: 

```../plots/``` - illustrative plots for LAD inference

```../results/Summary_mixtures.txt``` - all annotated mixtures of tumors

``` ../results/Summary_divisions_with_symmetrical_no_mixtures.txt``` - all non-mixed samples annotated with LAD

```../results/LAD_by_mice_line_agg.txt``` - aggregated statistic of LAD for C3H and CAST strains

```../results/LAD_by_mice_line_by_driver_agg.txt``` - aggregated statistic of LAD per driver per mice strain 

```../results/Summary_many_drivers.txt``` - summary of LAD for samples with multiple annotated drivers

### Error-free replication rate estimation

1. Fit vaf distribution on each sample as a mixture of two normal distributions. Estimate mean value of each distribution and its weight in the mixture.

> Main script: ```error_free_replication_rate/scripts/fit_2normal_1div_4cells_only.R```

> Script to provide driver-specific estimations:```error_free_replication_rate/scripts/fit_2normal_by_driver_1div_4cells_only.R```

> Output: ```error_free_replication_rate/results/fit_norm_mixture_1div_4cells.txt ```

Driver-specific estimations are in the same folder with the corresponding names.

2. Scripts for illustration plots:

```error_free_replication_rate/plot_error_free_distribution.R```

```error_free_replication_rate/plot_w_subclon_vs_HMM_multi_emission.R```
