# Example workflow for running varDFE to infer the DFE in a single species

Last updated: 2024-10-29

In [2]:
import varDFE
import dadi

## step0: load and visualize the data

This example demonstrates the workflow using a site frequency spectrum (SFS) derived from The 1000 Genomes Project data. 
For details on how this SFS was generated, refer to [Zhen et al. 2021](https://genome.cshlp.org/content/31/1/110).


In [3]:
# read and fold the spectra
synfs = dadi.Spectrum.from_file('SYN.sfs') # synonymous
misfs = dadi.Spectrum.from_file('MIS.sfs') # nonsynonymous/missense

In [None]:
# generate general information
print(synfs.S())
print(misfs.S())

In [None]:
# adapt a plotting function from varDFE
# here the model is missense sfs and data is synonymous sfs (you can change the naming later)
%matplotlib inline
from varDFE.Misc.Plotting import ggplot_dadi_1d
ggplot_dadi_1d(outprefix = 'MIS_SYN', model = misfs, fs = synfs, yvar = 'count', returnplot=True)

In [None]:
# make a percentage plot
%matplotlib inline
varDFE.Misc.Plotting.ggplot_dadi_1d(outprefix = 'MIS_SYN', model = misfs, fs = synfs, yvar = 'percent', returnplot=True)

## step1: `Demog1D_sizechangeFIM` module for demographic inference

In [None]:
%%bash
mkdir -p ./output/logs/
WORKSCRIPT='../workflow/Demography/Demog1D_sizechangeFIM.py'

# show help
python $WORKSCRIPT -h

# run demographic inference
python $WORKSCRIPT \
--pop 'example' --mu '2.50E-08' --Lcds '19089129' --NS_S_scaling '2.31' --Nrun 5 \
'SYN.sfs' 'two_epoch' './output/demography/two_epoch' &> './output/logs/demography.log'

### Command line explanation

Positional arguments:
|Position | Value | Explanation |
|---------|-------|-------------|
|1 | 'SYN.sfs' | Path to folded synonymous SFS file |
|2 | 'two_epoch' | Demographic model to use |
|3 | './output/demography/two_epoch' | Output directory path |

Options:
|Argument | Value | Explanation |
|---------|-------|-------------|
|--pop | 'example' | Population identifier |
|--mu | '2.50E-08' | Exon mutation rate (mutations/bp/generation) |
|--Lcds | '19089129' | Number of called CDS sites used in SFS |
|--NS_S_scaling | '2.31' | Scaling factor for Lsyn and Lnonsyn length |
|--Nrun | 5 | Number of replicated runs |

Additional:
- Redirects both stdout and stderr to './output/logs/demography.log'

### Log outputs

This script gives (hopefully) most of the information you need for demographic inference and comparisons:

In the output log, it provides:

1. Input parameters parsed
2. Within each replicated inference, every iteration's log-likelihood and demographic parameter values
3. Top three replicates with the highest log-likelihoods
4. Convergence of the best inference with the highest log-likelihoods
5. Fisher's Information Matrix (FIM) based standard deviation estimates

### File outputs

In the output folders, the files are organized as following:

```
.
├── demography
│   └── two_epoch
│       ├── bestrun
│       │   ├── example_demog_two_epoch_runXX.SD.txt
│       │   ├── example_demog_two_epoch_runXX.info.txt
│       │   ├── example_demog_two_epoch_runXX.png
│       │   ├── example_demog_two_epoch_runXX.txt
│       │   ├── example_demog_two_epoch_runXX_folded.expSFS
│       │   └── example_demog_two_epoch_runXX_unfolded.expSFS
│       ├── detail_5runs
│       │   └── ...
│       ├── example_demog_two_epoch_SFS.pdf
│       └── example_demog_two_epoch_summary.txt
└── logs
    └── demography.log
```

* `example_demog_two_epoch_SFS.pdf`: The SFS fit for the best model
* `example_demog_two_epoch_summary.txt`: Tabulated data for all the runs
* `bestrun/`: the folder with all the information for the best run (where `XX` represents the run number e.g. `run03`)
    * `example_demog_two_epoch_runXX.info.txt`: Tabulated data for the best run with uncertainty estimate appended (FIM + convergence)
    * `example_demog_two_epoch_runXX.png`: `dadi`'s original diagnostic plot for examining residuals
    * `example_demog_two_epoch_runXX_folded.expSFS`: folded expected SFS from this run

## step2: `DFE1D_refspectra` module for building a reference spectra cache

In [None]:
%%bash
WORKSCRIPT='../workflow/DFE/DFE1D_refspectra.py'

# show help
python $WORKSCRIPT -h

# generate reference spectra cache
python $WORKSCRIPT \
'two_epoch' '2.33,0.43' '100' './output/dfe/refspectra/example' \
&> './output/logs/dfe_refspectra.log'

### Command line explanation

Positional arguments:
|Position | Value | Explanation |
|---------|-------|-------------|
|1 | 'two_epoch' | Demographic model to use |
|2 | '2.33,0.43' | Demographic parameters (nu,T) for the two-epoch model |
|3 | '100' | Number of samples in the SFS to generate |
|4 | './output/dfe/refspectra/example' | Path/NamePrefix to the output file |

Additional:
- Redirects both stdout and stderr to './output/logs/dfe_refspectra.log'

### Log outputs

In the output log, it provides:

1. Input parameter parsed
2. The population-scaled selection coefficient values used for building each reference spectra
3. Summary of the reference spectra cache

### File outputs

In the output folders, the files are organized as following:

```
.
├── dfe
│   └── refspectra
│       ├── example_DFESpectrum.bpkl
│       └── example_DFESpectrum_QC.pdf
└── logs
    └── dfe_refspectra.log
```

* `example_DFESpectrum.bpkl`: The reference spectra cache with given demographic parameters
* `example_DFESpectrum_QC.pdf`: QC plots showing the expected SFS under various different selection coeffcients

### Notes

This step is time-consuming (~40 mins on a local computer for the example above). Using a cluster is recommended for efficiency.

By default, both negative and positive selection coefficients are computed.

## step3: `DFE1D_inferenceFIM` module for DFE inference

In [None]:
%%bash
WORKSCRIPT='../workflow/DFE/DFE1D_inferenceFIM.py'

# show help
python $WORKSCRIPT -h

# perform DFE inference assuming a gamma-distributed DFE function
python $WORKSCRIPT \
--pop 'example' --mu '2.50E-08' --Lcds '19089129' --NS_S_scaling '2.31' --Nrun 5 \
'MIS.sfs' './output/dfe/refspectra/example_DFESpectrum.bpkl' 'gamma' '4062' './output/dfe/gamma' &> './output/logs/dfe_gamma.log'

### Command line explanation

Positional arguments:
|Position | Value | Explanation |
|---------|-------|-------------|
|1 | 'MIS.sfs' | Path to folded nonsynonymous SFS file |
|2 | './output/dfe/refspectra/example_DFESpectrum.bpkl' | Path to reference DFE spectra |
|3 | 'gamma' | DFE functional form to use |
|4 | '4062' | Theta of synonymous regions |
|5 | './output/dfe/gamma' | Output directory path |

Options:
|Argument | Value | Explanation |
|---------|-------|-------------|
|--pop | 'example' | Population identifier |
|--mu | '2.50E-08' | Exon mutation rate (mutations/bp/generation) |
|--Lcds | '19089129' | Number of called CDS sites used in SFS |
|--NS_S_scaling | '2.31' | Scaling factor for Lsyn and Lnonsyn length |
|--Nrun | 5 | Number of replicated runs |

Additional:
- Redirects both stdout and stderr to './output/logs/dfe_gamma.log'

### Log outputs

This script gives (hopefully) most of the information you need for DFE inference: 

In the output log, it provides: 

1. Input parameters parsed
2. Within each replicated inference, every iteration's log-likelihood and DFE parameter values
3. Top three replicates with the highest log-likelihoods
4. Convergence of the best inference with the highest log-likelihoods
5. Fisher's Information Matrix (FIM) based standard deviation estimates

### File outputs

In the output folders, the files are organized as following: 

```
.
├── dfe
│   └── gamma
│       ├── bestrun
│       │   ├── example_DFE_gamma_runXX.SD.txt
│       │   ├── example_DFE_gamma_runXX.info.txt
│       │   ├── example_DFE_gamma_runXX.png
│       │   ├── example_DFE_gamma_runXX.txt
│       │   ├── example_DFE_gamma_runXX_folded.expSFS
│       │   └── example_DFE_gamma_runXX_unfolded.expSFS
│       ├── detail_5runs
│       │   └── ...
│       ├── example_DFE_gamma_PDF.pdf
│       ├── example_DFE_gamma_SFS.pdf
│       └── example_DFE_gamma_summary.txt
└── logs
    └── dfe_gamma.log
```

* `example_DFE_gamma_SFS.pdf`: The SFS fit for the best model
* `example_DFE_gamma_PDF.pdf`: The probability density function (PDF) for the best DFE inference
* `example_DFE_gamma_summary.txt`: Tabulated data for all the runs
* `bestrun/`: the folder with all the information for the best run (where `XX` represents the run number e.g. `run03`)
    * `example_DFE_gamma_runXX.info.txt`: Tabulated data for the best run with uncertainty estimate appended (FIM + convergence)
    * `example_DFE_gamma_runXX.png`: `dadi`'s original diagnostic plot for examining residuals
    * `example_DFE_gamma_runXX_folded.expSFS`: folded expected SFS from this run

### Notes

Currently, this pipeline supports the following DFE functions: 

1. gamma: Gamma distribution
2. neugamma: Gamma distribution + neutral point mass 
3. gammalet: Gamma distribution + lethal point mass 
5. lognormal: Log-normal distribution
6. lourenco_eq: Fisher's Geometric Model derived DFE distribution (eq. 15 in [Lourenço et al. 2011](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1558-5646.2011.01237.x))

## step4: `DFE1D_gridsearch` module for DFE grid-search and comparisons

In [None]:
%%bash
WORKSCRIPT='../workflow/DFE/DFE1D_gridsearch.py'

# show help
python $WORKSCRIPT -h

python $WORKSCRIPT \
--max_bound "0.5,0.5" --min_bound "0.1,1E-4" --dfe_scaling --Nanc 7043 --Npts 10 \
'MIS.sfs' './output/dfe/refspectra/example_DFESpectrum.bpkl' 'gamma' '9383' './output/dfe/gridsearch/example_grids' &> './output/logs/dfe_gridsearch.log'

### Command line explanation

Positional arguments:
|Position | Value | Explanation |
|---------|-------|-------------|
|1 | 'MIS.sfs' | Path to folded nonsynonymous SFS file |
|2 | './output/dfe/refspectra/example_DFESpectrum.bpkl' | Path to reference DFE spectra |
|3 | 'gamma' | DFE functional form to use |
|4 | '9383' | Theta of nonsynonymous regions |
|5 | './output/dfe/gridsearch/example_grids' | Path/NamePrefix to the output file |

Options:
|Argument | Value | Explanation |
|---------|-------|-------------|
|--max_bound | "0.5,0.5" | Maximum values range for DFE parameters |
|--min_bound | "0.1,1E-4" | Minimum values range for DFE parameters |
|--dfe_scaling | enabled | Use `2NeS` or `S` distribution (flag present, so unscale the population-scaled DFE `2NeS` to individual-level `S`) |
|--Nanc | 7043 | Ancestral population size for DFE scaling |
|--Npts | 10 | Number of grid points per parameter |

Additional:
- Redirects both stdout and stderr to './output/logs/dfe_gridsearch.log'

This command performs a grid search for DFE parameters using the specified reference spectra and SFS file. It uses a gamma distribution as the DFE functional form, applies DFE unscaling, and outputs the likelihood surface to the specified directory.

### Log outputs

In the output log, it provides: 

1. Input parameters parsed
2. Highest log-likelihoods during grid search
3. Final execution status 

### File outputs

In the output folders, the files are organized as following: 

```
.
├── dfe
│   └── gridsearch
│       ├── example_grids.npy
│       ├── example_grids.pdf
│       └── example_grids.txt
└── logs
    └── dfe_gridsearch.log
```

* `example_grids.npy`: The DFE grid-search output in numpy formats
* `example_grids.txt`: The DFE grid-search output in text formats
* `example_grids.pdf`: Likelihood surface and MLE estimates derived from the grid-search

# About

## Disclaimer

varDFE and this tutorial are provided "as is" without any warranties or representations of any kind, express or implied. I make no guarantees or warranties regarding the accuracy, reliability, completeness, suitability, or timeliness of the software.

## Citation

When using varDFE, please also cite the dadi and fitdadi packages:

RN Gutenkunst, RD Hernandez, SH Williamson, CD Bustamante "Inferring the joint demographic history of multiple populations from multidimensional SNP data" PLoS Genetics 5:e1000695 (2009).

BY Kim, CD Huber, KE Lohmueller "Inference of the Distribution of Selection Coefficients for New Nonsynonymous Mutations Using Large Samples" Genetics 206:345 (2017).