# EasySFS Tutorial

## Outline
1. Basics of the SFS
2. Our Dataset
3. Overview of `easySFS`
4. Example Run of `easySFS`

-----
## 1. Basics of the SFS

### What is the SFS?

Site frequency spectrum or allele frequency spectrum is a joint distribution of allele frequencies among population.

- One population. SFS is vector $S$ of values, Entry $S[i]$ contains number of positions where derived allele occured in $i$ hyplotype samples.
<div>
<img src="pictures/1d_plot.png" width="300" align="left" />
</div>

- Two populations. SFS is a 2-dimentional matrix where entry $S[i, j]$ corresponds to the number of positions where the derived alleles occured in $i$ haplotype samples in population 1 and in $j$ haplotype samples in population 2.
<div>
<img src="pictures/2d_plot.png" width="250" align="left" />
</div>

- $P$ populations. SFS is a $P$-dimentional tensor. Example for three populations:
<div>
<img src="pictures/3d_plot.png" width="300" align="left" />
</div>

### Ignore monomorphic bins

SFS bin that corresponds to the frequency 0 per population contains number of monomorphic positions with the ancestral allele. Some tools (`dadi`, `moments`, `momi2`) ignore that value as it could be easily evaluated from total sequence length and other values of SFS.

For almost the same reason the bin with maximum frequency per population is excluded from the analisys. The number of sites with fixed derived allele is relatively small and, moreover, could be a consequence of false ancestral allele identification.

On the plots above both monomorphic bins are excluded. The monomorphic bin with 0 frequency is usually very high as most sites are monomorphic, and, thus, including it to the plot will lead to the severe disproportions of other bins.

### What data do we need to build the accurate SFS?

Site frequency spectrum reflects the history of individuals, however, it is important to build it accurately. Some recommendations to keep in mind:

- VCF format file of filtered genotypes
- As many SNP's as possible
- As many individuals as possible (8-10 okay, 100 better)
- No relatives
- Avoid missing data (`easySFS` could help with missing data)
- Coordinates of neutral sites or intergenic regions (similar evolution forces)

### What if we do not know the derived allele? (SFS folding)

Sometimes outgroup information is missed. In that case we can use minor allele frequencies (MAF) to build our SFS. The MAF SFS can be easily built from the usual SFS. This process is called *folding* and MAF SFS is called *folded SFS*.

- Example of the SFS folding in case of one population:

    Unfolded and folded SFS correspondingly:

    <img src="pictures/1d_plot_before_folding.png" width="300" align="left"/>
    <img src="pictures/1d_plot_after_folding.png" width="300" align="left"/>

- Example of the SFS folding in case of two populations:

    Unfolded and folded SFS correspondingly:
    
    <img src="pictures/2d_plot_before_folding.png" width="250" align="left"/>
    <img src="pictures/2d_plot_after_folding.png" width="250" align="left"/>

-----
## 2.  Our Dataset

We have a dataset for clouded leopards (*Neofelis nebulosa*). It is an example data for one contig (10,000,000 bp) for 10 diploid individuals (20 haploid samples).

All the data is available in the `data` folder:

In [None]:
%%bash
ls data

- File `data/clouded_leopard_data.vcf` is our VCF file for all 5 individuals (X chromosome is excluded):

In [None]:
%%bash
# First five lines
head -6 data/example_data.vcf

In [None]:
%%bash
# The header line of the VCF file (we take first 84 lines of file and then show the last one)
head -6 data/example_data.vcf | tail -1

- File `data/popmap` provides population assignments per individual (all our individuals are from the same population that is marked as `NN`):

In [None]:
%%bash
cat data/popmap

-----
## 4. Overview of `easySFS`

### What is [easySFS](https://github.com/isaacovercast/easySFS#install--run)?
- A tool for SFS construction from the VCF file.
- Generates SFS in several formats: `dadi`/`fastsimcoal2`/`momi2` (GADMA accepts all of them)
- Provides means to project SFS down to account for missed data.
- Allows to focus SFS on the independent SNP's (for RADSeq-like data only!)

(Unfortunately, no paper to cite, however, authors ask to credit [\[Gutenkunst et al. 2009\]](https://doi.org/10.1371/journal.pgen.1000695))

### What is the SFS projection?
Assume we have two different datasets with different numbers of samples. We construct two SFS and want to compare them, but they are of different size. SFS is a histogram of allele frequencies and it is possible to downsize or project it. Thus, we could downsize both spectra to some smaller size and compare them.

### Some intuition about SFS projection

Let us get some intuition about how to use information from the bigger SFS for its projection.

Assume we have the following data ($A$ stands for ancestral allele and $T$ for derived allele):

- Sample 1: $A\ A\ A\ T\ A$
- Sample 2: $A\ A\ T\ A\ A$
- Sample 3: $A\ A\ T\ T\ \ .$
- Sample 4: $T\ A\ T\ A\ A$

and we want to build projection of the SFS on three samples.

We can see that the derived allele has frequency of $1/4$ (one out of four) among four samples on the first position.
Let us subsample three individuals from given four samples and compute the frequency of derived allele on the first position. We can choose samples 1, 2, 3 and obtain $0/3$ frequency (with probability $p_1=0.75$) or we can obtain frequency of $1/3$ by choosing sample 4 and any other two samples (with probability $p_2 = 0.25$).

Thus, for our SFS projection we can put value equal to $p_1$ in the $0/3$ bin and value of $p_2$ in the $1/3$ bin.

The third position has derived allele frequency of $3/4$. We can obtain frequency of $3/3$ by choosing samples 2, 3, 4 and frequency of $2/3$ by choosing sample 1 and any other two samples.

Having the probabilities of each subsampling frequency (see [hypergeometric distribution](https://en.wikipedia.org/wiki/Hypergeometric_distribution)) we can build new SFS with smaller size.

### Accounting for the missed data

SFS projection can easy account for missed information! For the last position in our example data we just add $1$ to the frequency $0/3$ of the projected SFS. That is exactly what `easySFS` does.


--------
## 5. Example Run of `easySFS`

### Installation

Resources: [more information about installation of easySFS](https://github.com/isaacovercast/easySFS#install--run)

Download `easySFS` from GitHub repo:

In [None]:
%%bash
git clone https://github.com/isaacovercast/easySFS.git
ls easySFS

### Run `easySFS` with the `--help` option

In [None]:
%%bash
./easySFS/easySFS.py --help

### Run `easySFS` for your VCF file with `--preview` option

In [None]:
%%bash
./easySFS/easySFS.py -i data/example_data.vcf -p data/popmap -a --preview > outputs/easySFS_preview_output
cat outputs/easySFS_preview_output

#### Visualize the results of the preview
In this notebook we will use python:

In [None]:
from scripts.draw_easySFS_preview import draw_easySFS_preview
draw_easySFS_preview("outputs/easySFS_preview_output")

Moreover, one can run script from command line the following way (the same picture will appear if you use it from the command line interface):

In [None]:
%%bash
python scripts/draw_easySFS_preview.py outputs/easySFS_preview_output

### Run easySFS for your VCF file with `--proj` option

We use `--proj 14` option instead `--preview`.

We add the following options:
* `--total-length 10000000` to construct correct monomorphic bin
* `--unfolded` to construct unfolded spectrum
* `-o outputs/easySFS_output` to write output in this directory
* `-f` to rewrite if the output directory exists

In [None]:
%%bash
./easySFS/easySFS.py -i data/example_data.vcf -p data/popmap -a --total-length 10000000 --unfolded -o outputs/easySFS_output -f --proj 14

## easySFS output
EasySFS with `--proj` option generates SFS for a given data in several formats:
- `dadi` - input format of dadi, moments,
- `fastsimcoal2` - input format of fastsimcoal2,
- `momi` - input format of momi2 (only if we use full projections, i.e. do not project SFS down).

GADMA can work with any input format mentioned above.

In [None]:
%%bash
ls outputs/easySFS_output

In [None]:
%%bash
ls outputs/easySFS_output/dadi

In [None]:
%%bash
cat outputs/easySFS_output/dadi/NN-14.sfs

In [None]:
%%bash
cat outputs/easySFS_output/fastsimcoal2/NN_DAFpop0.obs

## Picture of our SFS

In this example we will use SFS `output/dadi/NN.sfs` generated for dadi. Let us draw the picture:

In [None]:
from scripts.draw_sfs import draw_1d_sfs
draw_1d_sfs("outputs/easySFS_output/dadi/NN-14.sfs")

-----
## 4. Hands On: Build Folded SFS

Now it is your turn! Can you build a **folded** site frequency spectrum?

You have to perform several steps:
* Run `easySFS` to construct folded spectrum
* Print the result SFS
* Draw the result SFS