## F-statistics

Let's activate our conda environment first

In [None]:
conda activate echo_workshop

Within the `analyses` folder, let's create a new directory called `Fstats`

In [1]:
mkdir Fstats

We will run all Fstatistics on both the imputed and non imputed dataset. So, within the directory `Fstats`, we can create:
1) `imputed`
2) `non_imputed`

And within each, we will create a `dataset` directory

#### Non Imputed Dataset

In [None]:
mkdir non_imputed

In [None]:
cd non_imputed

In [None]:
mkdir dataset

Copy the file non_imputed_set_fstats.* and check:
- N of SNPs
- N of groups

In [None]:
cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/PopStructureFiles/non_imputed_fstats.* .
wc -l non_imputed_set_fstats.bim
awk '{print $1}' non_imputed_fstats.fam | sort | uniq -c

**Question** 

Compared to the dataset we used for the PCA, there's an additional group, which one? 

#### Imputed Dataset

Be sure that `cd ../../` redirects you to the path where you want to place the `imputed` directory

In [None]:
cd ../../
mkdir imputed
cd imputed
mkdir dataset
cp /gpfs/helios/projects/echo_workshops/project.1.tk/data/PopStructureFiles/imputed_set_fstats.* .

### First steps with F-statistics

#### Converting PLINK to EIGENSTRAT format

The following example are tailored for the **non imputed** dataset, be sure that you move to your non_imputed/dataset directory. 

We will follow the same steps for the imputed dataset as well: when you analyse the imputed dataset be sure to move to the correct `imputed` directory, and copy the correct `imputed` files.

A set of PLINK file are available, but we need to convert them to EIGENSTRAT format
to run fstatistics.   

To do so, you can use the BED2EIG.sh script: it will create a par file and run convertf. 


Usage:
bash BED2EIG.sh input_file_prefix output_file_prefix

In [None]:
bash /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/BED2EIG.sh non_imputed_fstats non_imputed_fstats

The BED2EIG.sh will create a convertf par file. 

In [None]:
python /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/JobParser.py --command "convertf -p convertf_non_imputed_fstats.par" --name convertf

In [None]:
sbatch convertf.sh

#### Edit the .ind file

In [None]:
sed -i 's/:/\t/g' non_imputed_fstats.ind && awk '{print $2,$3,$1}' non_imputed_fstats.ind > tmp && mv tmp non_imputed_fstats.ind

## F3 Statistics

F3 Statistics can be used for two purposes:
- Test whether Pop C is admixed with Pop A and Pop B, *f3*(A,B,C) 
- Measure the shared drift of Pop A and Pop B, given an outgroup Pop O, *f3*(A,B,O) 


## F3 Admixture

F3Admixture is a formal test of admixture: if *f3*(A,B,C) < -3, then Pop C is shown to be admixed between Pop A and Pop B. 

Let's create a directory where we can run and store the F3 Admixture analyses.

In [None]:
mkdir F3_A

Similarly as any other ADMIXTOOLS software, the command line follows a specific pattern:

software -p file.par

In the case of F3 statistics, both F3 Admixture and F3 Outgroup, we will use the software **qp3Pop**. So that the command line will be:

In [None]:
qp3Pop -p file.par

The par file will contain the necessary information to perform the analyses, namely:
- the dataset in EIGENSOFT format
- the list of three-populations test(s) we want to carry
- optional parameters (here, we will use inbreed: YES, a parameter needed when dealing with pseudohaploid data, such ad aDNA)

An example of a par.file:

In [None]:
genotypename: ../dataset/non_imputed_fstats.geno
snpname:   ../dataset/non_imputed_fstats.snp
indivname:   ../dataset/non_imputed_fstats.ind
popfilename:  F3A_Tests_List
inbreed: YES

The popfilename parameter takes a text file with a list of the F3 Admixture tests we want to run. The text file will contain three populations on each line in this order: (Proxy) source 1, (Proxy) source 2 and Target. For example:

In [None]:
GermanySM.Anc FranceLIA.Anc KoksijdeEMA.Anc
NedEMA.Anc FranceLIA.Anc KoksijdeEMA.Anc
UKCamEMA.Anc FranceLIA.Anc KoksijdeEMA.Anc

With the popfilename created, we are ready to run F3Admixture, and test whether PopC is admixed with the proxy sources.

In [None]:
qp3Pop -p F3A.par >> RES_F3A

In [None]:
python /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/JobParser.py --command "qp3Pop -p F3A.par >> RES_F3A" --name F3A

#### F3 Admixture Output

The output file shows 7 columns:
- Source 1
- Source 2
- Target
- F3
- Standard Error
- Z score
- Number of SNPs used

To prove than and admixture event has taken place, the Zscore should be < -3. 

**Question** 

- Is there an admixed group?
- What groups are the proxy sources?

## F3 Outgroup

F3 Outgroup statistics will measure how closely two populations are (Pop A and Pop B) given an outgroup (Pop O). Generally, F3 Outgroup is used on a fixed target group, while sifting through multiple other groups. 

For example, *f3O*(X, KOS, Yoruba), where as X we will test many different Northern European groups.

Let's create a directory where we can run and store the F3 Outgroup analyses.

In [None]:
mkdir F3_O

Again, similarly as any other ADMIXTOOLS software, the command line follows a specific pattern: software -p file.par

In the case of F3 statistics, both F3 Admixture and F3 Outgroup, we will use the software **qp3Pop**. So that the command line for F3 Outgroup will be:

In [None]:
qp3Pop -p file.par

The par file should look like this:

In [None]:
genotypename: ../dataset/non_imputed_fstats.geno
snpname:   ../dataset/non_imputed_fstats.snp
indivname:   ../dataset/non_imputed_fstats.ind
popfilename:  F3O_Tests_List
inbreed: YES

F3O_Tests_List is a three column file, listing all F3-stastistic combinations that we want to test, for example:

In [None]:
English.HO	KoksijdeEMA.Anc	YRI
Estonian.HO	KoksijdeEMA.Anc	YRI
Finnish.HO	KoksijdeEMA.Anc	YRI
FranceLIA.Anc	KoksijdeEMA.Anc	YRI
French.HO	KoksijdeEMA.Anc	YRI
GermanySM.Anc	KoksijdeEMA.Anc	YRI
NedEMA.Anc	KoksijdeEMA.Anc	YRI
Norwegian.HO	KoksijdeEMA.Anc	YRI
Scottish.HO	KoksijdeEMA.Anc	YRI
UKCamEMA.Anc	KoksijdeEMA.Anc	YRI
UKKentEMA.Anc	KoksijdeEMA.Anc	YRI
UKNorfolkEMA.Anc	KoksijdeEMA.Anc	YRI
UKSouthEMA.Anc	KoksijdeEMA.Anc	YRI
UKSuffolkEMA.Anc	KoksijdeEMA.Anc	YRI
UKYrkEMA.Anc	KoksijdeEMA.Anc	YRI

Note that while the first population is changing (the 'X') group, the other two are kept the same (KOS and Yoruba).

With the popfilename created, we are ready to run F3 Outgroup.

In [None]:
qp3Pop -p F3_O.par >> RES_F3O

In [None]:
python /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/JobParser.py --command "qp3Pop -p F3_O.par >> RES_F3O" --name F3_O

In [None]:
sbatch F3_O.sh

#### F3 Outgroup Output

Let's look inside the output

In [None]:
less -S RES_F3

The output has 7 columns: 

- Source 1
- Source 2
- Target
- F3
- Standard Error
- Z score
- Number of SNPs used

The first three columns contain the population labels, the column names "Source1, Source2 and Target" are optimal labels when using qp3Pop for F3 Admixture, but not ideal for F3 Outgroup. Source 1 stands for the reference group, Source 2 for the target population and Target is actually the outgroup.

A commonly used procedure to analyse and visualize the F3 Outgroup resuts is to sort the results based on the **f_3** column. The higher value will indicate a stronger shared drift, while lower values indicate a less strong shared drift. 

We can visualize the results with F3O_plot.py.

In [None]:
python F3O_plot.py RES_F3O Plot_RES_F3O

**Question** 

- Which are the top scoring groups?

#### Comparing two target groups with F3 Outgroups

We can also compare two target groups based on their affinity to the same list of 'X' populations. For this task, a scatterplot is quite handy.

The second target group will be the imputed data, and we will compare the imputed and non imputed F3 Outgroup results. We will use the same reference group used on for the non imputed set. 

In [None]:
English.HO	KoksijdeEMA.Anc	YRI
Estonian.HO	KoksijdeEMA.Anc	YRI
Finnish.HO	KoksijdeEMA.Anc	YRI
FranceLIA.Anc	KoksijdeEMA.Anc	YRI
French.HO	KoksijdeEMA.Anc	YRI
GermanySM.Anc	KoksijdeEMA.Anc	YRI
NedEMA.Anc	KoksijdeEMA.Anc	YRI
Norwegian.HO	KoksijdeEMA.Anc	YRI
Scottish.HO	KoksijdeEMA.Anc	YRI
UKCamEMA.Anc	KoksijdeEMA.Anc	YRI
UKKentEMA.Anc	KoksijdeEMA.Anc	YRI
UKNorfolkEMA.Anc	KoksijdeEMA.Anc	YRI
UKSouthEMA.Anc	KoksijdeEMA.Anc	YRI
UKSuffolkEMA.Anc	KoksijdeEMA.Anc	YRI
UKYrkEMA.Anc	KoksijdeEMA.Anc	YRI

In [None]:
genotypename: ../dataset/non_imputed_set_fstats.geno
snpname:   ../dataset/non_imputed_set_fstats.snp
indivname:   ../dataset/non_imputed_set_fstats.ind
popfilename:  F3O_Tests_List_Target2
inbreed: YES

In [None]:
qp3Pop -p F3_O.par >> RES_F3O_Target2

### PLOT

In [None]:
To better understand the final scatterplot, we need to modify slightly the F3 Outgroup result file. 

In [None]:
sed -i 's/KoksijdeEMA.Anc/imputed_KoksijdeEMA.Anc/g' RES_F3O

In [None]:
python /gpfs/helios/projects/echo_workshops/project.1.tk/scripts/F3O_Scatterplot.py RES_F3O ../../non_imputed/F3_O/RES_F3O F3O_Scatterplot

## F4 statistics

With *f4*(A,B,C,O) we are going to test whether Pop C shared more drift with Pop A, Pop B, given an outgroup O. Specifically:
- if Pop C shares more drift with Pop A, the statistics will be positive
- if Pop C shares more drift with Pop B, the statistics will be negative

Along with the F4 values, **qpDstat** will also estimate the Zscores, that can be used to statistically rejected the null hypothesis of no shared drift when they are: -3 < Zscore > +3.

In [None]:
mkdir F4
cd F4
mkdir analyses
mkdir dataset

For the F4 statistics we will used a dataset containing both imputed and non imputed genomes. Copy it in your `F4/dataset` directory.

In [None]:
cd dataset
cp  /gpfs/helios/projects/echo_workshops/project.1.tk/data/PopStructureFiles/F4_dataset.* .

The tool we are going to use to run F4 statistics is **qpDstat**, and as for all other ADMIXTOOLS applications, we are using a par file as follows:

In [None]:
qpDstat -p par.file >> RES_F4

To run F4 statistics the par file should look like this:

In [None]:
genotypename:   ../dataset/F4_dataset.geno
snpname:   ../dataset/F4_dataset.snp
indivname:  ../dataset/F4_dataset.ind
#poplistname:  list_F4.txt #(contains list of poulations- one population on each line).
# Program will run the method for all quadrapules.
popfilename:   F4_List
f4mode: YES

Importantly, in our case we are going to use the option
**popfilename**: list, where we list all four populations tests we want to carry. 

Alternatively, we could use pop**list**name option, where we list N populations, and qpDstat will test every possible combination between those N groups. 

An example of **popfilename**: F4_List:

In [None]:
English.HO French.HO KoksijdeEMA.Anc_imputed YRI
English.HO French.HO KoksijdeEMA.Anc_nonimputed YRI

And qpDstat will test two F4statistics in the populations in the order given, thus:
- *f4*(English.HO, French.HO, imputed Koksijde, YRI)
- *f4*(English.HO, French.HO, non imputed Koksijde, YRI)

An example of **poplistname: F4_List** can be:

In [None]:
Eng.S
French.HO
KOS
YRI

And qpDstat will perform the following tests:

Eng.S, French.HO, KOS, YRI

Eng.S, KOS, French.HO, YRI

Eng.S, YRI, French.HO, KOS

#### F4 Output

In [None]:
egrep result F4_RES

In [None]:
result: POPA POPB POPC POPD f_4 Zscore BABA ABBA SNPs

We can focus on the Zscore to interpret the F4 results. 

**Question** 

- Which groups share more drift with our target group?