# qpWave and qpAdm

qpWave and qpAdm allow to model a target population as a mixture of others, given a set of reference groups. 

Following (https://comppopgenworkshop2019.readthedocs.io/en/latest/contents/05_qpwave_qpadm/qpwave_qpadm.html): 


QpWave and qpAdm are tools for summarizing information from multiple F-statistics, to make demographic inferences. With *qpWave* and *qpAdm* we can:

* Detecting the minimum number of independent gene pools to explain a set of target populations (qpWave)
* Testing sufficienty of an admixture model within the resolution of data (qpAdm)
* Estimating admixture proportions (qpAdm)


### Preparing the dataset

Both qpWave and qpAdm require input file of EIGENSTRAT format, therefore we should first convert the plink file to the geno/ind/snp format. 

As for all EIGENSTRAT/ADMIXTOOLS sofwares, the basic sintax to run convertf is: `software -p parfile.par`. To convert the plink files to EIGENSOFT format we need to run:

`convertf -p file.par`

An automated script to create a convertf par file is available in ../scripts/BED2EIG.sh, and can be used as follows:

bash BED2EIG.sh input_prefix output_prefix

In [2]:
! bash ../scripts/BED2EIG.sh ../dataset/1KGs_chr1_maf_pruned 1KGs_chr1_maf_pruned_converted

You should now find a convertf_\*.par file in your working directory.

In [3]:
! convertf -p convertf_1KGs_chr1_maf_pruned_converted.par

/bin/bash: convertf: command not found


First let's edit the .ind file so that the third column has population/group names

In [9]:
! sed -r 's/:/\t/g' 1KGs_chr1_maf_pruned_converted.ind > tmp

In [10]:
! awk '{print $2,$3,$1}' tmp > 1KGs_chr1_maf_pruned_converted.ind
! rm tmp

### Prepare the Left and Right populations

To run both qpWave and qpAdm we will need two simple text files: right and left files. Both files contain a list of populations, with one population per line. It is possible to also use single-sample groups. The population group we list must be available in the third colum of the .ind file.

* The **left** file, should list the proxy sources of the admixture event we want to test with qpAdm.  
* The **right** file, should list the reference groups: populations differentialy related to the left population and the admixed target.

In [28]:
! echo -e "CEU\nYRI" > left.txt
! echo -e "MSL\nGWD\nLWK\nFIN\nIBS\nTSI" > right.txt 

## Running qpWave

In order to detect the minimum number of independent gene pools to explain a set of target populations, we are going to run *qpWave*. *qpWave* allows to check whether there was any gene flow between the left and the right populations, with the aim to select groups that are as indipendent as possible.

If the right and the left population are independent, we can then move to run *qpAdm*.

### preparing the par file

As for all other EIGENSOFT/AdmixTools we are going to prepare a par file for the software. A bash .sh script is available in ../script/ to automate this step.

In [21]:
! bash ../scripts/qpWave_qpAdm.sh 1KGs_chr1_maf_pruned_converted left.txt right.txt

In [22]:
! more qpWave_qpAdm_1KGs_chr1_maf_pruned_converted.par

indivname: 1KGs_chr1_maf_pruned_converted.ind
snpname: 1KGs_chr1_maf_pruned_converted.snp
genotypename: 1KGs_chr1_maf_pruned_converted.geno
popleft: left.txt
popright: right.txt
details: YES
allsnps: YES
summary: YES


In [None]:
! qpWave -p qpWave_qpAdm_1KGs_chr1_maf_pruned_converted.par >> qpWave.log

In the log file, qpWave list the file used, as well as the left and right populations considered for the run.

We are interested at the last lines, where the **ranking** is. Specifically, we are going to focus at the last ranking row, that correspond to the highest ranking degree. Currently, we are testing N=2 left populations, so the maximum ranking will be N-1.  

We are looking for an indication that the populations considered are independent, a p-value < 0.05, at **taildiff**, will indicate that the selected groups are indeed independent. 

In [24]:
! egrep rank qpWave.log

f4rank: 0 dof:      5 chisq:  1439.595 tail:      3.62331187e-309 dofdiff:      0 chisqdiff:     0.000 taildiff:                    1
f4rank: 1 dof:      0 chisq:     0.000 tail:                    1 dofdiff:      5 chisqdiff:  1439.595 taildiff:      3.62331187e-309


## Running qpAdm

So far, we have tested the left and the right groups, thus the proxy sources of the admixture event and the reference groups. We have not yet studied their relationship with the target admixed samples.
Now that we know that the right and the left groups are independent, we can proceed running *qpAdm* with those samples to model the target admixed group.

To run *qpAdm*, we need to add the target admixed group in the left.txt file. Be sure it is added as first in the list of left.txt

In [29]:
! sed  -i '1s/^/ASW\n/' left.txt

We can now run qpAdm with the same par file we create before, for qpWave.

In [None]:
! qpAdm -p qpWave_qpAdm_1KGs_chr1_maf_pruned_converted.par >> qpAdm.log

We are going to focus at this block available in qpAdm.log:

We modelled ASW as a mixture of CEU and YRI (following the order of the lift in the left.txt file).

* `best coefficients`  is listing the ancestry proporions assigned
* `std. errors` are the standard errors computed via block jackknife
* `summ` will given you the summary of the run, in this case ASW is modelled with 2 groups, the p-value of the model is 0.74, and the ancestry proportions are ~0.22 and ~0.77

The last block shows all the model tested by the single run:

```
fixed pat  wt  dof     chisq       tail prob
00  0     4     1.943        0.746211     0.221     0.779 
```

* `00` stands for "both sources used", both CEU and YRI are used to model ASW.
* `tail prop` indicated the p-value, in qpAdm a p-value > 0.05 is selected to indicate that the model is supported. In this case, the model is supported with a tailprob of 0.74.
* `0.221` the ancestry proportion of the first source (CEU)
* `0.779` the ancestry proportion of the second source (YRI)


```
fixed pat  wt  dof     chisq       tail prob
01  1     5  1329.187               0     1.000     0.000 

```

* `01` stands for "ONE sources used", the first is used, the second is omitted: thus only CEU are used to model ASW.
* `tail prop` indicated the p-value, in qpAdm a p-value > 0.05 is selected to indicate that the model is supported. In this case the model is not supported.
* `1` the ancestry proportion of the first source (CEU)
* `0` the ancestry proportion of the second source (YRI), that is 0 because it is not used

We thus conclude that ASW can be modelled as a mixture of CEU and YRI, with CEU contributing ~20% and YRI ~80%.