# Hendricks Group

# Hidden Ancestry Example Notebook

_In the following notebook, we solve an example Hidden Ancestries problem._ 

* We use simulated SNPs with minor allele frequencies for $K=4$ ancestries -- European ancestries, African ancestries, South Asian ancestries, and East Asian ancestries. 

* We numerically solve for the 4 ancestry's true proportions in the total (simulated) population, which is the vector $\pi^*:=(\pi_1,\pi_2,\pi_3,\pi_4)$.
    * $\pi_1$ denotes the proportion of European ancestries in the population
    * $\pi_2$ denotes the proportion of African ancestries in the population
    * $\pi_3$ denotes the proportion of South Asian ancestries in the population
    * $\pi_4$ denotes the proportion of East Asian ancestries in the population

* In this notebook, we work with an example data set $D$ with known (i.e., manufactured) ancestry proportions $\pi_k=0.25$, $k=1,2,3,4,$ and ensure that our Python script correctly uses the data to solve for these ancestry proportions using Sequential Least Squares Quadratic Programming, or SLSQP.
    * Thus, we have the simple case in which all 4 ancestries have equal proportions of $25\%$ of the total population.

* The cell below calls the generalized HA script, which is the main feature of the HA Python package. Then we can access the functions inside of the script to solve an example problem.

In [1]:
%run HA_script.py

* For now, we change the current working directory to read in our genetic data, $D$, in the next cell
    * This is because the data is not stored in our package yet.
    * Note that eventually we will need to upload data along with our package once we settle on a finalized example data set.
* We then read in the data using Pandas to convert from the CSV format to an array endowed with numerical linear algebra properties Python understands.
* How $D$ is formatted matters a lot...
    * $D$ contains (for now) a reference column containing the natural numbers, Chromosome number, SNP number (location on genome), the $4$ minor allele frequencies of the $K=4$ ancestires, and the total allele frequencies. We only need the last five columns of $D$ to solve our example problem.
        * That is, we only need the minor and total allele frequences, which should be $K+1$ columns of D.
* Finally, we print out the first 5 rows of $D$ to take a look at its structure and check for basic correctness in what we _think_ we are working with!

In [5]:
import os
import pandas as pd

os.chdir('/nfs/storage/math/gross-s2/projects/mixtures/example_sims/')

# Read in the data
D = pd.read_csv("Afr_CEU_sas_eas_10000tot_2500Afr_2500sas_2500eas_sims_and_reference.txt", sep='\t')

D.head(5) ### Look at the first 5 rows

Unnamed: 0,CHR,SNP,CEU_MAF,afr_MAF,sas_MAF,eas_MAF,af
0,1,rs1000364,0.3939,0.070438,0.423307,0.384912,0.32305
1,1,rs1002655,0.3434,0.230153,0.468285,0.350196,0.34735
2,1,rs1008082,0.2828,0.082341,0.236197,0.112097,0.18235
3,1,rs10082057,0.0,0.181558,0.0,0.0,0.0463
4,1,rs10082123,0.1515,0.230183,0.220848,0.227184,0.2108


* $D$ was manufactured so that the true solution is $\pi^*=(.25,.25,.25,.25)$.
* In the cell below, we choose an initial iterate, $\pi^{(0)}$, sufficiently far away from the truth, $\pi^*$.

In [6]:
pi_0 = np.transpose([[0.8,0.1,0.05,0.05]])

* The user must specify the number of ancestries -- here, we have $4$.
* Finally, the user may apply the HA function to the data $D$, with intial iterate $\pi^{(0)}$, and the number of ancestries, $K=4$.
* The HA function will output an array containing the numerical solution, $\pi^{final}$, the number of SLSQP iterations taken to find the numerical solution, and the total runtime (in seconds) of SLSQP.

In [15]:
k=4 # User must specify number of ancestries!
output_array = HA(D, pi_0, k) # This line runs the HA function inside the HA_script

print('our problem includes', np.shape(D)[0], 'SNPs, and', k, 'ancestries')
print()
print('numerical solution via SLSQP, pi_final = ',output_array[0])
print()
print('number of SLSQP iterations:',output_array[1])
print()
print('runtime of SLSQP:',output_array[2],'seconds')

our problem includes 61857 SNPs, and 4 ancestries

numerical solution via SLSQP, pi_final =  [0.25022569 0.24995832 0.24972281 0.25009319]

number of SLSQP iterations: 7

runtime of SLSQP: 0.020284197060391307 seconds


### _Above we see a printout of the numerical solution, $\pi^{final}$, the number of SLSQP iterations, and the time in seconds of the run. Observe how close eachof the components of the answer are to 0.25._