# Hendricks Group

# Hidden Ancestry Example Notebook

_In the following notebook, we solve an example Hidden Ancestries problem._ 

* We use simulated SNPs with minor allele frequencies for $K=5$ ancestries -- European ancestries, African ancestries, South Asian ancestries, East Asian ancestries, and Native American ancestires. 

* We numerically solve for the 5 ancestry's true proportions in the _observed_ population, which is the vector $\pi^*:=(\pi_1,\pi_2,\pi_3,\pi_4, \pi_5)$.
    * $\pi_1$ denotes the proportion of European ancestries in the observed population
    * $\pi_2$ denotes the proportion of African ancestries in the observed population
    * $\pi_3$ denotes the proportion of South Asian ancestries in the observed population
    * $\pi_4$ denotes the proportion of East Asian ancestries in the observed population
    * $\pi_5$ denotes the proportion of Native American ancestries in the observed population

* In this notebook, we work with an example data set $D$ with $N=10,000$ SNPs and ensure that our Python script correctly uses the data to solve for these ancestry proportions using Sequential Least Squares Quadratic Programming, or SLSQP.

* The cell below calls the generalized HA script, which is the main feature of the HA Python package. Then we can access the functions inside of the script to solve an example problem.

In [1]:
%run HA_script.py

* In the cell below, we read in the example data set provided with this package, called "packagedata.csv".
* We use Pandas to convert from the CSV format to an array endowed with numerical linear algebra properties Python understands.
* How $D$ is formatted matters a lot -- order matters...
    * $D$ contains Chromosome number, SNP number (RSID, location on genome), base pair (bP), reference and alternate alleles (A1 and A2),the $K=5$ minor allele frequencies of the (1000 genomes) reference ancestires (in this case), and finally, any (gnomAD) observed allele frequencies we wish to model. 
    * $D$ does not contain an indexing column containing the natural numbers up to $N$ -- Pandas is just showing this as part of the print statement -- it is not a column of $D$ itself! 
    * We only need certain columns of $D$ to solve our example problem...
        * That is, we only need the minor allele frequencies and whichever observed allele frequency we are modeling, which should be $K+1$ columns of D. In our case, we need 6 of the total 13 columns.
* We print out the first 5 rows of $D$ to take a look at its structure and check for basic correctness in what we _think_ we are working with!

In [2]:
import pandas as pd

# Read in the data
D = pd.read_csv("HA-package-data.csv")

D.head(5) ### Look at the first 5 rows

Unnamed: 0,CHR,RSID,bP,A1,A2,ref_eur_1000G,ref_afr_1000G,ref_sas_1000G,ref_eas_1000G,ref_nam_1000G,obs_afr_gnomad,obs_amr_gnomad,obs_oth_gnomad
0,1,rs2887286,1156131,C,T,0.173275,0.541663,0.531712,0.846223,0.7093,0.48861,0.525943,0.229705
1,1,rs41477744,2329564,A,G,0.001238,0.035714,0.0,0.0,0.0,0.045914,0.001179,0.008272
2,1,rs9661525,2952840,G,T,0.168316,0.120048,0.09918,0.393853,0.2442,0.135977,0.286052,0.155617
3,1,rs2817174,3044181,C,T,0.428213,0.959325,0.639072,0.570454,0.5,0.854879,0.48818,0.470425
4,1,rs12139206,3504073,T,C,0.204215,0.801565,0.393671,0.389881,0.3372,0.724178,0.295508,0.258748


* In the cell below, we specify the number of reference ancestries, $K$, (here we have 5) and optionally choose an initial iterate.
    * Choosing an initial iterate is optional, and so we leave this command commented out.
    * The default initial iterate is $\pi^{(0)}=\frac{1}{K}(1,\ldots,1)\in \mathbb{R}^{K}$.
    * The initial iterate must be a $K \times 1$ (column) or $1 \times K$ (row) vector (the HA script can handle either shape)

In [3]:
K=5 # User must specify number of ancestries!
pi_0 = [[0.3,0.1,0.2,0.1,0.3]] # You do not have to provide the initial iterate, but you may.
np.shape(pi_0)

(1, 5)

* In the cell below, we quickly check that we have specified a data matrix $D$ and total number of ancestries $K$ that match the number of SNPs we think we are working with as well as the correct reference ancestry number.

In [4]:
print('our problem includes', np.shape(D)[0], 'SNPs, and', K, 'reference ancestries.')

our problem includes 10000 SNPs, and 5 reference ancestries.


* Finally, in the cell below, we apply the HA function to the data $D$, with intial iterate $\pi^{(0)}$ (optional), and the number of ancestries, $K=5$. 
    * We can also provide different observation columns if we choose.
* The HA function will output an array containing the numerical solution, $\pi^{final}$, the number of SLSQP iterations taken to find the numerical solution, and the total runtime (in seconds) of SLSQP.

In [5]:
output_array = HA(D,K) # This line runs the HA function inside the HA_script using default initial iterate and default observed
# output_array = HA(D,K,obs=1) # This line runs the HA function inside the HA_script using default initial iterate and second observed
# output_array = HA(D,K,obs=3) # This line runs the HA function inside the HA_script using default initial iterate and third observed
# output_array = HA(D,K,pi_0) # This line runs the HA function inside the HA_script with a particular choice for initial iterate


print('numerical solution via SLSQP, pi_final = ',output_array[0])
print()
print('number of SLSQP iterations:',output_array[1])
print()
print('runtime:',output_array[2],'seconds')

numerical solution via SLSQP, pi_final =  [0.15767165 0.82505753 0.00328754 0.00767394 0.00630933]

number of SLSQP iterations: 11

runtime: 0.032493100000000004 seconds


### _Above we see a printout of the numerical solution, $\pi^{final}$, the number of SLSQP iterations, and the time in seconds of the run._

In detail...

* The numerical solution we have found is given by $$\pi^\text{final}\approx (0.158,0.825,0.003,0.008,0.006)$$

    * $\pi_1^\text{final}\approx 0.158$ denotes the proportion of European ancestries in the observed population
    * $\pi_2^\text{final}\approx 0.825$ denotes the proportion of African ancestries in the observed population
    * $\pi_3^\text{final}\approx 0.003$ denotes the proportion of South Asian ancestries in the observed population
    * $\pi_4^\text{final}\approx 0.008$ denotes the proportion of East Asian ancestries in the observed population
    * $\pi_5^\text{final}\approx 0.006$ denotes the proportion of Native American ancestries in the observed population
    * Recall that we chose the gnomAD African sample for our observed population in this example.
    
* SLSQP went through about 10 iterations to obtain this numerical solution

* The runtime of the script/computational process should be less than a tenth of a second!
    * I am witnessing run times closer to a hundredth of a second!