# Hendricks Group

# Hidden Ancestry Example Notebook

_In the following notebook, we solve an example Hidden Ancestries problem._ 

* We use simulated SNPs with minor allele frequencies for $K=5$ ancestries -- European ancestries, African ancestries, South Asian ancestries, East Asian ancestries, and Native American ancestires. 

* We numerically solve for the 5 ancestry's true proportions in the _observed_ population, which is the vector $\pi^*:=(\pi_1,\pi_2,\pi_3,\pi_4, \pi_5)$.
    * $\pi_1$ denotes the proportion of European ancestries in the observed population
    * $\pi_2$ denotes the proportion of African ancestries in the observed population
    * $\pi_3$ denotes the proportion of South Asian ancestries in the observed population
    * $\pi_4$ denotes the proportion of East Asian ancestries in the observed population
    * $\pi_5$ denotes the proportion of Native American ancestries in the observed population

* In this notebook, we work with an example data set $D$ with $10,000$ SNPs and ensure that our Python script correctly uses the data to solve for these ancestry proportions using Sequential Least Squares Quadratic Programming, or SLSQP.

* The cell below calls the generalized HA script, which is the main feature of the HA Python package. Then we can access the functions inside of the script to solve an example problem.

In [1]:
%run HA_script.py

* For now, we change the current working directory to read in our genetic data, $D$, in the next cell
    * This is because the data is not stored in our package yet.
    * Note that eventually we will need to upload data along with our package once we settle on a finalized example data set.
* We then read in the data using Pandas to convert from the CSV format to an array endowed with numerical linear algebra properties Python understands.
* How $D$ is formatted matters a lot...
    * $D$ contains (for now) a reference column containing the natural numbers, Chromosome number, SNP number (location on genome), the $5$ minor allele frequencies of the $K=5$ ancestires, and the gnomAD observed allele frequencies. We only need certain columns of $D$ to solve our example problem.
        * That is, we only need the minor allele frequencies and whichever observed allele frequency we are modeling, which should be $K+1$ columns of D.
* Finally, we print out the first 5 rows of $D$ to take a look at its structure and check for basic correctness in what we _think_ we are working with!

In [2]:
import os
import pandas as pd

os.chdir('/nfs/storage/math/gross-s2/projects/mixtures/genomic_resources/packagedata')

# Read in the data
D = pd.read_csv("packagedata.csv")

D.head(5) ### Look at the first 5 rows

Unnamed: 0,SNP,CHR,ref_eur,ref_afr,ref_sas,ref_eas,ref_nam,gnomad_afr,gnomad_amr,gnomad_oth
0,rs6695131,1,0.408394,0.596249,0.327196,0.344257,0.5116,0.585968,0.410165,0.385185
1,rs16823459,1,0.002475,0.053571,0.126784,0.071429,0.0,0.0517,0.001179,0.017463
2,rs10909918,1,0.496266,0.135923,0.464228,0.596266,0.6977,0.209444,0.602837,0.482505
3,rs2483280,1,0.414622,0.083342,0.425382,0.202389,0.0,0.118763,0.183649,0.368762
4,rs2487680,1,0.04703,0.000992,0.10225,0.219258,0.3837,0.014355,0.174941,0.070772


* In the cell below, we specify the number of ancestries, $K$, and choose an initial iterate, $\pi^{(0)}=\frac{1}{K}(1,\ldots,1)\in \mathbb{R}^{K}$.

In [3]:
K=5 # User must specify number of ancestries!
pi_0 = np.transpose(1/K*np.ones((K,1)))

* The user must specify the number of ancestries -- here, we have $5$.
* Finally, the user may apply the HA function to the data $D$, with intial iterate $\pi^{(0)}$, and the number of ancestries, $K=5$.
* The HA function will output an array containing the numerical solution, $\pi^{final}$, the number of SLSQP iterations taken to find the numerical solution, and the total runtime (in seconds) of SLSQP.

In [5]:

output_array = HA(D, pi_0, K) # This line runs the HA function inside the HA_script

print('our problem includes', np.shape(D)[0], 'SNPs, and', K, 'ancestries')
print()
print('numerical solution via SLSQP, pi_final = ',output_array[0])
print()
print('number of SLSQP iterations:',output_array[1])
print()
print('runtime of SLSQP:',output_array[2],'seconds')

our problem includes 10000 SNPs, and 5 ancestries

numerical solution via SLSQP, pi_final =  [0.1583858  0.82308077 0.00714997 0.00357303 0.00781043]

number of SLSQP iterations: 10

runtime of SLSQP: 0.012948384508490562 seconds


### _Above we see a printout of the numerical solution, $\pi^{final}$, the number of SLSQP iterations, and the time in seconds of the run._