# Hendricks Group

# Hidden Ancestry Example Notebook

_In the following notebook, we solve an example problem. We use simulated SNPs with minor allele frequencies for 4 ancestries -- European ancestries, African ancestries, South Asian ancestries, and East Asian ancestries. We try to solve for the 4 ancestry's proportions in the total (simulated) population. In detail, we choose known ancestry proportions (we choose our own answer!) and ensure that our Python script correctly uses the data to solve for the ancestry proportions. Here, we have the simple case in which all 4 ancestries have equal proportions of $0.25$ of the total population._

In [1]:
# This calls the generalized HA script -- be sure to have it saved in you working directory!
# This cell is not going to work once you change your working directory in the next cell to read in data!!!!!
# Changing your working directory is totally unneccessary if you are working in the "nfs" based directory already

%run HA_script.py

In [2]:
import os
import pandas as pd

# Change the current working directory to read in data
# Note that eventually we will need to upload this data along with our package...
os.chdir('/nfs/storage/math/gross-s2/projects/mixtures/example_sims/')

# Read in the data
ev = pd.read_csv("Afr_CEU_sas_eas_10000tot_2500Afr_2500sas_2500eas_sims_and_reference.txt", sep='\t')

ev.head(5) ### Look at the first 5 rows

Unnamed: 0,CHR,SNP,CEU_MAF,afr_MAF,sas_MAF,eas_MAF,af
0,1,rs1000364,0.3939,0.070438,0.423307,0.384912,0.32305
1,1,rs1002655,0.3434,0.230153,0.468285,0.350196,0.34735
2,1,rs1008082,0.2828,0.082341,0.236197,0.112097,0.18235
3,1,rs10082057,0.0,0.181558,0.0,0.0,0.0463
4,1,rs10082123,0.1515,0.230183,0.220848,0.227184,0.2108


In [3]:
# Collect and name the SNPS
# These are each of the columns above

# Transposes are appearing only to turn row vectors (which I introduced by using brackets)
# back into column vectors. This is a little sloppy, but guarantees correct size. See below.
a_1 = np.transpose([ev['CEU_MAF']])
a_2 = np.transpose([ev['afr_MAF']])
a_3 = np.transpose([ev['sas_MAF']])
a_4 = np.transpose([ev['eas_MAF']])
a_t = np.transpose([ev['af']])


# Now stack the columns into an input matrix that HA function will accept. Should be size Nxk.
A = np.hstack((a_1,a_2,a_3,a_4))

# Perform checks and print results
print('number of SNPs is', np.shape(a_t)[0])

print()

print('our MAF & TAF column vectors have size', np.shape(a_1),np.shape(a_2),np.shape(a_3),np.shape(a_4),np.shape(a_t))

print()

print('and our input matrix is size', np.shape(A))

number of SNPs is 61857

our MAF & TAF column vectors have size (61857, 1) (61857, 1) (61857, 1) (61857, 1) (61857, 1)

and our input matrix is size (61857, 4)


In [4]:
# The real answer is (.25,.25,.25,.25) for the data set we just read in.
# Therefore, we choose a starting guess sufficiently far away

x_t=np.transpose([[0.8,0.1,0.05,0.05]])

In [5]:
# Ensure that the starting guess is proper size
# SHould be kx1

x_t.shape

(4, 1)

In [6]:
what_we_want = HA(A, a_t, x_t)

print(what_we_want)

(array([0.25022569, 0.24995832, 0.24972281, 0.25009319]), 7, 0.03330934402765706)


_Above we see a printout of the answer, the number of SLSQP iterations, and the time in seconds of the run. Observe how close the components of the answer are to 0.25._