# Hendricks Group

# Hidden Ancestry Correlated Data Simulations

_In the following notebook, we investigate the effect of correlated observations on the numerical solution to the Hidden Ancestries problem.
In particular, we consider a series of solutions to problems in which we have repeated observations in our data set.
Repeated observations are simply equal data, which are perfectly correlated.
We use simulated SNPs with minor allele frequencies for only 2 ancestries -- European ancestries and African ancestries.
We try to solve for both ancestry's proportions in the total (simulated) population. 
In detail, we choose known ancestry proportions (we choose our own answer!) and ensure that our Python script correctly uses the data to solve for the ancestry proportions. Here, we have the case in which the European ancestries comprise $0.1$ of the total population and African ancestries comprise $0.9$ of the total population.
Although this data contain $>77,000$ SNPs, we will only be using about 10 of these for now..._

In [1]:
# This calls the generalized HA script -- be sure to have it saved in you working directory!
# This cell is not going to work once you change your working directory in the next cell to read in data!!!!!
# Changing your working directory is totally unneccessary if you are working in the "nfs" based directory already

%run HA_script.py

In [10]:
import os
import pandas as pd

# Change the current working directory to read in data
# Note that eventually we will need to upload this data along with our package...
os.chdir('/nfs/storage/math/gross-s2/projects/mixtures/example_sims/')

# Read in the data
ev = pd.read_csv("Afr_CEU_10000tot_9000Afr_sims_and_reference.txt", sep='\t')

ev.head(10) ### Look at the first 5 rows

Unnamed: 0,CHR,SNP,CEU_MAF,afr_MAF,sample_af
0,1,rs10047256,0.03535,0.162716,0.1464
1,1,rs1005979,0.399,0.351198,0.35375
2,1,rs1006236,0.06566,0.256969,0.23795
3,1,rs1007249,0.0,0.026783,0.02385
4,1,rs1007282,0.01515,0.114079,0.1
5,1,rs10082085,0.0,0.18551,0.16505
6,1,rs10157158,0.3081,0.353187,0.35005
7,1,rs10157174,0.0,0.079365,0.07035
8,1,rs10157521,0.0,0.155752,0.137
9,1,rs10157912,0.0101,0.158734,0.14465


In [3]:
# Collect and name the SNPS
# These are each of the columns above

# Transposes are appearing only to turn row vectors (which I introduced by using brackets)
# back into column vectors. This is a little sloppy, but guarantees correct size. See below.
a_1 = np.transpose([ev['CEU_MAF']])
a_2 = np.transpose([ev['afr_MAF']])
a_t = np.transpose([ev['sample_af']])


# Now stack the columns into an input matrix that HA function will accept. Should be size Nxk.
A = np.hstack((a_1,a_2))

# Perform checks and print results
print('number of SNPs is', np.shape(a_t)[0])

print()

print('our MAF & TAF column vectors have size', np.shape(a_1),np.shape(a_2),np.shape(a_t))

print()

print('and our input matrix is size', np.shape(A))

number of SNPs is 77095

our MAF & TAF column vectors have size (77095, 1) (77095, 1) (77095, 1)

and our input matrix is size (77095, 2)


## Before we do simulations we (1) choose an initial starting guess to seed all of our runs with (2) we solve the problem with all $77,095$ SNPs just as a baseline

In [45]:
# The real answer is (.1,.9) for the data set we just read in.
# Therefore, we choose a starting guess sufficiently far away

x_t=np.transpose([[0.5,0.5]])

In [46]:
# Ensure that the starting guess is proper size
# SHould be kx1

x_t.shape

(2, 1)

In [50]:
what_we_want = HA(A, a_t, x_t)

print('solution using every SNP:', what_we_want[0])

print()

print('exact solution should be:', [0.1,0.9])

solution using every SNP: [0.09997312 0.90002688]

exact solution should be: [0.1, 0.9]


## First: solve ten 1 SNP problems using first ten observations in data frame

In [51]:
sol_array = np.zeros((10,2))

for i in range(0,10):
    a_1_single_SNP = a_1[i:i+1]
    a_2_single_SNP = a_2[i:i+1]
    a_t_single_SNP = a_t[i:i+1]


    # Now stack the columns into an input matrix that HA function will accept. Should be size Nxk.
    A_single_SNP = np.hstack((a_1_single_SNP,a_2_single_SNP))

    what_we_want_single_SNP = HA(A_single_SNP, a_t_single_SNP, x_t)[0]
    
    sol_array[i,:] = what_we_want_single_SNP

In [54]:

print(sol_array)
print(np.shape([[1,2]]))

[[0.12810647 0.87189353]
 [0.5        0.5       ]
 [0.09941437 0.90058563]
 [0.5        0.5       ]
 [0.14231425 0.85768575]
 [0.11028922 0.88971078]
 [0.5        0.5       ]
 [0.11359199 0.88640801]
 [0.12039644 0.87960356]
 [0.09475581 0.90524419]]
(1, 2)


#### Above we have printed out the 10 solutions we found. Some are good, some are almost perfect (!!!), and some are very bad...

#### In fact, we see that 3 of the SNPs must be relatively un-informative since three of the printed solutions are just the initial starting guess of $x_t=(0.5,0.5)$. In other words, the initial solution was a better solution to these problems than whatever the 1-step of SQP landed on...


#### Need to plot these 10 sols.

## Next: Create a data frame with 1 copy of each of the 10 initial SNPs above.

### We will have 10 total SNPs and solve the problem with all 10 included now. (Just another baseline.)

In [58]:
A_ten_SNP = np.zeros((10,2))

for i in range(0,10):
    a_1_single_SNP = a_1[i:i+1]
    a_2_single_SNP = a_2[i:i+1]
    a_t_single_SNP = a_t[i:i+1]


    # Now stack the columns into an input matrix that HA function will accept. Should be size Nxk.
    A_ten_SNP[i,:] = np.hstack((a_1_single_SNP,a_2_single_SNP))

    what_we_want_ten_SNP = HA(A_ten_SNP, a_t_single_SNP, x_t)[0]

In [60]:
print(what_we_want_ten_SNP)

[0.15097919 0.84902081]


#### Here we see a solution that is close, but about $0.05$ "off" in each component, where we recall the truth is $(0.1,0.9).$

## Next: Create a data frame with 10 copies of each of the 10 initial SNPs.

### Now we will have 100 total SNPs and solve the problem with all 100. I am expecting the same exact solution as what we just got since the ratios of each data observation has not changed... (?)

In [63]:
A_hund_SNP = np.zeros((100,2))
a_t_hund_SNP = np.zeros((100,1))

for i in range(0,10):
    a_1_single_SNP = a_1[i:i+1]
    a_2_single_SNP = a_2[i:i+1]
    a_t_single_SNP = a_t[i:i+1]

    for j in range(0,10):
        # Now stack the columns into an input matrix that HA function will accept. Should be size Nxk.
        A_hund_SNP[i+j,:] = np.hstack((a_1_single_SNP,a_2_single_SNP))
        a_t_hund_SNP[i+j,:] = a_t_single_SNP

    what_we_want_hund_SNP = HA(A_hund_SNP, a_t_hund_SNP, x_t)[0]


In [64]:
print(what_we_want_hund_SNP)

[0.10130979 0.89869021]


#### Wow! Here we see a solution that is actually closer, only about $0.01$ "off" in each component, where we recall the truth is $(0.1,0.9).$

#### Is the effect of the multiple informative SNP's is outweighing the "bad"/uniformative SNPs now?

## Final test for now: Create a data frame with 80 copies of an informative SNP and 20 copies of an uninformative SNP

### Now we will again have 100 total SNPs and solve the problem with all 100. I no longer know what to expect!

### I will use the very first initial SNP for copying 80 times since it was informative

### I will use the second initial SNP for copying 20 times since it seemed uninformative (only based on its single SNP solution...)

In [87]:
good_n = 80
bad_n = 20 # these two values must sum to 100!

A_hund_SNP_1 = np.zeros((100,2))
a_t_hund_SNP_1 = np.zeros((100,1))


a_1_good_SNP = a_1[0:1]
a_2_good_SNP = a_2[0:1]
a_t_good_SNP = a_t[0:1]

for j in range(0,good_n):
    # Now stack the columns into an input matrix that HA function will accept. Should be size Nxk.
    A_hund_SNP_1[i,:] = np.hstack((a_1_good_SNP,a_2_good_SNP))
    a_t_hund_SNP_1[i,:] = a_t_single_SNP
    
a_1_bad_SNP = a_1[1:2]
a_2_bad_SNP = a_2[1:2]
a_t_bad_SNP = a_t[1:2]

for j in range(0,bad_n):
    # Now stack the columns into an input matrix that HA function will accept. Should be size Nxk.
    A_hund_SNP_1[good_n+i,:] = np.hstack((a_1_bad_SNP,a_2_bad_SNP))
    a_t_hund_SNP_1[good_n+i,:] = a_t_bad_SNP

what_we_want_hund_SNP_1 = HA(A_hund_SNP_1, a_t_hund_SNP_1, x_t)[0]

In [88]:
print(what_we_want_hund_SNP_1)

[0.1309241 0.8690759]


#### Very interesting. I am getting the solution above even when I use 99 "bad" SNPs and only 1 "good" SNP... As soon as I take away the only good SNP, we are back to the initial guess, $x_t=(0,5,0.5)$.

#### Need to put in sliders for good_n and bad_n to show this?

#### More investigation needed.