# 10-tip balanced coalescent tree analysis

The topology for this simcat analysis is a 10-tip, imbalanced, bifurcating tree. The database of simulations used here was simulated outside of this notebook. Each sample has 20,000 SNPs. Ech simulation allowed for each node to slide up and down 25% of its height, for each branch to take a different Ne value, and for the root position to shift between 0.5 and 1.5 times its original depth. The timing of the admixture edge varied from .3 to .7, and the magnitude varied from 0.05 to 0.5.

## Imports:

In [1]:
import simcat
import toytree
import toyplot
import toyplot.svg
import numpy as np
import pandas as pd
import h5py
import csv
from keras.models import Sequential,load_model
from keras.layers import Dense
from sklearn.metrics import accuracy_score, confusion_matrix

Using TensorFlow backend.


## Load up database:  
This is a huge database with over 60,000 ipcoal simulations.

In [2]:
mod = simcat.Analysis(
    name="cleaned",
    workdir="../merged/",
    mask_admixture_min=0.05,
    mask_sisters=True,
    scale=1,
)

[init] cleaned
[load] (63740, 210, 16, 16)
[filter] (63740, 210, 16, 16)
[vectorize] (63740, 53760)
[train/test] (42705, 53760)/(21035, 53760)


In [3]:
mod.train_test_split(prop=0.1)

[train/test] (57366, 53760)/(6374, 53760)


## Look at the tree:

In [4]:
mod.tree.draw(ts='p');

## Fit a neural network:

#### Prepare the data:

In [5]:
# encode labels as ints:
unique_labs = np.unique(mod.y)
onehot_dict = dict(zip(range(len(unique_labs)),unique_labs))
inv_onehot_dict = dict(zip(unique_labs,range(len(unique_labs))))

#### Write out the dict so you can interpret your predictions later on...  
We'll use it when we load up the results to use on other data, later.

In [6]:
with open('onehot_dict.csv', 'w') as f:  # Just use 'w' mode in 3.x
    w = csv.DictWriter(f, onehot_dict.keys())
    w.writeheader()
    w.writerow(onehot_dict)

In [7]:
# number of non-sister admixture scenarios in our training data
len(onehot_dict)

177

#### Now actually create one-hot-encoded labels for the training and testing

In [8]:
# one-hot encode training labels
y_idxs = [inv_onehot_dict[i] for i in np.array(mod.y_train)]
y = np.zeros((len(y_idxs),len(onehot_dict)))
for rowidx in range(y.shape[0]):
    y[rowidx,y_idxs[rowidx]] += 1

In [9]:
# one-hot encode test labels
y_test_idxs = [inv_onehot_dict[i] for i in np.array(mod.y_test)]
y_test = np.zeros((len(y_test_idxs),len(onehot_dict)))
for rowidx in range(y_test.shape[0]):
    y_test[rowidx,y_test_idxs[rowidx]] += 1

#### Record which of these onehot values maps to "NaN", indicating that there was introgression between sister edges. We don't want to include this in the training.

In [10]:
# for excluding NaN from the analysis -- which integer value is NaN?
nanval = {onehot_dict[i]:i for i in onehot_dict.keys()}["NaN"]

## The data is prepared. Now train the model:  
For this analysis, at each stage, we're going to exclude the data that has the "NaN" label.

### Define the network:  
We'll have one hidden layer with 1000 nodes.

In [11]:
# Neural network architecture
model = Sequential()
model.add(Dense(1000, input_dim=mod.X_train.shape[1], activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### Do the training:  
At each epoch, let's evaluate the model's accuracy on a separate test dataset that doesn't impact the model training.

In [12]:
num_epochs = 75
epoch_accuracies = []
test_accuracies = []
for i in range(num_epochs):
    print("~~~~~~~~~~~~~~ Training epoch "+ str(i) + ": ~~~~~~~~~~~~~~")
    history = model.fit(mod.X_train[~(np.argmax(y,1)==nanval)], 
                        y[~(np.argmax(y,1)==nanval)], 
                        epochs=1, 
                        batch_size=512,
                        verbose=False)
    acc = history.history['accuracy']
    print("training accuracy: " + str(round(acc[0],2)))
    epoch_accuracies.append(acc[0])
    
    # now make predictions on the test data
    y_pred = model.predict(mod.X_test[~(np.argmax(y_test,1)==nanval)])
    #Convert predictions to label
    pred = list()
    for i in range(len(y_pred)):
        pred.append(np.argmax(y_pred[i]))
        
    #Converting one hot encoded test label to label
    test = list()
    for i in range(len(y_test[~(np.argmax(y_test,1)==nanval)])):
        test.append(np.argmax(y_test[~(np.argmax(y_test,1)==nanval)][i]))
        
    a = accuracy_score(pred,test)
    print("test accuracy: "+str(round(a,2)))
    test_accuracies.append(a)
    

~~~~~~~~~~~~~~ Training epoch 0: ~~~~~~~~~~~~~~
training accuracy: 0.02
test accuracy: 0.06
~~~~~~~~~~~~~~ Training epoch 1: ~~~~~~~~~~~~~~
training accuracy: 0.09
test accuracy: 0.13
~~~~~~~~~~~~~~ Training epoch 2: ~~~~~~~~~~~~~~
training accuracy: 0.17
test accuracy: 0.22
~~~~~~~~~~~~~~ Training epoch 3: ~~~~~~~~~~~~~~
training accuracy: 0.25
test accuracy: 0.28
~~~~~~~~~~~~~~ Training epoch 4: ~~~~~~~~~~~~~~
training accuracy: 0.31
test accuracy: 0.32
~~~~~~~~~~~~~~ Training epoch 5: ~~~~~~~~~~~~~~
training accuracy: 0.36
test accuracy: 0.37
~~~~~~~~~~~~~~ Training epoch 6: ~~~~~~~~~~~~~~
training accuracy: 0.41
test accuracy: 0.41
~~~~~~~~~~~~~~ Training epoch 7: ~~~~~~~~~~~~~~
training accuracy: 0.44
test accuracy: 0.44
~~~~~~~~~~~~~~ Training epoch 8: ~~~~~~~~~~~~~~
training accuracy: 0.47
test accuracy: 0.48
~~~~~~~~~~~~~~ Training epoch 9: ~~~~~~~~~~~~~~
training accuracy: 0.5
test accuracy: 0.5
~~~~~~~~~~~~~~ Training epoch 10: ~~~~~~~~~~~~~~
training accuracy: 0.52
test accu

#### You can see that the model is scoring around 74% on the test dataset.

### Save the trained model to load up later.

In [13]:
model.save("bal_10tip_2mil_mod_20ksnps_1000node.h5")