# Loading trained models and using them to make predictions

Here, we load two trained models -- one first stage model, one second stage model -- and use them to make predictions about the source and destination of introgression from newly simulated data.

### Imports

In [1]:
import simcat
import toytree
import numpy as np
import pandas as pd
import ipcoal
import pandas as pd

from keras.models import load_model

Using TensorFlow backend.


### Load up our trained models

The models are written out using keras' model.save function, as .h5 files.

In [2]:
firststage_mod = load_model("../models/bal_10tip_2mil/firststage_mod.h5")
secondstage_mod = load_model("../models/bal_10tip_2mil/secondstage_mod.h5")

### Load up the classification dictionary

This is a dictionary I made during the training process to translate the one-hot encodings used by models to a literal source/destination pair.

In [3]:
onehot_dict = pd.read_csv('../models/bal_10tip_2mil/onehot_dict.csv')
onehot_dict

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,167,168,169,170,171,172,173,174,175,176
0,11,2,3,4,5,6,7,8,9,111,...,912,913,914,92,93,94,95,96,97,


### Load up our input topology

This is the topology used as a basis for the training simulations. Aside from using the topology, the parameters in the simulations were drawn from distributions based around the node heights on this tree.

In [4]:
input_topo = toytree.tree('../models/bal_10tip_2mil/species_tree.tre')

### Define a simple simulation

We'll give it some simple scenario with constant Ne values and without sliding the nodes, just to demonstrate.

In [5]:
# here is an opportunity to change the node heights... we'll just copy in the original tree
slidetree=input_topo

# mut rate same as in training
mut=1e-8

# set the Ne values for the whole tree
Ne_min = 500000
Ne_max = 500000
popsizes = np.random.uniform(
                Ne_min, Ne_max, slidetree.nnodes)

slidetree = slidetree.set_node_values("Ne", default=1e5)
nes = iter(popsizes)

slidetree = slidetree.set_node_values(
    "Ne",
    {i.name: next(nes) for i in slidetree.get_feature_dict()}
)

# plot the tree
slidetree.draw(ts='p',
    edge_type='p');

### Propose an admixture edge  
...using the toytree/ipcoal format for doing so.

In [6]:
source = 12
dest = 15
time = 0.5
magnitude = 0.4

# define admixture tuple
admix = (
    source,
    dest,
    time,
    magnitude
)

### Define our model

In [7]:
# build ipcoal Model object using our defined parameters
model = ipcoal.Model(
    tree=slidetree,
    admixture_edges=[admix],
    Ne=None,
    mut=mut,
    seed=12345,
    )

### Simulate many SNPs

In [8]:
model.sim_snps(20000)

### Restructure the simulated SNPs into count matrices using the ipcoal `get_snps_count_matrix` function.

In [9]:
count_mat=ipcoal.utils.get_snps_count_matrix(model.treeorig,model.seqs)

### Normalize the simulated count matrix

In [10]:
count_mat = count_mat / count_mat.max()

### Feed the simulated SNP data to the first stage model

In [11]:
pred1 = firststage_mod.predict(count_mat.reshape(1,210*16*16))
np.argmax(pred1[0])

42

### Feed the first stage model output to the second stage model

In [12]:
pred2 = secondstage_mod.predict(pred1)
np.argmax(pred2[0])

42

### Make a prediction

In [13]:
# what is the prediction?
onehot_dict[str(np.argmax(pred2[0]))][0]

'12,15'

### Compare to the correct answer

In [14]:
# what is the correct answer?
','.join([str(source),str(dest)])

'12,15'