# Building a small LAI model

For today's exercise we'll be building a small classifier that can infer the ancestries of query haplotypes after being trained on many haplotype - ancestry pairs. Put another way, this model will learn to classify haplotypes as belonging to different populations. We will only be using simulated data for this exercise, so let's start by defining some functions to simulate our data. We will begin with a simple 3 population model with no migration:

In [2]:
import msprime
import numpy   as np
import seaborn as sn # Plotting
import matplotlib.pyplot as plt # Plotting

from sklearn.metrics import confusion_matrix # Plotting


#################
# VISUALIZATION #
#################


def plot_confusion(predictions, labels, num_pops):
    # Create confusion matrix
    conf_matrix = confusion_matrix(labels, predictions)
    conf_matrix = conf_matrix / conf_matrix.sum(axis=1)

    # Plot confusion matrix
    classes = ['ABCDEFG'[x] for x in range(num_pops)]
    plt.figure(figsize=(12, 7))
    sn.set(font_scale=1.9)
    sn.heatmap(conf_matrix, annot=True, fmt='.2f', xticklabels=classes, yticklabels=classes, vmin=0, vmax=1)
    plt.show();


##############
# DEMOGRAPHY #
##############


def simple_divergence(n, l, ab_split=1_000, abc_split=4_000):
    '''
    Simulation of n * 2 diploid samples for each population, we consider three populations "A", "B", and "C" such that:
        - Each has N_e = 10,000 in the present.
        - "A" and "B"  become a single population "AB"  :ab_split  generations in the past.
        - "C" and "AB" become a single population "ABC" :abc_split generations in the past.

    Parameters:
        n - Number of samples to simulate (per population).
        l - Length of the region to simulate.
        ab_split  - Generations in the past when "AB"  split into "A"  and "B"
        abc_split - Generatiosn in the past when "ABC" split into "AB" and "C"

    Output value:
        Tree sequence for the simulated and mutated samples.

    Relevant Documentation:
        msprime.Demography
        msprime.SampleSet
        msprime.sim_ancestry()
    '''

    pass

Now that we have some functions to simulate data, we want to:

  1. Simulate a large enough dataset for training + testing
  2. Split the simulated data into training and testing sets
  3. Label our data, since this is supervised machine learning
  4. Fit our model to the training data
  5. Predict and measure the model error on the testing data 

In [3]:
import numpy as np

from sklearn.tree            import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


# Generate data matrix
n_samples = 300
max_snp   = 1024
ts        = simple_divergence(n_samples, 1_000_000, 200, 400)

# Extract genotype matrix, rows as individuals and columns as SNPs
pass

# Generate labels
n_pops = 3
labels = []
for pop in range(n_pops):
    labels.extend([pop] * n_samples * 2)

# Split into training and testing datasets
pass

# Train model
pass

# Make predictions
pass

# Show results
pass

What are some limitations for this approach of classifying sequences?

Consider the way most real world datasets are structured. How could we train on existing datasets but make inferences on unlabeled data? How will our model behave for admixed individuals?