# Multiclass classification on continuous variables using _ImaGene_

In this example, the aim is to estimate a continuous parameter.
Please refer to the tutorial for binary and multiclass classification for an in-depth explanation of each step and case study.
Briefly, we aim at estimating the selection coefficient on one variant conferring lactase persistence in Europeans.
We will discretize the distribution into classes and perform a multiclass classification.

In [None]:
import os
import gzip
import _pickle as pickle

import numpy as np
import scipy.stats
import arviz

import tensorflow as tf
from tensorflow import keras
from keras import models, layers, activations, optimizers, regularizers
from keras.utils import plot_model
from keras.models import load_model

import itertools
import skimage.transform
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import pydot # optional, but required by keras to plot the model

In [None]:
%run -i ../ImaGene.py

### 1. Read data from VCF file and store it into _ImaGene_ objects

In [None]:
file_LCT = ImaFile(nr_samples=198, VCF_file_name='LCT.CEU.vcf');
gene_LCT = file_LCT.read_VCF();

As an illustration, here we will sort rows by distance and columns by frequency and resize the image to (128,128).

In [None]:
gene_LCT.filter_freq(0.005);
gene_LCT.sort('rows_dist');
gene_LCT.sort('cols_freq');
gene_LCT.resize((128,128));
gene_LCT.convert(flip=True);
gene_LCT.plot();
gene_LCT.summary();

### 2. Run and process simulations to be used for training the neural network

We provide an example of parameter file called `params_continuous` which simulates a total of 205,000 loci of 80kbp with allelic selection coefficients from 0 to 400 in $2N_e$ units with $N_e=10,000$ with a step of 1 and additive effect.
All other parameters are set as in the example of binary classification.

In [None]:
# change accordingly, e.g.:
# path_sim = '/home/mfumagal/Data/ImaGene/Tutorials/' # for my local machine
# path_sim = '/mnt/quobyte/ImaGene/' # for workshop spp1819
path_sim = './'

Edit `params_continuous.txt` file accordingly. Here I assume that simulations will be stored in `path+Continuous/`.

In [None]:
# if you wish to generate new training data, do not run otherwise
import subprocess
subprocess.call("bash ../generate_dataset.sh params_continuous.txt".split())

We wish to perform a multiclass classification to estimate the selection coefficient, a continuous parameter. 
In _ImaGene_ we can easily do that by imposing a new discrete set of classes and reassign the new targets to such classes with the methods `.set_classes` and `.set_targets`.

### 3. Implement, train and evaluate the neural network

The pipeline for training and testing is the following one.

In [None]:
i = 1
while i <= 10:

    # simulations 
    file_sim = ImaFile(simulations_folder=path_sim+'Continuous/Simulations' + str(i), nr_samples=198, model_name='Marth-3epoch-CEU')

    # retain only 20 data points per class as a quick example
    gene_sim = file_sim.read_simulations(parameter_name='selection_coeff_hetero', max_nrepl=20)
    
    # manipulate data
    gene_sim.filter_freq(0.005)
    gene_sim.sort('rows_dist')
    gene_sim.sort('cols_freq')
    gene_sim.resize((128,128))
    gene_sim.convert(flip=True)
    
    # we assign 11 classes out of all the data simulated
    gene_sim.set_classes(nr_classes=11)
    if i == 1:
        print(gene_sim.classes)
    # and we assign targets corresponding to the previously set classes 
    gene_sim.set_targets()
    
    # randomise data
    gene_sim.subset(get_index_random(gene_sim))

    # targets have to be converted into categorical data; 
    # here we can use some extra options to, for instance, impose a Guassian distribution on the true targets
    gene_sim.targets = to_categorical(gene_sim.targets, wiggle=0, sd=0.5)
    
    # at first iteration we build the model 
    # note that, as an illustration, we don't implement a final fully-connected layer as we are double sorting the matrix
    if i == 1:

        model = models.Sequential([
                    layers.Conv2D(filters=32, kernel_size=(3,3), strides=(1,1), activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.005, l2=0.005), padding='valid', input_shape=gene_sim.data.shape[1:]),
                    layers.MaxPooling2D(pool_size=(2,2)),
                    layers.Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.005, l2=0.005), padding='valid'),
                    layers.MaxPooling2D(pool_size=(2,2)),
                    layers.Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.005, l2=0.005), padding='valid'),
                    layers.MaxPooling2D(pool_size=(2,2)),
                    layers.Flatten(),
                    layers.Dense(units=len(gene_sim.classes), activation='softmax')])
        model.compile(optimizer='adam',
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])

        net_LCT = ImaNet(name='[C32+P]+[C64+P]+[C128+P]')

    # training for iterations from 1 to 9
    print(i)
    if i < 10:
        score = model.fit(gene_sim.data, gene_sim.targets, batch_size=32, epochs=1, verbose=1, validation_split=0.10)
        net_LCT.update_scores(score)
    else:
        # testing for iteration 10
        net_LCT.test = model.evaluate(gene_sim.data, gene_sim.targets, batch_size=None, verbose=1)
        net_LCT.predict(gene_sim, model)

    i += 1

In [None]:
# set working directory where to save models, e.g. 
# path='/home/mfumagal/Data/ImaGene/Tutorials/Data/' # my local machine
# path='./' # for workshop spp1819
path = './'

In [None]:
# save final (trained) model
model.save(path+'model.multi_cont.h5')

# save testing data
gene_sim.save(path+'gene_sim.multi_cont')

# save network
net_LCT.save(path+'net_LCT.multi_cont')

Recall that to load all these files you can use the following commands.

In [None]:
gene_sim = load_imagene(path+'gene_sim.multi_cont');
net_LCT = load_imanet(path+'net_LCT.multi_cont');
model = load_model(path+'model.multi_cont.h5');

In [None]:
# assess the training
net_LCT.plot_train();

We can report loss, accuracy and confusion matrix as any classification task although in this case it may be more informative to investigate the difference between true and predicted values instead of classes.

In [None]:
# print the testing results [loss, accuracy] and plot confusion matrix
print(net_LCT.test)
net_LCT.plot_cm(gene_sim.classes, text=False);

### 4. Deploy the trained network on your genomic data of interest

Finally, we use the trained network to estimate the selection coefficient of our locus of interest.
A plot of the probability distrbution of selection coefficient can be obtained by, for instance, drawing MCMC samples. MCMC samples can also be used to obtain Bayes Factors and HPDI. 
(However, it is not guaranteed that this approach is better than using a regression as final layer. More tests need to be conducted.)

In [None]:
values = plot_scores(model, gene_LCT, classes=gene_sim.classes);

In output this function returns the following values: MAP, MLE, HPD, BF.

In [None]:
print(values)