# Tutorial for multiclass classification using _ImaGene_

In this example, the aim is to classify whether a given _locus_ is under positive selection or neutrally evolving with 3 classes of selection coefficient (0,200,400 in 2Ne units with Ne=10,000).
Please refer to the tutorial for binay classification for an in-depth explanation of each step.
Here we will just highlight the main differences.

In [None]:
import os
import gzip
import _pickle as pickle

import numpy as np
import scipy.stats

import skimage.transform
from keras import models, layers, activations, optimizers, regularizers
from keras.utils import plot_model
from keras.models import load_model

import itertools
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

In [None]:
%run -i ImaGene.py

We can check the sample allele frequency for the selected allele. recall that we imposed selection to be acting in the middle of the region. Therefore, the targeted allele will be in position '0.5' in the _msms_ file.

In [None]:
freqs = calculate_allele_frequency(gene_sim, 0.5)
plt.scatter(gene_sim.targets, freqs, marker='o')
plt.xlabel('Selection coefficient')
plt.ylabel('Allele frequency')

As you can see, the current allele frequency for the targeted allele (on the y-axis) tends to increase for increasing selection coefficient (on the x-axis).

Next, _ImaGene_ provides functionalities to manipulate our object. Specifically we can do the following:
* convert ancestral/derived to major/minor allele polarisation
* filter out columns based on a minimum allele frequency (e.g. 0.01)
* sorting rows and columns by frequency (or distance from the most frequent entry)
* resize rows and columns (e.g. to 128x128)

For instance, the options above could be achieved with the following lines:

Here we want to do a simple binary classification. This means that we only want to consider 2 classes. For doing that, first we need to set '.classes' to the desired values and then we need to take a subset of the data corresponding to the desired classes only.

We can achieve these steps with the following lines with, as an illustration,  classes (i.e. selection coefficients) of 0 and 300 (in 2Ne units).

In [None]:
mygene.classes = np.array([0,300])
classes_idx = get_index_classes(mygene.targets, mygene.classes)
len(classes_idx)

As you can see, we now have 4000 data points, as expected. Finally, let's take the corresponding subset of the data.

In [None]:
mygene.subset(classes_idx)
mygene.summary()

Now we have an object of 4000 data points (images). 

In [None]:
i = 1
while i <= 10:

    # simluations with a one-epoch demographic model
    myfile = ImaFile(simulations_folder='/home/mfumagal/Data/ImaGene/Binary/Simulations' + str(i) + '.Epoch1', nr_samples=128, model_name='Marth-1epoch-CEU')

    mygene = myfile.read_simulations(parameter_name='selection_coeff_hetero', max_nrepl=2000)
    
    # manipulate data, we keep the ancestral/derived polarisation in this example and filter out SNPs with a derived allele frequency loer than 2%
    mygene.filter_freq(0.02)
    mygene.sort('rows_freq')
    mygene.sort('cols_freq')
    mygene.resize((128, 128))
    mygene.convert(verbose=False)
    
    # we use only classes 0,200,400
    mygene.classes = np.array([0,200,400])
    
    # randomise data
    mygene.subset(get_index_classes(mygene.targets, mygene.classes))
    mygene.subset(get_index_random(mygene))

    # targets have to be converted into categorical data
    mygene.targets = to_categorical(mygene.targets)
    
    # at first iteration we build the model 
    # note that, as an illustration, we don't implement a final fully-connected layer
    if i == 1:

        model = models.Sequential([
                    layers.Conv2D(filters=32, kernel_size=(3,3), strides=(1,1), activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.005, l2=0.005), padding='valid', input_shape=mygene.data.shape[1:4]),
                    layers.MaxPooling2D(pool_size=(2,2)),
                    layers.Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.005, l2=0.005), padding='valid'),
                    layers.MaxPooling2D(pool_size=(2,2)),
                    layers.Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.005, l2=0.005), padding='valid'),
                    layers.MaxPooling2D(pool_size=(2,2)),
                    layers.Flatten(),
                    layers.Dense(units=len(mygene.classes), activation='softmax')])
        model.compile(optimizer='adam',
                    loss='categorical_crossentropy',
                    metrics=['accuracy'])

        mynet = ImaNet(name='[C32+P]+[C64+P]x2')

    # training for iterations from 1 to 9
    if i < 10:
        score = model.fit(mygene.data, mygene.targets, batch_size=32, epochs=1, verbose=0, validation_split=0.10)
        mynet.update_scores(score)
    else:
        # testing for iteration 10
        mynet.test = model.evaluate(mygene.data, mygene.targets, batch_size=None, verbose=0)
        mynet.predict(mygene, model)

    i += 1

In [None]:
# save final (trained) model
model.save('Data/model.multi.h5')

# save testing data
mygene.save('Data/mygene.multi')

# save network
mynet.save('Data/mynet.multi')

In [None]:
# assess the training
mynet.plot_train()

In [None]:
# print the testing results [loss, accuracy]
print(mynet.test)

In [None]:
# plot a confusion matrix (on the last mygene object which represents the testing data)
mynet.plot_cm(mygene.classes)