# Training


This notebook contains the information to run the training of the models for both mask and boolean features predictions. Both are based on DeepLab architectures with little modifications to adapt the neural networks to our specific needs.

## Semantic segmentation model

In [1]:
# Necessary imports for the execution of the code.
# Make sure you have previously executed requirements.txt

'''
# WARNING:

Check https://pytorch.org/get-started/previous-versions/ and install the proper
pytorch and torchvision versions according to your cuda version.

You can figure out your cuda versions with:
/usr/local/cuda/bin/nvcc --version

'''

import os
import sys
import torch
import click
import pickle
import datahandler
import sklearn.metrics
import ETL_lib as Tox

from pathlib import Path
from torch.utils import data
from trainer import train_model
from model import createDeepLabv3_resnet_50, binary_fenotypes_wideresnet50

**Step 1 (Creation of masks):** We must first create the masks from the roi files in the folders. This will serve as targets in the training process.

*Input:* Two folders paths and one list are expected:
1. *raw_data_path*: Folder where all the data is. This data must contain several plate folders to train.

2. *masks_path*: Destination folder where the masks will be saved for the training process. When the model has been trained, this folder will be automatically removed.

3. *masks_names*: It expects a dictionary of key=str and value=list. Each interior list must contain a set of roi files to create a unified mask that will be segmented afterwards The key name of this list will represent the identifier name of the mask. This means if you want to segment the fish outline dorsal joining that roi file with the rois from the eyes just add the following key -> value: 'outline_dorsal' -> ['fishoutline_dorsal', 'eye_up_dorsal', 'eye_down_dorsal']. Atention!: It is mandatory for the masks names to end with \_lat or \_dor to indicate whether the mask will belong to lateral or dorsal image.

In [None]:
############ INPUT ############

raw_data_path = '../raw_data'
data_path = '../processed_data'
masks_names = masks_names = {'outline_lat': ['fishoutline_lateral'],
                             'heart_lat': ['heart_lateral'],
                             'yolk_lat': ['yolk_lateral'],
                             'ov_lat': ['ov_lateral'],
                             'eyes_dor': ['eye_up_dorsal', 'eye_down_dorsal'],
                             'outline_dor': ['fishoutline_dorsal', 'eye_up_dorsal', 'eye_down_dorsal']}


############ CODE ############

# Let's check whether raw data path exist or not
if os.path.exists(raw_data_path):
    print('Generating masks folders...')
    Tox.data_generation_pipeline(raw_data_path, data_path, masks_names)
    print('Finished!')
else:
    print(raw_data_path, "does not exist.")

with open(data_path + '/complete_fishes.pkl', 'rb') as pickle_file:
    complete_list = pickle.load(pickle_file)
        
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

**Step 2 (DeepLab Training):** Once we have target data formatted to train our model (i.e extracted masks from rois) we can run DeepLab training. This process may take quite a long time depending on the used parameters (such as epochs or batch size).

In order to take advantatge of the data of every single well (even of those which don't have all the rois) a two step training is carried out. The first step of training will be meant to be trained exclusively with the wells that have all the rois. Contrary, in the second training step, we will also use those uncomplete wells using as a target the predictions made by the model itself. This paradigm is a modification based on *Learning without forgetting* by Zhizhong Li and Derek Hoiem. In order to do that, the previous called function *generation_pipeline* generated a *pkl* file including all complete wells. This will be passed to datahandler to say: just charge this data for training. In the second phase no list will be passed so the dataloader will charge all wells instead. Note that this phase doesn't ensure an improvement in our model so just a pair of epochs are run. If the model worsens, the best model from the first phase is kept. 

Next we will define input parameters:
1. *images_folder*: Folder where images are (path from data_path).
2. *masks_folders*: Folders where masks are (path form data_path).
3. *model_directory*: Directory where the model is going to be saved.
4. *masks_weights*: Weights to applied to each mask. Note that weights lists must be the same length that masks_folders and weights ara applied in the same order. By default those weights have been chosen to accomplish some area criterions. For each mask the weights have been put to be mean(area)/mean(max_area). So the greater mask will have weight one.
5. *criterion*: Loss function to be used in the model.
6. *optimizer*: Optimizer for the model.
7. *metrics*: Metrics to be used in the evaluation of the model.
8. *seed*: Seed for the model.
9. *fraction*: Fraction of the data to be used in test.
10. *batch_size*: Batch size
11. *num_epochs*: Number of epochs for the first phase.
12. *num_epochs2*: Number of epochs for the second phase.

In [None]:
'''
# NOTES:

-Batch size needs to be larger than one due to the batch normalization.

-The chosen loss funcion (nn.BCEWithLogitsLoss()) applies a sigmoid to the
output ans then applies the binary cross entropy loss function (a pixel belongs
to a class or doesn't)

This first part of the code is responsible to define where data is found and
where the model is going to be saved. In order to do that a main data directory
must be declared. Inside the this directory an image folder must exists with
all the raw images and the folders with the different masks containg the masks
with the exact same name of their corresponding original image:

                     ___________data_path _____________
                    /               |                  \
                Images       Mask1_folder  ...  Maskn_folder
                  |                 |                  |
              img1.png          img1.png           img1.png
                  .                 .                  .
                  .                 .                  .
                  .                 .                  .
              imgk.png          imgk.png           imgk.png


If the step 1 cell has been executed, this configuration is ensured.
'''

############ INPUT ############

# Define images path and masks paths from data_path
images_folder = 'Images'
masks_folders = list(masks_names.keys())

# Path from current path to save the generated model
model_directory = Path('./Model_masks')
if not model_directory.exists():
    model_directory.mkdir()
    

# Model parameters

# Creation of the model
model = createDeepLabv3_resnet_50(outputchannels=len(masks_folders))

masks_weights = torch.tensor([[[0.5, 2, 2, 2, 1, 0.5]]])
### Transformation of masks_weights to correct format:
masks_weights = masks_weights.repeat_interleave(190, dim=1)
masks_weights = masks_weights.repeat_interleave(1024, dim=0).transpose(0,2)
if torch.cuda.is_available():
    masks_weights = masks_weights.to(torch.device('cuda:0'))


criterion = torch.nn.BCEWithLogitsLoss(reduction='mean',
                                       pos_weight = masks_weights) # Specify the loss function

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)  # Specify the optimizer
                                                           # with a low learning rate

# Specify the evaluation metrics
metrics = {'f1_score': sklearn.metrics.f1_score}
           #'auroc': sklearn.metrics.roc_auc_score}
           #'accuracy_score': sklearn.metrics.accuracy_score}

seed = 1
fraction = 0.2
batch_size = 8
num_epochs = 10
num_epochs2 = 3


        
############ CODE ############

# CREATE WEIGHTS TENSOR STRUCTURE FOR TRAINING




# FIRST TRAINING PHASE

# Ceation of the data loaders ['Train', 'Test']
dataloaders, image_datasets = datahandler.get_dataloader_single_folder(data_path,
                                                       images_folder,
                                                       masks_folders,
                                                       batch_size = batch_size,
                                                       seed = seed,
                                                       fraction = fraction,
                                                       images_list = complete_list)


# Train the model
best_loss = train_model(model,
                criterion,
                dataloaders,
                optimizer,
                bpath = model_directory,
                masks_names = masks_folders,
                metrics = metrics,
                num_epochs = num_epochs,
                device = device)



# SECOND TRAINING PHASE

dataloaders2, _ = datahandler.get_dataloader_single_folder(data_path,
                                                       images_folder,
                                                       masks_folders,
                                                       batch_size = batch_size,
                                                       fraction = fraction,
                                                       test_list = getattr(image_datasets['Test'],'image_names'))

# Train the model
_ = train_model(model,
                criterion,
                dataloaders,
                optimizer,
                bpath = model_directory,
                masks_names = masks_folders,
                metrics = metrics,
                num_epochs = 2,
                device = device,
                best_loss = best_loss)

torch.save(model, model_directory / 'weights.pt')

# Alternatively you can upload an already trained model:
# torch.load(model_directory + 'weights.pt')

## Boolean Phenotypes model

**Step 1 (DeepLab Training):** Let's train the model for the boolean phenotypes. As in the semantic segmentation model, this one is also based on DeepLab architecture but adding a fully connected network in the output of the convolutional neural network.

We must take into account that given the nature of the available data to train our models, it has been impossible to train an accurate network for the *otholitsdefects* variable. Consequently, we have removed this phenotype as it was adding some bias to the overall model worsening the behavior of the predictions of the other variables.

The input parameters to execute the training are:

1. *pheno_names*: A list with the name of the phenotypes to predict.
2. *model_directory*: Directory where the model is going to be saved.
3. *class_weights*: Weight given to each class.
4. *positive_weights*: Penalization for false negative for each class.
5. *criterion*: Loss function to be used in the model.
6. *optimizer*: Optimizer for the model.
7. *metrics*: Metrics to be used in the evaluation of the model.
8. *seed*: Seed for the model.
9. *fraction*: Fraction of the data to be used in test.
10. *batch_size*: Batch size
11. *acum_steps*: Accumulable steps to apply backpropagation.
12. *num_epochs*: Number of epochs for the first phase.

In [None]:
############ INPUT ############

pheno_names = ['bodycurvature',
               'yolkedema',
               'necrosis',
               'tailbending',
               'notochorddefects',
               'craniofacialedema',
               'finabsence',
               'scoliosis',
               'snoutjawdefects']

# Path from current path to save the generated model
model_directory = Path('./Model_pheno')
if not model_directory.exists():
    model_directory.mkdir()
    

# Model Parameters

# Creation of the model
model = binary_fenotypes_wideresnet50(len(pheno_names))
    
class_weights = [1/len(pheno_names) for i in pheno_names]

positive_weights = [1, 1, 1, 1, 10, 1, 10, 1, 1]

criterion = torch.nn.BCEWithLogitsLoss(reduction='none',pos_weight = positive_weights) # Specify the loss function

optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5, 
                                                 weight_decay = 1e-5) # Specify the optimizer
                                                                      # with a low learning rate

metrics = {'f1_score': sklearn.metrics.f1_score,
           'precision': sklearn.metrics.precision_score,
           'recall': sklearn.metrics.recall_score}

seed = 1
fraction = 0.2
batch_size = 4
acum_steps = 64
num_epochs = 50


        
############ CODE ############

'''
Given a list of image names, filters it by only keeping the fish_names with
all the boolean phenotypes in feno_names
'''
stats_path = data_path + '/stats.csv'
complete_bools = Tox.filter_by_bool(complete_list, pheno_names, stats_path)

# Ceation of the data loaders ['Train', 'Test']
dataloaders, image_datasets = datahandler.get_dataloader_single_folder_bool(data_dir = data_path,
                                                       image_folder = images_path,
                                                       feno_names = pheno_names,
                                                       model_folder = model_directory,
                                                       image_list = complete_bools,
                                                       batch_size = batch_size,
                                                       seed = seed,
                                                       fraction = fraction)



# TRAINING PHASE

 _ = train_clasif_model(model,
                     feno_names,
                     criterion,
                     dataloaders,
                     optimizer,
                     bpath = model_directory,
                     metrics = metrics,
                     num_epochs = num_epochs,
                     bs = batch_size,
                     batch_acum = acum_steps,
                     class_weights = class_weights)