# Experimento

El objetivo de este notebook es comparar el finetuning (FT) y el featureextraction (FE) sobre el problema del plankton: finetuning entrena la red completa y featureextraction solo toca la capa final. 

Un segundo objetivo es comparar que cantidad de datos necesitamos con cada una de las dos tećnicas.

Realizaremos las pruebas utilizando un conjunto de entrenamiento con los años 2006 y 2007 y el test con 2008.

El objetivo no es obtener el máximo acierto sino ver que método es mejor según la cantidad de datos que tengamos.

In [1]:
import os,sys

if not os.path.isdir("/media/nas/pgonzalez/IFCB_HDF5"):
    print("You should have the IFCB_HDF5 project in this directory to run this notebook")
    raise StopExecution
sys.path.insert(1, os.path.abspath("/media/nas/pgonzalez/IFCB_HDF5"))

## Carga de datos
Leemos los metadatos y si no existen en local los descargamos

In [2]:
import pandas as pd

if not os.path.isfile('IFCB.csv.zip'):
    print("CSV data do not exist. Downloading...")
    !wget -O IFCB.csv.zip "https://unioviedo-my.sharepoint.com/:u:/g/personal/gonzalezgpablo_uniovi_es/EfsVLhFsYJpPjO0KZlpWUq0BU6LaqJ989Re4XzatS9aG4Q?download=1"

data = pd.read_csv('IFCB.csv.zip',compression='infer', header=0,sep=',',quotechar='"')
#Compute sample and year information
data['year'] = data['Sample'].str[6:10].astype(str) #Compute the year
samples=data.groupby('Sample').first()
samples=samples[['year']]
print(data)

                        Sample  roi_number        OriginalClass  \
0        IFCB1_2006_158_000036           1                  mix   
1        IFCB1_2006_158_000036           2  Tontonia_gracillima   
2        IFCB1_2006_158_000036           3                  mix   
3        IFCB1_2006_158_000036           4                  mix   
4        IFCB1_2006_158_000036           5                  mix   
...                        ...         ...                  ...   
3457814  IFCB5_2014_353_205141        6850       Leptocylindrus   
3457815  IFCB5_2014_353_205141        6852                  mix   
3457816  IFCB5_2014_353_205141        6855                  mix   
3457817  IFCB5_2014_353_205141        6856                  mix   
3457818  IFCB5_2014_353_205141        6857                  mix   

              AutoClass FunctionalGroup  year  
0                   mix      Flagellate  2006  
1           ciliate_mix         Ciliate  2006  
2                   mix      Flagellate  2006  
3  

## Filtrado de datos
Quitamos ejemplos de las clases (pasandolos a mix), de cuatro clases que no existen en el train pero si en el test (Odontella, Hemiaulus, Gonyaulax y Stephanopyxis)

In [3]:
from tqdm import tqdm
import numpy as np

tqdm.pandas()

classcolumn = "AutoClass" #Autoclass means 51 classes
yearstraining = ['2006','2007'] #Years to consider as training
yearsvalidation = ['2008'] #Years to consider as test

samplestraining = list(samples[samples['year'].isin(yearstraining)].index) #Samples to consider for training
samplesvalidation = list(samples[samples['year'].isin(yearsvalidation)].index) #Samples to consider for testing

data[classcolumn]

classes=np.unique(data[classcolumn])
classes.sort()

cls_to_delete=np.argwhere((classes=='Odontella') | (classes=='Hemiaulus') | (classes=='Gonyaulax') | (classes=='Stephanopyxis'))
classes=np.delete(classes,cls_to_delete)
print(classes)

#Check data by year
print(pd.crosstab(index=data['year'],columns='count'))

['Asterionellopsis' 'Cerataulina' 'Ceratium' 'Chaetoceros' 'Corethron'
 'Coscinodiscus' 'Cylindrotheca' 'DactFragCerataul' 'Dactyliosolen'
 'Dictyocha' 'Dinobryon' 'Dinophysis' 'Ditylum' 'Ephemera' 'Eucampia'
 'Euglena' 'Guinardia' 'Guinardia_flaccida' 'Guinardia_striata'
 'Gyrodinium' 'Laboea' 'Lauderia' 'Leptocylindrus' 'Licmophora'
 'Myrionecta' 'Paralia' 'Phaeocystis' 'Pleurosigma' 'Prorocentrum'
 'Pseudonitzschia' 'Pyramimonas' 'Rhizosolenia' 'Skeletonema'
 'Thalassionema' 'Thalassiosira' 'Thalassiosira_dirty' 'bad' 'ciliate_mix'
 'clusterflagellate' 'detritus' 'dino30' 'kiteflagellates' 'mix'
 'mix_elongated' 'na' 'pennate' 'tintinnid']
col_0   count
year         
2006   131002
2007   273080
2008   427308
2009   732398
2010   327996
2011   419692
2012   394766
2013   422255
2014   329322


## Configuración del entrenamiento
Configuramos el entrenamiento y los archivos de salida. Es importante ver que aquí podemos configurar cuantos ejemplos usamos para entrenar. 

In [4]:
import torch,torchvision
import random
import numpy as np

print("Using pytorch {}".format(torch.__version__))

torch.manual_seed(0) #Reproducible
random.seed(0) #it seems that the transforms uses this random
np.random.seed(0)

gpus = [0,1] #gpus to use
num_gpus = len(gpus)

num_workers = 6*num_gpus
batch_size = 512 #512 for resnet 18, and 34 (with 2 gpus). 256 for resnet 50
batch_size_val = 512

print("Num workers {}. Batch size training {}. Batch size validation {}".format(num_workers,batch_size,batch_size_val))

num_epochs = 50 # @param

#Subsample data
train_size = 0.1

model_save_path_fe="model18_fe_0_1.pt" #Where to save the model once trained
model_save_path_ft="model18_ft_0_1.pt" #Where to save the model once trained

hdf5_files_path = '/media/nas/pgonzalez/IFCB_HDF5/output/' #Directory with the dataset

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using %s"%device)

Using pytorch 1.7.1
Num workers 12. Batch size training 512. Batch size validation 512
Using cuda:0


In [5]:
import torchvision.transforms as T
from h5ifcbdataset import H5IFCBDataset
from torch.utils.data import DataLoader
from pathlib import Path
from sklearn.model_selection import train_test_split

#Define transofrmations
train_transform = T.Compose([
  T.Resize(size=256),
  T.RandomResizedCrop(size=224),
  T.RandomHorizontalFlip(),
  T.ToTensor()
])

val_transform = T.Compose([
  T.Resize(size=256),
  T.CenterCrop(size=224),
  T.ToTensor()
])

#files to load
files = [hdf5_files_path+s+'.hdf5' for s in samplestraining]
#Define data loader

train_dset = H5IFCBDataset(files,classes,classattribute="AutoClass",verbose=1,trainingset=True,transform=train_transform)


Loading samples: 100%|██████████| 164/164 [02:21<00:00,  1.16it/s]


In [6]:
from sklearn.model_selection import train_test_split

indexes_train,_ = train_test_split(list(range(len(train_dset))),train_size=train_size,stratify=train_dset.targets,random_state=0)
train_subset = torch.utils.data.Subset(train_dset, indexes_train)
train_loader = DataLoader(train_subset,batch_size=batch_size,num_workers=num_workers,shuffle=True,pin_memory=True)

In [7]:
import torch.nn as nn

def load_network(model_save_path):
    base_model = torchvision.models.resnet18(pretrained=True) #From which model to start
    
    model = base_model
    print("Adjusting the CNN for %s classes" % len(classes))
    model.fc = nn.Linear(model.fc.in_features, len(classes))
    
    print(model)

    print("Let's use", len(gpus), "GPUs!")
    model = nn.DataParallel(model,device_ids=gpus)

    #Define loss function
    loss_fn = nn.CrossEntropyLoss()
    if os.path.isfile(model_save_path):
        model.load_state_dict(torch.load(model_save_path))
        is_trained=True
    else:
        is_trained=False
    
    model = model.to(device) #Send model to gpu
    return model,loss_fn,is_trained

In [8]:
import time
import torch.nn.functional as nnf


def run_epoch(model, loss_fn, loader, optimizer, device):
    """
    Train the model for one epoch.
    """
    loss_epoch = 0 
    start_time = time.time()
    # Set the model to training mode
    model.train()
    for step, (x, y, _) in enumerate(loader):
        optimizer.zero_grad()
        x = x.to(device)
        y = y.to(device)

        # Run the model forward to compute scores and loss.
        scores = model(x)
        loss = loss_fn(scores, y)
        loss_epoch = loss_epoch + loss.item()
        # Run the model backward and take a step using the optimizer.

        loss.backward()
        optimizer.step()

        if step % 50== 0:
            spent = time.time()-start_time
            print(f"Step [{step}/{len(loader)}]\t Loss: {loss.item()} \t Time: {spent} secs [{(batch_size*50)/spent} ej/sec]]")
            start_time = time.time()

    return loss_epoch

def make_preds(model, loader, device):
    """
    Check the accuracy of the model.
    """
    with torch.no_grad():
        # Set the model to eval mode
        model.eval()
        y_true = []
        y_pred = []
        y_probs = []
        sample = []
        for x, y, s in loader: #The idea is that the dataloader can give me the sample of the image so we can return it
            x = x.to(device)
            y = y.to(device)
            # Run the model forward, and compare the argmax score with the ground-truth
            # category.
            output = model(x)
            predicted = output.argmax(1)
            prob = nnf.softmax(output, dim=1)
            y_probs.extend(prob.cpu().detach().numpy())
            y_true.extend(y.cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())
            sample.extend(s)
    return y_true,y_pred,y_probs,sample

## Proceso de entrenamiento

El parámetro importante aquí es `only_fe_layer`. En caso de que sea true solo toca los pesos de la última capa. Si es false, se reentrena toda la red.

In [9]:
def finetune(model,loss_fn,train_loader,only_fe_layer,model_save_path,device):
    if only_fe_layer:
        for param in model.parameters():
            param.requires_grad = False
        for param in model.module.fc.parameters():
            param.requires_grad = True
        optimizer = torch.optim.Adam(model.module.fc.parameters(), lr=1e-4)
    else:
        for param in model.parameters():
            param.requires_grad = True
        optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

    for epoch in range(num_epochs):
        # Run an epoch over the training data.
        print('Starting epoch %d / %d' % (epoch + 1,num_epochs))
        loss_epoch = run_epoch(model, loss_fn, train_loader, optimizer, device)

        print(f"Epoch [{epoch+1}/{num_epochs}]\t Loss: {loss_epoch / len(train_loader)}")
        
    #Save model in this point
    #TODO
    torch.save(model.state_dict(), model_save_path)
    print("Fine tune done and model saved.")

## Experimento

Primero entrenamos la red como FE y luego como FT y mostramos los resultados.

In [10]:
from sklearn.metrics import classification_report,accuracy_score
import threading

print('Starting process...')
print('Training model FE')

model_fe,loss_fn,is_trained = load_network(model_save_path_fe)

if not is_trained:
    finetune(model_fe,loss_fn,train_loader,True,model_save_path_fe,device)
else:
    print("Model FE was trained already")
    
print('Training model FT')
model_ft,loss_fn,is_trained = load_network(model_save_path_ft)

if not is_trained:
    finetune(model_ft,loss_fn,train_loader,False,model_save_path_ft,device)
else:
    print("Model FTwas trained already")

Starting process...
Training model FE
Adjusting the CNN for 47 classes
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, aff

In [11]:
from sklearn.metrics import accuracy_score

files_test = [hdf5_files_path+s+'.hdf5' for s in samplesvalidation]
test_dset = H5IFCBDataset(files_test,classes,classattribute="AutoClass",verbose=1,trainingset=False,transform=val_transform)
test_loader = DataLoader(test_dset,batch_size=batch_size,num_workers=num_workers,shuffle=False,pin_memory=True)

y_true,y_pred,_,_ = make_preds(model_fe, test_loader, device)
print(accuracy_score(y_true, y_pred))

y_true,y_pred,_,_ = make_preds(model_ft, test_loader, device)
print(accuracy_score(y_true, y_pred))

Loading samples: 100%|██████████| 122/122 [02:23<00:00,  1.18s/it]


0.8486220711992286
0.875965345839535
