## Evaluate the trained network (Week 12) - Step 4

####**Designed by Joon Son Chung, November 2020**

This is based on https://github.com/joonson/face_trainer. You should read the code if you want to understand more about the training details.

In this step, we train the network on the generated dataset.

The baseline model, available from [here](http://www.robots.ox.ac.uk/~joon/data/res18_vggface1_baseline.model), is trained on the VGGFace1 dataset using the softmax loss.

The training and validation sets should also be in the experiments folder.

`save_path` should be changed every time you run a new experiment.


**The models take up a significant amount of disk space. Make sure that you have enough space on your Google Drive, and delete any unnecessary/ unsuccessful experiments.**



In [None]:
from google.colab import drive
from zipfile import ZipFile
from tqdm import tqdm
import os, glob, sys, shutil, time
import torch
import torchvision.transforms as transforms
import torchvision.models as models
from PIL import Image

# mount Google Drive
drive.mount('/content/drive', force_remount=True)

# path of the data directory relative to the home folder of Google Drive
GDRIVE_HOME = '/content/drive/My Drive'
FOLDER      = 'MLVU/your_dataset'

# The following 4 lines are the only parts of the code that you need to change. You can simply run the rest.
data_dir      = os.path.join(GDRIVE_HOME,FOLDER) ## path of the general experiment
initial_model = os.path.join(GDRIVE_HOME,'MLVU/res18_vggface1_baseline.model') ## path to the pre-trained model
train_zip     = os.path.join(data_dir,'train_set.zip') ## training data as zip
val_zip       = os.path.join(data_dir,'val_set.zip') ## validation data as zip
save_path     = os.path.join(data_dir,'experiment_v1') ## training logs and trained models will be saved here

# Extract the cropped images
with ZipFile(train_zip, 'r') as zipObj:
  zipObj.extractall("/train_set")
with ZipFile(val_zip, 'r') as zipObj:
  zipObj.extractall("/val_set")
print('Zip extraction complete')

Make sure that the files have been extracted properly. Make sure that this is a reasonable number.

In [None]:
train_files = glob.glob('/train_set/*/*.jpg')
val_files   = glob.glob('/val_set/*/*.jpg')
print(len(train_files),'train set files and',len(val_files),'validation set files found.')

First, clone the face recognition trainer from GitHub and add it to path.

In [None]:
!rm -rf face_trainer
!git clone https://github.com/joonson/face_trainer.git

sys.path.append('face_trainer')

The training script. Please do not change, but try to read and understand.

In [None]:
import datetime
from utils import *
from EmbedNet import *
from DatasetLoader import get_data_loader
import torchvision.transforms as transforms

# ## ===== ===== ===== ===== ===== ===== ===== =====
# ## Trainer script
# ## ===== ===== ===== ===== ===== ===== ===== =====

def train_network(args):

    ## Make folders to save the results and models
    args.model_save_path     = args.save_path+"/model"
    args.result_save_path    = args.save_path+"/result"

    if not(os.path.exists(args.model_save_path)):
        os.makedirs(args.model_save_path)
            
    if not(os.path.exists(args.result_save_path)):
        os.makedirs(args.result_save_path)

    ## Load models
    s = EmbedNet(**vars(args)).cuda();

    ## Write args to scorefile
    scorefile = open(args.result_save_path+"/scores.txt", "a+");

    strtime = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    scorefile.write('%s\n%s\n'%(strtime,args))
    scorefile.flush()

    ## Input transformations for training
    train_transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Resize(256),
         transforms.RandomCrop([224,224]),
         transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

    ## Input transformations for evaluation
    test_transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Resize(256),
         transforms.CenterCrop([224,224]),
         transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

    ## Initialise trainer and data loader
    trainLoader = get_data_loader(transform=train_transform, **vars(args));
    trainer     = ModelTrainer(s, **vars(args))
    
    ## If initial model is specified, start from that model
    if(args.initial_model != ""):
        trainer.loadParameters(args.initial_model);
        print("Model %s loaded!"%args.initial_model);

    besteer   = 100
    bestmodel = ''

    ## Core training script
    for it in range(1,args.max_epoch+1):

        clr = [x['lr'] for x in trainer.__optimizer__.param_groups]

        print("Training epoch %d with LR %f "%(it,max(clr)));

        loss, traineer = trainer.train_network(trainLoader, verbose=True);

        if it % args.test_interval == 0:

            snapshot = args.model_save_path+"/model%09d.model"%it
            
            sc, lab = trainer.evaluateFromList(transform=test_transform, **vars(args))
            result = tuneThresholdfromScore(sc, lab, [1, 0.1]);

            print("IT %d, VEER %2.4f"%(it, result[1]));
            scorefile.write("IT %d, VEER %2.4f\n"%(it, result[1]));

            trainer.saveParameters(snapshot);

            if result[1] < besteer:
                besteer   = result[1]
                bestmodel = snapshot

        print("TEER/TAcc %2.2f, TLOSS %f"%( traineer, loss));
        scorefile.write("IT %d, TEER/TAcc %2.2f, TLOSS %f\n"%(it, traineer, loss));

        scorefile.flush()

    scorefile.close();

    print('Best validation EER: %2.4f, %s'%(besteer,bestmodel))

Specify the input arguments to the trainer below, and run to train the network. The validation losses will be printed below, and also saved to `save_path`.

See [here](https://github.com/joonson/face_trainer/blob/6fd64d96a1195c18689b1755853eea2082091819/trainEmbedNet.py#L18-L65) for the meanings of each of these arguments, and see [here](https://github.com/joonson/face_trainer/blob/main/README.md#implemented-loss-functions) for the list of available loss functions. If you use meta-learning loss functions, `nPerClass` must be 2 or more.

Note that the trainer includes a script to make sure that there are only `nPerClass` images per class per mini-batch. This helps with the meta-learning loss functions.


In [None]:
import easydict 
args = easydict.EasyDict({ "batch_size": 20, # batch size
                          "trainfunc": "softmax", # loss function
                          "lr": 0.001, # learning rate
                          "lr_decay": 0.90, # how much to decrease the learning rate, every 5 epochs
                          "weight_decay": 0, # regularization to reduce overfitting (e.g. 1e-4 might be reasonable)
                          "margin": 0.1, # for AM-softmax and AAM-softmax
                          "scale": 30, # for AM-softmax and AAM-softmax
                          "nPerClass": 1, # support set + query set size for meta-learning
                          "nClasses": 1000, # number of identities in the training dataset
                          # Don't change below here!!
                          "save_path": save_path,
                          "max_img_per_cls": 500, 
                          "nDataLoaderThread": 5, 
                          "test_interval": 3, 
                          "max_epoch": 30, 
                          "optimizer": "adam",
                          "scheduler": "steplr",
                          "hard_prob": 0.5,
                          "hard_rank": 10,
                          "initial_model": initial_model,
                          "train_path": "/train_set",
                          "train_ext": "jpg",
                          "test_path": "/val_set",
                          "test_list": "/val_set/test_list.csv",
                          "model": "ResNet18",
                          "nOut": 512,
                          "mixedprec": False})
        
train_network(args)

Try various experiments and record the results.


| # | Params                   | Val EER | Notes |
|---|--------------------------|---------|-------|
| 1 | softmax / batch_size 200 |         |       |
| 2 |                          |         |       |
| 3 |                          |         |       |
| 4 |                          |         |       |
| 5 |                          |         |       |