# Decision Process of Network Hyper-Parameters
## Prepared by Furkan Küçük for DataBoss Analytics Job Application

Decision process of a machine learning pipeline can get very trivial. In this notebook, I will try to explain my general approach on hyper-parameter tuning. 

First of all, hyper-parameter decision process might take some time. This may be due to shortage of resources (like having a low-spec hardware) or time. Especially tasks on computer vision may take some time since datasets of computer vision tasks may be challenging, have clutter, hard to label etc. Besides, since computer vision tasks are relatively more complicated, it may need more complex deep learning architectures.

Important note: This task is being done with a relatively low-spec hardware(Google Colab). Hence, one may need to tune hyper-parameters with pre-acquired insights. The experiments will be held with a light network architecture and will have an assumption for found parameters will apply for all conditions. However, despite this is not an accurate assumption since different hyper-parameters may perform better under different conditions, this experiments can provide a good start point for further hyper-parameter tuning.

In [1]:
# imports
import math
import os
import torch
from torch import nn, optim
from torch.optim.lr_scheduler import _LRScheduler
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
import torch.nn.functional as F
from dataset_utils import alternativeSeperation
from model_utils import ModelEngine

  from tqdm.autonotebook import tqdm, trange


Before beggining any machine learning task, one have to prepare project specific data. Tiny Imagenet dataset comes with 3 folders; for training, for validation and for testing. Training folder is well structured for PyTorch's prebuilt ImageFolder dataset handler. However, validation folder has all images in one folder, and labels of them in a text file. I implemented 2 alternatives for dealing such a task. 
- (Alternative) First one of the alternatives, is an extension for PyTorch's "nn.Dataset" class, gets images according to text file. This alternative introduces some overhead to system since it labels all images as the training goes. Further explaination can be obtained from class doc.
- (Preferred) Second alternative is a function that seperates images among class folders that created correspondingly. With this function, the need of RAM usage for holding label information becomes unnecessary. This function also enables user to use PyTorch's "ImageFolder" implementation.

In [2]:
#alternativeSeperation()

Definition: A loss landscape is a multi-dimensional representation of loss functions. For more information please take a look at [1].

In computer vision tasks, most of the loss landscapes are not convex. This property of loss landscapes makes it hard to find the global optima. There are 3 main elements that affects the loss landscape severely; loss function, data distribution and model architecture. (Other hyper-parameters also affect loss landscapes) In computer vision, this data distribution can be adjusted slightly to lead the machine learning pipeline to find the global optima. (or at least a good local optima)

- First adjustment should be channel normalization. Normalization is an important step to have an easier loss landscape. In machine learning projects, all features may have their own distributions. However, numerically, a distribution significantly different from other ones (e.g. a distribution with much larger samples) may affect loss landscape severely and decreases gradients on a dimension while increases on another one. (In other words, the optimizer may think that a feature is more important than other one) With channel normalization, one may prevent that issue. This process can be done by statistically analyzing datasets channel distribution. 

- Data augmentation may be nice to have tool for exploration of loss landscape. A dataset can only represent a partition of a real distribution. With data augmentation, one can slightly increase the coverage of that representation and in some ways enhance it. Some of the benefits of data augmentation are:
    - May prevent overfitting (in some cases) for classes by changing translation, rotation, angle etc.
    - May balance a dataset that is unbalanced so, the bias of model can be decreased.
    - Increase of coverage may lead to better generalization.
    
    <br>Some ways to augment image data are:</br>
   
   
    - Random cropping-resizing-rotation
    - Horizontal and vertical flips
    - Random erasure
    - Hue and color shifts
    - Shearing, tilting and skewing
    - Adding jitter and random noise
    - Random distortions
    
However, data augmentation is just replicating an image with some distortions and noise. Overdoing may harm generalization. In this project, very little amount of augmentation will be used.

In this part, we will find out which augmentation performs better for our case.

[1] https://arxiv.org/pdf/1712.09913.pdf

In [3]:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
transformations_simple = {
    'train': transforms.Compose([
        transforms.CenterCrop(57),
        transforms.ToTensor(),
        normalize
    ]),
    'val': transforms.Compose([
        transforms.CenterCrop(57),
        transforms.ToTensor(),
        normalize
    ]),
    'test': transforms.Compose([
        transforms.CenterCrop(57),
        transforms.ToTensor(),
        normalize
    ])
}

data_dir = "data"

image_datasets_simple = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          transformations_simple[x])
                  for x in ['train', 'val']}
dataloaders_simple = {x: DataLoader(image_datasets_simple[x], batch_size=32,
                             shuffle=True)
               for x in ['train', 'val']}

In [24]:
transformations = {
    'train': transforms.Compose([
        transforms.RandomCrop(57),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
#         normalize
    ]),
    'val': transforms.Compose([
        transforms.Resize(57),
        transforms.ToTensor(),
#         normalize
    ]),
    'test': transforms.Compose([
        transforms.Resize(57),
        transforms.ToTensor(),
#         normalize
    ])
}

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          transformations[x])
                  for x in ['train', 'val']}
dataloaders = {x: DataLoader(image_datasets[x], batch_size=32,
                             shuffle=True)
               for x in ['train', 'val']}

One may need to create a network architecture to experiment on. As mentioned before, in this notebook, a single simple network will be created to evaluate some hyper-parameters. Due to limited hardware, optimization for some hyper-parameters will be done with this network. This approach is not accurate, as different hyper-parameters may perform better or worse under different architectures. However, this method may, at least, provide a good starting point for further optimization, as this procedure gives a rough idea on how augmentations affect the loss landscape.

In [5]:
class ConvNetExperimental(nn.Module):
    def __init__(self):
        super(ConvNetExperimental, self).__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=128, kernel_size=3, stride=1),
            nn.ReLU(),  
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  
            nn.Conv2d(in_channels=512, out_channels=384, kernel_size=3, stride=1),
            nn.ReLU(),
        )
        self.fc = nn.Linear(384 * 3 * 3, 200)
        
    def forward(self, inputs):
        x = self.feature_extractor(inputs)
        return self.fc(x.view(x.size(0), -1))
    
# model = ConvNetExperimental()
# out = model(torch.zeros(1, 3, 57, 57))
# out.shape

In [6]:
simple_model = ConvNetExperimental()
simple_criterion = nn.CrossEntropyLoss()
simple_optimizer = optim.Adam(simple_model.parameters())
simple_engine = ModelEngine(simple_model, simple_criterion, simple_optimizer, model_name="SimpleModel1")

Using device cuda:0


In [7]:
simple_engine.fit(dataloaders_simple)

HBox(children=(FloatProgress(value=0.0, description='Epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…




The created network is rather simple, however, is able to generalize the given dataset relatively. The simplicity of network also makes the experimenting procedure faster. Please note that, the given dataset is just a center cropped version of the original dataset to 57x57 resolution. One can observe the training loss and validation loss curves from TensorBoard. From these curves, one can observe that the model has a slight overfit issue. This slight overfit may indicate a bad local optima. It is a good practice to augment the data and find the optimal augmentations for the current network architectures. As mentioned before, this may lead to a better loss landscape, thus better generalization.

In [8]:
augmented_model = ConvNetExperimental()
augmented_criterion = nn.CrossEntropyLoss()
augmented_optimizer = optim.Adam(augmented_model.parameters())
augmented_engine = ModelEngine(augmented_model, augmented_criterion, augmented_optimizer, model_name="AugmentedModel6")

Using device cuda:0


In [9]:
augmented_engine.fit(dataloaders)

HBox(children=(FloatProgress(value=0.0, description='Epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…




Graphics named "AugmentedModel1" shows the learning curves for a dataset augmentation with "randomly resized and cropped images", and "AugmentedModel2" shows for "randomly cropped images". As it can be seen from TensorBoard graphics, despite of the worse training time loss values, augmentation lead to better generalization for both practices. One can understand from the graphics that application of "randomly resizing" augmentation on Tiny Imagenet dataset is not a good practice.

Most of the regularization methods mainly aim for better generalization, despite their different approaches. However, one can say that regularization methods are mostly to "restrict the power of the method and prevent overfitting" on irreducible error for specific model-dataset pair. 

The regularization parameters are omitted for those experiments, since they are not much needed. This is because the models are not powerful enough and thus, restricting power is not the highest priority. Before adapting the regularization methods, it can be nice to see a model overfitting (not severely) on the dataset.

In [5]:
class ResBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, batch_norm=True):
        super(ResBlock, self).__init__()

        padding_size = kernel_size // 2

        self.conv1 = nn.Conv2d(in_channels=in_channels,
                               out_channels=out_channels,
                               kernel_size=kernel_size,
                               padding=padding_size,
                               stride=stride)

        self.conv2 = nn.Conv2d(in_channels=out_channels,
                               out_channels=out_channels,
                               kernel_size=kernel_size,
                               padding=padding_size,
                               stride=stride)

        self.batch_norm = batch_norm
        if batch_norm:
            self.batch_norm_layer = nn.BatchNorm2d(out_channels)
        self.size_fix = None
        if in_channels != out_channels:
            self.size_fix = nn.Sequential(
                nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, padding=1),
#                 nn.BatchNorm2d(out_channels)
            )
        self.relu = nn.ReLU(inplace=True)
        
    def forward(self, inputs):
        residual = inputs
        x = self.conv1(inputs)
        x = self.batch_norm_layer(x)
        x = self.relu(x)
        x = self.conv2(x)
        x = self.batch_norm_layer(x)
        if self.size_fix:
            residual = self.size_fix(residual)
        return self.relu(x + residual)

class ResNetExperimental(nn.Module):
    def __init__(self, dropout_prob=0):
        super(ResNetExperimental, self).__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=1),
            ResBlock(in_channels=64, out_channels=128, kernel_size=3, stride=1),
            nn.MaxPool2d(2, 2), 
            ResBlock(in_channels=128, out_channels=256, kernel_size=3, stride=1),
            nn.MaxPool2d(2, 2),
            ResBlock(in_channels=256, out_channels=512, kernel_size=3, stride=1),
            nn.MaxPool2d(2, 2),  
            ResBlock(in_channels=512, out_channels=1024, kernel_size=3, stride=1),
            nn.AvgPool2d(6, 6),
        )
        self.fc = nn.Sequential(
            nn.Dropout(dropout_prob),
            nn.Linear(1024, 200)
        )
        
    def forward(self, inputs):
        x = self.feature_extractor(inputs)
        return self.fc(x.view(x.size(0), -1))
# model = ResNetExperimental()
# out = model(torch.zeros(1, 3, 57, 57))
# out.shape

In [28]:
resnet_model = ResNetExperimental()
resnet_criterion = nn.CrossEntropyLoss()
resnet_optimizer = optim.Adam(resnet_model.parameters())
resnet_engine = ModelEngine(resnet_model, resnet_criterion, resnet_optimizer, model_name="ResNetModel4")

Using device cuda:0


In [None]:
resnet_engine.fit(dataloaders)

HBox(children=(FloatProgress(value=0.0, description='Epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

The model seems to overfit on the dataset. Some measures of regularization can be adapted to solve this issue. Some of the regularization methods did not applied till now are:
- L1-L2 regularization
- Dropout
- (Further) Augmentation
- Stochastic Pooling [1]

Further augmentation and stochastic pooling will not be implemented for this project due to time restrictions.

L2 regularization is a widely adapted regularization method for prevent underfitting and better generalization. The simple mathematical formula of the L2 regularization is shown below.

$$Loss(\hat{y}, y) + \lambda * \sum_j {\beta_j^2}$$

As it can be seen from the formula, by adding a penalty to each parameter( $\beta_j$ ) for being large, this regularizer forces parameters to be as low as possible. With that way, the search space of parameters is being forced to shrink. This is also called "weight decay". 

However, L2 regularization shouldn't be applied with the classic Adam optimizer. Please take a look at second item of the "Notes On Optimizers" section. 

## Notes On Optimizers
- While adaptive methods (especially Adam optimizer) are widely used for experimenting and hyper-parameter tuning, a sequence of partial training procedures employing SGD optimizer by watching learning curve carefully and adapting learning rate accordingly among these training procedures performs the best for the final training. Please note that this performance may vary depending on Machine Learning Engineer's insights.

- Adam optimizer is very popular as it is good for experimenting and performs on acceptable levels. While Adam performs good on practice, theoretically, Adam optimizer has a calculation error in adaptive weight decay section. Correction of this error is proposed by [2]. The authors of [2] name the method as AdamW.

In [25]:
corrected_model = ResNetExperimental(.7)
corrected_criterion = nn.CrossEntropyLoss()
corrected_optimizer = optim.SGD(corrected_model.parameters(), lr=.1, weight_decay=.01)
corrected_scheduler = SGDR(corrected_optimizer, t_max=len(dataloaders["train"]))
corrected_engine = ModelEngine(corrected_model, corrected_criterion, corrected_optimizer,
                               scheduler=corrected_scheduler, model_name="CorrectedModel21")

Using device cuda:0


In [26]:
corrected_engine.fit(dataloaders)

HBox(children=(FloatProgress(value=0.0, description='Epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

KeyboardInterrupt: 

In [27]:
class SGDR(_LRScheduler):

    def __init__(self, optimizer, t_max, eta_min=1e-6, last_epoch=-1, t_mult=2, decay_mult = .9):
        self.decay_mult = decay_mult
        self.t_max = t_max
        self.t_mult = t_mult
        self.restart_point = t_max
        self.eta_min = eta_min
        self.restarted_at = 0
        self.restart_triggered = 0
        super().__init__(optimizer, last_epoch)
        
    def restart(self):
        self.restart_point *= self.t_mult
        self.restarted_at = self.last_epoch
        self.restart_triggered += 1

    def cosine(self, base_lr):
        return self.eta_min + ((base_lr - (self.restart_triggered * (1 - self.decay_mult) * base_lr)) - self.eta_min) * (1 + math.cos(math.pi * (self.last_epoch - self.restarted_at) / self.restart_point)) / 2

    def get_lr(self):
        if (self.last_epoch - self.restarted_at) >= self.restart_point:
            self.restart()
        return [self.cosine(base_lr) for base_lr in self.base_lrs]

There are some more hyper-parameters to tune after deciding on network architecture. These hyper-parameters are also to be tuned. These hyper-parameters are:
- Loss function: A loss function has a crucial effect on loss landscape, and should be selected accordingly to the given task. Cross Entropy Loss is good with multi-class classification tasks. Logarithmic property of the function improves "convexity" of the loss function. Further information can be found at PyTorch's doc[1].
- Optimizer: While the standard optimizer for any given loss function is stochastic gradient descent (SGD), an adaptive optimizer performs better on hyper-parameter tuning task. Adam optimizer is considered as a good adaptive optimizer choice for various tasks. Please take a look at "notes on optimizers" part for further understandings on how optimizers can be used.
- Scheduler: A scheduler adapts learning rate of the optimizer accordingly. While adaptive optimizers mostly do not need a scheduler, a SGD optimizer with a good scheduler may outperform all other optimizers.

The training procedure will be handled by "Model Engine" written for the given projects. This class eases the training, validation and prediction process. One can look at documentation of "ModelEngine" class for further information.

[1] https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss

In [8]:
class ConvNet1(nn.Module):
    def __init__(self):
        super(ConvNet1, self).__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=128, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.BatchNorm2d(128),    
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.BatchNorm2d(256),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.BatchNorm2d(512),
            nn.MaxPool2d(2, 2),  
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.BatchNorm2d(512),
        )
        self.fc = nn.Linear(512 * 10 * 10, 200)
        
    def forward(self, inputs):
        x = self.feature_extractor(inputs)
        return self.fc(x.view(x.size(0), -1))

torch.Size([1, 200])

In [5]:
model1 = ConvNet1()
criterion1 = nn.CrossEntropyLoss()
optimizer1 = optim.SGD(model1.parameters(), lr=.01, momentum=.9)
scheduler1 = SGDR(optimizer1, len(dataloaders["train"]))
engine1 = ModelEngine(model1, criterion1, optimizer1, scheduler=scheduler1)

In [6]:
model2 = ResNet1()
criterion2 = nn.CrossEntropyLoss()
optimizer2 = optim.AdamW(model2.parameters())#optim.SGD(model2.parameters(), lr=.01, momentum=.9)
scheduler2 = SGDR(optimizer2, len(dataloaders["train"]))
engine2 = ModelEngine(model2, criterion2, optimizer2)#, scheduler=scheduler2)

Using device cuda:0


In [7]:
engine2.fit(dataloaders)

HBox(children=(FloatProgress(value=0.0, description='Epochs', max=10.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=3125.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Iterations', max=313.0, style=ProgressStyle(description_w…

NameError: name 'earlyStopping' is not defined