# Kaggle Dogs vs Cats Library and Demo

This is a library to download and parse the [Kaggle's Dogs vs Cats competition](https://www.kaggle.com/competitions/dogs-vs-cats/overview) dataset and a demo of CNNs.

It's inspired on [Keras' Image classification from scratch](https://keras.io/examples/vision/image_classification_from_scratch/) demo.

The code of this notebook in one demo file is in [.../examples/dogvscats/demo/](https://github.com/gomlx/gomlx/tree/main/examples/dogsvscats/demo)

This notebook contains 3 demos for the "Dogs Vs Cats" datasets:

1. A plain CNN model. It reaches decent accuracy.
2. Transfer learning with a pre-trained InceptionV3 model -- loaded from the internet. You get the best accuracy this way.
3. ["Bootstrap Your Own Latent" (BYOL)](https://arxiv.org/abs/2006.07733) pretraining with unsupervised data (no labels) and then finetune on a few examples (1000 steps). It's a powerful technique when there is lots of data, but mostly unlabeled -- not the case here, all images are labeled, but an interesting technique and demo.

## Environment Set Up

Let's set up `go.mod` to use the local copy of GoMLX, so it can be developed jointly the dataset code with the model. That's often how data pre-processing and model code is developed together with experimentation.

If you are not changing GoMLX code, feel free to simply skip this cell. Or if you used a different directory for you projects, change it below.

Notice the directory `${HOME}/Projects/gomlx` is where the GoMLX code is copied by default in [its Docker](https://hub.docker.com/repository/docker/janpfeifer/gomlx_jupyterlab/general).

In [1]:
!*rm -f go.work && go work init && go work use . "${HOME}/Projects/gomlx" "${HOME}/Projects/gonb" "${HOME}/Projects/gopjrt" "${HOME}/Projects/bsplines"
%goworkfix

	- Added replace rule for module "github.com/gomlx/bsplines" to local directory "/home/janpf/Projects/bsplines".
	- Added replace rule for module "github.com/gomlx/gomlx" to local directory "/home/janpf/Projects/gomlx".
	- Added replace rule for module "github.com/janpfeifer/gonb" to local directory "/home/janpf/Projects/gonb".
	- Added replace rule for module "github.com/gomlx/gopjrt" to local directory "/home/janpf/Projects/gopjrt".


## Data Preparation

The dataset takes ~790Mb compressed, and contains ~25K examples (a few are not parseable JPG), evenly split. We further separate this into 20K for training and 5K for validation/testing, randomly picked. 

GoMLX provides a dataset loading and data augmentation [library for the Kaggle Dogs vs Cats competition](https://pkg.go.dev/github.com/gomlx/gomlx@v0.1.0/examples/dogsvscats#section-readme). 
This make it easy to acess the data -- but this notebook serves as documentation and example for the library.

Let's first create the train/validation dataset, and display a sample of the augmented images. The `dogsvscats` library provide a `dogsvscats.CreateDatasets`
function that takes a `dogsvscats.PreprocessingConfiguration` and returns 3 datasets: one for training; one for evaluation on training data; one for evaluation on a validation (separate) data. Only the training data is augmented, and we use that in the cell below to sample from -- you'll notice some random rotations, and images are randomly flipped.

The first time it runs it may take a while, since it needs to download the data from the internet. The default directory for the data is `${HOME}/work/dogs_vs_cats/`, but you can change it, by setting the `--data` flag. The next time it runs it will re-use the downloaded data.

In [13]:
import (
    "github.com/gomlx/gomlx/ml/data"
    "github.com/gomlx/gomlx/ml/context"
    "github.com/gomlx/gomlx/examples/dogsvscats"
    "github.com/janpfeifer/must"
)

var (
    flagDataDir   = flag.String("data", "~/work/dogs_vs_cats", "Directory to cache downloaded and generated dataset files.")
    flagCheckpoint = flag.String("checkpoint", "", "Directory save and load checkpoints from. If left empty, no checkpoints are created.")
    contextSettings *string
)

// init_contextSettings is executed at initialization and sets the flag "set" to accept the default context setting flags.
// Use --help to see all options one can set.
func init_contextSettings() {
	ctx := dogsvscats.CreateDefaultContext()
	contextSettings = commandline.CreateContextSettingsFlag(ctx, "set")
}

// ContextFromSettings returns the default context merged with values parsed from --set flag.
func ContextFromSettings() *context.Context {
    ctx := dogsvscats.CreateDefaultContext()
    must.M(commandline.ParseContextSettings(ctx, *contextSettings))
    return ctx
}
%% 
ctx := ContextFromSettings()
config := dogsvscats.NewPreprocessingConfigurationFromContext(ctx, *flagDataDir)
// Download dataset, if not yet downloaded.
must.M(dogsvscats.Download(config.DataDir))
fmt.Println("Dogs vs Cats dataset downloaded.")

Dogs vs Cats dataset downloaded.


Now that the data is downloaded we can create a `train.Dataset` and sample from it. We sample a few images and display below.

In [4]:
import(
    timage "github.com/gomlx/gomlx/types/tensors/images"
    "github.com/gomlx/gomlx/types/tensor"
    "github.com/gomlx/gomlx/ml/data"
    "github.com/gomlx/gopjrt/dtypes"

    "github.com/janpfeifer/gonb/gonbui"
)

// sample some random augmented images and display them in the Notebook.
func sample(config *dogsvscats.PreprocessingConfiguration, numRows, numPerRow int) {
    var images []image.Image
    var labels []dogsvscats.DorOrCat

    // Create datasets with batch size equals 2*numPerRow.
    var configForSample dogsvscats.PreprocessingConfiguration
    configForSample = *config
    configForSample.BatchSize = numRows*numPerRow  // Sample only what we need, in one batch.
    configForSample.ModelImageSize = 256
    configForSample.ForceOriginal = true
    configForSample.UseParallelism = true
    configForSample.DType = dtypes.Uint8
    
    // Sample the images.
    ds, _, _ := dogsvscats.CreateDatasets(&configForSample)
    _, inputsT, labelsT := must.M3(ds.Yield())
    
    // Get indices and labels of the images.
    indices := inputsT[1].Value().([]int64)
    labelsFloat := labelsT[0].Value().([]uint8)
    labels = make([]dogsvscats.DorOrCat, 0, numRows*numPerRow)
    for _, labelFloat := range labelsFloat {
        labels = append(labels, dogsvscats.DorOrCat(labelFloat))
    }
    
    // Convert images from tensor to Go images.
    images = timage.ToImage().Batch(inputsT[0])
    htmlRows := make([]string, 0, numRows)
    count := 0
    for row := 0; row < numRows; row++ {
        cells := make([]string, 0, numPerRow)
        for col := 0; col < numPerRow; col++ {
            imgIdx := indices[count]
            cells = append(cells, embedImageInHTML(images[count], labels[count].String(), imgIdx, 256))
            count++
        }
        htmlRows = append(htmlRows, fmt.Sprintf("<tr>\n\t<td>%s</td>\n</tr>", strings.Join(cells, "</td>\n\t<td>")))
    }
    htmlTable := fmt.Sprintf("<h3>%s</h3><table>%s</table>\n", "Sample Dogs vs Cats", strings.Join(htmlRows, ""))
    gonbui.DisplayHTML(htmlTable)
}

// embedImageInHTML, with a label.
func embedImageInHTML(img image.Image, label string, imgIdx int64, size int) string {
    imgSrc := must.M1(gonbui.EmbedImageAsPNGSrc(img))   // Generate image in-html (in the src filed), as opposed to a separate file.
    return fmt.Sprintf(`<figure style="padding:4px;text-align: center; background-color: lightgray; color: black;"><img src="%s" width="%dpx" height="%dpx"><figcaption style="text-align: center;">%s (%d)</figcaption></figure>`,
                       imgSrc, size, size, label, imgIdx)
}

%%
ctx := ContextFromSettings()
config := dogsvscats.NewPreprocessingConfigurationFromContext(ctx, *flagDataDir)
sample(config, 2, 8)

0,1,2,3,4,5,6,7
Dog (9647),Dog (5741),Dog (10517),Dog (3269),Dog (10723),Dog (2507),Dog (2620),Dog (0)
Cat (5659),Cat (11333),Cat (8080),Cat (7072),Cat (1765),Cat (9241),Cat (9170),Cat (2492)


### Pre-Generating Agumented and Scaled Images

While our dataset does parallelize the work of augmenting and scaling the images, it's still bottlenecked mostly by the transformation than by the machine learning (at least if running in an old GPU). 

So an alternative is pre-generating the augmented and scaled images, which takes space in disk, but will significantly accelerate training. 

The `gomlx/examples/dogsvscats` library provides the `PreGenerate(config *Configuration, numEpochsForTraining int)` function that does that. It will take some 10 minutes to generate 50 epochs of augmented data (~1M uniquely augmented images, or ~30000 unique batches of size 32), which is plenty to train, but takes 22Gb of space. The function `dogsvscats.CreateDatasets` will automatically use the pre-generated data if it find the files in the `--data` directory.

> **Note**: one issue with the pre-generated dataset (as its currently implemented) is that it's not shuffable: so if 
> one restarts training without going over all the 40 epochs it will see the same images over and over and will overfit to them.

This can take up to 10 minutes, but only needs to be run once. If it detects the files already exist, it's just skipped.

In [5]:
import (
    "github.com/gomlx/gomlx/ml/data"
    "github.com/gomlx/gomlx/ml/train"
)

// We increase the size of the batch for the generation of images -- it makes it a bit faster.
%% --set="batch_size=100"
repeats := 50
ctx := ContextFromSettings()
config := dogsvscats.NewPreprocessingConfigurationFromContext(ctx, *flagDataDir)
dogsvscats.PreGenerate(config, repeats, false)

// Report on number of records from each dataset -- we need to read through them.
fmt.Println("")
for dsIdx, dsName := range []string{dogsvscats.PreGeneratedTrainFileName, dogsvscats.PreGeneratedTrainEvalFileName, dogsvscats.PreGeneratedValidationFileName} {
    fmt.Printf("Dataset %q: ... \r", dsName)
    dsPath := path.Join(config.DataDir, dsName)
    ds := dogsvscats.NewPreGeneratedDataset(dsName, dsPath, 1, false, config.ModelImageSize, config.ModelImageSize, config.DType)
    count := 0
    for {
        if _, _, _, err := ds.Yield(); err != nil { break }
        count++
    }
    if dsIdx == 0 { // For train data, where we generate multiple augmented versions of the original image.
        fmt.Printf("Dataset %q: %d images (== %d x %d)\n", dsName, count, repeats, count/repeats)    
    } else {
        fmt.Printf("Dataset %q: %d images\n", dsName, count)    
    }
}

Validation data for evaluation already generated in "/home/janpf/work/dogs_vs_cats/validation_eval_data.bin"
Training data for evaluation already generated in "/home/janpf/work/dogs_vs_cats/train_eval_data.bin"
Training data for training already generated in "/home/janpf/work/dogs_vs_cats/train_data.bin"

Dataset "train_data.bin": 1009400 images (== 50 x 20188)
Dataset "train_eval_data.bin": 20188 images
Dataset "validation_eval_data.bin": 4798 images


## Training a CNN model

Our first model is a simple CNN model currently using images scaled down to 75x75 pixels, with random rotations (mean 0 and standard deviation of 5 degrees) and random flips.

### Model Hyperparameters

Defined in one place for all models, because many are used accross the training loop. See `dogsvscats.CreateDefaultContext` defined in [.../examples/dogsvscats/train.go](https://github.com/gomlx/gomlx/blob/main/examples/dogsvscats/train.go), the default values are:

In [28]:
%%
ctx := ContextFromSettings()
fmt.Println(commandline.SprintContextSettings(ctx))

Context hyperparameters:
	"activation": (string) 
	"adam_dtype": (string) 
	"adam_epsilon": (float64) 1e-07
	"augmentation_angle_stddev": (float64) 20
	"augmentation_force_original": (bool) false
	"augmentation_random_flips": (bool) true
	"batch_size": (int) 16
	"byol_finetune": (bool) false
	"byol_hidden_nodes": (int) 4096
	"byol_inception": (bool) false
	"byol_pretrain": (bool) false
	"byol_projection_nodes": (int) 256
	"byol_reg_len1": (float64) 0.01
	"byol_regularization_rate": (float64) 1
	"byol_target_update_ratio": (float64) 0.99
	"cnn_dropout_rate": (float64) -1
	"cnn_embeddings_size": (int) 128
	"cnn_num_layers": (float64) 5
	"cosine_schedule_steps": (int) 0
	"dropout_rate": (float64) 0.1
	"eval_batch_size": (int) 100
	"fnn_dropout_rate": (float64) -1
	"fnn_normalization": (string) 
	"fnn_num_hidden_layers": (int) 3
	"fnn_num_hidden_nodes": (int) 128
	"fnn_residual": (bool) true
	"inception_finetuning": (bool) true
	"inception_pretrained": (bool) true
	"kan_bspline_degree": (i

### Model Definitions

GoMLX model functions have the signature:

```go
type ModelFn func(ctx *context.Context, spec any, inputs []*graph.Node) (predictions []*graph.Node)
```

- `ctx` holds the hyperparameters and variables created and used by the model.
- `spec` is an opaque value provided by the dataset. Simply ignored for this project, but it can be used to identity different types of inputs, for heterogeneous input types (e.g.: for multi-headed models)
- `inputs` is a tuple of inputs. For this dataset, the input takes only one value, a batch of images to be classified.

The function returns a tuple of predictions. For this project with only one element, the batch of predictions for the input images.

The `dogsvscats.CnnModelGraph` in [.../examples/dogsvscats/train.go](https://github.com/gomlx/gomlx/blob/main/examples/dogsvscats/model_cnn.go) is defined as:

```go
// CnnModelGraph builds the CNN model for our demo.
// It returns the logit, not the predictions, which works with most losses.
// inputs: only one tensor, with shape `[batch_size, width, height, depth]`.
func CnnModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node {
	embeddings := CnnEmbeddings(ctx, inputs[0])
	logit := fnn.New(ctx.In("readout"), embeddings, 1).NumHiddenLayers(0, 0).Done()
	return []*Node{logit}
}

func CnnEmbeddings(ctx *context.Context, images *Node) *Node {
	batchSize := images.Shape().Dimensions[0]
	numConvolutions := context.GetParamOr(ctx, "cnn_num_layers", 5)

	// Dropout.
	dropoutRate := context.GetParamOr(ctx, "cnn_dropout_rate", -1.0)
	if dropoutRate < 0 {
		dropoutRate = context.GetParamOr(ctx, layers.ParamDropoutRate, 0.0)
	}
	var dropoutNode *Node
	if dropoutRate > 0.0 {
		dropoutNode = Scalar(images.Graph(), images.DType(), dropoutRate)
	}

	filterSize := 16
	logits := images
	imgSize := logits.Shape().Dimensions[1]
	for convIdx := range numConvolutions {
		ctx := ctx.Inf("%03d_conv", convIdx)
		if convIdx > 0 {
			logits = normalizeImage(ctx, logits)
		}
		for repeat := 0; repeat < 2; repeat++ {
			ctx := ctx.Inf("repeat_%02d", repeat)
			residual := logits
			logits = layers.Convolution(ctx, logits).Filters(filterSize).KernelSize(3).PadSame().Done()
			logits = activations.ApplyFromContext(ctx, logits)
			if dropoutNode != nil {
				logits = layers.Dropout(ctx, logits, dropoutNode)
			}
			if residual.Shape().Equal(logits.Shape()) {
				logits = Add(logits, residual)
			}
		}
		if imgSize > 16 {
			// Reduce image size by 2 each time.
			logits = MaxPool(logits).Window(2).Done()
			imgSize = logits.Shape().Dimensions[1]
		}
		logits.AssertDims(batchSize, imgSize, imgSize, filterSize)
	}

	// Flatten the resulting image, and treat the convolved values as tabular.
	logits = Reshape(logits, batchSize, -1)
	return fnn.New(ctx.Inf("%03d_fnn", numConvolutions), logits, context.GetParamOr(ctx, "cnn_embeddings_size", 128)).Done()
}

func normalizeImage(ctx *context.Context, x *Node) *Node {
	x.AssertRank(4) // [batch_size, width, height, depth]
	norm := context.GetParamOr(ctx, "cnn_normalization", "")
	if norm == "" {
		context.GetParamOr(ctx, layers.ParamNormalization, "")
	}
	switch norm {
	case "layer":
		return layers.LayerNormalization(ctx, x, 1, 2).ScaleNormalization(false).Done()
	case "batch":
		return batchnorm.New(ctx, x, -1).Done()
	case "none", "":
		return x
	}
	exceptions.Panicf("invalid normalization selected %q -- valid values are batch, layer, none", norm)
	return nil
}
```

Below we do a small test that given a placeholder input, it will build the computation graph with the correct shape:

In [10]:
import (
    . "github.com/gomlx/gomlx/graph"
    "github.com/gomlx/gomlx/backends"

    _ "github.com/gomlx/gomlx/backends/xla"
)

var _ = NewGraph

%%
ctx := ContextFromSettings()
config := dogsvscats.NewPreprocessingConfigurationFromContext(ctx, *flagDataDir)

// Let's just check that we get the right shape from the model function, wihtout any real data.
g := NewGraph(backends.New(), "test")
inputs := []*Node{
    // Images: create a graph parameter node shaped [batch_size, width, heigh, depth=4]:
    Parameter(g, "images", shapes.Make(config.DType, config.BatchSize, config.ModelImageSize, config.ModelImageSize, 4)),
}
outputs := dogsvscats.CnnModelGraph(ctx, nil, inputs)
fmt.Printf("Logits shape for batch_size=%d: %s\n", config.BatchSize, outputs[0].Shape())
outputs[0].AssertDims(config.BatchSize, 1)

Logits shape for batch_size=16: (Float32)[16 1]


### Training Loop

The trainer `dogsvscats.TrainModel` is defined in the [.../examples/dogsvscats/train.go](https://github.com/gomlx/gomlx/blob/main/examples/dogsvscats/train.go). It is straight forward (and almost the same for every different project) and does the following for us:

- If a checkpoing is given (--checkpoint) and it has a previously saved model, it loads hyperparmeters and trained variables.
- Create trainer: with selected model function (by default the `dogsvscasts.CnnModelGraph`), optimizer, loss and metrics.
- Create a `train.Loop` and attach to it a progressbar, a periodic checkpoint saver and a plotter (`--set="plots=true"`).
- Train the selected number of train steps.
- Report results.

Below we train 50 steps with the default settings just to check things are working.

In [15]:
%% --set="model=cnn;plots=false;train_steps=50"
ctx := ContextFromSettings()
dogsvscats.TrainModel(ctx, *flagDataDir, *flagCheckpoint)

Training (50 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (13 steps/s)[0m [step=49] [loss+=0.679] [~loss+=0.691] [~loss=0.691] [~acc=52.63%]         
	[Step 50] median train step: 7713 microseconds
	Updated batch normalization mean/variances averages.

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.682
	Mean Loss (#loss): 0.682
	Mean Accuracy (#acc): 55.68%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.682
	Mean Loss (#loss): 0.682
	Mean Accuracy (#acc): 55.79%



### Training Session: CNN with 60K steps (~50 epochs)

The first cell below removes a previous version of our `base_cnn` model checkpoints, just in case.

In [19]:
!rm -rf ~/work/dogs_vs_cats/base_cnn

In [20]:
%% --checkpoint=base_cnn --set="model=cnn;plots=true;train_steps=63000"
ctx := ContextFromSettings()
dogsvscats.TrainModel(ctx, *flagDataDir, *flagCheckpoint)

Training (63000 steps):    1% [........................................] (83 steps/s) [18s:12m33s] [step=692] [loss+=0.629] [~loss+=0.654] [~loss=0.654] [~acc=60.91%]          

Training (63000 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (97 steps/s)[0m [step=62999] [loss+=0.180] [~loss+=0.362] [~loss=0.362] [~acc=84.54%]             


	[Step 63000] median train step: 7682 microseconds
	Updated batch normalization mean/variances averages.

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.357
	Mean Loss (#loss): 0.357
	Mean Accuracy (#acc): 84.40%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.378
	Mean Loss (#loss): 0.378
	Mean Accuracy (#acc): 83.64%



### Results from multiple runs with 120K steps (~100 epochs):

| Try | Train <br/> Loss | Train <br/> Accuracy | Validation <br/> Loss | Validation <br/> Accuracy |
| --- | --- | --- | --- | --- |
| 1 | 0.171 | 93.19% | 0.255 | 89.79% |
| 2 | 0.195 | 92.02% | 0.272 | 88.60% |
| 3 | 0.196 | 92.00% | 0.296 | 87.64% |
| 4 | 0.188 | 92.40% | 0.269 | 88.85% |
| 5 | 0.172 | 92.90% | 0.264 | 88.98% |
| 6 | 0.172 | 93.00% | 0.255 | 89.49% |
| 7 | 0.197 | 91.92% | 0.278 | 88.17% |
| 8 | 0.174 | 92.98% | 0.240 | 90.60% |
| 9 | 0.185 | 92.66% | 0.284 | 88.17% |
| 10 | 0.179 | 92.60% | 0.270 | 89.15% |





## Transfer Learning from Inception V3

Inception is one of the classic image models, that can be very good for transfer learning -- using a pre-trained model for new tasks. It is provided in GoMLX library of pre-trained models.

Reference:
    - [Rethinking the Inception Architecture for Computer Vision](
        http://arxiv.org/abs/1512.00567) (CVPR 2016)

The code below will define the new model type, and train it for a few steps, just to check things are working. If the 
model weights are not yet downloaded, it will also download them.

The inception model is relatively large, so it takes a few seconds to build it.


In [11]:
import (
    "path"
    "github.com/gomlx/gomlx/models/inceptionv3"
    "github.com/gomlx/gomlx/types/tensor"
	timage "github.com/gomlx/gomlx/types/tensor/image"
)

var (
    flagInceptionPreTrained = flag.Bool("pretrained", true, "If using inception model, whether to use the pre-trained weights to transfer learn")
    flagInceptionFineTuning = flag.Bool("finetuning", true, "If using inception model, whether to fine-tune the inception model")
)

// Include it as a model type.
// 
// Notice that GoNB (the Notebook kernel) will rename `init_inceptionv3` to `init`.
func init_inceptionv3() {
    modelTypeToModelFn["inception"] = InceptionV3ModelGraph
}

// InceptionV3ModelGraph uses an optionally pre-trained inception model.
func InceptionV3ModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node {
    _ = spec           // Not needed.
    images := inputs[0] // Images scaled from 0.0 to 1.0
    channelsConfig := timage.ChannelsLast
    images = inceptionv3.PreprocessImage(images, 1.0, channelsConfig)  // Adjust image to format used by Inception.

    var preTrainedPath string
    if *flagInceptionPreTrained {
        // Use pre-trained 
        preTrainedPath = *flagDataDir
        err := inceptionv3.DownloadAndUnpackWeights(*flagDataDir)  // Only downloads/unpacks the first time.
        AssertNoError(err)
    }
    inceptionV3Builder := inceptionv3.BuildGraph(ctx, images).
        PreTrained(preTrainedPath).
        SetPooling(inceptionv3.MaxPooling).
        Trainable(*flagInceptionFineTuning)
    if *flagInceptionPreTrained {
        inceptionV3Builder = inceptionV3Builder.PreTrained(preTrainedPath)
    }
    logits := inceptionV3Builder.Done()
    
    if !*flagInceptionFineTuning {
        logits = StopGradient(logits) // We don't want to train the inception model.
    }

    // logits = FnnOnTop(ctx, logits)
    logits = layers.DenseWithBias(ctx.In("readout"), logits, 1)
    return []*Node{logits}
}

// Train for a few steps, just to test things are working.
%% --steps=100 --model=inception --pretrained=true --finetuning=false --plots=false
config := buildConfig()
trainModel(config)

Training (100 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (27 steps/s)[0m [step=99] [loss+=0.583] [~loss+=0.797] [~loss=0.797] [~acc=46.81%]        
	[Step 100] median train step: 6287 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.911
	Mean Loss (#loss): 0.911
	Mean Accuracy (#acc): 46.18%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.906
	Mean Loss (#loss): 0.906
	Mean Accuracy (#acc): 47.15%



### Training Session: InceptionV3 pre-trained, fine-tuning, 10K steps

In [12]:
!rm -rf ~/work/dogs_vs_cats/inception_v3_finetuned

In [13]:
%% --steps=10000 --model=inception --pretrained=true --finetuning=true --checkpoint=inception_v3_finetuned
config := buildConfig()
trainModel(config)

Training (10000 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (29 steps/s)[0m [step=9999] [loss+=0.528] [~loss+=0.139] [~loss=0.139] [~acc=94.35%]          


	[Step 10000] median train step: 24191 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.120
	Mean Loss (#loss): 0.120
	Mean Accuracy (#acc): 95.26%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.196
	Mean Loss (#loss): 0.196
	Mean Accuracy (#acc): 92.53%



### Results from multiple runs:

| Try | Train <br/> Loss | Train <br/> Accuracy | Validation <br/> Loss | Validation <br/> Accuracy |
| --- | --- | --- | --- | --- |
| 1 | 0.035 | 98.83% | 0.202 | 93.60% |
| 2 | 0.029 | 98.95% | 0.220 | 93.17% |
| 3 | 0.018 | 99.39% | 0.225 | 93.49% |
| 4 | 0.021 | 99.27% | 0.235 | 93.49% |
| 5 | 0.029 | 99.06% | 0.240 | 93.28% |
| 6 | 0.039 | 98.57% | 0.257 | 92.96% |
| 7 | 0.035 | 98.81% | 0.223 | 93.13% |
| 8 | 0.055 | 98.02% | 0.281 | 92.04% |
| 9 | 0.041 | 98.50% | 0.257 | 92.49% |
| 10 | 0.025 | 99.21% | 0.219 | 94.09% |
| 11 | 0.039 | 98.60% | 0.230 | 93.36% |



### Inception Model Architecture, but no transfer learning (not using the pre-trained weights)

In [14]:
!rm -rf ~/work/dogs_vs_cats/inception_v3_base

In [15]:
%% --steps=1 --model=inception --pretrained=false --finetuning=true --checkpoint=inception_v3_base
config := buildConfig()
trainModel(config)

Training (1 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (3 steps/min)[0m [step=0] [loss+=0.706] [~loss+=0.706] [~loss=0.706] [~acc=43.75%]        
	[Step 1] median train step: 18311052 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.946
	Mean Loss (#loss): 0.946
	Mean Accuracy (#acc): 50.17%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.939
	Mean Loss (#loss): 0.939
	Mean Accuracy (#acc): 50.66%



## Bootstrap Your Own Latency (BYOL)

Based on the paper ["Bootstrap Your Own Latency [arxiv]"](https:/arxiv.org/abs/2006.07733), the idea is to first train an unsupervised embedding of the data, which can then be "transfer learned" to the final model.

This shouldn't be compared to a fully supervised model, if one has labels for every example. But it should do much better if, for instance, one only has labels to a fraction of the examples.

To evaluate if it works, we expected a BYOL pre-trained model would get to a higher accuracy quicker than a model trained from scratch -- but likely reach similar accuracy if trained with all our data (where everything is labeled).

For BYOL two models are created:

1. "online": A CNN model (reusing model above) that actually generates a prediction.
2. "target": Another CNN model (but with different randomly initialized weights) used for regularization.

Only the "online" model is updated with gradient descent, using a composed loss of with the label (just as the previous model) plus
a regularization loss on the square euclidian (L2) distance from the projections (just an extra FNN layer) of the "online" and "target" models.

The "target" model is not touched by gradient descent, but instead, after each step we do a moving average towards the "online" model parameters.

More details in the [paper](https:/arxiv.org/abs/2006.07733).

### BYOL Image Pairs

For BYOL, each training image is augmented in two different ways (2 different rotations) and BYOL uses the different versions to regularize one version's embeddings to the other.

First we generate a sample of the images, to make sure they are paired correctly: each row should have a pair of dogs (same image, different rotations) and a pair of cats (same image, different rotations).

#### Sample of image pairs with different augmentations:

In [16]:
import 	"github.com/gomlx/gomlx/types/slices"

%% --batch=4 --model=byol
config := buildConfig()

trainDS, _, _ := dogsvscats.CreateDatasets(config)

_, inputsT, labelsT, err := trainDS.Yield()
AssertNoError(err)
if len(inputsT) < 2 {
    fmt.Println("Pairs not being generated!?")
    return
}

// Get indices and labels of the images.
labelsFloat := labelsT[0].Local().Value().([]float32)
labels := slices.Map(labelsFloat, func (l float32) dogsvscats.DorOrCat {
    return dogsvscats.DorOrCat(l)
})

// Convert images from tensor to Go images.
imagesA, err := timage.ToImage().Batch(inputsT[0].Local())
AssertNoError(err)
imagesB, err := timage.ToImage().Batch(inputsT[1].Local())
AssertNoError(err)

numRows := *flagBatchSize
htmlRows := make([]string, 0, numRows)
for row := 0; row < numRows; row += 1 {
    cells := []string{
        embedImageInHTML(imagesA[row], labels[row].String(), 0, 64),
        embedImageInHTML(imagesB[row], labels[row].String(), 0, 64),
    }
    htmlRows = append(htmlRows, fmt.Sprintf("<tr>\n\t<td>%s</td>\n</tr>", strings.Join(cells, "</td>\n\t<td>")))
}
htmlTable := fmt.Sprintf("<h3>%s</h3><table>%s</table>\n", "Pairs Of Dogs And Cats", strings.Join(htmlRows, ""))
gonbui.DisplayHTML(htmlTable)

0,1
Dog (0),Dog (0)
Cat (0),Cat (0)
Dog (0),Dog (0)
Cat (0),Cat (0)


### BYOL Model

BYOL is used on top of some base model, which defaults to our previous CNN model (
but can be configured to the InceptionV3 by setting `--byol_inceptionv3`).

                                                                                  

In [17]:
// Include it as a model type.
func init_byol() {
    modelTypeToModelFn["byol"] = ByolCnnModelGraph
}

// byolModel is the core of the BYOL model.
// It's built twice, once for the "online" model once for the "target" model -- using contexts on different scopes.
//
// baseTrainable defines whether the base model should be trainable (set to false for the "target"
// model, or if fine-tuning is disabled)
func byolModel(ctx *context.Context, images *Node, baseTrainable bool) (logit, embeddings *Node) {
	isInceptionV3 := context.GetParamOr(ctx, "byol_inception", false)
	if isInceptionV3 {
		channelsConfig := timage.ChannelsLast
		images = inceptionv3.PreprocessImage(images, 1.0, channelsConfig) // Adjust image to format used by Inception.
		embeddings = inceptionv3.BuildGraph(ctx, images).
			SetPooling(inceptionv3.MaxPooling).
			Trainable(baseTrainable).Done()
	} else {
		// Simple CNN model -- we need an extra FNN on top, so we discard the original prediction.
		embeddings = CnnEmbeddings(ctx, images)
	}
	if !baseTrainable {
		embeddings = StopGradient(embeddings)
	}

	logit = layers.DenseWithBias(ctx.In("readout"), embeddings, 1)
	return
}

// ByolCnnModelGraph builds a BYOL-version of the CNN model of our demo.
//
// It returns the logit, not the predictions, which works with most losses.
// inputs: only one tensor, with shape `[batch_size, width, height, depth]`.
func ByolCnnModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node {
	_ = spec // Not used.

	// Create two models: same structure, different initializations, and if `--byol_use_pairs` is set,
	// different augmentations of the same image.
	onlineCtx := ctx.In("online")
	targetCtx := ctx.In("target").WithInitializer(initializers.RandomNormalFn(0, 1.0))

	// No dropout for the "target" model, and a more random initialization.
	targetCtx.SetParam("conv_dropout", 0.0) // Disable dropout on the target side.
	targetCtx.SetParam("dropout", 0.0)      // Disable dropout on the target side.

	// Evaluation/Inference and if pre-training is over, we only use the "online" model, and return
	// its prediction.
	g := inputs[0].Graph() // Graph.
	if !ctx.IsTraining(g) || !*flagByolPretraining {
		baseTraining := ctx.IsTraining(g) && *flagByolFinetuning
		onlineLogit, _ := byolModel(onlineCtx, inputs[0], baseTraining)
		return []*Node{onlineLogit} // Return only the logits.
	}

	stackedImages12 := Concatenate([]*Node{inputs[0], inputs[1]}, 0) // For "online" model.
	stackedImages21 := Concatenate([]*Node{inputs[1], inputs[0]}, 0) // For "target" model.

	regularizationRate := context.GetParamOr(targetCtx, "byol_regularization_rate", 1.0)

	_, onlineEmbedding := byolModel(onlineCtx, stackedImages12, true)
	onlineProjection := byolProjection(onlineCtx, onlineEmbedding)
	onlinePrediction := byolOnlinePrediction(onlineCtx, onlineProjection)

	_, targetEmbedding := byolModel(targetCtx, stackedImages21, false)
	targetProjection := byolProjection(targetCtx, targetEmbedding)
	targetCtx.EnumerateVariablesInScope(func(v *context.Variable) {
		v.Trainable = false
	})
	targetProjection = StopGradient(targetProjection)

	byolReg := byolLoss(onlinePrediction, targetProjection)
	train.AddLoss(ctx, MulScalar(byolReg, regularizationRate))

	// Update "target" model with moving average to the "online" model.
	movingAverageRatio := context.GetParamOr(targetCtx, "byol_target_update_ratio", 0.999)
	if movingAverageRatio < 1.0 {
		onlineScope := onlineCtx.Scope()
		targetScope := targetCtx.Scope()
		targetCtx.EnumerateVariablesInScope(func(targetVar *context.Variable) {
			if !strings.HasPrefix(targetVar.Scope(), targetScope) {
				exceptions.Panicf("BYOL target model variable %q::%q has unexpected scope (not prefixed with %q)",
					targetVar.Scope(), targetVar.Name(), targetScope)
			}

			// Get corresponding variable in "online" model.
			onlineVarScope := onlineScope + targetVar.Scope()[len(targetScope):]
			onlineVar := ctx.InspectVariable(onlineVarScope, targetVar.Name())
			if onlineVar == nil {
				exceptions.Panicf("BYOL target model variable %q::%q has no corresponding variable %q::%q in online model",
					targetVar.Scope(), targetVar.Name(), onlineVarScope, targetVar.Name())
			}

			targetValue := targetVar.ValueGraph(g)
			onlineValue := onlineVar.ValueGraph(g)
			targetValue = Add(
				MulScalar(onlineValue, 1.0-movingAverageRatio),
				MulScalar(targetValue, movingAverageRatio))
			targetVar.SetValueGraph(targetValue)
		})
	}
	return []*Node{} // No prediction to return.
}

func byolProjection(ctx *context.Context, embeddings *Node) *Node {
	projectionNodes := context.GetParamOr(ctx, "byol_projection_nodes", 256)
	projectionHiddenNodes := context.GetParamOr(ctx, "byol_hidden_nodes", 4096)

	// Re-use FnnOnTop: redefine its params based on BYOL ones, in the local scope.
	ctx = ctx.In("byol_projection")
	hiddenCtx := ctx.In("hidden")
	embeddings = layers.Dense(hiddenCtx, embeddings, true, projectionHiddenNodes)
	embeddings = normalizeFeatures(hiddenCtx, embeddings)
	embeddings = layers.Relu(embeddings)
	embeddings = layers.Dense(ctx.In("projection"), embeddings, true, projectionNodes)
	return embeddings
}

func byolOnlinePrediction(ctx *context.Context, projection *Node) *Node {
	projectionNodes := context.GetParamOr(ctx, "byol_projection_nodes", 256)
	projectionHiddenNodes := context.GetParamOr(ctx, "byol_hidden_nodes", 4096)

	ctx = ctx.In("byol_online_prediction")
	hiddenCtx := ctx.In("hidden")
	projection = layers.Dense(hiddenCtx, projection, true, projectionHiddenNodes)
	projection = normalizeFeatures(hiddenCtx, projection)
	projection = layers.Relu(projection)
	projection = layers.Dense(ctx.In("projection"), projection, true, projectionNodes)
	return projection
}

// byolLoss is based on the projections from the "online" model and "target" models -- the order
// doesn't matter.
func byolLoss(p0, p1 *Node) *Node {
	p0 = L2NormalizeWithEpsilon(p0, 1e-12, -1)
	p1 = L2NormalizeWithEpsilon(p1, 1e-12, -1)
	return AddScalar(
		MulScalar(
			ReduceSum(Mul(p0, p1), -1),
			-2.0),
		2.0)
}

%% --steps=100 --model=byol --plots=false
config := buildConfig()
trainModel(config)


Training (100 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (28 steps/s)[0m [step=99] [loss+=0.728] [~loss+=0.705] [~loss=0.705] [~acc=49.00%]         
	[Step 100] median train step: 2388 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.697
	Mean Loss (#loss): 0.697
	Mean Accuracy (#acc): 50.19%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.698
	Mean Loss (#loss): 0.698
	Mean Accuracy (#acc): 49.87%



### Pre-training unsupervised BYOL model with 10000 steps, batch size 100 (~50 epochs)

In [31]:
!rm -rf ~/work/dogs_vs_cats/byol_pretrained

In [32]:
%% -model=byol -steps=10000 -batch=100 -byol_pretrain -checkpoint byol_pretrained
config := buildConfig()
trainModel(config)

Training (10000 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (23 steps/s)[0m [step=9999] [loss+=0.041] [~loss+=0.040]           


	[Step 10000] median train step: 36124 microseconds
Pre-training only, no evaluation.
- Saving cleared checkpoint.


### Train Linear Model on top of BYOL Pretrained Representation

1. Train a linear layer on top of a randomly initialized CNN model

In [33]:
%% -steps=10000 -batch=100 -model=byol -dropout=0 -conv_dropout=0
config := buildConfig()
trainModel(config)

Training (10000 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (96 steps/s)[0m [step=9999] [loss+=0.662] [~loss+=0.659] [~loss=0.659] [~acc=60.51%]          
	[Step 10000] median train step: 5870 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.662
	Mean Loss (#loss): 0.662
	Mean Accuracy (#acc): 60.53%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.662
	Mean Loss (#loss): 0.662
	Mean Accuracy (#acc): 61.17%



2. Train a linear layer on top of a BYOL pre-trained model

In [64]:
!rm -rf ~/work/dogs_vs_cats/byol_linear ; cp -r ~/work/dogs_vs_cats/byol_pretrained ~/work/dogs_vs_cats/byol_linear
!rm ~/work/dogs_vs_cats/byol_linear/training_plot_points.json

In [65]:
%% -steps=10000 -batch=100 -model=byol -dropout=0 -conv_dropout=0 -checkpoint=byol_linear -checkpoint_keep=0
config := buildConfig()
trainModel(config)

loading: "checkpoint-n0000007-20240418-170757-initial"


Training (10000 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (72 steps/s)[0m [step=9999] [loss+=0.642] [~loss+=0.644] [~loss=0.644] [~acc=62.97%]          


	[Step 10000] median train step: 6499 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.647
	Mean Loss (#loss): 0.647
	Mean Accuracy (#acc): 62.11%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.646
	Mean Loss (#loss): 0.646
	Mean Accuracy (#acc): 63.11%



### Fine-tune Only 1000 steps

And compare a BYOL pretrained embeddings with a randomly initialized model.

1. Randomly initialized model

In [66]:
%% -model=byol -steps=1000 -batch=16 -byol_finetuning -checkpoint_keep=0
config := buildConfig()
trainModel(config)

Training (1000 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (115 steps/s)[0m [step=999] [loss+=0.617] [~loss+=0.601] [~loss=0.601] [~acc=67.44%]        
	[Step 1000] median train step: 3290 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.580
	Mean Loss (#loss): 0.580
	Mean Accuracy (#acc): 69.13%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.585
	Mean Loss (#loss): 0.585
	Mean Accuracy (#acc): 68.81%



2. Starting from BYOL pre-trained model

In [67]:
!rm -rf ~/work/dogs_vs_cats/byol_finetuned ; cp -r ~/work/dogs_vs_cats/byol_pretrained ~/work/dogs_vs_cats/byol_finetuned
!rm ~/work/dogs_vs_cats/byol_finetuned/training_plot_points.json

In [68]:
%% -model=byol -steps=1000 -batch=16 -byol_finetuning -checkpoint_keep=0 --checkpoint=byol_finetuned
config := buildConfig()
trainModel(config)

loading: "checkpoint-n0000007-20240418-170757-initial"


Training (1000 steps):  100% [[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m[32m=[0m] (52 steps/s)[0m [step=999] [loss+=0.579] [~loss+=0.594] [~loss=0.594] [~acc=68.73%]         


	[Step 1000] median train step: 3113 microseconds

Results on train-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.570
	Mean Loss (#loss): 0.570
	Mean Accuracy (#acc): 70.19%
Results on valid-eval [Pre]:
	Mean Loss+Regularization (#loss+): 0.573
	Mean Loss (#loss): 0.573
	Mean Accuracy (#acc): 69.98%

