<a href="https://colab.research.google.com/github/mschuessler/two4two/blob/LoMedHiVarSamplers/examples/train_lenet_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pathlib
import os
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras import layers
import pandas as pd
from keras_preprocessing.image import ImageDataGenerator


# Two4two data training
The notebook demonstrates how to train a modern LeNet CNN on a Dataset pregenerated with the [two4two Module](https://github.com/mschuessler/two4two).

If you open this notebook in Colab please make sure to request a GPU Instance. Training times will be excessively slow otherwise.

As a first step we define a relative path where the trained model should be saved.

In [2]:
relative_model_path = "two4two_example_model"

# Mounting Google drive to save trained model later
Since colab is a free resource it runtimes are limited. Hence we want to save our model to reuse it late when our reserved instance terminates. To do this we mount a google drive. If this notebook is not run inside of collab the following cells will skip the mounting of Google drive and save the model to your local directory from which this notebook is executed.

In [3]:
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  mount_drive = True
else:
  mount_drive = False

In [4]:
if mount_drive:
  from google.colab import drive
  drive.mount("/content/gdrive")

Mounted at /content/gdrive


We will now define the path to which or trained model will be saved. You may alter the name of the dictionary according to your preference. If you run this notebook outside of Google Colab just change this path to a local path of your notebook server.

In [5]:
model_filepath = os.path.join("/content/gdrive/My Drive", relative_model_path) if mount_drive else relative_model_path

We will use the the callback functionality of keras to save our model whenever we achieved a new highest validation accuracy when training the model.

In [6]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=model_filepath,
    save_weights_only=False,
    monitor='val_accuracy',
    mode='max',
    save_best_only=True)

# Defining model architecture / loading previously trained model
The follwoing cell tests wheter the dictionary where the model should be saved exist, if so we try to load it.

If the directory does not exist we define a new model according to a modern LeNet architecture and complie it using the ADAM optimizer.

In [7]:
trained_model_exists = os.path.exists(model_filepath)

In [8]:
if trained_model_exists:
  modernLenetModel = keras.models.load_model(model_filepath)
else:
  modernLenetModel = keras.models.Sequential([
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(2, activation="softmax"),
    ])
  modernLenetModel.compile(loss="categorical_crossentropy",
                             optimizer="adam", metrics=["accuracy"])
    

# Download dataset
In this example we will be using a pregenerated dataset package. It contains a larger dataset called "spherical_color_bias". This dataset contains two biases. The sphercitiy of the blocks as well as their color are somewhat predictive of the label peaky or stretchy. The package contains 80.000 images for training, 500 for validation and 3.000 for testing.

It also contains the testing data of 3000 images each for four other datasets that contain only one or none of the two biases. One dataset does not even have the correct arm positions.

Hence, we would expect that if we train a model on the dataset with two biases it should before worse on the other test sets, if the biases are used by the model for its predictions.

In [9]:
datasets = ["spherical_color_bias", "no_arms", "no_bias", "spherical_bias", "color_bias"]

In [13]:
data_dir = keras.utils.get_file(
    origin="https://f001.backblazeb2.com/file/two4two/datasets_models/golden80k.tar.gz",
    fname="two4two_datasets",
    untar=True
)

Downloading data from https://f001.backblazeb2.com/file/two4two/datasets_models/golden80k.tar.gz


# Reading dataframe from jsonl

In [14]:
data_dir

'/root/.keras/datasets/two4two_datasets'

In [15]:
train_dir = os.path.join(data_dir, datasets[0], "train")
train_df = pd.read_json(os.path.join(train_dir, "parameters.jsonl"), lines=True)
train_df["filename"] = train_df["id"] + ".png"

In [16]:
valid_dir = os.path.join(data_dir, datasets[0], "validation")
valid_df = pd.read_json(os.path.join(valid_dir, "parameters.jsonl"), lines=True)
valid_df["filename"] = valid_df["id"] + ".png"

# Creating Datagenerator from dataframes

In [17]:
datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = datagen.flow_from_dataframe(dataframe=train_df, directory=train_dir,
                                              x_col="filename", y_col="label", batch_size=64)
valid_generator = datagen.flow_from_dataframe(dataframe=valid_df, directory=valid_dir,
                                              x_col="filename", y_col="label", batch_size=64)
STEP_SIZE_TRAIN = train_generator.n // train_generator.batch_size
STEP_SIZE_VALID = valid_generator.n // valid_generator.batch_size

Found 80000 validated image filenames belonging to 2 classes.
Found 500 validated image filenames belonging to 2 classes.


# Train Model
We highly recommend to train the model at least for 10 or even better 20 epochs.

In [18]:
modernLenetModel.fit(train_generator,
                     steps_per_epoch=STEP_SIZE_TRAIN,
                     validation_data=valid_generator,
                     validation_steps=STEP_SIZE_VALID,
                     epochs=5,
                     callbacks = [model_checkpoint_callback]
                     )

Epoch 1/5
INFO:tensorflow:Assets written to: /content/gdrive/My Drive/two4two_example_model/assets
Epoch 2/5
INFO:tensorflow:Assets written to: /content/gdrive/My Drive/two4two_example_model/assets
Epoch 3/5
INFO:tensorflow:Assets written to: /content/gdrive/My Drive/two4two_example_model/assets
Epoch 4/5
INFO:tensorflow:Assets written to: /content/gdrive/My Drive/two4two_example_model/assets
Epoch 5/5
INFO:tensorflow:Assets written to: /content/gdrive/My Drive/two4two_example_model/assets


<tensorflow.python.keras.callbacks.History at 0x7fb7b9344c50>

# Evaluating the model
We now use our trained model on the test sets of all datasets. As expected the model perfroms worse on the models were not all biases are present.

In [19]:
for dataset_name in datasets:
        test_dir = os.path.join(data_dir, dataset_name, "test")
        test_df = pd.read_json(os.path.join(test_dir, "parameters.jsonl"), lines=True)
        test_df["filename"] = test_df["id"] + ".png"

        datagen = ImageDataGenerator(rescale=1. / 255)
        test_generator = datagen.flow_from_dataframe(dataframe=test_df, directory=test_dir,
                                                     x_col="filename", y_col="label",
                                                     batch_size=64)

        print("Evaluating on " + dataset_name)
        modernLenetModel.evaluate(test_generator)[1]

Found 3000 validated image filenames belonging to 2 classes.
Evaluating on spherical_color_bias
Found 3000 validated image filenames belonging to 2 classes.
Evaluating on no_arms
Found 3000 validated image filenames belonging to 2 classes.
Evaluating on no_bias
Found 3000 validated image filenames belonging to 2 classes.
Evaluating on spherical_bias
Found 3000 validated image filenames belonging to 2 classes.
Evaluating on color_bias
