# Foreword
The original DIBaS bacterial dataset is available here: http://misztal.edu.pl/software/databases/dibas/

We took a subset (five species out of the 33 available), reduced the size of each image (which are now 800 pixels wide) and changed the encoding from .tif to .png. Moreover, we pre-selected a train/test set split, and created a proper folder structure. The final result is [available here](https://raw.githubusercontent.com/ne1s0n/coding_excercises/master/data/dibas_cell_db/dibas_cell_db.zip) (zip file, 72 MB)

# Assignment

Build a multiclass classifier on (a subset of) the DIBaS bacterial dataset.

# Setup

In [None]:
#where to find the data
data_url = 'https://raw.githubusercontent.com/ne1s0n/coding_excercises/master/data/dibas_cell_db/dibas_cell_db.zip'

#where to download the compressed data
data_root = '/content/data/'

#where the images will actually be found
base_dir = data_root + 'dibas_cell_db/'

#image size (after reshaping)
image_shape = (224, 224)

#size for mini-batches
batch_size = 20

# Data

If the data is not here we download it.

In [None]:
import os
import glob
import requests
import zipfile

#this are derived folders
train_dir = base_dir + 'train/'
val_dir = base_dir + 'test/'

#these two lists should contain the full paths of all train and test images
train_filenames = glob.glob(train_dir + '*/*')
val_filenames   = glob.glob(val_dir + '*/*')

#let's check that we actually have the data
if len(train_filenames) == 0 or len(val_filenames) == 0:
  #either the data was never downloaded or something bad happened
  #in any case, we donwload and unzip everything

  #room for data
  os.makedirs(data_root, exist_ok=True)

  #downloading
  r = requests.get(data_url)
  open(data_root + 'local_archive.zip', 'wb').write(r.content)

  #unpacking
  z = zipfile.ZipFile(data_root + 'local_archive.zip')
  z.extractall(path = data_root)

  #at this point data is there, we are ready to get the list of files
  train_filenames = glob.glob(base_dir + 'train/*/*')
  val_filenames   = glob.glob(base_dir + 'test/*/*')

#whatever the original case, at this point we have the files
print('Available images for train: ' + str(len(train_filenames)))
print('Available images for validation: ' + str(len(val_filenames)))


Available images for train: 89
Available images for validation: 20


# Data generators

In [None]:
from keras.preprocessing.image import ImageDataGenerator

# Data generators
train_datagen = ImageDataGenerator(
      rescale=1./255,
      #bacteria can be upside-down, left-right 
      #and still being bacteria
      horizontal_flip=True,
      vertical_flip=True,
      
      #high rotation values introduce artifacts in the filled
      #triangles. Let's keep this low
      rotation_range=10,

      #conservative parameters, just to meddle a little with
      #the possibilities
      width_shift_range=0.2,
      height_shift_range=0.2,
      shear_range=0.2,
      zoom_range=0.2,
      fill_mode='nearest')

# Note that the validation data should not be augmented!
val_datagen = ImageDataGenerator(rescale=1./255)

#filling the generators
train_generator = train_datagen.flow_from_directory(
        # This is the target directory
        train_dir,
        # All images will be resized to image_shape
        target_size=image_shape,
        batch_size=batch_size,
        # Since we use categorical_crossentropy loss, we need categorical labels
        class_mode='categorical')

# batch_size can be 1 or any factor of test dataset size to ensure that test 
#dataset is sampled just once, i.e., no data is left out
val_generator = val_datagen.flow_from_directory(
        val_dir,
        target_size=image_shape,
        batch_size=5,
        class_mode='categorical')

Found 89 images belonging to 5 classes.
Found 20 images belonging to 5 classes.


# Transfer learning: Resnet50

## Get the architecture

In [None]:
#https://keras.io/api/applications/resnet/#resnet50-function
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import BatchNormalization

#downloading the net and its weights trained on imagenet dataset
my_resnet = ResNet50(weights='imagenet', include_top=False, input_shape=(image_shape[0], image_shape[1], 3))
print("my_resnet has " + str(len(my_resnet.layers)) + " layers")

#it's already the default value, but we set it anyway to 
#make clear we are not going to train the whole thing
my_resnet.trainable = False

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
my_resnet has 175 layers


## Add the trainable top layers

In [None]:
from keras import models, layers

model = models.Sequential()
model.add(my_resnet)
model.add(layers.Flatten())
model.add(layers.Dense(units=5, activation='softmax'))


## Take a look at the network

In [None]:
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
resnet50 (Functional)        (None, 7, 7, 2048)        23587712  
_________________________________________________________________
flatten_2 (Flatten)          (None, 100352)            0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 501765    
Total params: 24,089,477
Trainable params: 501,765
Non-trainable params: 23,587,712
_________________________________________________________________
None


In [None]:
#brave people will look into resnet50
#print(my_resnet.summary())

## Model compile

In [None]:
from keras import optimizers
from keras.callbacks import EarlyStopping
from keras.callbacks import ReduceLROnPlateau

# Model compile / fit
model.compile(loss='categorical_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

# early stopping: https://keras.io/callbacks/#earlystopping
#es = EarlyStopping(monitor='loss', mode='min', min_delta=0.001, verbose=1, patience=40, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='loss', mode='min', factor=0.9, patience=15, min_lr=1e-20, verbose=1, cooldown=3)

## Model fit

In [None]:
history = model.fit(
      train_generator,
      #steps_per_epoch=round(ntrain/batch_size,0),
      epochs=50,
      validation_data=val_generator,
      #validation_steps=20, #50
      #validation_steps=round(nval/batch_size,0),
      #callbacks=[es, reduce_lr],
      callbacks=[reduce_lr],
      verbose=2)

Epoch 1/50
5/5 - 4s - loss: 10.9098 - accuracy: 0.1011 - val_loss: 8.0347 - val_accuracy: 0.2000
Epoch 2/50
5/5 - 3s - loss: 7.7760 - accuracy: 0.1798 - val_loss: 8.0556 - val_accuracy: 0.4000
Epoch 3/50
5/5 - 3s - loss: 6.1489 - accuracy: 0.4382 - val_loss: 3.5873 - val_accuracy: 0.4000
Epoch 4/50
5/5 - 3s - loss: 2.8957 - accuracy: 0.3933 - val_loss: 3.9429 - val_accuracy: 0.5000
Epoch 5/50
5/5 - 3s - loss: 3.0133 - accuracy: 0.5056 - val_loss: 1.4495 - val_accuracy: 0.6000
Epoch 6/50
5/5 - 3s - loss: 1.7444 - accuracy: 0.4607 - val_loss: 1.7475 - val_accuracy: 0.6000
Epoch 7/50
5/5 - 3s - loss: 1.2730 - accuracy: 0.5730 - val_loss: 0.8499 - val_accuracy: 0.4500
Epoch 8/50
5/5 - 3s - loss: 0.8191 - accuracy: 0.6067 - val_loss: 0.6693 - val_accuracy: 0.7000
Epoch 9/50
5/5 - 3s - loss: 0.5691 - accuracy: 0.7865 - val_loss: 0.4334 - val_accuracy: 0.8000
Epoch 10/50
5/5 - 3s - loss: 0.4942 - accuracy: 0.8202 - val_loss: 0.4665 - val_accuracy: 0.8000
Epoch 11/50
5/5 - 3s - loss: 0.5728 - 

# CNN from scratch

## Define the architecture

In [None]:
from keras import models, layers
from keras.layers import Conv2D, MaxPooling2D, Flatten

model2 = models.Sequential()
model2.add(Conv2D(32, (3, 3), padding="same", activation="relu", input_shape=(image_shape[0], image_shape[1], 3)))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(Conv2D(64, (3, 3), padding="same", activation="relu"))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(Conv2D(128, (3, 3), padding="same", activation="relu"))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(layers.Flatten())
model2.add(layers.Dense(units=5, activation='softmax'))


## Take a look at the network

In [None]:
print(model2.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 224, 224, 32)      896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 112, 112, 32)      0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 112, 112, 64)      18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 56, 56, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 56, 56, 128)       73856     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 28, 28, 128)       0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 100352)           

## Model compile

In [None]:
model2.compile(loss='categorical_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

## Model fit

In [None]:
history2 = model2.fit(
      train_generator,
      #steps_per_epoch=round(ntrain/batch_size,0),
      epochs=50,
      validation_data=val_generator,
      #validation_steps=20, #50
      #validation_steps=round(nval/batch_size,0),
      #callbacks=[es, reduce_lr],
      verbose=2)

Epoch 1/50
5/5 - 3s - loss: 2.2789 - accuracy: 0.2135 - val_loss: 1.6747 - val_accuracy: 0.3500
Epoch 2/50
5/5 - 3s - loss: 1.6363 - accuracy: 0.2921 - val_loss: 1.5655 - val_accuracy: 0.2000
Epoch 3/50
5/5 - 3s - loss: 1.5535 - accuracy: 0.2022 - val_loss: 1.4552 - val_accuracy: 0.4000
Epoch 4/50
5/5 - 3s - loss: 1.3951 - accuracy: 0.4157 - val_loss: 1.1937 - val_accuracy: 0.6500
Epoch 5/50
5/5 - 3s - loss: 1.2531 - accuracy: 0.4157 - val_loss: 1.0711 - val_accuracy: 0.5500
Epoch 6/50
5/5 - 3s - loss: 1.0282 - accuracy: 0.5618 - val_loss: 0.8245 - val_accuracy: 0.8000
Epoch 7/50
5/5 - 3s - loss: 0.7169 - accuracy: 0.7079 - val_loss: 0.5892 - val_accuracy: 0.7000
Epoch 8/50
5/5 - 3s - loss: 0.5246 - accuracy: 0.7416 - val_loss: 0.5533 - val_accuracy: 0.6500
Epoch 9/50
5/5 - 3s - loss: 0.6383 - accuracy: 0.6742 - val_loss: 0.8055 - val_accuracy: 0.6000
Epoch 10/50
5/5 - 3s - loss: 0.7292 - accuracy: 0.6404 - val_loss: 0.6807 - val_accuracy: 0.6000
Epoch 11/50
5/5 - 3s - loss: 0.6582 - a