# Scalable image classification with Tensorflow

Image classification via CNN - Turkish lira

This work is the project for the Algorithms of Massive Datasets exam of the Data Science master's degree (Università degli Studi di Milano, Italy)

In [None]:
from google.colab import files

uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [None]:
!mkdir ~/.kaggle/
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

In [None]:
api = KaggleApi()
api.authenticate()

In [None]:
api.dataset_download_files(dataset='baltacifatih/turkish-lira-banknote-dataset', path='data/', quiet=False, unzip=True)

  0%|          | 5.00M/3.50G [00:00<01:53, 33.0MB/s]

Downloading turkish-lira-banknote-dataset.zip to data


100%|██████████| 3.50G/3.50G [01:04<00:00, 58.2MB/s]





In [None]:
import os
import datetime
import numpy as np
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, Activation, Flatten, Dropout, BatchNormalization, Conv2D, MaxPooling2D
from tensorflow.keras.models import Sequential

In [None]:
DATASET_PATH = "data"

In order to achieve distributed training, you should configure the `TF_CONFIG` file like

```json
os.environ['TF_CONFIG'] = json.dumps({
    'cluster': {
        'worker': ["localhost:20000", "localhost:20001"]
    },
    'task': {'type': 'worker', 'index': 0}
})
```

and replace "localhost:20000" and "localhost:20001" with the ip addresses of your workers. More info [here](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).

Let's define the strategy. `MultiWorkerMirroredStrategy` allows for syncronized training across multiple machines with multiple GPUs. See [here](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) and [here](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras) for more details.

In [None]:
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

We resize each picture to 64x64. Moreover, we set a batch size of 64 images per worker. In this example we are using a single worker, and that will need to change if you specify more than one worker in the `TF_CONFIG` file.

In [None]:
IMG_WIDTH = 64
IMG_HEIGHT = 64

NUM_WORKERS = 1
PER_WORKER_BATCH_SIZE = 64
GLOBAL_BATCH_SIZE = PER_WORKER_BATCH_SIZE * NUM_WORKERS

The scalability of the input comes from the fact that we won't load all images into main memory but we'll exploit the `tf.Data` API in order to read batches of them. In order to do this, we won't use the train/test split provided in the txts but we will randomly split the data into a 80%-20% split.

In [None]:
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  DATASET_PATH,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(IMG_WIDTH, IMG_HEIGHT),
  label_mode='categorical',
  batch_size=GLOBAL_BATCH_SIZE)

Found 6000 files belonging to 6 classes.
Using 4800 files for training.


In [None]:
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
  DATASET_PATH,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(IMG_WIDTH, IMG_HEIGHT),
  label_mode='categorical',
  batch_size=GLOBAL_BATCH_SIZE)

Found 6000 files belonging to 6 classes.
Using 1200 files for validation.


In [None]:
class_names = train_ds.class_names
print(class_names)

['10', '100', '20', '200', '5', '50']


In [None]:
num_classes = len(class_names)

We will need to scale the rgb values in order to have them in the range [0,1] which is more convenient for a neural network. This scaling will be done real-time while batch reading the images from disk.

In [None]:
def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255

    return image, label

We implement prefetch and caching of portions of the dataset in order to improve performance as suggested [here](https://www.tensorflow.org/guide/data_performance)

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.map(scale, num_parallel_calls=AUTOTUNE).repeat().cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.map(scale, num_parallel_calls=AUTOTUNE).cache().prefetch(buffer_size=AUTOTUNE)

We now distribute the input across multiple devices. See [here](https://www.tensorflow.org/tutorials/distribute/input) for more details.

In [None]:
dist_dataset = strategy.experimental_distribute_dataset(train_ds)

Tensorboard

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir logs

In [None]:
logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

Let's define the model. We build the architecture of our CNN using the famous VGG blocks

In [None]:
with strategy.scope():
    model = Sequential()

    # VGG Blocks
    model.add(Conv2D(32, (3, 3), padding='same', input_shape=(IMG_WIDTH,IMG_HEIGHT,3)))
    model.add(Activation('relu'))
    model.add(Conv2D(32, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(64, (3, 3), padding='same'))
    model.add(Activation('relu'))
    model.add(Conv2D(64, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(128, (3, 3), padding='same'))
    model.add(Activation('relu'))
    model.add(Conv2D(128, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Flatten())

    # Dense layers
    model.add(Dense(128))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))

    model.compile(optimizer=tf.keras.optimizers.Adam(), 
                  loss=tf.keras.losses.CategoricalCrossentropy(),
                  metrics=["accuracy"])

Early stopping to speed up training and reduce overfitting

In [None]:
es = EarlyStopping(monitor='loss', verbose=1, mode='min', patience = 2, min_delta=0.01)

Let's train the model. Since the dataset is perfectly balanced, we may skip using a validation set.

In [None]:
history = model.fit(dist_dataset,
            epochs=15,
            steps_per_epoch = 75,
            callbacks=[es, tensorboard_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 00011: early stopping


Let's evaluate it on the test set

In [None]:
model.evaluate(test_ds)



[0.06516726315021515, 0.9800000190734863]

In [None]:
model.save("cnn.h5")