# Flower Classification

In this project I will be participating in the [Petals to the Metal](https://www.kaggle.com/c/tpu-getting-started/overview) competition, which is a "getting started" competition on [Kaggle](kaggle.com). The objective is to classify different flowers.

## Setup
### GPU Setup
This cell is there to let me use my small gpu (NVIDIA 1050ti) or my cpu to run Tensorflow. I only use my big GPU if I need it, because Tensorflow can cause issues with other programs

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
import os
os.environ['TF_MIN_GPU_MULTIPROCESSOR_COUNT']='4'
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13566780651794817565
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 15892667402495347456
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 9883535296
locality {
  bus_id: 1
  links {
  }
}
incarnation: 5822688444456132874
physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:26:00.0, compute capability: 7.5"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 3134482023
locality {
  bus_id: 1
  links {
  }
}
incarnation: 497772350763305170
physical_device_desc: "device: 1, name: GeForce GTX 1050 Ti, pci bus id: 0000:25:00.0, compute capability: 6.1"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 12286538907267088110
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "

### Imports

In [2]:
import tensorflow as tf
import math, re, os
import numpy as np
from matplotlib import pyplot as plt
#from sklearn.
AUTO = tf.data.experimental.AUTOTUNE
IMAGE_SIZE = [512, 512]
#BATCH_SIZE = 16 * strategy.num_replicas_in_sync

### TPU Strategy

Using TPUs requires the use of an explicit distribution strategy. The workload needs to be distributed equally between all of the TPU cores.

The first step is to find out if a TPU is available:

In [3]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU', tpu.master)
except ValueError:
        tpu = None    

Next if a TPU is available, a TPU distribution strategy has to be made

In [4]:
if tpu:
    tf.config.experimental.connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.diestibute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()
print('REPLICAS: ' , strategy.num_replicas_in_sync)

REPLICAS:  1


In [5]:
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

### Filenames

In [6]:
path = r'.\tfrecords-jpeg-512x512'
TRAINING_FILENAMES = tf.io.gfile.glob(path + '/train/*.tfrec')
VALIDATION_FILENAMES = tf.io.gfile.glob(path + '/val/*.tfrec')
TEST_FILENAMES = tf.io.gfile.glob(path + '/test/*.tfrec')

### Helper Functions
These functions can be found in the `petal_helper.py` file. I will be re-making them here in an effort to understand what they do. These functions are primarily used to prepare the data for use by the TPU. With a TPU it is important to always keep it supplied with data to analyze, for this reason, the data is in the form of `tfrec` files, as opposed to the `jpeg` pictures and separate `csv` with the labels for the pictures. This data is divided into 16 `tfrec` files, meaning that they contain both the pictures and the labels if applicable (The holdout set contains the unique identifiers of the pictures). The TPU that is used has 8 distinct cores. It is recommended to split the dataset up into twice the number of TPU cores to maximize efficiency. In order to minimize loading times, the datasets are also loaded into "close proximity" to the TPU, meaning in the same bucket in google cloud services in this case. It will be interesting if data in the same configuration can be used on my gpus and if not what changes I will have to make to adapt the code. I would expect the code to possibly run smoothly on my big GPU (2080ti) and cause issues with my small GPU (1050ti) due to the much lower memory.

#### Loading the datasets from the `tfrec` files

In [7]:
def decode_image(image_data):
    '''Turn image data into an array of numbers
    Args:
        image_data: jpeg image extracted from a tfrecord file
    Returns:
        tensor containing the data of the image
        
    '''
    image = tf.image.decode_jpeg(image_data, channels = 3)
    image = tf.cast(image, tf.float32) / 255.0 # convert image to floates in [0,1] range
    image = tf.reshape(image, [*IMAGE_SIZE , 3]) # explicit size needed for TPU
    return image

In [8]:
def read_labeled_tfrecord(example):
    LABELED_TFREC_FORMAT = {
        'image': tf.io.FixedLenFeature([], tf.string), # tf.string means bytesting
        'class': tf.io.FixedLenFeature([], tf.int64) # shape[] means single element
    }
    example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    label = tf.cast(example['class'], tf.int32)
    return image, label

def read_unlabeled_tfrecord(example):
    UNLABELED_TFREC_FORMAT = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'id': tf.io.FixedLenFeature([], tf.string)
        # no class, because this will be used on the test dataset
    }
    example = tf.io.parse_single_example(example, UNLABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    idnum = example['id']
    return image, idnum

In [9]:
def load_dataset(filenames, labeled = True, ordered = False):
    # Read from tfrecords. for optimal performance, reading from multiple files at once and disregarding
    # data order. order does not mapper since we will be shuffling the data anyway
    # it's generally a good idea to split the data into 16 parts when working with a tpu
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False #disable order use what is currently being loaded
    
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads = AUTO) 
    dataset = dataset.with_options(ignore_order)
    dataset = dataset.map(read_labeled_tfrecord if labeled else read_unlabeled_tfrecord,
                          num_parallel_calls = AUTO)
    # returns a dataset of (image, label) pairs if labeled = True or (image, id) if not labeled
    return dataset


In [10]:
load_dataset(TRAINING_FILENAMES)

<ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.float32, tf.int32)>

#### Augmenting the data

In order to reduce overfitting a function will be created that will flip the images randomly. This could be "augmented" by additional operations like cutting or rotating the images later on to possibly increase accuracy.

The `dataset.prefetch(AUTO)` in the following `get_training_dataset` function causes all of the data pipeline code to be execute on the cpu during gradient descent calculations on the TPU, this should cause the performance to not be impacted. It will be interesting to see if this also applies while running the code on the GPU.

In [11]:
def data_augment(image, label):
    image = tf.image.random_flip_left_right(image)
    return image, label

### Making the Datasets for Training, Validation and Testing

In [12]:
def get_training_dataset():
    dataset = load_dataset(TRAINING_FILENAMES, labeled = True)
    dataset = dataset.map(data_augment, num_parallel_calls = AUTO)
    dataset = dataset.repeat() #This makes the dataset repeat if needed to always fully utilize the TPU
    dataset = dataset.shuffle(2048) #This shuffles the order of the datapoints in the dataset to combat overfitting
    dataset = dataset.batch(BATCH_SIZE) # This sets the batch size used, probably needs to be adjusted wehn usingt the gpu
    dataset = dataset.prefetch(AUTO) # prefetch next patch while trining (autotune prefetch buffer size)
    return dataset

def get_validation_dataset(ordered = False):
    dataset = load_dataset(VALIDATION_FILENAMES, labeled = True, ordered = ordered)
    dataset = dataset.batch(BATCH_SIZE)
    dataset.cache()
    dataset = dataset.prefetch(AUTO)
    return dataset

def get_test_dataset(ordered = False):
    dataset = load_dataset(TEST_FILENAMES, labeled = False, ordered = ordered)
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(AUTO)
    return dataset

### Loading the Data

These are all the classes of flower that are contained in the dataset

In [13]:
ds_train = get_training_dataset()
ds_valid = get_validation_dataset()
ds_test = get_test_dataset()

print('Training: ', ds_train)
print('Validation: ', ds_valid)
print('Test: ', ds_test)

Training:  <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.float32, tf.int32)>
Validation:  <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.float32, tf.int32)>
Test:  <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.float32, tf.string)>


### Flower Classes

In [14]:
CLASSES = ['pink primrose',    'hard-leaved pocket orchid', 'canterbury bells', 'sweet pea',     'wild geranium',     'tiger lily',           'moon orchid',              'bird of paradise', 'monkshood',        'globe thistle',         # 00 - 09
           'snapdragon',       "colt's foot",               'king protea',      'spear thistle', 'yellow iris',       'globe-flower',         'purple coneflower',        'peruvian lily',    'balloon flower',   'giant white arum lily', # 10 - 19
           'fire lily',        'pincushion flower',         'fritillary',       'red ginger',    'grape hyacinth',    'corn poppy',           'prince of wales feathers', 'stemless gentian', 'artichoke',        'sweet william',         # 20 - 29
           'carnation',        'garden phlox',              'love in the mist', 'cosmos',        'alpine sea holly',  'ruby-lipped cattleya', 'cape flower',              'great masterwort', 'siam tulip',       'lenten rose',           # 30 - 39
           'barberton daisy',  'daffodil',                  'sword lily',       'poinsettia',    'bolero deep blue',  'wallflower',           'marigold',                 'buttercup',        'daisy',            'common dandelion',      # 40 - 49
           'petunia',          'wild pansy',                'primula',          'sunflower',     'lilac hibiscus',    'bishop of llandaff',   'gaura',                    'geranium',         'orange dahlia',    'pink-yellow dahlia',    # 50 - 59
           'cautleya spicata', 'japanese anemone',          'black-eyed susan', 'silverbush',    'californian poppy', 'osteospermum',         'spring crocus',            'iris',             'windflower',       'tree poppy',            # 60 - 69
           'gazania',          'azalea',                    'water lily',       'rose',          'thorn apple',       'morning glory',        'passion flower',           'lotus',            'toad lily',        'anthurium',             # 70 - 79
           'frangipani',       'clematis',                  'hibiscus',         'columbine',     'desert-rose',       'tree mallow',          'magnolia',                 'cyclamen ',        'watercress',       'canna lily',            # 80 - 89
           'hippeastrum ',     'bee balm',                  'pink quill',       'foxglove',      'bougainvillea',     'camellia',             'mallow',                   'mexican petunia',  'bromelia',         'blanket flower',        # 90 - 99
           'trumpet creeper',  'blackberry lily',           'common tulip',     'wild rose']

In [15]:
print('Number of classes: {}'.format(len(CLASSES)))
print('First 5 classes, alphabetical:')
for name in sorted(CLASSES)[:5]:
    print(name)

Number of classes: 104
First 5 classes, alphabetical:
alpine sea holly
anthurium
artichoke
azalea
balloon flower


### Helper Function: Count the Number of Images

To find out how many images are in a given dataset, the individual filenames can be used. An example of a filename is `00-512x512-798.tfrec`. The number 798 in the name indicates that the file contains 798 images. For this reason regular expressions can be used to extract this number and then sum up all the individual numbers to get a total for a given dataset:

In [16]:
def count_data_items(filenames):
    n = np.sum([int(re.compile(r'-(\d*)\.').search(filename).group(1)) for filename in filenames])
    return n

In [17]:
NUM_TRAINING_IMAGES = count_data_items(TRAINING_FILENAMES)
NUM_TEST_IMAGES = count_data_items(TEST_FILENAMES)
NUM_VALIDATION_IMAGES = count_data_items(VALIDATION_FILENAMES)

## The Deep Learning Models
### VGG16 based model

Now a first version of the deep learning model will be set up. This model will consist of a base made up of a VGG16 model trained on the ImageNet dataset. This base will then be used in transfer learning by adding additional layers and training them on the dataset of flowers. The ImageNet model will not be trained but used as is.

In [18]:
with strategy.scope(): 
    pretrained_model = tf.keras.applications.VGG16(
    weights = 'imagenet',
    include_top = False,
    input_shape = [*IMAGE_SIZE, 3]
    )
    pretrained_model.trainable = False
    model = tf.keras.Sequential([
         # Transfer learning using ImageNet as a base to extract features from the images
         pretrained_model,
         # attach a new head to act as a classifier
         tf.keras.layers.GlobalAveragePooling2D(),
         tf.keras.layers.Dense(len(CLASSES), activation = 'softmax')
     ])
    model.compile(
            optimizer = 'adam',
            loss = 'sparse_categorical_crossentropy',
            metrics = ['sparse_categorical_accuracy']
            
        )
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg16 (Model)                (None, 16, 16, 512)       14714688  
_________________________________________________________________
global_average_pooling2d (Gl (None, 512)               0         
_________________________________________________________________
dense (Dense)                (None, 104)               53352     
Total params: 14,768,040
Trainable params: 53,352
Non-trainable params: 14,714,688
_________________________________________________________________


#### Training the model

In [19]:
EPOCHS = 12
STEPS_PER_EPOCH = count_data_items(TRAINING_FILENAMES) // BATCH_SIZE

history = model.fit(
    ds_train,
    validation_data = ds_valid,
    epochs = EPOCHS,
    steps_per_epoch = STEPS_PER_EPOCH
)

Epoch 1/12

KeyboardInterrupt: 

#### Making a Submission

Now I will use the model to make a submission to the Kaggle competition in order to judge its efficacy

In [None]:
test_ds = get_test_dataset(ordered = True)

test_images_ds = test_ds.map(lambda image, idnum: image)
probabilities = model.predict(test_images_ds)
predictions = np.argmax(probabilities, axis = -1)
print(predictions)

In [None]:
# Get image ids from test set and convert to ints
test_ids_ds = test_ds.map(lambda image, idnum: idnum).unbatch()
test_ids = next(iter(test_ids_ds.batch(NUM_TEST_IMAGES))).numpy().astype('U')
# Write the submission file
np.savetxt(
    'submission.csv',
    np.rec.fromarrays([test_ids, predictions]),
    fmt = ['%s', '%d'],
    delimiter = ',',
    header = 'id, label',
    comments = '',
)
!head submission.csv

### Xception based model

Now instead of using the VGG16 model as a base to be re-trained, I will be using the Xception model, which should be better suited for the task

In [None]:
with strategy.scope(): 
    pretrained_model = tf.keras.applications.Xception(
    weights = 'imagenet',
    include_top = False,
    input_shape = [*IMAGE_SIZE, 3]
    )
    pretrained_model.trainable = False
    model = tf.keras.Sequential([
         # Transfer learning using ImageNet as a base to extract features from the images
         pretrained_model,
         # attach a new head to act as a classifier
         tf.keras.layers.GlobalAveragePooling2D(),
         tf.keras.layers.Dense(len(CLASSES), activation = 'softmax')
     ])
    model.compile(
            optimizer = 'adam',
            loss = 'sparse_categorical_crossentropy',
            metrics = ['sparse_categorical_accuracy']
            
        )
model.summary()

In [None]:
EPOCHS = 12
STEPS_PER_EPOCH = count_data_items(TRAINING_FILENAMES) // BATCH_SIZE

history = model.fit(
    ds_train,
    validation_data = ds_valid,
    epochs = EPOCHS,
    steps_per_epoch = STEPS_PER_EPOCH
)