# Flower Classification

In this project I will be participating in the [Petals to the Metal](https://www.kaggle.com/c/tpu-getting-started/overview) competition, which is a "getting started" competition on [Kaggle](kaggle.com). The objective is to classify different flowers.

## Setup
### Imports

In [12]:
import tensorflow as tf
import math, re, os
import numpy as np
from matplotlib import pyplot as plt
from sklearn.
AUTO = tf.data.experimental.AUTOTUNE
IMAGE_SIZE = [512, 512]
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

NameError: name 'strategy' is not defined

### Filenames

In [5]:
path = r'.\tfrecords-jpeg-512x512'
TRAINING_FILENAMES = tf.io.gfile.glob(path + '/train/*.tfrec')
VALIDATION_FILENAMES = tf.io.gfile.glob(path + '/val/*.tfrec')
TEST_FILENAMES = tf.io.gfile.glob(path + '/test/*.tfrec')

### Helper Functions
These functions can be found in the `petal_helper.py` file. I will be re-making them here in an effort to understand what they do. These functions are primarily used to prepare the data for use by the TPU. With a TPU it is important to always keep it supplied with data to analyze, for this reason, the data is in the form of `tfrec` files, as opposed to the `jpeg` pictures and separate `csv` with the labels for the pictures. This data is divided into 16 `tfrec` files, meaning that they contain both the pictures and the labels if applicable (The holdout set contains the unique identifiers of the pictures). The TPU that is used has 8 distinct cores. It is recommended to split the dataset up into twice the number of TPU cores to maximize efficiency. In order to minimize loading times, the datasets are also loaded into "close proximity" to the TPU, meaning in the same bucket in google cloud services in this case. It will be interesting if data in the same configuration can be used on my gpus and if not what changes I will have to make to adapt the code. I would expect the code to possibly run smoothly on my big GPU (2080ti) and cause issues with my small GPU (1050ti) due to the much lower memory.

#### Loading the datasets from the `tfrec` files

In [11]:
def decode_image(image_data):
    '''Turn image data into an array of numbers
    Args:
        image_data: jpeg image extracted from a tfrecord file
    Returns:
        tensor containing the data of the image
        
    '''
    image = tf.image.decode_jpeg(image_data, channels = 3)
    image = tf.cast(image, tf.float32) / 255.0 # convert image to floates in [0,1] range
    image = tf.reshape(image, [*IMAGE_SIZE , 3]) # explicit size needed for TPU
    return image

In [7]:
def read_labeled_tfrecord(example):
    LABELED_TFREC_FORMAT = {
        'image': tf.io.FixedLenFeature([], tf.string), # tf.string means bytesting
        'class': tf.io.FixedLenFeature([], tf.int64) # shape[] means single element
    }
    example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    label = tf.cast(example['class'], tf.int32)
    return image, label

def read_unlabeled_tfrecord(example):
    UNLABELED_TFREC_FORMAT = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'id': tf.io.FixedLenFeature([], tf.string)
        # no class, because this will be used on the test dataset
    }
    example = tf.io.parse_single_example(example, UNLABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    idnum = example['id']
    return image, idnum

In [8]:
def load_dataset(filenames, labeled = True, ordered = False):
    # Read from tfrecords. for optimal performance, reading from multiple files at once and disregarding
    # data order. order does not mapper since we will be shuffling the data anyway
    # it's generally a good idea to split the data into 16 parts when working with a tpu
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False #disable order use what is currently being loaded
    
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads = AUTO) 
    dataset = dataset.with_options(ignore_order)
    dataset = dataset.map(read_labeled_tfrecord if labeled else read_unlabeled_tfrecord,
                          num_parallel_calls = AUTO)
    # returns a dataset of (image, label) pairs if labeled = True or (image, id) if not labeled
    return dataset


In [9]:
load_dataset(TRAINING_FILENAMES)

<ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.float32, tf.int32)>

#### Augmenting the data

In order to reduce overfitting a function will be created that will flip the images randomly. This could be "augmented" by additional operations like cutting or rotating the images later on to possibly increase accuracy.

The `dataset.prefetch(AUTO)` in the following `get_training_dataset` function causes all of the data pipeline code to be execute on the cpu during gradient descent calculations on the TPU, this should cause the performance to not be impacted. It will be interesting to see if this also applies while running the code on the GPU.

In [10]:
def data_augment(image, label):
    image = tf.image.random_flip_left_right(image)
    return image, label

### Making the Datasets for Training, Validation and Testing

In [None]:
def get_training_dataset():
    dataset = load_dataset(TRAINING_FILENAMES, labeled = True)
    dataset = dataset.map(data_augment, num_parallel_calls = AUTO)
    dataset = dataset.repeat() #This makes the dataset repeat if needed to always fully utilize the TPU
    dataset = dataset.shuffle(2048) #This shuffles the order of the datapoints in the dataset to combat overfitting
    dataset = dataset.batch(BATCH_SIZE) # This sets the batch size used, probably needs to be adjusted wehn usingt the gpu
    dataset = dataset.prefetch(AUTO) # prefetch next patch while trining (autotune prefetch buffer size)
    return dataset

def get_validation_dataset(ordered = False):
    dataset = load_dataset(VALIDATION_FILENAMES, labeled = True, ordered = ordered)
    dataset = dataset.batch(BATCH_SIZE)
    dataset.cache()
    dataset = dataset.prefetch(AUTO)
    return dataset

def get_test_dataset(ordered = False):
    dataset = load_dataset(TEST_FILENAMES, labeled = False, ordered = ordered)
    dataset = dataset.batch(BATCH_SIZE)