Kyle Mckay - F21DL Coursework Portfolio
===

1. [Data Set Choice](#h1)
2. [Visualization and Initial Data Exploration](#h2)
    1. [Class Distribution](#h2_0)
    1. [Direct Comparison](#h2_1)
    2. [Conversion to Pixel Data](#h2_2)
    3. [Average Images](#h2_3)
    4. [Standard Deviation](#h2_4)
3. [Acknowledgement](#ack)

# Setup

In [18]:
%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import image

Initalise variables and functions used throughout the notebook

In [19]:
data_dir = 'DATA' 
img_dir = f'{data_dir}/images'

# Utility method to plot numpy greyscale pixel matrices side by side for comparison
def multi_plot(*imgs):
    fig = plt.figure()

    # TODO add multiple rows if many images
    for i, avg in enumerate(imgs, 1):
        title, dat = avg
        ax = fig.add_subplot(1, len(imgs), i)
        ax.imshow(dat, vmin=0, vmax=255, cmap='Greys_r')
        plt.title(title)
        plt.axis('off')
    
    plt.show()

# Functional approach to collapsing 3rd order numpy greyscale pixel tensors using certain methods (e.g. mean)
def collapse_img(full_mat, fn = np.mean, size = (300, 180)):
    # Apply desired function
    collapsed_img = fn(full_mat, axis = 0)
    
    # Reshap back to a matrix
    return collapsed_img.reshape(size)

# <a id="h1">Data Set Choice</a>

I chose to work with the FGVC-Aircraft data set (Maji et al.) because:

- It is an image dataset and I'm interested in computer vision and image processing techniques.
- The data set is intended for use as a multiclass classification problem, which provides an interesting level of complexity above binary classification and below multi-label classification.
- There is granularity to the classes in the data set. With 41 manufacturers providing plenty of instances (image files) to work with for each.
- The images are not provided in a uniform size or aspect ratio, which adds some complexity to homogenising and reducing the data in preprocessing to meet computational constraints.
- The data provenance is clearly stated and acceptable for my research/learning purposes.

# <a id="h2">Visualization and Initial Data Exploration</a>

## <a id="h2_0">Class Distribution</a>

I first want to get a sense for the number of instances there are per manufacturer since I know there are 100 images per aircraft model (the highest level of granularity), but have no sense for how evenly distributed those are among the lower class granularity levels.

This is important since for training purposes I want my data to have an approximately even distribution to avoid imbalanced classification results.

In [54]:
# Check how many images there are per manufacturer (the lowest level of glanularity)
def class_count(file):
    classes = pd.read_csv(file, sep=" ", header=None, 
        names=["Image", "Class"], skip_blank_lines=True)
    
    counts = classes['Class'].value_counts()

    # Interested in classes with at least 100 instances (reasonable quantity)
    return counts[counts >= 100]

class_count(f'{data_dir}/images_manufacturer_train.txt')


Boeing       733
Airbus       434
Embraer      233
McDonnell    232
de           167
Canadair     134
Douglas      133
Cessna       133
British      133
Lockheed     102
Fokker       100
Name: Class, dtype: int64

In [None]:

# all_imgs = os.listdir(img_dir)


Before exploring the data it's obvious from a brief inspection that some preprocessing needs to be done to homogenise the images so that they share the same features and can be more directly compared.


## <a id="h2_1">Direct Comparison</a>

The following code pulls 3 random sample images from each class for an initial direct visual comparison.

It's clear from a quick visual inspection that the pollen carrying bees have distinct features (the pollen) on either side of their abdomen which isn't present in the other images. This suggests to me it should be fine to process the images in greyscale and reduce the number of features for efficiency since they should still be distinguishable.

## <a id="h2_2">Conversion to Pixel Data</a>

The following script is converting greyscale versions of the image files into a 3rd order tensor with $n\times m$ elements. $n$ is the number of observations and $m$ is the number of pixels (features). The element values represent the pixel shade.

## <a id="h2_3">Average Images</a>

By averaging the pixel data for all of a given class, I can get a sense for the regions which are common to each class. It's not as clear as before, but there's still an obvious presence on either side of the abdomen for the pollen carrying class which isn't present in the other images.

### Average Images Difference

The difference between the classes becomes clearer by taking the difference of these two averaged images. Now it's much more obvious that one class (pollen carrier) has a distinct region of pixels in the lower left and a less distinct region in the lower right.

## <a id="h2_4">Standard Deviation</a>

To get a sense of the pixel regions which vary the most in each class, the same reasoning of averaging can be applied, but instead taking the standard deviation. The resulting images aren't as clear, but here I can see a lot more varience in those lower left and right regions for the pollen carrying class. This reinforces my understanding of the data.

# <a id="ack">Acknowledgement</a>

FGVC-Aircraft data set:

Fine-Grained Visual Classification of Aircraft, S. Maji, J. Kannala, E. Rahtu, M. Blaschko, A. Vedaldi, [arXiv.org](https://arxiv.org/abs/1306.5151), 2013

Image Data Exploration Code adapted from: [https://towardsdatascience.com/exploratory-data-analysis-ideas-for-image-classification-d3fc6bbfb2d2](https://towardsdatascience.com/exploratory-data-analysis-ideas-for-image-classification-d3fc6bbfb2d2)