In [1]:
import numpy as np
import matplotlib.pyplot as plt


# Dataset Visualisation
The three generated datasets are plotted below. These datasets represent three common boundary situations that we would encounter in practice: linearly separable boundary in space, higher-order separable boundary, and linearly non-separable boundary. I chose these datasets because they are easy to visualise so that we can know how the algorithm works geometrically.

![datasets.png](datasets.png)

The CIFAR10 and CIFAR100 datasets are visualised by projecting the extracted features to two dimensional space with t-SNE algorithm.



<p float="left">
    <img src="cifar10-tsne.png"  width="30%"/><img src="cifar100-tsne.png"  width="30%"/>
</p>



Different colour represents different classes. It is clear that the shape of the projected features is similar with the blobs dataset. This actually indicates that the extracted features with pre-trained network are good for logistic regression algorithm. Therefore, the selection preference on the blob dataset should be able to reflect the behaviour on CIFAR10 and CIFAR100 only if the selection algorithm is based on the geometric location of the samples.



# Experiment  1: Analysis of the algorithms
This experiment is designed to analyse the intrinsic behaviour of the algorithms individually. 
## 1.1 POP
The POP algorithm assumes that keeping boundary points only can already ensure simple machine learning algorithms like knn and logistic regression to recover the desired decision boundary. Many papers have proved this hypothesis with critical experiments. However, for deep learning, the performance is still unknown. If we consider the network as extracting features first and classify the features with a logistic regression algorithm, then the problem is whether the network can extract high-quality features or not with POP selected samples. Please note that the POP is not expected to get the same extracted features because of the randomness of deep learning.

For the generated blob dataset, the weakness countplot is shown below. Recall that the weakness is defined as the number of times that a sample is not a boundary sample by projecting the features to each axis. Therefore, only samples with weakness equal to 2 are purely inner samples. The right plot shows the selected samples in green. It indicates that if the classifier can separate the selected points, then it should be able to separate all the other samples as well. Actually, the test accuracy is
99.5% if we use early stop method by selecting the model that can achieve the highest validation accuracy.

<p float="left">
    <img src="pop-blob.png"  width="45%"/> <img src="pop-blob-368.png"  width="45%"/>
</p>



However, for higher-order feature dimensions, there are less purely inner samples as defined by POP. The countplots of CIFAR10 (left) and CIFAR100 (right) are shown below.

<p float="left">
    <img src="pop-cifar10.png"  width="45%"/> <img src="pop-cifar100.png"  width="45%"/>
</p>



For CIFAR10 and CIFAR100, there are no purely inner samples (weakness == 128 since we use 128-dimensional features). Thus POP is less efficient for datasets with more classes and features.

Also, the circles(left) and the moons (right)  POP weakness countplots are shown below. It is clear that the reduction rate of POP is highly contained by the geometric location of the samples. If the samples are not separatable after projecting them into a single axis, then POP is not a suitable algorithm. 

<p float="left">
    <img src="pop-circles.png"  width="45%"/><img src="pop-moons.png"  width="45%"/> 
</p>



I varied the weakness threshold for CIFAR10 and CIFAR100 to see what will happen if I decide to remove not purely inner samples. The classification accuracy for CIFAR10 is shown in the table below. For CIFAR10, the results are still acceptable. However, for CIFAR100, it is meaningless to use POP because with weakness == 0 (pure boundary samples), there are already more than 85% samples selected and the relative accuracy is close to 1.

<img src="pop-his.png"  width="50%"/>

There are two possible ways to improve the POP algorithm performance. The first way is to remove the BatchNormalisation layer of the pre-trained feature extraction network so that the difference between feature values are higher and it would be easier to avoid samples with closer feature values but different class labels. The other way is to lower the numpy.isclose() tolerance so that samples with similar features will be considered as non-boundary samples. 

## 1.2 EGDIS

EGDIS is designed to select both boundary samples and the densest samples. However, samples at the boundary are harder to classify compared with inner samples. Therefore, the test accuracy may be slightly lower than other algorithms if we train the network with the same amount of samples. The benefit is that these boundary samples are the key to achieve higher accuracy and lower loss. Researchers can focus on designing new networks or training methods to classify them right. To prove the hypothsis, I plotted the density of classification scores of the boundary samples below. The left side is CIFAR10 and the right side is CIFAR100. I didn't use the generated datasets to show this because these datasets are too simple and most of the scores are above 0.9%. Also, for blobs, there is only 1 boundary sample selected. For moons, there are 3 boundary samples selected. For circles, there are 35 boundary samples selected. 

<img src="egdis-boundary-scores-cifar10.png" width="50%"/><img src="egdis-boundary-scores-cifar100.png" width="50%"/>

For CIFAR10, there are 3069 boundary samples. For CIFAR100, there are 12258 boundary samples. The reason why there are so many low-score samples is that the extracted features from NasNetLarge can only achieve a test accuracy of 71.60% (for CIFAR10, this value is 95.54,  which is close to the best accuracy that we can achieve). If we fine-tune the network with CIFAR100 in advance, the scores should be much higher and the boundary points should be much less. However, this is acceptable and indicates that it is harder to extract good features for these samples so that we can focus on these samples to improve the network performance.

The visualisation of the selected samples are shown below. For CIFAR10, the color is yellow and for CIFAR100, the color is pink. It is much harder to interpret the CIFAR100 plot because there are too many blobs but for CIFAR10, the selected points are located at the boundary between two  clusters and the inner region of the clusters.

<img src="egdis-cifar10.png" width="45%"/> <img src="egdis-cifar100.png" width="45%"/> 

The generated datasets are shown below.  
<img src="egdis-blobs.png" width="32%"/> <img src="egdis-moons.png" width="32%"/> <img src="egdis-circles.png" width="32%"/>

It can be seen that most of the selected samples are located at the inner section of the clusters. Here EGDIS selected 136, 177, 202 samples respectively and achieved a test score of 0.993, 1.000, and 0.785. For circle dataset, the performance is not good. This is because to recover the decision boundary, we need samples all over the circle but the dense region is not distributed across the circle.

For CIFAR10, the EGDIS algorithm achieved a relative accuracy of 86.58% with 16.27% training samples. The results for CIFAR100 are still running.

## 1.3 CL

The basic idea of CL is to select samples with high classification scores. However, it is not designed to select subset samples, but to speed up the training speed by train with easy samples first. The advantage to use CL as data selection algorithm is that it is highly related with the behaviour of neural network. For this reason, CL should performam better than other algorithms.

The main drawback of CL is that it only selects easy samples so the accuracy is limited if the dataset is hard to classify.