Vision is arguably one of our greatest strengths as humans. Even the most complex ideas tend to become much easier to understand as soon as we find a way to visualize them. Therefore, when we are working with datasets that have more than 3 dimensions, it can be challenging just to understand what’s going on inside of them.
Anomaly detection is a subfield of AI that detects anomalies in datasets. A machine learning algorithm learns the "patterns" that the majority of cases adhere to and then singles out the few cases that deviate from those normal patterns. Appropriate projections of the high-dimensional datasets to two dimensions often help by isolating anomalies in the image of those projections. The anomalies can then easily be spotted by a human just by looking at the image. There are many different techniques with different strengths and weaknesses.
We have developed develop a modular, extensible anomaly visualization framework with graphical user interface that allows to evaluate different data visualizations techniques on user provided datasets.
Create new python3 virtual environment:
python3 -m venv venv
and source it:
source venv/bin/activate
Install libraries:
pip install -r requirements.txt
Run the app:
python run.py
Go to this address in your web browser:
http://127.0.0.1:5000
Create docker image
docker build -t name:tag . (example: docker build -t EVA:latest .)
Run the app
docker run -p 5000:port name:tag (example: docker run -p 5000:5000 EVA:latest)
After you run the server on http://127.0.0.1:5000
, you will be greeted with this home screen.
Here, you can either choose an existing dataset that are in /data
or you can upload your own dataset. After, you upload your own dataset it gets listed on the dropdown where you can choose and submit it. After you decide to submit the dataset, a table will be created for you to have a peek into it. If the dataset is too large(>5000 datapoints), only dataset head is displayed.
However for the smaller dataset you can navigate through the displayed table.
Now, you can apply filter to your dataset and choose the label column. You can also choose 'None'
Now for the main part, you will have the oppertunity to choose an algorithm to apply to the dataset as well as a visualisation technique you prefer.
For algorithm, you can choose between:
- PCA
- LLE
- TSNE
- UMAP
- ISOMAP
- K-Mapper
- MDS
As visualisation methods, you can choose between:
- Scatter plot
- Box plot
- k-nearest neighbour
- dendrogram
- Density plot
- Heatmap
After you choose algorithms and visualisations methods, you will be redirected to the page which shows the plots obtained from chosen algorithms
Dimensionality reduction algorithms help with understanding data through visualisation. The main concept of dimensionality reduction is using techniques to embed high dimensional (D > 3) points to lower, usually 2 or 3 dimensional space to plot the data.
EVA implements following dimensionality reduction algorithms:
Principal component analysis (PCA) is a classic algorithm for dimensionality reduction. PCA transforms points from the original space to the space of uncorrelated features over given dataset via eigendecomposition of covariance matrix. PCA by design is a linear algorithm meaning that it's not capable of capturing non-linear correlations.
https://en.wikipedia.org/wiki/Principal_component_analysis
- n_dim - change the dimensionality of the projection
Locally Linear Embedding (LLE) is a nonlinear dimensionality reduction algorithm. LLE is a two step procedure. The first step consists of finding k-nearest neighbours of each point and computing the weights of reconstructing original point with it's neighbours. Second step embeds high-dimensional data points in lower dimensional space using weights learned in the first step.
https://cs.nyu.edu/~roweis/lle/papers/lleintro.pdf
- n_dim - change the dimensionality of the projection
- k_neighbours - change number of neighbours used to reconstruct the point in the first step
T-distributed Stochastic Neighbour Embedding (t-SNE) is a nonlinear dimensionality reduction algorithm. In order to create low-dimensional mapping, t-SNE computes similarity of high-dimensional points through creating a probability distribution. This distribution is used for creating low-dimensional mapping of points by comparing it to the distribution of low-dimensional points and minimizing the difference between those two using K-L Divergence.
http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
- n_dim - change the dimensionality of the projection
- perplexity - change the target perplexity of the algorithm, balance attention between global and local structure of the data
Uniform Manifold Approximation and Projection (UMAP) is a general dimensionality reduction algorithm using topological tools. UMAP assumes that the data is uniformly distributed on a Rimannian manifold which is locally connected and the Rimannian metric is locally constant or can be approximated as such. UMAPS models data manifold with fuzzy topological structure and embed the data into low dimensional space by finding closest possible equivalent topological structure.
https://arxiv.org/pdf/1802.03426.pdf
- n_dim - change the dimensionality of the projection
- k_neighbours - change the number of neighbours used in creation of k-neighbour graph
ISOMAP is a nonlinear dimensionality reduction algorithm. In order to create low dimensional embedding of data ISOMAP creates weighted graph of k-nearest neighbours with euclidean distance as weights of the graph. After the graph is created, ISOMAP computes shortest paths between two nodes and use this information to create low dimensional representation with MDS algorithm.
http://web.mit.edu/cocosci/Papers/sci_reprint.pdf
- n_dim - change the dimensionality of the projection
- k_neighbours - change the number of neighbours used in creation of k-neighbour graph
Kepler-MAPPER (k-MAPPER) is python library implementing MAPPER algorithm form the topological data analysis field. k-MAPPER use embedding created by other dimensionality reduction algorithm (ex. t-SNE) and pass it to MAPPER algorithm.
https://github.com/scikit-tda/kepler-mapper
- n_dim - change the dimensionality of the projection
- k_neighbours - change the number of neighbours passed to the MAPPER algorithm
Multidimensional Scaling (MDS) is a nonlinear dimensionality reduction algorithm. Low dimensional embedding is obtained by solving eigenvalue problem in double centered matrix of squared proximity matrix.
https://en.wikipedia.org/wiki/Multidimensional_scaling
- n_dim - change the dimensionality of the projection
Structure of this project allows extending dashboard to support new dimensionality reduction algorithms with custom, controllable parameters. In order to add new algorithm, follow these steps:
- Extend
EvaData
class inapplication/plotlydash/Dashboard.py
withapply_{name_of_your_alg}
method following convention of the otherapply
methods. - Extend
_getgraph
and_getdropdowns
methods inDimRedDash
class inapplication/plotldydash/dim_red_dshboards.py
to support new plots and callbacks for your new algorithm. - Add new form options to
VisForm
class informs.py
file, - Edit form code in
application/templates/home.html
to include your new, updated form.
To demonstrate the results we have selected 3 datasets, namely, FMNIST, MNIST and FishBowl. We have then plotted 2D scatter-plot for all the datasets and selected algorithms.
Fashion-MNIST (FMNIST) is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
https://github.com/zalandoresearch/fashion-mnist
This is how the data looks (each class takes three-rows):
For the visualisation purpose here, we have applied following filters:
- Shoe images as the Inliers (6000 Images)
- T-shirt images as the Outliers (6000 Images)
- Define 1% outliers in the dataset -> 60 T-shirt images
- Normalise: True
- PCA preposcessed: True
- k-neighbour: 20
- perplexity: 30
- k-neighbour: 15
- k-neighbour: 20
From the visualisation it is clear that UMAP
performs the better job of isolating the outliers.
The Modified National Institute of Standards and Technology (MNIST) database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
http://yann.lecun.com/exdb/mnist/
Sample image for MNIST dataset:
For the visualisation purpose here, we have applied following filters:
- Zeros Inliers: 6742 images
- Ones Outliers : 67 images (1%)
- Normalise: True
- PCA preposcessed: True
- k-neighbour: 5
- K-neighbour: 15
From the visualisation it is clear that UMAP
performs the better job of isolating the outliers.
Fish Bowl dataset comprises a sphere embedded in 3D whose top cap has been removed. In other words, it is a punctured sphere, which is sparsely sampled at the bottom and densely at the top.
For the visualisation purpose here, we have applied following filters:
-
Rows: 2010
-
Inliers: 2000
-
Outliers: 10
-
Normalise: False
-
PCA preposcessed: False
- k-neighbour: 5
- perplexity: 30
- k-neighbour: 20
- k-neighbour: 20
From the visualisation, we can say that LLE
argueably performs the better job of isolating the outliers.
When reducing large data sets with high dimensional data to 2 or 3 dimensions where we can visualize it, the computational complexity of the algorithms play a major role in how much time it takes an algorithm to reduce all the data. We mainly worked with the Mnist and Fashion Mnist data sets. Both contain 60.000 images with each image having 28x28 pixels corresponding to 784 features per Image. The next points give an overview of what we learned from our experience using the app & different algorithms on these large datasets.
-
LLE does not scale well with the number of data samples. We believe that the computation of the knn graph LLE performs as the initial algorithm step requires a lot of computational resources. Therefore, one should avoid applying LLE on a regular machine on high dimensional (e.g images) data sets with approximately more than 2000 samples.
-
TSNE does not scale well with the number of data samples. As in the case of LLE, we do not recommend to run TSNE on high dimensional datasets with more than 2000 samples.
-
PCA scales well with the number of data samples. In this case, one should not encounter problems when applying PCA to high dimensional datasets with a high data samples number. Even for all the 60.000 images, PCA is able to reduce the dimensions in an appropiate time.
-
Other algorithms as Isomap and MDS suffer from the same computational problems as LLE and do not scale well with high dimensional data sets with a high data sample number.
The good news is that there is one algorithm that is as fast as PCA and is able to unfold non linear data as well as TSNE. This is the UMAP algorithm, which is the most modern one of all (2018).
-
UMAP scales well with high dimensional data and with a high data sample number.
-
We recommend to try PCA which preserves the global structure of the data and UMAP which focusses more on preserving local topolgical information about the data for high dimensional datasets with a high sampel number. By using these two algorithms, one can get a quick overview if there are any outliers present & if the outliers are clearly separated from the inliers in the 2 or 3 dimensional space.
-
The application also gives one the option to apply PCA to the dataset before using computationally expensive algorithms as TSNE. This is a common preprocessing step. Consider we have the MNIST data set consisting of 60.000 images with each 784 features. Our data matrix has the dimensions 60.000 x 784. Then, we apply PCA and reduce the data to e.g 50 dimensions and keep 90% of the variance. Our new data matrix has the dimensions 60.000 x 50. Then, we can apply TSNE on this lower dimensional data set. We observed that since we are keeping most of the variance in the preprocessing step, the TSNE results are still positive. Most important, since we are working with a lower dimensional data set, the computation time drops. Nevertheless, the TSNE or LLE computation times seem to depend more on the number of data samples then on the data dimension. For this reason, we still recommend to try UMAP or PCA at first for obtaining quick results.
Last but not least, we conducted some research regardin the scalability of UMAP in comparison to other algorithms. The next both figures can be found in the original UMAP paper from 2018.