Skip to content

nuclearczy/Gaussian-Bayes_and_KNN_on_Fashion_MNIST_Dataset

Repository files navigation

Gaussian-Bayes and KNN on Fashion MNIST Dataset

Build Status Codacy Badge License GitHub release (latest by date including pre-releases) Python 3.7 GitHub repo size

Author: Zuyang Cao

Overview

This project employed two machine learning methods to classify the fashion MNIST dataset:

  • ML estimation with Gaussian assumption followed by Bayes rule
  • K-Nearest-Neighbor

Two dimensionality reduction techniques are applied on both machine learning methods:

  • PCA (Principal Component Analysis)
  • LDA (Linear Discriminant Analysis)

Dataset

License: MIT

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

Dataset Visualized

Figure 1. Visualized Dataset

Tunable Parameters

PCA Parameters

  • pca_target_dim: Using PCA to reduce the data dimension to this number.

LDA Parameters

  • components_number: Number of components (< n_classes - 1) for dimensionality reduction.

KNN Parameters

  • neighbor_num: Number of neighbors taken into calculation.

Results

Dimensionality Reduction Visualization

  • PCA

PCA_train_2D

Figure 2. PCA training set 2D

PCA_train_3D

Figure 3. PCA training set 3D

  • LDA

LDA_train_2D

Figure 4. LDA training set 2D

LDA_train_3D

Figure 5. LDA training set 3D

For more visualization images, please refer to visualization folder.

KNN with Different Parameters

  • K-Neighbors

Accuracy vs K Neighbors_scaled

Figure 6. Accuracy and K Number

From Figure 2, it is clear that KNN reaches 100% accuracy on training set when K is set to 1. This is a typical overfitting circumstance. When increasing the K number, the accuracy on test set increased slightly and begin to be stable after K reaches 7. So the default K number in this project is set to 7 after this demo.

  • Dimension Reduction Parameters

Accuracy vs PCA&LDA

Figure 7. Accuracy with PCA and LDA

When the dimension number N is larger than 10, the PCA accuracy increases as N increases, however LDA accuracy is always the same. After referring to scikit-learn manual, components_number has a higher limit which is max(dimension_number - 1, class_number - 1). In this case, the class number is 10, so the higher limit for components number in LDA is 9. Which means in Figure 7, the LDA is always 9-dimensional. Thus, the accuracy keeps at a fixed value.

Low PCA&LDA Parameters

Figure 8. Accuracy with Low PCA and LDA Value

Setting N under 10, both PCA and LDA accuracies are monotonically increasing. From Figure 7 and Figure 8, the default number of N is set to 30 in this project to get a stable result.

Bayes vs KNN

The gaussian based Bayes classifier is a simple self built class, thus the accuracy maybe lower than the built-in classifier from scikit-learn or other libraries.

PCA dimension is set to 30 and LDA set to default in both methods.

Dataset Bayes Accuracy KNN Accuracy
LDA Training set 75.12 % 87.00 %
PCA Training set 71.91 % 88.59 %
LDA Testing set 73.70 % 83.06 %
PCA Testing set 71.58 % 85.46 %

For Bayes running output log sample, please refer to Travis building log. The running result is at the end of the log.

About

Experiment on Gaussian based Bayes classifier and KNN on fashion MNIST dataset

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages