# Animal and Vehicle Colored Image Classification
## Final Capstone Project
### John A. Fonte
May 2019 <br>
[Github](https://github.com/jafonte01)

---
---

# Table of Contents

1. Introduction
    - Statement of Problem to be Solved
    - Explanation of Dataset
2. Load Image Data
    - Perform any cleaning
    - openCV
<br><br>
2. Data Exploration
    - Visualization of Sample Datapoints
<br><br>
3. Data Preparation for Modeling
    - Dimensionality reduction
    - Use MLP Classifier to demonstrate dimensionality reduction effect
<br><br>
4. Unsupervised Learning
    - Spectral Clustering
    - t-SNE modeling and comparison to PCA results
<br><br>
5. Supervised Modeling
    - Scaling introduced into modeling pipeline if not already done in Part 3
    - Inclusion & Application of Autoencoding
    - MLP Classifier
    - Recurrent Neural Network
    - Convolutional Neural Networks
<br><br>
6. Conclusion
    - Final Analysis & Recommendations

# 1. Introduction

# Explanation of Dataset: "CIFAR-10"

Introduction and Significance of Research
Object detection, identification, and classification is an ever-growing task in the computer science industry. The applications of the inter-disciplinary field of so-called "computer vision" range from facial recognition to handwriting detection to automating censoring and redaction, not to mention the applicability of accurately indexing the innumerable amount of images on the internet.

It is therefore of the utmost importance to create machine learning models that not only accurately classify objects, but to do so with optimized efficiency. This project aims to determine which models achieve these two goals of accuracy and efficiency.

Proposed Data Source
The proposed image dataset is one compiled from the Canadian Institute for Advanced Research, with the help of the University of Toronto's Computer Science Department. The dataset, called CIFAR-10, is an image classification dataset with 10 different classes - four are vehicles (airplane, automobile, ship, truck), and six are animals (bird, cat, deer, dog, frog, horse).*

Image loading libraries such as Pillow ("PIL") and Pickle will be used to import the image data into the Python environment. Luckily, with great thanks to CIFAR, the images are all uniform 32 x 32 pixel pictures, making reshaping and resolution adjusting unnecessary. This will save significant time in skipping past cumbersome data cleaning work.

http://www.cs.toronto.edu/~kriz/cifar.html CIFAR 10 and CIFAR 100


In [1]:
def unpickle(file):
    import pickle
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

In [3]:
batch1 = unpickle('D:/Github/Data-Science-Bootcamp/CAPSTONE - FINAL/cifar-10-batches-py/data_batch_1')

In [4]:
batch1.keys()

dict_keys([b'batch_label', b'labels', b'data', b'filenames'])

In [13]:
batch1.get(b'data')

array([[ 59,  43,  50, ..., 140,  84,  72],
       [154, 126, 105, ..., 139, 142, 144],
       [255, 253, 253, ...,  83,  83,  84],
       ...,
       [ 71,  60,  74, ...,  68,  69,  68],
       [250, 254, 211, ..., 215, 255, 254],
       [ 62,  61,  60, ..., 130, 130, 131]], dtype=uint8)

In [None]:
https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/
    
    sklearn MLP (deep feed-forward)
    convolutional 
    recurrent
    
    https://www.kaggle.com/hamishdickson/preprocessing-images-with-dimensionality-reduction
    
    
    https://idyll.pub/post/dimensionality-reduction-293e465c2a3443e8941b016d/
    
    # you need to do unsupervised learning ---- look under the "clustering"
    # section to see clustering of images, and sampling the clustering
    #spectral clustering is good for image clustering as well....
    
    https://www.datacamp.com/community/tutorials/machine-learning-python
    
    # Import matplotlib
import matplotlib.pyplot as plt

# Figure size in inches
fig = plt.figure(figsize=(8, 3))

# Add title
fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold')

# For all labels (0-9)
for i in range(10):
    # Initialize subplots in a grid of 2X5, at i+1th position
    ax = fig.add_subplot(2, 5, 1 + i)
    # Display images
    ax.imshow(clf.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)
    # Don't show the axes
    plt.axis('off')

In [None]:
# more visualization - look into: isomap, lle, t-sne, kernel_PCA


# Import `Isomap()`
from sklearn.manifold import Isomap

# Create an isomap and fit the `digits` data to it
X_iso = Isomap(n_neighbors=10).fit_transform(X_train)

# Compute cluster centers and predict cluster index for each sample
clusters = clf.fit_predict(X_train)

# Create a plot with subplots in a grid of 1X2
fig, ax = plt.subplots(1, 2, figsize=(8, 4))

# Adjust layout
fig.suptitle('Predicted Versus Training Labels', fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.85)

# Add scatterplots to the subplots 
ax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=clusters)
ax[0].set_title('Predicted Training Labels')
ax[1].scatter(X_iso[:, 0], X_iso[:, 1], c=y_train)
ax[1].set_title('Actual Training Labels')

# Show the plots
plt.show()