# Exploratory Analysis

Since the goal of my project is to build an image classifier using a subset of ImageNet's dataset, I've been less focused on the actual training/testing data itself and more focused on the processes behind building/training the model(s).

With that in mind, this analysis will be about those processes / experimentations.

## Data

My final data will be a subset of data downloaded from http://image-net.org/, selected by searching across the different "synsets" that they have (which is correlated and connected with WordNet).

I'll be relying on ImageNet's labels as the ground-truth, although I'll be doing some spot-checking to remove defunct images. These are simply caused by the URLs no longer being online, since the images aren't hosted by ImageNet but instead by outside parties.

(The above might not be entirely true. Apparently they allow you to download them if you sign-up / sign-in. I'm electing not to do that.)

## Experimentation

### Experiment One - MNIST Digits

#### Link to code:

https://github.com/minhoftran/my-hw-repo/blob/master/final_project_experiment_mnist_digits.ipynb

This first experiment was mainly to get Keras and TensorFlow installed, then hopefully become more familiar with them.

My goal for this (aside from successfully creating a model to predict the appropriate handwritten digits) was to attempt to set up a reproducable and reuseable notebook that broke down the process step-by-step. This would (hopefully) allow me to copy the notebook and adjust the steps as needed depending on the desired outcome/dataset/model.

Code was largely taken from an online tutorial (originally built for Keras using the Theano backend), then adjusting things as necessary for TensorFlow: https://elitedatascience.com/keras-tutorial-deep-learning-in-python

### Experiment Two - Cats and Dogs (Kaggle)

#### Code for attempt one:

https://github.com/minhoftran/my-hw-repo/blob/master/final_project_experiment_catsndogs%20(1st%20attempt).ipynb

Attempt one on 'cats and dogs' was mostly spent trying to adjust my MNIST experiment to work with non-uniform image sizes and with color this time.

I had initially attempted (and succeeded) to load and preprocess (crop and resize) the images using PIL--Python Image Library. Where I become stuck was when it came time to actually assign class labels to my datasets.

At this point, I discovered that Keras has several pre-built methods that would take care of both the preprocessing and the class labels.

Attempt stopped and back to reading/researching.

#### Code for attempt two:

https://github.com/minhoftran/my-hw-repo/blob/master/final_project_experiment_catsndogs%20(2nd%20attempt).ipynb

This time I decided to not simply wing it using bits of knowledge picked up randomly here and there (PIL, mostly), instead opting to read up on Keras and the various built in tools they have.

https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

Luckily it has a method of labelling files with the appropriate class simply by organizing them under specific directories.

There's also various ways to augment the data (rotation, skews, scaling, zooming, etc). This was great to use for smaller datasets and to fight against overfitting.

Another goal here was to lift the code from lesson 17 where it displayed a dataframe of the images against the predictions (for a quick visual validation). Couldn't quite get it working; needs more tinkering.

#### Code for attempt three: Using RandomForestClassifiers

https://github.com/minhoftran/my-hw-repo/blob/master/final_project_experiment_catsndogs%20(3rd%20attempt%20-%20pca%20dr%20with%20random%20forest).ipynb

This attempt was me stepping back and testing a non-neural network model to tackle classification. Code was mostly lifted from Lesson-17 (Dimensionality Reduction).

Takeaway: seems like this would be great for sorting out a very unique class of images from another.

In class, we dealt with monkeys vs bicycles. I could also see this being used to build a very light/quick model to pull out images with glaring differences; such as line drawings vs 3d models, 3d models vs photographs, etc etc etc...

Doesn't seem like this will be robust enough for my purposes though.

## Experiment Three - Bicycles, Monkeys, and Roses

#### Code:

https://github.com/minhoftran/my-hw-repo/blob/master/final_project_experiment_bicyclemonkeyrose.ipynb

Somewhat happy with the results from experiment two - attempt two, now I'm trying to fit that model to work with more than two classes.

Made some changes to the last model layer, switching from activation='sigmoid' (concepts non-mutually exclusive) to use 'softmax' instead (concepts mutually exclusive).

I'm aware that for my final use-case (nature/urban/animals/humans), the concepts can easily overlap within an image, but since I'll be looking for a single concept output, using softmax will hopefully make a more decisive decision for me.

Another major change was figuring out which loss function to use during model compiling. Switched from 'binary_crossentropy' (which as expected, only works with binary classifications) to 'categorical_crossentropy' (for multiclass classification).

Update: stripped 'rose' from the equation to see if it worked as expected. When I originally ran it with all three classes, monkeys were being severely under-predicted as monkeys (only 1 out of 100 were correctly predicted).

I think the initial issue was simply a lack of monkey data, so the roses were overwhelming it.

## Next Steps

My next steps are to finish collecting/cleaning the data for my final taxonomy (nature/urban/animal/people), then run it through the same code from experiment three.

Fingers crossed that there won't be much debugging.