# DL4J Neural Net Computer Vision Example

Neural nets are a machine learning algorithm used for classificaiton and prediction that can deal with complex dimensionality. This notebook provides sample code on how to structure, run and save a neural net using DL4J for a simplified computer vision problem. There are pictures of different animals and the goal is to differentiate and classify them by giving probabilities of each class.

<img src="nn_diagram.jpg">

Neural nets are especially great for image and word datasets that are not dense. The data is convereted to a numerical representation and fed into the net where each node in the net applies a linear and non-linear transformation.

>***linear equation***<br>
>$z_k= \sum_{j=1} \mathbf{w_{k,j}}\mathbf{x_j} + \mathbf{b_k}$


>***sigmoid non-linear equation***<br>
>$y= \sigma\Bigg(\dfrac{1}{(1+\mathrm{e}^{-z})}\Bigg)$

The weights ($w$), also known as parameters, are used to fit the model to the objective/goal of the model. In order to accomplish this, gradient descent optimization techniques are used to ***find the optimal weights*** that will lead to correct classification. Gradient descent takes the derivative of the calculated model loss and shifts the weights using learning rates and other hyper parameters like momentum to move the weight up or down the gradient curve. More information about how gradient descent works can be found in the resource section.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Extrema_example.svg/600px-Extrema_example.svg.png">

<center>- Wikipedia 

More information on DL4J and how neural nets function can be found at:
- DL4J http://deeplearning4j.org/documentation.html
- Neural Nets for Newbies https://youtu.be/Cu6A96TUy_o

## Requirements

- Java 8
- Maven 3.3.9
- iScala Notebook

## Setting Dependencies

In [1]:
//Below is for Jupyter-Scala notebook. If iScala is used then below should change to load dependencies
load.resolver("DefaultMavenRepository" at "https://repo1.maven.org/maven2")



In [2]:
val dl4jVersion = "0.4-rc3.8"
val nd4jVersion = "0.4-rc3.8"
val canovaVersion = "0.0.0.14"

[36mdl4jVersion[0m: java.lang.String = [32m"0.4-rc3.8"[0m
[36mnd4jVersion[0m: java.lang.String = [32m"0.4-rc3.8"[0m
[36mcanovaVersion[0m: java.lang.String = [32m"0.0.0.14"[0m

In [None]:
load.ivy("org.deeplearning4j" % "deeplearning4j-core" % dl4jVersion)
load.ivy("org.deeplearning4j" % "deeplearning4j-nlp" % dl4jVersion)
load.ivy("org.deeplearning4j" % "deeplearning4j-ui" % dl4jVersion)
load.ivy("org.nd4j" % "nd4j-x86" % nd4jVersion)
load.ivy("canova-spark" % "org.nd4j" % canovaVersion)
load.ivy("canova-nd4j-codec" % "org.nd4j" % canovaVersion)
load.ivy("canova-nd4j-image" % "org.nd4j" % canovaVersion)

In [None]:
import org.apache.commons.io.{FileUtils, FilenameUtils}
import org.canova.api.records.reader.RecordReader
import org.canova.api.split.LimitFileSplit
import org.canova.image.loader.BaseImageLoader
import org.canova.image.recordreader.ImageRecordReader
import org.deeplearning4j.datasets.canova.RecordReaderDataSetIterator
import org.deeplearning4j.datasets.iterator.DataSetIterator
import org.deeplearning4j.eval.Evaluation
import org.deeplearning4j.nn.api.OptimizationAlgorithm
import org.deeplearning4j.nn.conf.MultiLayerConfiguration
import org.deeplearning4j.nn.conf.NeuralNetConfiguration
import org.deeplearning4j.nn.conf.layers.ConvolutionLayer
import org.deeplearning4j.nn.conf.layers.DenseLayer
import org.deeplearning4j.nn.conf.layers.OutputLayer
import org.deeplearning4j.nn.conf.layers.SubsamplingLayer
import org.deeplearning4j.nn.conf.layers.LocalResponseNormalization
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork
import org.deeplearning4j.nn.weights.WeightInit
import org.deeplearning4j.nn.conf.GradientNormalization
import org.deeplearning4j.nn.conf.Updater
import org.deeplearning4j.optimize.listeners.ScoreIterationListener
import org.nd4j.linalg.api.ndarray.INDArray
import org.nd4j.linalg.dataset.{SplitTestAndTrain, DataSet}
import org.nd4j.linalg.factory.Nd4j
import org.nd4j.linalg.lossfunctions.LossFunctions
import java.io.{FileOutputStream, DataOutputStream, IOException, File}
import java.util.{Random}
import scala.collection.mutable.ListBuffer

## Loading Data

First step is to cleanup and load the data for training and testing.
- Store the data in a folder that the model can load from
- Confirm the formats are the same (e.g. pictures exist and have similar sizes)
- Convert data to a DataSet structure (numerical feature format and labels)
- Setup the data to load in batches inside an iterator

Something to be aware of with data is supervised vs. unsupervised which just means labeled vs unlabeled. In this example we have labeled images we are working with. Thus, it's supervised.

### *Data*

Images provided in this example are from the U.S Fish and Wildlife Service because the images are in the public domain. There four categories with ~ 20 images each in the dataset provided:

- bear
- deer
- duck
- turtle

The images vary in pixel size and they are all RGB which means they have 3 channels of color.

<center>***Example Image***</center>

<img src="animals/turtle/Blandings_Turtle.jpg">

In [None]:
// Load images and labels
val mainPath: File = new File("animals")
val labels: List[String] = List("bear", "deer", "duck", "turtle")

val recordReader: RecordReader = new ImageRecordReader(width, height, channels, appendLabels)
try {
  recordReader.initialize(
    new LimitFileSplit(mainPath, BaseImageLoader.ALLOWED_FORMATS, numExamples, outputNum, null, new Random(123)))
} catch {
  case ioe: IOException => ioe.printStackTrace()
  case e: InterruptedException => e.printStackTrace()
}
val dataIter: DataSetIterator = new RecordReaderDataSetIterator(recordReader, batchSize, -1, outputNum)

When working with computer vision, you will want many more examples to run through your model for it to build a solid representation of the different animals. The sample set is too small to achieve high accuracy scores. When you have sparse examples, use techniques to modify and expand the dataset such as:
- flip images by various degrees
- change the color saturation (including change to grey scale)
- crop the image in different positions
- search and download more examples

## Configuring

Model configuration takes experimentation to get familiar with all the options. Below outlines key attributes that you can define in the model configuration. 

- ***weightInit*** = how to initialize parameters which is typically a variation on random
- ***activation*** = non-linear function applied to parameters (weights & bias) on every node in the layer
- ***seed*** = locks parameter initialization each time for consistancy when checking hyper-parameters impact
- ***gradientNormalization*** = regularization techniques to smooth gradient results
- ***optimizationAlgo*** = type of convext optimizer used to calculate gradients that determines how to apply that loss function gradient to weight updates
- ***updater*** = type of equation to use when updating parameters (e.g. Nesterovs applies momentum to the learning rate for the gradient update)
- ***learningRate*** = the step to take down or up the optimizer algorithm to improve model convergence
- ***regularization*** = tells the model to apply weight decay (e.g. l1 or l2 defines the amount to apply and this applied to both weights and bias)
- ***list*** = how many layers are in the model and does not count input as a layer
- ***layer*** = construct to define each layer. requires a number when there are more than one
- ***backprop*** = whether to apply backprop to the model for parameters updates
- ***pretrain*** = whether to pretrain the model

Note, most of these can be defined globally or inside the definition of each layer. 

### *Variables*

In [None]:
val seed = 123
val height = 120
val width = 120
val channels = 3
val numExamples = 80
val outputNum = 4
val batchSize = 20
val listenerFreq = 5
val appendLabels = true
val iterations = 2
val epochs = 2
val splitTrainNum = 10

***Computer Vision Common Configuration***<br>
Its good to start with common configuration approaches like the ones provided below and then use training and tunning to modify hyperparameters. More information on this topic is covered in the Tuning section. 

- ***"relu"*** = rectifed linear unit is an activation function that helps prevent gradient vanishing because its sets the activation threshold at zero
> $f(x)=max(0,x)$
- ***LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD*** *(aka cross-entropy)* = evaluates and scores model error
> $H_y{'}(y) = -\sum_{i} \mathbf{y_i}log({y_i})$
- ***OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT*** = how to update weights based on error and gradient from the full training set
> $w = w -\alpha(y_i-h_w(x_i))x_{i,j}$

***Tiny ImageNet Example***<br>
Below are two different example configurations. First is pulled from the Tiny ImageNet paper that provides guidance on how to build as compact a model as possible to be effective in image classification.

In [None]:
// Tiny ImageNet Example
val confTiny: MultiLayerConfiguration = new NeuralNetConfiguration.Builder()
  .seed(seed)
  .iterations(iterations)
  .activation("relu")
  .weightInit(WeightInit.XAVIER)
  .gradientNormalization(GradientNormalization.RenormalizeL2PerLayer)
  .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
  .updater(Updater.NESTEROVS)
  .learningRate(0.01)
  .momentum(0.9)
  .regularization(true)
  .l2(0.04)
  .useDropConnect(true)
  .list()
  .layer(0, new ConvolutionLayer.Builder(5, 5)
    .name("cnn1")
    .nIn(channels)
    .stride(1, 1)
    .padding(2, 2)
    .nOut(32)
    .build())
  .layer(1, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
    .kernelSize(3, 3)
    .name("pool1")
    .build())
  .layer(2, new LocalResponseNormalization.Builder(3, 5e-05, 0.75).build())
  .layer(3, new ConvolutionLayer.Builder(5, 5)
    .name("cnn2")
    .stride(1, 1)
    .padding(2, 2)
    .nOut(32)
    .build())
  .layer(4, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
    .kernelSize(3, 3)
    .name("pool2")
    .build())
  .layer(5, new LocalResponseNormalization.Builder(3, 5e-05, 0.75).build())
  .layer(6, new ConvolutionLayer.Builder(5, 5)
    .name("cnn3")
    .stride(1, 1)
    .padding(2, 2)
    .nOut(64)
    .build())
  .layer(7, new SubsamplingLayer.Builder(SubsamplingLayer.PoolingType.MAX)
    .kernelSize(3, 3)
    .name("pool3")
    .build())
  .layer(8, new DenseLayer.Builder()
    .name("ffn1")
    .nOut(250)
    .dropOut(0.5)
    .build())
  .layer(9, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
    .nOut(outputNum)
    .activation("softmax")
    .build())
  .backprop(true).pretrain(false)
  .cnnInputSize(height, width, channels).build()

***AlexNet Example***<br>
 The second configuration is a slight variant on AlexNet which won the ImageNet competition in 2012 for image classification.

In [None]:
// AlexNet Example
val nonZeroBias = 1
val dropOut = 0.5
val poolingType: SubsamplingLayer.PoolingType = SubsamplingLayer.PoolingType.MAX

val confAlexNet: MultiLayerConfiguration = new NeuralNetConfiguration.Builder()
    .seed(seed)
    .weightInit(WeightInit.XAVIER)
    .activation("relu")
    .updater(Updater.NESTEROVS)
    .iterations(iterations)
    // normalize to prevent vanishing or exploding gradients
    .gradientNormalization(GradientNormalization.RenormalizeL2PerLayer) 
    .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
    .learningRate(1e-3)
    .learningRateScoreBasedDecayRate(1e-1)
    .regularization(true)
    .l2(5 * 1e-4)
    .momentum(0.9)
    .miniBatch(false)
    .list()
            //conv1
    .layer(0, new ConvolutionLayer.Builder(new int[]{11, 11}, new int[]{4, 4}, new int[]{3, 3})
            .name("cnn1")
            .nIn(channels)
            .nOut(96)
            .build())
    .layer(1, new LocalResponseNormalization.Builder()
            .name("lrn1")
            .build())
    .layer(2, new SubsamplingLayer.Builder(poolingType, new int[]{3, 3}, new int[]{2, 2})
            .name("pool1")
            .build())
            //conv2
    .layer(3, new ConvolutionLayer.Builder(new int[]{5, 5}, new int[]{1, 1}, new int[]{2, 2})
            .name("cnn2")
            .nOut(256)
            .biasInit(nonZeroBias)
            .build())
    .layer(4, new LocalResponseNormalization.Builder()
            .name("lrn2")
            .k(2).n(5).alpha(1e-4).beta(0.75)
            .build())
    .layer(5, new SubsamplingLayer.Builder(poolingType, new int[]{3, 3}, new int[]{2, 2})
            .name("pool2")
            .build())
            //conv3
    .layer(6, new ConvolutionLayer.Builder(new int[]{3, 3}, new int[]{1, 1}, new int[]{1, 1})
            .name("cnn3")
            .nOut(384)
            .build())
            //conv4
    .layer(7, new ConvolutionLayer.Builder(new int[]{3, 3}, new int[]{1, 1}, new int[]{1, 1})
            .name("cnn4")
            .nOut(384)
            .biasInit(nonZeroBias)
            .build())
            //conv5
    .layer(8, new ConvolutionLayer.Builder(new int[]{3, 3}, new int[]{1, 1}, new int[]{1, 1})
            .name("cnn5")
            .nOut(256)
            .biasInit(nonZeroBias)
            .build())
    .layer(9, new SubsamplingLayer.Builder(poolingType, new int[]{3, 3}, new int[]{2, 2})
            .name("pool3")
            .build())
    .layer(10, new DenseLayer.Builder()
            .name("ffn1")
            .nOut(4096)
            .biasInit(nonZeroBias)
            .dropOut(dropOut)
            .build())
    .layer(11, new DenseLayer.Builder()
            .name("ffn2")
            .nOut(4096)
            .biasInit(nonZeroBias)
            .dropOut(dropOut)
            .build())
    .layer(12, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
            .name("output")
            .nOut(outputNum)
            .activation("softmax")
            .build())
    .backprop(true)
    .pretrain(false)
    .cnnInputSize(height,width,channels).build()

In [None]:
// Initialize the network and alternate which configuration to pass into MultiLayerNetwork
val network: MultiLayerNetwork = new MultiLayerNetwork(confAlexNet)
network.init()

***Listeners***

Apply setListeners to the network to get information on how the model is performing. ScoreIterationListener is the simplest one to check if the model is converging in its predictions on the training data. Basically its showing how accurate is the model predicting the results of the training data. Typically you are working to lower the scores as close to zero as possible.

In [None]:
network.setListeners(new ScoreIterationListener(listenerFreq))

***Gradients***

Backpropagation is how you move the weight($w$) updates from stochastic gradient descent back into the model. Sometimes there are score results of NaN or 0 because the gradient explodes or vanishes. As changes are moved backwards through the layers in deep nets, the gradient tends to get smaller. The neurons in the beginning layers learn more slowly than the neurons in the later layers which can make it vanish. Sometimes the gradient gets too big in earlier layers which makes it explode. More information on how to address these issues are in the references below. Just be aware this is common and requires tuning.

## Training

Once you've loaded the data and the model configuration is initialized, train the model by calling fit on the configured network and passing in the dataset. The goal of training is to define parameters that will provide high accuracy on classification results but generalize enough to perform well on new data.

In [None]:
// Runs 1 epoch
val testInput = new ListBuffer[INDArray]()
val testLabels = new ListBuffer[INDArray]()

while (dataIter.hasNext()) {
  val dsNext: DataSet = dataIter.next()
  dsNext.scale()
  val trainTest: SplitTestAndTrain = dsNext.splitTestAndTrain(splitTrainNum, new Random(seed))
  val trainInput: DataSet = trainTest.getTrain() // get feature matrix and labels for training
  testInput += trainTest.getTest().getFeatureMatrix()
  testLabels += trainTest.getTest().getLabels()
  network.fit(trainInput)
}

In [None]:
// Assumes 1 epoch completed already
for (i <- 1 until epochs) {
  dataIter.reset()
  while (dataIter.hasNext()) {
    val dsNext: DataSet = dataIter.next()
    val trainTest: SplitTestAndTrain = dsNext.splitTestAndTrain(splitTrainNum, new Random(seed))
    val trainInput: DataSet = trainTest.getTrain()
    network.fit(trainInput)
  }
}

## Evaluating

After the model converges in regards to its loss function, you can run new test data through the model to see how well it generalizes and predicts. The test data should be a dataset that was not used during training.

Example performance indicators:
- ***accuracy*** = number of correct predictions to total predictions 
- ***precision*** = number of correct positive predictions divided by total positive class values predicted
- ***recall*** = number of correct positive predictions divided by the total actual positive class values
- ***f1-score*** = measure of test accuracy as a balance between precision and recall

In [None]:
val eval: Evaluation = new Evaluation(labels)
while (dataIter.hasNext()) {
  val testDS: DataSet = dataIter.next(batchSize)
  val output: INDArray = network.output(testDS.getFeatureMatrix())
  eval.eval(testDS.getLabels(), output)
}
print(eval.stats())

## Saving

Save the model configuration and parameters when you are satisfied with evaluation scores or if a training break is needed. 

In [None]:
val basePath = FilenameUtils.concat(System.getProperty("user.dir"))
val confPath = FilenameUtils.concat(basePath, network.toString() + "-conf.json")
val paramPath = FilenameUtils.concat(basePath, network.toString() + ".bin")

In [None]:
// Save parameters
try {
  val dos: DataOutputStream = new DataOutputStream(new FileOutputStream(paramPath))
  Nd4j.write(network.params(), dos)
  dos.flush()
  dos.close()
  // Save model configuration
  FileUtils.write(new File(confPath), network.conf().toJson())
} catch {
  case ioe: IOException => ioe.printStackTrace()
}

## Tuning

Next to loading data and the time to train, tuning is a one of the key challenges to produce effective neural nets. To get a good sense of how to tune, spend time running different models and reading academic papers that outline various approaches. This will help you gain understanding of how to tune. Below are a couple pointers to get you started:

***General Pointers***

Start with as few hyper-parameters as possible to start and focus on improving scores with those first. Also, focus on tuning one hyper-parameter at a time and keep the others fixed. When it seems you can no longer improve the scores on it, change to a new one and be willing to go back to the first after you've made adjustments to other hyper-parameters. 

***Learning Rate ( $\alpha$)***

Learning rate is a good hyper-parameter to start with. Watch how the scores change and if it is a smooth decrease till the final epoch that's a good parameter to work with. If it's smooth early on and then oscillates randomly or if the scores climb then lower the parameter. Shift by order magnitude like 10 and then make the adjustments smaller as you get closer to a smooth decrease.

***Mini-batch Size***

Mini-batch size makes a difference when tuning. If its too small then you aren't maximing matrix library optimizations and too large leads to not updatig the weights enough. Be aware that the size is independent of other hyper-parameters so you don't have to have tuned hyper-parameters to find a good mini-batch size. Look for accuracy vs time to find the size that works best.

***Batch Normalization***

Batch normalization is the popular technique in the last year for deep neural net training because it leads to faster learning and higher overall accuracy. You can work with higher learning rates and avoid using regularization techniques like dropout. When passing in input, it is common to scale the input by shifting it to zero-mean and unit variance but as the input passes through the net it gets adjusted by parameters which is known as "covariate shift". Using batch norm in each mini-batch and between layers helps to reset the input normalization.

***Automated Tuning***

Manual tuning is great to get a feel how to use hyper-parameters but when you want to get quick results, automated tuning techniques will help cut down training time. There are many different approaches to try like grid, random and bayesian. 


For more information in general on tuning check out the references below.

## Final Points

Once you've spent time training and tuning the net, you should end up with a configuration and parameters you can apply to new datasets. 

## References

For more information on how to develop neural nets, below are additional resources to explore.

- Skymind: http://www.skymind.io/
- U.S. Fish and Wildlife Service (animal sample dataset): http://digitalmedia.fws.gov/cdm/
- Tiny ImageNet Classification with CNN: http://cs231n.stanford.edu/reports/leonyao_final.pdf
- AlexNet: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf & https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/train_val.prototxt
- Neural Networks and Deep Learning: http://neuralnetworksanddeeplearning.com/chap3.html
- Neuarl Networks: http://nbviewer.jupyter.org/github/masinoa/machine_learning/blob/master/04_Neural_Networks.ipynb
- Visual Information Theory: https://colah.github.io/posts/2015-09-Visual-Information/
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift: http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
- Deep Learning Booke: http://www.deeplearningbook.org/
- Neural Networks for Machine Learning: https://www.coursera.org/course/neuralnets
- Convolutional Neural Networks for Visual Recognition: http://cs231n.github.io/