# Using Deep Convolutional NNs for Traffic Sign Classification

#### by Gerti Tuzi
Feb, 2017

---

In this lab, I used Tensorflow to train 2 classifiers to perform traffic sign classification. Data used is that of the [German traffic sign dataset](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/5898cd6f_traffic-signs-data/traffic-signs-data.zip), provided from [INI](http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset)one of the standard benchmark sets for image classification.

Udacity's Autonomous Car staff provided a training and evaluation harness with an simple implementation of the LeNet. I have made the following modifications and contributions:

* Data augmentation (random rotations, translations)
* Slight modifications of LeNet's architecture (introduced dropout layers)
* Explored a new architecture using a naiive implementation of GoogLeNet's Inception module topology (*not GoogleLeNet network*)


## Data

### Overview
(from INI)
* Single-image, multi-class classification problem
* 43 classes
* More than 50,000 images in total
* Large database
* Ground-truth data generated via semi-automatic annotation
* Physical traffic sign instances are unique within the dataset (i.e., each real-world traffic sign only occurs once)

### Format
* The images contain one traffic sign each
* Images contain a border of 10 % around the actual traffic sign (at least 5 pixels) to allow for edge-based approaches
* The actual traffic sign is not necessarily centered within the image. This is true for images that were close to the image border in the full camera image
* Image Shape: 32, 32, 3 (RGB)


### Size
Data counts are as follows:

* Training Set:   39209 samples
* Test Set:       12630 samples


#### Example
![SignExample](images/TrafficSignExample.png)


### Classes distribution
The types / classes of objects in the dataset are not evenly distributed. The following graph shows their distribution:

---
![ClassDistro](images/ClassCounts.png)

---

As we can see, class (label) distribution in not even across the dataset (classes are not balanced)


## Data Processing


#### Data augmentation
Training set is a sample from the data that the net will see during testing. Since the training set is just a snapshot, it carries with it its own particular noise (and limitations, i.e. variations of images not captured in the set)

Data agumentation *artificially* introduces artifacts that can be seen during evaluation (or real world evaluation). 

##### Image Transformations

In this report the following transformations were performed *before* training:
* *Rotation* with respect to the normal of the center of the image
* *Shear*, which performs a sort-of perspective transformation
* *Shift* horizontal & vertical directions. This introduces occlusions (partial visibility)

... other artifacts such as noise (gaussian noise, salt & pepper, blurr, etc) would make for a better classifier.


![ImageTransformations](images/ImageTransformations.png)

---


# Training
The following parameters were used for training the networks:

* Batch size 150: Seems to lead to faster convergence than higher batch sizes (which can also hit memory limits)
* Epochs 100 (to save time)
* Weights are uniformly randomly initialized with simga = 0.01 (larger networks should benefit from higher variance, while smaller networks need to have smaller initial values).
* Adam optimization. speeds up learning greatly.
* L2 regularization (beta = 0.01)
* Learning rate schedule (see below)
* Data shuffling : before the start of each training epoch, data was shuffled
* Data augmentation (random rotations, shifts, shear)
* Dropout: LeNet-5 (0.3), GTInception (0.6)


#### L2 Regularization
For both nets, L2 regularization was also added to the loss function to reduce overfitting. L2 regularization ensures that the weights do not become very large - which is usually a sign of fitting the data too well and fitting to the particular noise of the training data.


#### Learning Rate
The bottom of the error manifold gets narrower, the further you go. Therefore, as learning progresses, smaller steps need to be taken, to avoid swoshing back and forth in the ravine of the error space (or worse, get out entirely).


Given the fairly small number of epochs I used, I scheduled a drop in the learning rate every 40 epochs, as follows:


![LRcode](images/LR_Code.png)

where for the global rate, number of epochs is used.


![LR](images/LearningRate.jpg)


#### Data shuffling
his ensures that the learning process does not get "trapped" or biased optimizing for one class (for one or a few consecutive batches), then another (for another set of batches). If not shuffled, and presented as batches of same class samples, optimization process may result in class-specific minimum (in the error space) and it will never reach optimal minima (the chances of finding optimal minima are greatly reduced), which is not desired


#### Data augmentation
Each batch was augmented using random transforms, as described in data augmentation.

---


# Neural Networks Architecture

---

## Classifier : LeNet

The baseline classifier used is LeCun's *LeNet* classifier. This is used as a basic reference model in the literature.


![LeNet](images/lenet_w_droputs.png)


*The implementation of this network (and most of the training/evaluation) harness is provided from Udacity's CarND class. *



The following table shows the parameters for the LeNet:

![LeNetParams](images/LeNetParams.png)



---


## Classifier: GTInception 
Explanation of my network implementation

---

### Background: Network in Network (NiN)
According to Lin et al: "Convolutional networks are sufficient for abstraction when the instances of the latent concepts are linearly separable. It is beneficial to do a better abstraction on each local patch, before combining them into higher level concepts. [link](https://arxiv.org/pdf/1312.4400.pdf) 



![NetInNet](images/NetInNet.png)


The NiN is equivalent to a convolution layer with 1x1 convolution kernel". The intermediary neurons capture the non-linearities of of the abstract latent representations present in the data. 

---

### Background: Inception (GoogLeNet)

In attempting to capture the high nonlinearities of latent variables, a team a Google introduced the *Inception* concept. Inception leverages the idea of NiN (above), to perform a 3x3 a 5x5 convolution on top of a 1x1 convolution on the input of the previous layer. In addition it also performs a maxpool operation followed by a 1x1 convolution.[link](https://arxiv.org/pdf/1409.4842.pdf)


![GoogleInceptionModule](images/InceptionGoogLeNet.png)


In the image above, the 1x1 convolutions before the 3x3 and 5x5 are selected individually. The reduce the number of parameters before applying the costly 3x3 and 5x5 operations.

---

### GTInception: implementation of *Inception* module

For my implementation of inception, I used the same 1x1 operation on all the modules. The following image ilustrates my implementation


![myinception](images/MyInception.png)

The input (X) is the same input. 

The output of the inception module has the same spatial size as the input, but with a varying depth. 


The following is the implementation of this module

![InceptionCode](images/InceptionCode.png)

---

### Using Inception module.


The following is my implementation of the inception module (for reference, I called it *GTInception*).

![GTInception5](images/GTNet5.png)


---

### Design principle
*The premise behind the desing was the idea that hierarchical abstractions closer to the input tend to have closer spatial-correlations (activate together). While higher level abstractions will have more distant spatial correlations.[link1](https://arxiv.org/pdf/1409.4842.pdf), [link2](https://pdfs.semanticscholar.org/1d6e/6adc7a841393fc10b78dc0018e550aff589d.pdf)*

Therefore, 1x1 convolutions should capture closely correlated activations, while 5x5 convolutions should capture farther spaced activations. 

I varied the ratio of the relationship between the depth of the 1x1 convolution, to that of the 3x3 and the 5x5. As we move higher in the hierarchy of the network, the ratio of larger convolutions increases with respect to that of the smaller convolutions.

Per [[link](https://arxiv.org/pdf/1409.4842.pdf)], 

As we move up the hierarchy of the network, the stacked Inception modules should capture the highly non-linear latent concepts [link1](https://arxiv.org/pdf/1312.4400.pdf). Once represented, the relationship between these concepts should be able to be linearly combined. This linear space is captured by the fully connected layer (fc10). Obviously, if there are any non-linearities that were not captured from the stacked inception modules, they would be piecewise approximated by the fully-connected layer (fc10).


The following is the table of parameters for GTInception:


![GTInception5Params](images/GTInceptionParamsTable.png)


/ #1x1, #3x3, #5x5, and #pool proj columns show the number of filters used for each transformation.

The colored cells indicated the reduction in size after the inception output.


---

### Dropout regularization

I added the *dropouts* on the fully connected outputs (or drop outs on the following layers' inputs). Dropouts randomly introduce varying architectures during training. This variation of architectures, during evaluation, eventually results in an *ensemble of networks* working simultaneously together. 

Since the resulting "multiple networks" are trained on varying aspects of the training data, they locally optimize for different areas of the error-space. The ensemble output thus results in an averaging of their output [link](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)

For the LeNet-5 I used a 0.3 dropout rate (0.7 keep rate), while for GTInception net I used a 0.6 dropout (0.4 keep) rate. The reason for drastically increasing the droput rate for GTInception, is that GTInception has a much larger capacity (deeper, many more parameters than LeNet-5), thus it will tend to overfit much easier. We need a strong regularization factor. 

Consequently, higher rates should be used with very large nets, or if we are limited in our data.

---


## Evaluation

During training, the models with the best F1 score were saved for future evaluation. Obviously here we can implement early termination logic (another form of regularization to avoid overfitting).

Several metrics were printed during training. Please refer to the output of the training section in the associated notebook.

The noticeable aspect to observe, is that GTInception reaches a validation F1 score of approximately 0.90 between 20 and 25 epochs. This tool less than an hour using a machine with one GPU (from AWS) (details of GPU tbd)

Also notice, that on the german dataset, it has been reported [link](https://arxiv.org/pdf/1511.02992.pdf) that Google Inception has a performance accuracy of 0.9957.

---

Given 100 epochs of training, with a batch size of 150, I obtained the following *macro* metrics:

![PerfMetrics](images/TestPerformanceMetrics.png)

# Tenst GTInception on images from the web

Data retrieved from simple image search. The images were manually clipped and resized to fit the 32x32 input size to the network.

---
* **Speed Limit 30**: Simple, flat facing the camera. Clear visibility. Expect and obtained high confidence classification
![Spdlim30](images/Performance_speedlimit30.jpg)

---
* **Occluded General Caution**: Flat facing sign, but partially occluded by other signs. Still a high confidence classification result
![OccGenCaut](images/Performance_GenCaution_Occluded.jpg)

---
* **Unseen sign (dog needs disallowed)**: This sign was not part of the data set. The general shape of the dog does however resemble that of a right-turning arrow. Note however, that the probability entropy has gone up. Other classes have higher probabilities.
![UnseenDogPoop](images/Performance_NoDogPoop.jpg)

---
* **No passing (with end of enforcement)** - This is a sign with what I believe to would be an "end enforcement" sign below. This sign is partially shown (i.e. no letters), but the net is putting more emphasis on this partial sign for its interpretation. However, due to the "end"'s partial signage, the net also thinks that this could also be a no-passing sign as well. A note to the German traffic sign authorities !!
![NoPassingEnd](images/Performance_No_Passing_End.jpg)

---
* **Rotated Turn Right**: Rotated - about the z-axis in 3D space - sign, not fully facing the camera, with graffitis on it. The net is highly confident and correct in its decision. Data agumentation during training helps in cases like these.
![TurnRight](images/Performance_TurnRight.jpg)

---
* **Multiple general caution signs** - repeated caution signs, but rotated. The net correctly determines the class. Spatial invariance at work, best captured by 2D convolutions. 
![GenCautMultiple](images/Performance_GenCaution_Multiple.jpg)