# Behavioral Cloning - Project

The goals / steps of this project are the following:

 - Use the simulator to collect data of good driving behavior
 - Build, a convolution neural network in Keras that predicts steering angles from images
 - Train and validate the model with a training and validation set
 - Test that the model successfully drives around track one without leaving the road

![simulator_example](imgs/simulator_example.jpg)


## Requirements:

 - Environment setup: [Instructions](https://github.com/udacity/CarND-Behavioral-Cloning-P3)
 - Simulator: [Github repo](https://github.com/udacity/self-driving-car-sim)


## Project methodology:

OK, we are ready to start working on the project!:muscle:.

**Our goal is to train a convolutional neural network to predict steering angles from a car's camera.**


#### Convolutional Network (Basic concepts):

##### Exploratory Visualization

In order to use images as input for the deep learning models, images need to be converted into multidimensional arrays of number where each pixel represents a cell in the multidimensional array. For this process it is used the Numpy library ```ndimage``` as described in this [tutorial](http://www.scipy-lectures.org/advanced/image_processing/).

<img src="imgs/image_transformations.png" alt="Drawing" style="width: 800px;">

##### Algorithms and Techniques

Different deep convolutional neural nets architectures are used to perform this task, which nowadays seems to be the best known approach in the image recognition field. Working with Images is a complex task, for example, a grayscale image of size 150x150 would be transformed to a vector of size 150·150 = 22500 dimensions for a fully connected neural network. Such huge dimensionality with no predefined features makes this problem unapproachable for standard supervised machine learning approaches, even combining them with dimensional reduction techniques like PCA. 

Convolutional nets are elected to be the most efficient technique to extract relevant information from, in this case, images to be used in classification tasks. When used for image recognition, convolutional neural networks (CNNs) consist of multiple layers of small kernels which process portions of the input image, called receptive fields. Kernels are small matrix (normally 3x3 or 5x5) applied over the input image to extract features from data, this technique has been used in image processing for decades, from Photoshop filters to medical imaging. [This blog by Victor Powell](http://setosa.io/ev/image-kernels/) is an excellent resource to understand how kernels works. 

<img src="imgs/kernel.png" alt="Drawing" style="width: 300px;">

The outputs of these kernels are then tiled so that their input regions overlap, to obtain a better representation of the original image; this is repeated for every such layer. Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters. Compared to other image classification algorithms, convolutional neural networks use relatively little pre-processing. This means that the network is responsible for learning the filters that in traditional algorithms were hand-engineered. The lack of dependence on prior knowledge and human effort in designing features is a major advantage for CNNs.

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. There are several non-linear functions to implement pooling among which max pooling is the most common. It partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum. The intuition is that once a feature has been found, its exact location isn't as important as its rough location relative to other features. The function of the pooling layer is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. It is common to periodically insert a pooling layer in-between successive conv layers in a CNN architecture. The pooling operation provides a form of translation invariance.[ref 06].

<img src="imgs/cnn_pooling.png" alt="Drawing" style="width: 400px;">

The proposed net architecture for this particular problem is a neural net with 1 to 4 layers where each layer includes a CNN + Max Pooling layer. On top of that it is included a fully connected net with 150 nodes in the input side and 1 node to output results and dropout implemented. Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data, it basically consist in dropping out nodes randomly in a neural network to gain robustness in model predictions. 


On top of this, two different optimizers are employed: ```Adam``` and ```Adagram```. Optimizers are used to minimize the ```Cost``` function in a neural net. In the example below, we can see there are weights (W) and biases (b) for every node and connection between nodes in a neural network:

<img src="images/optimizers.png" alt="Drawing" style="width: 800px;">

A cost function is a measure of "how good" a neural network did with respect to it's given training sample and the expected output. It also may depend on variables such as weights and biases. A cost function is a single value, not a vector, because it rates how good the neural network did as a whole.

Specifically, a cost function is of the form: 
```
C(W,B,S,E)
```
where ```W``` is our neural network's weights, ```B``` is our neural network's biases, ```S``` is the input of a single training sample, and ```E``` is the desired output of that training sample.

While there are different ways to represent the ```Cost``` function, the goal of optimization is to minimize it. Different approaches are used, Stochastic Gradient Descent (SGD)  tries to find minimums or maximums by iteration. This is the most common approach and different versions of this method originates the optimizers here employed:
 * **AdaGrad** (for adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate, first published in 2011. Informally, this increases the learning rate for more sparse parameters and decreases the learning rate for less sparse ones. This strategy often improves convergence performance over standard stochastic gradient descent.
 * **Adam** is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of their magnitudes and recent gradients for that weight.

<img src="imgs/sgd.png" alt="Drawing" style="width: 600px;">

 


### Initial model design:

As previously mentioned, CNNs are the tool to use, so let's first desing a toy architecture that will allow us to test some initial results, let's propose a neural network with a single CNN layer and 4 dense layers and RELU activations:

|Layer (type)                     |Output Shape          |Param #     |Connected to               |
|:--------------------------------|:---------------------|:-----------|:--------------------------|
|convolution2d_1 (Convolution2D)  |(None, 49, 99, 24)    |1176        |convolution2d_input_1[0][0]|
|flatten_1 (Flatten)              |(None, 10752)         |0           |convolution2d_4[0][0]      |
|dense_1 (Dense)                  |(None, 100)           |1075300     |flatten_1[0][0]            |
|dropout_1 (Dropout)              |(None, 100)           |0           |dense_1[0][0]              |
|dense_2 (Dense)                  |(None, 64)            |6464        |dropout_1[0][0]            |
|dense_3 (Dense)                  |(None, 10)            |650         |dense_2[0][0]              |
|dense_4 (Dense)                  |(None, 1)             |11          |dense_3[0][0]              |
Total params: 1,152,869
Trainable params: 1,152,869
Non-trainable params: 0
____________________________________________________________________________________________________

 

