# Self-Driving Car Engineer Nanodegree

## Deep Learning

## Project: Build a Traffic Sign Recognition Classifier

In this notebook, a template is provided for you to implement your functionality in stages which is required to successfully complete this project. If additional code is required that cannot be included in the notebook, be sure that the Python code is successfully imported and included in your submission, if necessary. Sections that begin with **'Implementation'** in the header indicate where you should begin your implementation for your project. Note that some sections of implementation are optional, and will be marked with **'Optional'** in the header.

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

---
## Step 0: Load The Data

In [None]:
#importing some useful packages
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import cv2
import pickle
import pandas as pd
import random
import math
%matplotlib inline
print("Amazingly all libraries were loaded!!")

In [None]:
# Load pickled data

# TODO: Fill this in based on where you saved the training and testing data
## Please change this to folder where you place the notebook

training_file = './Data/train.p'
testing_file = './Data/test.p'

try:
    #pickle.dump(train, training_file, protocol=2)
    with open(training_file, mode='rb') as f:
        train = pickle.load(f)
    with open(testing_file, mode='rb') as f:
        test = pickle.load(f)
    
    #We will keep the Original Images and Labels here
    X_train, y_train = train['features'], train['labels']
    X_test, y_test = test['features'], test['labels']

    print('Training and Test datasets loaded.')
    
except Exception as e:
    print('Loading datasets failure')


---

## Step 1: Dataset Summary & Exploration

The pickled data is a dictionary with 4 key/value pairs:

- `'features'` is a 4D array containing raw pixel data of the traffic sign images, (num examples, width, height, channels).
- `'labels'` is a 1D array containing the label/class id of the traffic sign. The file `signnames.csv` contains id -> name mappings for each id.
- `'sizes'` is a list containing tuples, (width, height) representing the the original width and height the image.
- `'coords'` is a list containing tuples, (x1, y1, x2, y2) representing coordinates of a bounding box around the sign in the image. **THESE COORDINATES ASSUME THE ORIGINAL IMAGE. THE PICKLED DATA CONTAINS RESIZED VERSIONS (32 by 32) OF THESE IMAGES**

Complete the basic data summary below.

In [None]:
### Replace each question mark with the appropriate value.

# Make sure the sizes of the images and thelabels agree
assert(len(X_train) == len(y_train))
assert(len(X_test) == len(y_test))


# TODO: Number of training examples
n_train = len(X_train)

# TODO: Number of testing examples.
n_test = len(X_test)

# TODO: What's the shape of an traffic sign image?
image_shape = X_train[0].shape

# TODO: How many unique classes/labels there are in the dataset.
n_classes = len(set(y_train))

print("Number of training examples =", n_train)
print("Number of testing examples =", n_test)
print("Image data shape =", image_shape)
print("Number of classes =", n_classes)
print('Shape Original:', X_train.shape)

Visualize the German Traffic Signs Dataset using the pickled file(s). This is open ended, suggestions include: plotting traffic sign images, plotting the count of each sign, etc.

The [Matplotlib](http://matplotlib.org/) [examples](http://matplotlib.org/examples/index.html) and [gallery](http://matplotlib.org/gallery.html) pages are a great resource for doing visualizations in Python.

**NOTE:** It's recommended you start with something simple first. If you wish to do more, come back to it after you've completed the rest of the sections.

In [None]:
## Function that will Return the Sign Description by providing the index ( Index -> Description)
signNames = pd.read_csv('signnames.csv', sep=',')

def SignDescription(SignIdx):
    signName = signNames[signNames['ClassId']==SignIdx]['SignName'][SignIdx]
    #print (signName)
    return signName

In [None]:
### Data exploration visualization goes here.
### Feel free to use as many code cells as needed.


# Visualizations will be shown in the notebook.

# Plot four sample images
print(' ******* Sample images ******* ')
plt.figure(figsize=(20,50))
for i in range(50):
    index = random.randint(0, n_train)
    plt.subplot(15,4,i+1)
    
    plt.xticks([])
    plt.yticks([])
    plt.imshow(X_train[index])
    SignIdx = y_train[index]
    #print(str(SignIdx) + "-" + SignDescription(SignIdx))
    plt.title(str(SignIdx) + "-" + SignDescription(SignIdx))
    #plt.show()
    

### 1.1 Observations
As we can see above, the training images have different brightness, are at different angles, some are very blurred. The variety on the set it’s great to make sure our CNN gets trained and learns how to generilize and identify signals in a variety of optical conditions.
It’s has been suggested to convert images into greyscale to, somehow, normalize all color rages and enhance the features that we want to detect and learn from. Also to normilize the brightness we will will use a methode called "Histogram equalization".

----

## Step 2: Design and Test a Model Architecture

Design and implement a deep learning model that learns to recognize traffic signs. Train and test your model on the [German Traffic Sign Dataset](http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset).

There are various aspects to consider when thinking about this problem:

- Neural network architecture
- Play around preprocessing techniques (normalization, rgb to grayscale, etc)
- Number of examples per label (some have more than others).
- Generate fake data.

Here is an example of a [published baseline model on this problem](http://yann.lecun.com/exdb/publis/pdf/sermanet-ijcnn-11.pdf). It's not required to be familiar with the approach used in the paper but, it's good practice to try to read papers like these.

**NOTE:** The LeNet-5 implementation shown in the [classroom](https://classroom.udacity.com/nanodegrees/nd013/parts/fbf77062-5703-404e-b60c-95b78b2f3f9e/modules/6df7ae49-c61c-4bb2-a23e-6527e69209ec/lessons/601ae704-1035-4287-8b11-e2c2716217ad/concepts/d4aca031-508f-4e0b-b493-e7b706120f81) at the end of the CNN lesson is a solid starting point. You'll have to change the number of classes and possibly the preprocessing, but aside from that it's plug and play!

### Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

### 1.2 Image Pre-processing

In [None]:
def grayscale(img):
    return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

def equalizeHist(img):
    img_shape = img.shape
    try:
        n_channels = img_shape[2]
        grayScale = False
    except:
        grayScale = True 
        
    if grayScale:
        img = cv2.equalizeHist(img)
    else:
        for ch in range(0, n_channels-1):
            img[:,:,ch] = cv2.equalizeHist(img[:,:,ch])
            
    return img

def canny(img, low_threshold, high_threshold):
    """Applies the Canny transform"""
    return cv2.Canny(img, low_threshold, high_threshold)

def gaussian_blur(img, kernel_size):
    """Applies a Gaussian Noise kernel"""
    return cv2.GaussianBlur(img, (kernel_size, kernel_size), 0)

### Greyscale, Histogram equalization
#### 1) Greyscale to keep the range of color normilized
#### 2) Histogram equalization to keep the brightness of all images normalized (the very dark ones will become brighter and the very bright ones will get slighly darker)


In [None]:
from numpy import newaxis

# convert to B/W
X_train_bw = np.array([grayscale(image) for image in X_train])
X_test_bw = np.array([grayscale(image) for image in X_test])
#X_train_bw = X_train
#X_test_bw = X_test
print('Shape after converting to grayscale:', X_train_bw[0].shape)

# apply histogram equalization
X_train_hst_eq = np.array([equalizeHist(image) for image in X_train_bw])
X_test_hst_eq = np.array([equalizeHist(image) for image in X_test_bw])
print('Shape after Histogram Equalization:', X_train_hst_eq.shape)


### Mean substration and Normalization

In [None]:
### Preprocess the data here.
# 1) MEAN SUBTRACTION
def meanSubtraction(image_data):
    return image_data - np.mean(image_data, axis = 0)        
    
X_train_meanSub = meanSubtraction(X_train_hst_eq)
X_test_meanSub = meanSubtraction(X_test_hst_eq)
print('Mean Subtraction performed succesfully')
print('Shape after Mean Subtraction :', X_train_meanSub.shape)

# 2) NORMALIZATION
# I found 2 ways to normilze that I want to check how different these 2 techniques are.



def normalize1(image_data):
    a = -0.5
    b = 0.5
    grayscale_min = 0
    grayscale_max = 255
    return a + ( ( (image_data - grayscale_min)*(b - a) )/( grayscale_max - grayscale_min ) )

X_train_norm1 = normalize1(X_train_meanSub)
X_test_norm1 = normalize1(X_test_meanSub)
print('Shape after Normalization 1 :', X_train_norm1.shape)

def normalize2(image_data):
    return (image_data - image_data.mean()) / (np.max(image_data) - np.min(image_data))

X_train_norm2 = normalize2(X_train_meanSub)
X_test_norm2 = normalize2(X_test_meanSub)
print('Shape after Normalization 2 :', X_train_norm2.shape)

#assert np.array_equal(X_train_norm1, X_train_norm2), 'Normalization techniques 1 and 2 are not producing equal images.'

 
def imageStd(image_data):
    return tf.image.per_image_standardization(image_data)

X_train_std = np.array([imageStd(image) for image in X_train])
#X_test_std = np.array([grayscale(image) for image in X_test])
#X_train_bw = X_train
#X_test_bw = X_test
print('Shape after Standarization:', X_train_std[0].shape)

# 3) PCA/Whitening is good for completeness, but these transformations are not used with Convolutional Networks in praactice.

### Feel free to use as many code cells as needed.


#### Reshape to make sure the images are pefectly ready (have a shape of 32x32x1) for the Convolution layer Input

In [None]:
### Finally we will reshape
# We will do it for only the Normalized set that we decide to use; either Normilize 1 or 2
# reshape for conv layer

X_train_reshaped = X_train_norm2[..., newaxis]
X_test_reshaped = X_test_norm2[..., newaxis]

#X_train_reshaped = X_train_norm1
#X_test_reshaped = X_test_norm1
print('Shape after reshape:', X_train_reshaped.shape)


In [None]:
# This is to Compare a randomly selected image from the training set with all the pre-processed results step by step 
# to visualy inspect the diferences.

# show image of N random data points
count = 10
fig, axs = plt.subplots(count, 6, figsize=(count*2, count*6))
fig.subplots_adjust(hspace = .2, wspace=.01)
axs = axs.ravel()
for i in range(0, count*6, 6):
    index = random.randint(0, n_train)
        
    axs[i].axis('off')
    axs[i].imshow(X_train[index]) # ORIGINAL
    axs[i].set_title("Original- " + SignDescription(y_train[index]))

    axs[i+1].axis('off')
    axs[i+1].imshow(X_train_bw[index], cmap='gray')
    axs[i+1].set_title("B/W")

    axs[i+2].axis('off')
    axs[i+2].imshow(X_train_hst_eq[index], cmap='gray')
    axs[i+2].set_title("Histogram Equalized")
    
    axs[i+3].axis('off')
    axs[i+3].imshow(X_train_meanSub[index], cmap='gray')
    axs[i+3].set_title("Mean Subtraction")
    
    axs[i+4].axis('off')
    axs[i+4].imshow(X_train_norm1[index], cmap='gray')
    axs[i+4].set_title("Normalization 1")
    
    axs[i+5].axis('off')
    axs[i+5].imshow(X_train_norm2[index], cmap='gray')
    axs[i+5].set_title("Normalization 2")
    

### Question 1 

_Describe how you preprocessed the data. Why did you choose that technique?_

**Answer:**
Before I answer this question I’d like to elaborate the 2 different steps we use to pre-process the data before we feed it into the CNN and how these steps affect significantly the answer.
1.	The first step is to manipulate the raw data simply because we want it to be within an acceptable range and format for the network. Transformation and normalization are two widely used preprocessing methods to achieve this objective. Transformation involves manipulating raw data inputs to create a single input to a net (many to one), while normalization is a manipulation performed on a single data input to distribute the data evenly and scale it into an acceptable range for the network (one to one). Knowledge of the domain is important in choosing preprocessing methods to highlight underlying features in the data, which can increase the network's ability to learn the association between inputs and outputs. There are many different types of manipulations depending on the field and type of data we are using. Some simple preprocessing methods include computing differences between or taking ratios of inputs. In financial forecasting transformations might involve the use of standard technical indicators. Also moving averages, for example, which are utilized to help smooth price data. Other methods for data normalization that I found out there are min-max normalization (which is what I used here), z-score normalization and normalization by decimal scaling. As can see below, I used 2 different Normalization techniques and I plotted a random image and its normalized 2 versions and decided to train the CNN with both to see if there is any significant change in accuracy and performance. After 2 runs on each I decided to stick with Normalization 2. 
2.	The second step after the data is ready for the net, is to shuffle the training data. But why? Well, the learning in a neural network follows a walk that minimizes the error which depends on the current example. Some walks will end at a local minimum while we are seeking a global one. Therefore if we always start the training at the same sample point we could end up stuck in that local minima over and over again and that’s why it is extremely important to shuffle the training data. I did want to mention an interesting line of research from Bengio et al. on something they call “curriculum learning”. Here’s a sentence from the opening paragraph:
“By choosing which examples to present and in which order to present them to the learning system, one can guide training and remarkably increase the speed at which learning can occur.”
On the same topic, I have read comments from people like Viresh Ranjan, PhD Student in Machine Learning where he states that theory and practice are totally different. Here’s what he said: “different random shuffling of the training set may lead to different parameters. In practice, doesn't make that much of a difference. Since the neural network objective functions are non-convex, using different ordering of training samples would lead to possibly different local minimas. However, in practice, we observe that using different ordering of training samples doesn't result in any drastic changes in performance.“
So going back to the question, after looking at the raw images we noticed that they are all 32x32x3. So we don't need to re-shape them and guarantee they have this size. Why 32x32x3...Well, we look for a minimal factor where images don’t lose any information and we convey to a 32x32x3.
I WILL NOT SHUFFLE HERE UNTIL THE TRAINING DATA IS SPLIT INTO 2 DATASETS: FINAL TRAINING SET AND VALIDATION SET
### Added after submission following suggestions: Look above

In [None]:
# Plot a histogram of the count of the number of examples of each sign
# in the trainig set
def PlotHistogram(data):
    fig, ax = plt.subplots()

    n_examples_per_sign, bins, patches = ax.hist(data, bins=n_classes, normed = False, histtype='bar', orientation='horizontal')

    print("The sing with the highest number of exampels is:", SignDescription(np.argmax(n_examples_per_sign)))    
    print ("With = {:.0f}".format(np.max(n_examples_per_sign)) + " examples")

    # _____ This IS JUST TO COLOR THE HISTOGRAM
    bin_centers = 0.5 * (bins[:-1] + bins[1:])

    # scale values to interval [0,1]
    col = bin_centers - min(bin_centers)
    col /= max(col)

    # This is  the colormap I'd like to use.
    cm = plt.cm.get_cmap('RdYlBu_r')

    for c, p in zip(col, patches):
        plt.setp(p, 'facecolor', cm(c))
    # ________________________________________

    plt.title('Number of images of each sign in the training set')
    plt.ylabel('Sign')
    plt.xlabel('Count')
    ax.set_yticks(signNames['ClassId'])
    ax.set_yticklabels(signNames['SignName'], rotation=0 )
    for tick in ax.yaxis.get_major_ticks():
        tick.label.set_fontsize(6) 

    # Plot histogram.
    plt.plot()
    
    return n_examples_per_sign

In [None]:
n_examples_per_sign = PlotHistogram(y_train)

In [None]:
## After researching on what and why transformations we can do over the images to "generate" new ones, I came accross KERAS
# Image Generator that points at these set:
#      1) rotation,
#      2) width shift,
#      3) height shift,
#      4) shear,
#      5) zoom,
#      6) horizontal flip,
# Obviously I don't have time to develop these but I'll give it a try by "shifting and Rotating" using 
# scipy.ndimage.interpolation.shift and .rotate
import scipy.ndimage
def TransformImg(image):
    if (random.choice([True, False])):
        Newimage = scipy.ndimage.interpolation.shift(image, [random.randrange(-2, 2), random.randrange(-2, 2), 0])
    else:
        Newimage = scipy.ndimage.interpolation.rotate(image, random.randrange(-10, 10), reshape=False)
    return Newimage

In [None]:
### Generate additional data (OPTIONAL!)
# When looking at the histogram above, we notice that, naturally, the CNN would be trained with more examples of some type of
# signals than other with significantly less examples and therefore be more prone to identify those. For this reason it would be
# ideal to augment/generate training images to, somehow, try to equalize the amount of exampels per sign. 
# (Look at Question 2 below for details)

generated_images = []
generated_labels = []
generated_per_class = []

for class_index in range(n_classes):
    generated_per_class.append(0)
    class_example_count = n_examples_per_sign[class_index]
    add_examples = np.max(n_examples_per_sign) - class_example_count
    print("Sign {:.0f} has only {:.0f} examples, we want to create {:.0f} more.".format(class_index, class_example_count, add_examples))
    # We will generate images based on the already gray and normilized ones. I will start with the original and then use the
    # new generated ones as source to produce the next one. This way I will minimaze generating the same images.
    for Image2Transform, Image2Transform_Label in zip(X_train, y_train):
        if class_index == Image2Transform_Label:
            image = Image2Transform            
            while generated_per_class[class_index] < (int(add_examples)):
                image = TransformImg(image)
                generated_per_class[class_index] = generated_per_class[class_index] + 1                
                generated_images.append(image)
                generated_labels.append(class_index)
    print("Generated:", generated_per_class[class_index])
            


In [None]:
# All this images generated should be pre-processed like we did before: grayscale, etc..
# convert to B/W
generated_images = np.array([grayscale(image) for image in generated_images])
# equalizeHist
generated_images = np.array([equalizeHist(image) for image in generated_images])
# meanSubtraction
generated_images = meanSubtraction(generated_images)
# Normilize 2
generated_images = normalize2(generated_images)
# Reshape
generated_images = generated_images[..., newaxis]

# Add the generated data to original data
X_train_augmented = np.append(np.array(X_train_reshaped), np.array(generated_images), axis=0)
y_train_augmented = np.append(np.array(y_train), np.array(generated_labels), axis=0)

### and split the data into training/validation/testing sets here.
 # Splitting below
### Feel free to use as many code cells as needed.

In [None]:
n_examples_per_sign = PlotHistogram(y_train_augmented)

In [None]:
#________________________________________
#         SPLITTING THE DATA
#________________________________________
# We will split the Original Training set into a final Training set and a Validation test.

from sklearn.utils import shuffle

def data_split(Input_images, Output_labels, train_percent=.8):   
    m = len(Input_images)
    train_end = int(train_percent * m)    
       
    #Shuffle before we slpit
    Input_images, Output_labels = shuffle(Input_images, Output_labels, random_state=42)
    
    X_train = Input_images[:train_end]
    y_train = Output_labels[:train_end]
    
    X_validation = Input_images[train_end:]    
    y_validation = Output_labels[train_end:]
    
    return X_train, y_train, X_validation, y_validation

In [None]:
# Here is where we actually split the data

train_percent = 0.8 # 80% will be final training data and 20% will used as a validation data
X_train_final, y_train_final, X_validation, y_validation = data_split(X_train_augmented, y_train_augmented, train_percent)

n_train_final = len(X_train_final)
n_validation = len(X_validation)

print("Number of Final training examples =", n_train_final)    
print("% training examples =", n_train_final/len(X_train_augmented)*100)
print("Number of Validation examples =", n_validation)
    
image_shape = X_train_final[0].shape
print("The new image shape is:",image_shape)


### Question 2

_Describe how you set up the training, validation and testing data for your model. **Optional**: If you generated additional data, how did you generate the data? Why did you generate the data? What are the differences in the new dataset (with generated data) from the original dataset?_

**Answer:**
Just after a quick research I realized that deciding the percentages of the full raw data set that can be used to setup the training, validation and testing sets it's a complex and controverted topic. Doesn't look like there is a strict method to follow or specific percentages to use. The other interesting thing I learnt was that splitting without resampling (cross-validation, or better: bootstrapping) is unreliable unless you have an enormous sample (e.g., N>20000N>20000). Rigorous internal validation using the bootstrap is usually preferred, assuming that you program all model selection steps so they can be repeated in each bootstrap loop. And one of the problems with split sample approaches, besides volatility, is the difficulty in choosing the split fractions. After considerable reading I decided to keep 80% for Training and 20% for validation while I'll leave the  Testing set untouched.Look like this is a good starting point that I will change later to study any changes in performance and accuracy.

On the Optional question, I decided to start with the data provided. My intention, after reading about this topic, is to generate additional data or augment the data by performing “distortions” on the original training images by adding noise, performing skew, rotation, cropping random windows, and maybe others. Data augmentation is a well-known technique to address the problem of over fitting. I also found more interesting details on “ImageNet Classification with Deep Convolutional Neural Networks” paper that describes AlexNet CNN. In this paper the first form of data augmentation consists of generating image translations and horizontal reflections while the second form of data augmentation consists of altering the intensities of the RGB channels in training images. I’ll go through this changes if time permits.
While I was going through some of the visualization methods over the data I realized that the number of samples in the training set for each class is not evenly distributed, i.e there are many more samples of one type of sign than others. (Ex: 2200 samples of Stop signs and 300 of wrong direction) so the CNN, I’m assuming will be better trained to identify signs that have more samples and will have a higher error and a harder time predicting an unknown signs that belong to a class that has less abundant training samples. Please, look at the histogram below. I’d like equalize the number of samples per label to the max frequency among all signs, either by “getting” more training images from all the signs that have less training samples or by generating new images using the techniques mentioned above.
Sorry, time limits really didn’t allow me to do all the pre-processing I would have done if I had to do this right!

In [None]:
### Define your architecture here.
### Feel free to use as many code cells as needed.


### Question 3

_What does your final architecture look like? (Type of model, layers, sizes, connectivity, etc.)  For reference on how to build a deep neural network using TensorFlow, see [Deep Neural Network in TensorFlow
](https://classroom.udacity.com/nanodegrees/nd013/parts/fbf77062-5703-404e-b60c-95b78b2f3f9e/modules/6df7ae49-c61c-4bb2-a23e-6527e69209ec/lessons/b516a270-8600-4f93-a0a3-20dfeabe5da6/concepts/83a3a2a2-a9bd-4b7b-95b0-eb924ab14432) from the classroom._


**Answer:**
Allow me to go into a little detail here so it serves also as a reference for me to look at. This will be a Detail Design Specification that will guide me while implementing and debugging.
As suggested I'm starting with a LeNet-5 neural network architecture.
Input
My LeNet architecture will accepts a 32x32xC image as input, where C is the number of color channels. Since we are dealing with color images that has been pre-processed to grayscale, C is now 1 in this case. This input requirement forces me to make sure that all images are pre-processed to guarantee that they comply with this format (Please refer to Question 1)
Architecture
•	Layer 1: Convolutional. Like any other Conv Layer, this one will take 4 hyper parameters:
1.	Number of Filters K. I’m still reading and watching videos regarding how to decide this value. From what I found out, still there are not strict and rigorous studies regarding the selection of this parameter and is still an “art” that becomes easier with experience. Long story short, I found that in most cases the accuracy very quickly plateaus after what seems like a relatively small number of K (aka Kernels, feature maps, # filters) (at least on the old Caltech-101 and NORB datasets). Smaller receptive field sizes (FxF) tend to plateau more quickly than larger ones, but the relationship isn't quite as clear cut for larger receptive field sizes. As for why popular choices are 32 or 64, my understanding is that these powers of 2 are more convenient choices for optimizing your GPU usage.
If we go way back to the original LeNet5 we see 6 in the first layer, 6 in the second layer, 16 in the third layer, and 16 in the fourth layer. Though admittedly that was optimized on the MNIST. I’ll start with this configuration and test the different improvements that came later in time (AlexNet, ZFNet, VGGNet, GoogLeNet,…).
2.	Their spatial extent F (filters are always, for computational and complexity reasons, set as squares FxF, where F is usually 3,5,7 and sometimes 1. The reason why F is odd is due to the idea of using the center of the filter window as your “bulls-eye” when convoluting, and having cells/pixels on your left and right allows for this center to exist. Therefore Odd numbers allow this to happen. Still, there is NO clear sign that using even sizes have any negative impact. My intuition is that we should go as small as the smallest feature we want to detect). I’ll start with F = 5 as suggested in LeNet and I’ll test different values and their effect on the different efficiency and accuracy parameters.
3.	The stride S.  As with the value of F, I’ll use S = 2 as suggested in LeNet and I’ll test different values and their effect on the different efficiency and accuracy parameters
4.	The amount of zero Padding 
So if we assume (Look at the input section above) that the input is of the form:
W1 x H1 x D1 (Width, Height and Depth)
Then the OUTPUT of this convolution is on the form W2 x H2 x D2 where:
W2 = (W1- F + 2P)/S  +1
H2 = (H1- F + 2P)/S  +1  (i.e. width and height are computed equally by symmetry. i.e. H2=W2)
D2 = K (number of filters)
•	Activation. This is the function we use to “trigger” the neuron when we “detect” a feature. In many situations we use a sigmoid function or we can use the Rectifying Linear Unit(ReLu) .  ReLu seems to be the choice of preference since is widely used in the CNN context.
•	Pooling.
o	Pooling makes the representation smaller and more manageable without sacrificing in getting rid of relevant information
o	Operates over each Activation Map (above) independently.
o	The most common Pooling technique and the one I’ll be using is MAX POOLING. This technique brings 2 more hyper parameters per Pooling layer (or can’t be equal for all if the parameters fir appropriately the input activation layer that will applied over)
	Pooling window size: In many examples I’ve seen 2x2. 
	Stride of the Pooling window: Usually same as the window side, in this case 2. 
	This 2 parameters (2x2 an stride 2) will reduce the size of each activation map to half while keeping the same depth.
	The output shape should be 14x14x6.
•	Layer 2: Convolutional. As in layer 1, at this point I will just define the parameter:
o	K = 16
o	F = 5
o	S = 1
o	P = 1 
o	The output shape should be 10x10x16.
•	Activation. I’ll continue with ReLu.
•	Pooling.
o	I’ll continue with Max Pooling.
o	2x2 stride 2 (to cut in half my activation input)
o	The output shape should be 5x5x16.
•	Flatten. Flatten the output shape of the final pooling layer such that it's 1D instead of 3D. The easiest way to do is by using tf.contrib.layers.flatten, which is already imported for you.
•	Layer 3: Fully Connected. This should have 120 outputs.
•	Activation. Again ReLu.
•	Layer 4: Fully Connected. This should have 84 outputs.
•	Activation. ReLu.
•	Layer 5: Fully Connected (Logits). This should have 43 outputs.
Output
Return the result of the 2nd fully connected layer.


In [None]:
from tensorflow.contrib.layers import flatten

def JojoCovNet(x):    
    # Arguments used for tf.truncated_normal, randomly defines variables for the weights and biases for each layer
    mu = 0
    sigma = 0.1
    
    #_____________________________________________________________________________
    # Layer 1: Convolutional. Input = 32x32x1. Output = 28x28x32.
    #_____________________________________________________________________________
    # Number of Filters K
    K = 32
    
    # Their spatial extent F (FxF)
    F = 3
    
    # The Filter Stride FS
    FS = 1
    strides = [1, FS, FS, 1]
    
    # The Padding P
    padding='VALID'
    
    # Number of Channels on the input 
    Ch = image_shape[2]
    
    conv1_W = tf.Variable(tf.truncated_normal(shape=(F, F, Ch, K), mean = mu, stddev = sigma))
    conv1_b = tf.Variable(tf.zeros(K))
    conv1   = tf.nn.conv2d(x, conv1_W, strides, padding) + conv1_b
    
    #_____________________________________________________________________________
    # Layer 1: MaxPooling. 
    #_____________________________________________________________________________
    # Pooling window size P
    P = 2
    # Stride of the Pooling window S
    PS = 2
    layer2 = tf.nn.max_pool(conv1, ksize=[1, P, P, 1], strides=[1, PS, PS, 1], padding='VALID')
        
    #_____________________________________________________________________________
    # Layer 3 Dropout. 
    #_____________________________________________________________________________
    # dropout rate: dropoutRate 
    # I will start with 50% and try different values to tune and observe results
    dropoutRate = 0.5
    
    layer3 = tf.nn.dropout(layer2, dropoutRate)
    
    #_____________________________________________________________________________
    # Layer 4 Activation: ReLu
    #_____________________________________________________________________________
    layer4 = tf.nn.relu(layer3)
    
    # Convolution 2 and Pooling Layer
    conv2_W = tf.Variable(tf.truncated_normal(shape=(F, F, Ch, K), mean = mu, stddev = sigma))
    conv2_b = tf.Variable(tf.zeros(K))
    conv2   = tf.nn.conv2d(layer4, conv2_W, strides, padding)
    conv2 = conv2 + conv2_b
        
    layer_activation2 = tf.nn.relu(conv2)
    layer_pooling2 = tf.nn.max_pool(layer_activation2, ksize=[1, P, P, 1], strides=[1, PS, PS, 1], padding='VALID')
    
    
    #_____________________________________________________________________________
    # Layer 5 Flatten. 
    #_____________________________________________________________________________    
    fc0   = flatten(layer_pooling2)
    
    #_____________________________________________________________________________
    # Layer 6: Fully Connected. Output = 128.
    #_____________________________________________________________________________
    fc1_W = tf.Variable(tf.truncated_normal(shape=(7200, 128), mean = mu, stddev = sigma))
    fc1_b = tf.Variable(tf.zeros(128))
    fc1   = tf.matmul(fc0, fc1_W) + fc1_b
    
    #_____________________________________________________________________________
    # Layer 7 Activation: ReLu
    #_____________________________________________________________________________
    layer7 = tf.nn.relu(fc1)

    #_____________________________________________________________________________
    # Layer 8 Flatten. 
    #_____________________________________________________________________________    
    fc2   = flatten(layer7)
    
    #_____________________________________________________________________________
    # Layer 9: Fully Connected. Output = # classes.
    #_____________________________________________________________________________
    fc3_W = tf.Variable(tf.truncated_normal(shape=(128, n_classes), mean = mu, stddev = sigma))
    fc3_b = tf.Variable(tf.zeros(n_classes))
    logits = tf.matmul(fc2, fc3_W) + fc3_b
    
    # Note: Softmax will be performed outside the model
    

    return logits

In [None]:
### Train your model here.
#_____________________________________________________________________________
#    TRAINING PARAMETERS
#_____________________________________________________________________________
aLearning_rate = 0.001
epochs = 100 # Very large number initially that will get optimized when we reach the desired/target accuracy (see below runing the training loop)
batch_size = 128


#_____________________________________________________________________________
#      CONTRUCT THE MODEL
#_____________________________________________________________________________
# x is a placeholder for a batch of input images. 
# y is a placeholder for a batch of output labels.
x = tf.placeholder(tf.float32, (None, image_shape[0], image_shape[1], image_shape[2]))
y = tf.placeholder(tf.int32, (None))
one_hot_y = tf.one_hot(y, n_classes)


# Contruct the model line
logits = JojoCovNet(x)

# Define our Cross Entropy, Cost Function, optimizer type and Optimazer operation: minimize cost function
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits, one_hot_y)
cost_function = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate = aLearning_rate)
training_operation = optimizer.minimize(cost_function)

### Feel free to use as many code cells as needed.

In [None]:
#_____________________________________________________________________________
#      DEFINE HOW WE WILL EVALUATE THE MODEL
#_____________________________________________________________________________
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(one_hot_y, 1))
accuracy_operation = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
saver = tf.train.Saver()

def evaluate(X_data, y_data):
    num_examples = len(X_data)
    total_accuracy = 0
    sess = tf.get_default_session()
    for offset in range(0, num_examples, batch_size):
        batch_x, batch_y = X_data[offset:offset+batch_size], y_data[offset:offset+batch_size]
        accuracy = sess.run(accuracy_operation, feed_dict={x: batch_x, y: batch_y})
        total_accuracy += (accuracy * len(batch_x))
    return total_accuracy / num_examples

In [None]:
#_____________________________________________________________________________
#      TRAIN THE MODEL
#_____________________________________________________________________________
# Function to initialise the variables
initialize = tf.global_variables_initializer()

validation_accuracy = []
test_accuracy = []

sess = tf.Session()

sess.as_default()

with sess:
    sess.run(initialize)    
    
    print("Training...")
    print()
    for i in range(epochs):
        X_train_final, y_train_final = shuffle(X_train_final, y_train_final)
        for offset in range(0, n_train, batch_size):
            end = offset + batch_size
            batch_x, batch_y = X_train_final[offset:end], y_train_final[offset:end]
            sess.run(training_operation, feed_dict={x: batch_x, y: batch_y})
            
        validation_accuracy.append(evaluate(X_validation, y_validation))
        print("EPOCH {} ...".format(i+1))
        print("Validation Accuracy = {:.3f}".format(validation_accuracy[i]))
        print()        
        
        # When we reach an accurancy of 96% we can get out and record at the num of Epochs required to achieve it.
        # This num of Epochs would be the optimal one...without the need to go though all of them        
        if validation_accuracy[i] > 0.90:
            test_accuracy.append(evaluate(X_test, y_test))
            print("Test Accuracy = {:.3f}".format(test_accuracy[i]))            
        else:
            test_accuracy.append(0.0) # We don't perform the Evaluation over the Test Set when the Validation accuracy is lower than 95%
        
        if test_accuracy[i] > 0.90:
            print("Optimization Achieved on Epoch:", str(i+1))
            break 
            
    saver.save(sess, './JojoCovNet') 
    print("Model saved")
    
    np.savetxt("./Logs/epochsAccuraciesNEW.csv", (validation_accuracy,test_accuracy), delimiter=",")
    print("Epochs Accuracies saved")

In [None]:
#_____________________________________________________________________________
#      HOW TO CHOSE THE RIGHT
#     EPOCH NUM FOR THIS CASE
#_____________________________________________________________________________
from numpy import genfromtxt

accuracies = genfromtxt('./Logs/epochsAccuracies4.csv', delimiter=',')
plt.plot(accuracies[0],'r')
plt.plot(accuracies[1],'b')
plt.title('Validation Accuracies (Validation=Red, Test=Blue)')
plt.ylabel('Accuracies')
plt.xlabel('Epoch')
plt.show()

print ("Max Test Accuracy:",np.max(accuracies[1]))
print ("Epoch:",np.argmax(accuracies[1])+1)



# Epoch Accuracies Data (Over 150 Epochs)
Looking for treands/convergence and transients
![Epoch Accuracies](./Logs/EpochAccuracies.png)
Source: Munir Jojo-Verge

In [None]:
#_____________________________________________________________________________
#      TEST THE MODEL
#     ON THE TEST DATA
#_____________________________________________________________________________

# There is no need to run this Cell since I run the evaluation over the Test Data above.
#Just decided to keep it for reference.
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))

    test_accuracy_final = evaluate(X_test, y_test)
    print("Test Accuracy = {:.3f}".format(test_accuracy))


Test Accuracies Recorded:
Epochs     Max Validation Accuracy      Test Accuracy
150        0.985845447468               0.746
14         0.963274674843               0.805
29         0.978194338057               0.869

### Question 4

_How did you train your model? (Type of optimizer, batch size, epochs, hyperparameters, etc.)_


**Answer:**
Optimizer: I’m using the “AdamOptimizer” since it was suggested to be a better choice than the Gradient Descent.
But why?? : http://stats.stackexchange.com/questions/184448/difference-between-gradientdescentoptimizer-and-adamoptimizer-tensorflow
The AdamOptimizer uses Kingma and Ba's Adam algorithm to control the learning rate. Adam offers several advantages over the simple tf.train.GradientDescentOptimizer. Foremost is that it uses moving averages of the parameters (momentum); Bengio discusses the reasons for why this is beneficial in Section 3.1.1 of this paper. Simply put, this enables Adam to use a larger effective step size, and the algorithm will converge to this step size without fine tuning.

The main down side of the algorithm is that Adam requires more computation to be performed for each parameter in each training step (to maintain the moving averages and variance, and calculate the scaled gradient); and more state to be retained for each parameter (approximately tripling the size of the model to store the average and variance for each parameter). A simple tf.train.GradientDescentOptimizer could equally be used in your MLP, but would require more hyperparameter tuning before it would converge as quickly.

###    EPOCHS
This is actually a great topic of debate. How many Epochs should we run the training for? After a fair amount of reading I realized that there are a few approaches. I decided to take my own approach using the knowledge gathered and my intuition when it comes to understand conceptually what Epochs do.
The main basic idea is that too few epochs and you will probably "under fit"; too many epochs and will definitely "over fit" because you would have gone so many time over and over the testing set hat your CNN will “memorize” the outputs and do it perfectly. So how do we get it right value..what’s the trade off?
Well, what I decided to do is to run the training loop over a relatively large number of Epochs (150). [Why 150 and not 1500? Well I run a few epochs until I saw a convergence around 98% validation accuracy.] We I went through every epoch I evaluated the accuracy over the validation set, like we always do. That gives you a quantitative "feeling" of how good your training is going. What I did next was to evaluate the CNN over the test set only when the validation Accuracy was greater than 95%. My objective was to collect Validation accuracies and Test accuracies for all epochs which validation accuracies were above 90%. After I plotted (in excel – see below) Epochs Vs Validation accuracy and also Vs Testing accuracy. Finally I decided to use the Test Accuracy to break the loop on whatever Epoch it was only when the accuracy was >86% (which I thought it was a reasonable optimal target). Doing so, the number of Epochs became relatively irrelevant during the training process and only the testing accuracy was leading this process.  The disadvantage of following this method is that we have to evaluate the test data on every iteration where the validation accuracy is above a the threshold we provide.
After running it a few times the epochs always fell around 11-14
Batch Size = 128
Hyper parameters: See above.


### Question 5


_What approach did you take in coming up with a solution to this problem? It may have been a process of trial and error, in which case, outline the steps you took to get to the final solution and why you chose those steps. Perhaps your solution involved an already well known implementation or architecture. In this case, discuss why you think this is suitable for the current problem._

**Answer:**
My approach was to start with what we learnt first (LeNet -5) as the foundation for my architecture. This process lead me to wonder about the reasons behind the values of the hyperparameters on each layer and also wonder about the number of layers. I ended up going deep into the rabbit hole reading and watching lectures on the different architectures (AlexNet, ZFNet, VGGNet, GoogLeNet, …) that have been used in the field of image processing and image classification. All this research made my initial architecture evolve and change significantly as you can see. I ended up with this architecture after starting to play with Keras and observing that this proposed architectire produced nice results for Sign classification
I did go a trial and error approach to determine the optimal DropOut rate. I started with 50%, recorded my validation and testing accuracies for different epochs. I tried 60% and 70% and I noticed that it took more epochs to achieve a validation accuracy >96%. I did try 40% and I also noticed the same thing. Due to the time consuming process, I decided to stay with 50% and stop the tuning process.

---

## Step 3: Test a Model on New Images

Take several pictures of traffic signs that you find on the web or around you (at least five), and run them through your classifier on your computer to produce example results. The classifier might not recognize some local signs but it could prove interesting nonetheless.

You may find `signnames.csv` useful as it contains mappings from the class id (integer) to the actual sign name.

### Implementation

Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

In [None]:
### Load the images and plot them here.
import matplotlib.image as mpimg
def load_image(image_path):
    
    #reading in an image
    image = mpimg.imread(image_path)
    #printing out some stats and plotting
    #print('This image is:', type(image), 'with dimensions:', image.shape)
    #plt.imshow(image)  #call as plt.imshow(gray, cmap='gray') to show a grayscaled image
    return image

### Feel free to use as many code cells as needed.

In [None]:
def predict(img):
    classification = sess.run(tf.argmax(logits, 1), feed_dict={x: [img]})
    #print(classification)
    #print('NN predicted', classification[0])
    return classification[0]

### Loading New Images
In my second try, I decided to go a little deeper and get real images from the road. I gathered around 8.2 Gb worth of images in a variaty of situations.
I'm going to start by loading and ploting a few withut any pre-processing

In [None]:

for i in range(11):
    img = load_image('./images/Raw/Raw' + str(i+1) + '.png')
    plt.subplot(5,3,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(img)
for i in range(3):
    img = load_image('./images/Raw/Raw' + str(i+12) + '-Web.jpg')
    plt.subplot(5,3,i+12)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(img)



### Let's Reshape this images without croping and see what happenes with the predictions

In [None]:
## Let's Reshape this images without croping and see what happenes with the predictions

import os
import re
import scipy.misc
from scipy import ndimage, misc
i=0
images = []
for root, dirnames, filenames in os.walk("./images/Raw"):
    for filename in filenames:
        if re.search("\.(jpg|jpeg|png|bmp|tiff)$", filename):
            filepath = os.path.join(root, filename)
            image = ndimage.imread(filepath, mode="RGB")
            image_resized = misc.imresize(image, (32, 32))
            images.append(image_resized)
            scipy.misc.toimage(image_resized, cmin=0.0, cmax=...).save('./images/Raw/32x32/' + filename)
            print('File: '+'./images/Raw/' + filename + ' SAVED')
            i = i+1
            plt.subplot(10,3,i)
            plt.xticks([])
            plt.yticks([])
            plt.imshow(image_resized)

In [None]:
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))
    i=0
    for root, dirnames, filenames in os.walk("./images/Raw/32x32"):
        for filename in filenames:
            if re.search("\.(jpg|jpeg|png|bmp|tiff)$", filename):
                filepath = os.path.join(root, filename)
                image = ndimage.imread(filepath, mode="RGB")
                i = i+1
                plt.subplot(10,3,i)
                plt.xticks([])
                plt.yticks([])
                plt.imshow(image)
                signName_idx = predict(image)
                plt.title (str(signName_idx) + "-" + SignDescription(signName_idx))

     

### Question 6

_Choose five candidate images of traffic signs and provide them in the report. Are there any particular qualities of the image(s) that might make classification difficult? It could be helpful to plot the images in the notebook._



**Answer:**
In my second try, I decided to go a little deeper and get real images from the road. I gathered around 8.2 Gb worth of images in a variaty of situations.
I'm going to start by loading and ploting a few withut any pre-processing

 1.The Contrast of the image.
 2.The Angle of the traffic sign.
 3.Image might be jittered.
 4.The training data set does not include this traffic sign.
 5.Background Objects.

In [None]:
### Run the predictions here.
## SEE ABOVE.
### Feel free to use as many code cells as needed.

### Question 7

_Is your model able to perform equally well on captured pictures when compared to testing on the dataset? The simplest way to do this check the accuracy of the predictions. For example, if the model predicted 1 out of 5 signs correctly, it's 20% accurate._

_**NOTE:** You could check the accuracy manually by using `signnames.csv` (same directory). This file has a mapping from the class id (0-42) to the corresponding sign name. So, you could take the class id the model outputs, lookup the name in `signnames.csv` and see if it matches the sign from the image._


**Answer:**
Unfortunately it does not perform equally well on captured images. It has a performance of (2 out of 6) 33% accuracy on captured images as opposed to 86.35% on the test set.

I pre-processed the images and cropped them to be 32x32 and their quality was reduced significantly as you can see on the plots.

In [None]:
### Visualize the softmax probabilities here.
with tf.Session() as sess:
    saver.restore(sess, tf.train.latest_checkpoint('.'))
    # Load an color image
    for i in range(6):
        img = load_image('./images/32x32/Test' + str(i+1) + '.jpg')
        #plt.subplot(6,6,i+1)
        plt.imshow(img)
        signName_idx = predict(img)
        print ('Image ' + str(i+1) + ' is classified as: ' + str(signName_idx) + "-" + SignDescription(signName_idx))
        top_five = sess.run(tf.nn.top_k(tf.nn.softmax(logits), k=5), feed_dict={x: [img]})
        print("Top five: ", top_five)
### Feel free to use as many code cells as needed.

### Question 8

*Use the model's softmax probabilities to visualize the **certainty** of its predictions, [`tf.nn.top_k`](https://www.tensorflow.org/versions/r0.12/api_docs/python/nn.html#top_k) could prove helpful here. Which predictions is the model certain of? Uncertain? If the model was incorrect in its initial prediction, does the correct prediction appear in the top k? (k should be 5 at most)*

`tf.nn.top_k` will return the values and indices (class ids) of the top k predictions. So if k=3, for each sign, it'll return the 3 largest probabilities (out of a possible 43) and the correspoding class ids.

Take this numpy array as an example:

```
# (5, 6) array
a = np.array([[ 0.24879643,  0.07032244,  0.12641572,  0.34763842,  0.07893497,
         0.12789202],
       [ 0.28086119,  0.27569815,  0.08594638,  0.0178669 ,  0.18063401,
         0.15899337],
       [ 0.26076848,  0.23664738,  0.08020603,  0.07001922,  0.1134371 ,
         0.23892179],
       [ 0.11943333,  0.29198961,  0.02605103,  0.26234032,  0.1351348 ,
         0.16505091],
       [ 0.09561176,  0.34396535,  0.0643941 ,  0.16240774,  0.24206137,
         0.09155967]])
```

Running it through `sess.run(tf.nn.top_k(tf.constant(a), k=3))` produces:

```
TopKV2(values=array([[ 0.34763842,  0.24879643,  0.12789202],
       [ 0.28086119,  0.27569815,  0.18063401],
       [ 0.26076848,  0.23892179,  0.23664738],
       [ 0.29198961,  0.26234032,  0.16505091],
       [ 0.34396535,  0.24206137,  0.16240774]]), indices=array([[3, 0, 5],
       [0, 1, 4],
       [0, 5, 1],
       [1, 3, 5],
       [1, 4, 3]], dtype=int32))
```

Looking just at the first row we get `[ 0.34763842,  0.24879643,  0.12789202]`, you can confirm these are the 3 largest probabilities in `a`. You'll also notice `[3, 0, 5]` are the corresponding indices.

**Answer:**
To me it looks strange that the model is so certain in its predictions (1.0 = 100%) in 5 out of the 6 images I used even though some are wrong.
The positive thing is that when I repeat this operation with the same CNN I get consistently the same predictions.

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  \n",
    "**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.

In [None]:
# Close my tf session.
sess.close()