# Machine Learning Engineer Nanodegree
## Deep Learning
## Project: Build a Digit Recognition Program

In this notebook, a template is provided for you to implement your functionality in stages which is required to successfully complete this project. If additional code is required that cannot be included in the notebook, be sure that the Python code is successfully imported and included in your submission, if necessary. Sections that begin with **'Implementation'** in the header indicate where you should begin your implementation for your project. Note that some sections of implementation are optional, and will be marked with **'Optional'** in the header.

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a **'Question'** header. Carefully read each question and provide thorough answers in the following text boxes that begin with **'Answer:'**. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

----
## Step 1: Design and Test a Model Architecture
Design and implement a deep learning model that learns to recognize sequences of digits. Train the model using synthetic data generated by concatenating character images from [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html) or [MNIST](http://yann.lecun.com/exdb/mnist/). To produce a synthetic sequence of digits for testing, you can for example limit yourself to sequences up to five digits, and use five classifiers on top of your deep network. You would have to incorporate an additional ‘blank’ character to account for shorter number sequences.

There are various aspects to consider when thinking about this problem:
- Your model can be derived from a deep neural net or a convolutional network.
- You could experiment sharing or not the weights between the softmax classifiers.
- You can also use a recurrent network in your deep neural net to replace the classification layers and directly emit the sequence of digits one-at-a-time.

Here is an example of a [published baseline model on this problem](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42241.pdf). ([video](https://www.youtube.com/watch?v=vGPI_JvLoN0))

### Implementation
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

In [1]:
# Please see the codes in attached Jupyter notebooks.

### Question 1
_What approach did you take in coming up with a solution to this problem?_

**Answer:** I learned from the Deep Learning courses at Udacity that convolutional neural network (CNN) is very good at classification of single object in image. One of the advantages is that the location of object in image does not matter a lot for the classifier, because the things at different locations of image are going to share the same weights during classification. If such CNN classifier could classify single object in image, it is also possible that it could classify mulitple objects in image simutaneously. Google has also published a [paper](https://arxiv.org/abs/1312.6082) introducing multi-digit number classification using CNN. So I think it is worth a shot to try CNN for this problem.

### Question 2
_What does your final architecture look like? (Type of model, layers, sizes, connectivity, etc.)_

**Answer:** 
* The input is gray-scale 28 x 140 images. The images were applied to a 6 x 6 ConvNet with a depth of 16 and same padding followed by a same 6 x 6 ConvNet with a depth of 16 and same padding. The strides were [1, 2, 2, 1] for each each dimension. Finally, there is a fully connected neural network with 128 hidden nodes followed by 6 softmax classfier (the length of digit sequence and 5 integers). I did not optimize the ConvNets and hyperparameters in the model because I just want to try a basic convolutional network on digit recognition problem to see whether it works or not. The result is quite acceptable. The classification accuracy (of the whole sequence but not single digit) of test set is 86.3%. So I processed to train a convolutional network for the realistic SVHN dataset.
* Here is an illustration of the ConvNet strutrure.

<img src = 'report/ConvNet_Structure_1.jpg'>

### Question 3
_How did you train your model? How did you generate your synthetic dataset?_ Include examples of images from the synthetic data you constructed.

**Answer:** 
* I used stochastic gradient descent AdagradOptimizer to train the model.
* I generated a multi-digit sequence dataset called multi-MNIST_continuous dataset using exsiting MNIST dataset. Using single digit images from MNIST dataset and the blank images I generated, I was able to generate this new dataset containing strings of up to 5 digits. There might also be blanks at the beginning or the end of the digit sequence.
* Here I show some random examples of the images from my multi-MNIST_continuous dataset. It should be noted that the blanks were labeled as 10.

<img src = 'report/multiMNIST_continuous.jpeg'>

----
## Step 2: Train a Model on a Realistic Dataset
Once you have settled on a good architecture, you can train your model on real data. In particular, the [Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/) dataset is a good large-scale dataset collected from house numbers in Google Street View. Training on this more challenging dataset, where the digits are not neatly lined-up and have various skews, fonts and colors, likely means you have to do some hyperparameter exploration to perform well.

### Implementation
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

In [2]:
# Please see the codes in attached Jupyter notebooks.

### Question 4
_Describe how you set up the training and testing data for your model. How does the model perform on a realistic dataset?_

**Answer:** 
* In addition to the images containing the digit sequence, the SVHN dataset also provides the bounding box information for each single digits. So I wrote a algorithm to generate a smallest bounding box to cover the whole seuquence using the existing bounding box information for each single digits. With this smallest bounding box, I was able to generate several 'shifted' version of bounding box. Basically, I enlarged the bounding box by 30% at each dimension. Then I cropped the digit sequence and resize them to a 32 x 32 square image as the training and test data.
* I actually made some changes to the classification model compared to the one used to train on multiMNIST_continuous dataset in order to improve the performance of model on much 'harder' dataset. The changes will be mentioned in the next question. The classification accuracy of whole sequence in test set is 83.4%. The classification accuracy of single digit in test set is 95.9%.
* Here is an illustration of the ConvNet strutrure.

<img src = 'report/ConvNet_Structure_2.jpg'>

### Question 5
_What changes did you have to make, if any, to achieve "good" results? Were there any options you explored that made the results worse?_

**Answer:** 
* Acutally I would not call this 'good' results. Because even with an accuracy of 83.4% on whole sequence, it is still far from satisfaction. However, probably I was not able to further try more to improve the model due to the shortage of computing power. I was using a desktop with CPU i7-6700 without graphic cards, which significantly limited my training flexbility. My desktop memory was only 16GB, which also limited my expanded exploration on deeper and more complex neural networks.
* I changed the structure of convolutional network by replacing a large receptive field convolutional layer with a stack of very small convolutional filters. I also did image gray-scale, normalization and local_response_normalization in the convolutional network. These tricks helped to improve the classification accuracy significantly. In addition, tuning the learning rate is also very critical for learning.
* Using RGB image as input made the result worse and training difficult.

### Question 6
_What were your initial and final results with testing on a realistic dataset? Do you believe your model is doing a good enough job at classifying numbers correctly?_

**Answer:** 
* I remember, initially without using those tricks, the classification accuracy for the whole sequence of digits was a little bit more than 50% on SVHN dataset. After applying these tricks, the classification accuracy was improved to 83.4%. 
* As I mentioned above, due to my computation limit, this result is acceptable but still very far from satisfaction. To further improve the model, if I have a better computer, I would try using higher resolution images as inputs, say 64 x 64 instead of 32 x 32. I would also try to make the convolutional network deeper to see whether it would help to increase the classification accuracy.
* Here I show some random examples of digit sequence prediction on SVHN dataset.

<img src = 'report/SVHN_prediction.jpeg'>

----
## Step 3: Test a Model on Newly-Captured Images

Take several pictures of numbers that you find around you (at least five), and run them through your classifier on your computer to produce example results. Alternatively (optionally), you can try using OpenCV / SimpleCV / Pygame to capture live images from a webcam and run those through your classifier.

### Implementation
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

In [3]:
# Please see the codes in attached Jupyter notebooks.

### Question 7
_Choose five candidate images of numbers you took from around you and provide them in the report. Are there any particular qualities of the image(s) that might make classification difficult?_

**Answer:**
* Here are the five candidate images that I took around my home and workplace.

<img src = 'report/real_image_samples.jpeg'>

* Because I took these raw images using my iPhone, the resolution of these images are very high. The digit sequence is only a small part of the image and there is no bounding box information available. The input image of my digit recognizer is a 32 x 32 image. So I need to resize the raw images to specified size. As you can see from the illustration above, after resize, the features of digit sequence are almost all gone.

### Question 8
_Is your model able to perform equally well on captured pictures or a live camera stream when compared to testing on the realistic dataset?_

**Answer:**
* The prediction of the five candidate images are shown below. We can see the predictions do not make any sense and they are totally not comparable to the test result on the realistic SVHN dataset.

<img src = 'report/real_image_prediction_no_manipulation.jpeg'>

* However, if somehow I have the bounding box information for these digit sequence in the image, I can crop the digit sequence from the image and run them through the digit recognizer. The image containing digit sequence after crop are shown below.

<img src = 'report/real_image_cropped_manually.jpeg'>

* The predictions for these raw images if I use cropped images as input for the digit recognizer are shown below. We can see that the digit recognizer did a good job. The prediction accuracy are comparable to the test result on realistic SVHN dataset.

<img src = 'report/real_image_prediction_manual_crop.jpeg'>

* Therefore, it is necessary to develop a bounding box localizer to determine the location of digit sequence in the raw image before we run the digit recognizer on the images.

### Optional: Question 9
_If necessary, provide documentation for how an interface was built for your model to load and classify newly-acquired images._

**Answer:** Leave blank if you did not complete this part.

----
### Step 4: Explore an Improvement for a Model

There are many things you can do once you have the basic classifier in place. One example would be to also localize where the numbers are on the image. The SVHN dataset provides bounding boxes that you can tune to train a localizer. Train a regression loss to the coordinates of the bounding box, and then test it. 

### Implementation
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

In [4]:
# Please see the codes in attached Jupyter notebooks.

### Question 10
_How well does your model localize numbers on the testing set from the realistic dataset? Do your classification results change at all with localization included?_

**Answer:**
* The digit localizer works reasonably on the test set from the realistic SVHN dataset. Here I show the bounding box predicted by the digit localizer on the realistic SVHN dataset. It should be noted that the green box is the bounding box labeled and the red box is the bounding box predicted. IoU stands for 'Intersection over Union'. It is an value between 0 and 1, and an evaluation of the bounding box prediction.

<img src = 'report/SVHN_bbox_prediction.jpeg'>

### Question 11
_Test the localization function on the images you captured in **Step 3**. Does the model accurately calculate a bounding box for the numbers in the images you found? If you did not use a graphical interface, you may need to investigate the bounding boxes by hand._ Provide an example of the localization created on a captured image.

**Answer:**
* The digit localizer could detect the digit sequence on the image to some extent but the accuracy might not be very good. Here I show the bounding box prediction on the five real candidate images.

<img src = 'report/real_image_bbox_prediction.jpeg'>

* I could then crop the image and apply them to the digit recognizer to see whether the digit sequence prediction accuracy got improved. Here I show the cropped image used for digit recognizer.

<img src = 'report/real_image_cropped.jpeg'>

* However, the digit recognizer still did not get the prediction right. It is very likely due to that the digit localizer is not good enough to catch a digit sequence that is similar enough to the training data that the digit recognizer was trained on.

<img src = 'report/real_image_cropped_prediction.jpeg'>

* To further improve this, I need to improve the accuracy of digit localizer.

----
## Optional Step 5: Build an Application or Program for a Model
Take your project one step further. If you're interested, look to build an Android application or even a more robust Python program that can interface with input images and display the classified numbers and even the bounding boxes. You can for example try to build an augmented reality app by overlaying your answer on the image like the [Word Lens](https://en.wikipedia.org/wiki/Word_Lens) app does.

Loading a TensorFlow model into a camera app on Android is demonstrated in the [TensorFlow Android demo app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android), which you can simply modify.

If you decide to explore this optional route, be sure to document your interface and implementation, along with significant results you find. You can see the additional rubric items that you could be evaluated on by [following this link](https://review.udacity.com/#!/rubrics/413/view).

### Optional Implementation
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.

In [5]:


### Your optional code implementation goes here.
### Feel free to use as many code cells as needed.



### Documentation
Provide additional documentation sufficient for detailing the implementation of the Android application or Python program for visualizing the classification of numbers in images. It should be clear how the program or application works. Demonstrations should be provided. 

_Write your documentation here._

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.