# Week 11 - Application Example: Photo OCR

For the final week, we'll walk through a complex application of machine learning with Photo OCR.

The topics we'll discuss are
* Problem Description and Pipeline
* Sliding Windows
* Getting Lots of Data and Artificial Data
* Ceiling Analyis: What Part of the Pipeline to Work on Next

## Problem Description and Pipeline

Now that smartphones are the norm, there is a plethora of photographs available to be used as data. We can use Photo OCR to read text in images that we take. One thing photo OCR can be applied for is to create tags for searching for images. Photo OCR is still considered a very difficult ML problem. 

The photo OCR pipeline is as follows:
1. text detection - identify contiguous text in an image
2. character segmentation - given a block of identified text, segment out each individual letter
3. character classification - identify what each individual character is

There are more complicated OCR pipelines. For example, we can implement spell checking.

## Sliding Windows

The first stage of the photo OCR pipeline was text detection. This is a tough problem because the length of the thing we're trying to detect varies depending on the length of the block of text. 

Compare this to something like pedestrian detection, where we're looking for the same size person in an image, or at least a pretty common aspect ratio for height/width. In order to build a pedestrian detection system, we could provide training sample images of the same size to classify if an image has a pedestrian. To run this classifier on a new image, we'd slide a box of the same size of our training sample images across the new image and running a sliding window over different patches of the image. You would redo this for various sizes of image patches, but by holding the aspect ratio. What you're really doing for each patch is taking a patch of the new image, resizing it to match the size of the training sample images, and then classifying it as having a pedestrian or not.

Now if we compare this to text detection, we could do something similar. We'd have image patches as training samples, labeled both with and without text, and then run our sliding window classifier to find parts of the image that contain text. But we're not done here, because we want to identify contiguous blocks of text. What we have to do now is merge all of the patches in our new image that were identified as having text. To do so, we expand the size of the patches that were within a certain distances of other patches, where both patches have text in it. That is, if there is a word or series of words in an image, we need to combine entire word or series of words by merging each of the patches with text in it that are associated with that word.

After the expansion, we can draw bounding boxes around the identified contiguous regions where text was identified as separate text blocks in the image. We can then toss out peculiar bounding boxes. For example, words in our images are typically going to be identified in boxes that are wider than they are tall, so tall skinny bounding boxes can be disregarded.

The second stage was character segmentation. So we can again use a sliding window classier to identify splits between two characters in text. We'd provide examples of images both with and without examples of two characters being split by some space. We woudl slide the window across the bounding boxes identified and figure out where to split each of the characters within the text block. The final result would be a sequence of chipped images each of an individual character.

Then finally, we run the character classifier on each of the chipped images. We've actually done something like this before in our neural network assignment where we identified handwritten digits.

You can imagine that each of these steps would take a large portion of time, and each might use a different type of machine learning approach. Often times, each of these pipeline steps are broken up into teams of people who work on their particular part of the project.

## Getting Lots of Data and Artificial Data

On of the most reliable ways to get a high performance ML system is to take a low bias learning algorithm and train it on a massive training set. But how do you get so much training data? We can use artificial data synthesis to generate a lot of data. There are two main variations:
1. We create data from scratch
2. We turn a small dataset into a larger dataset

Let's look at the character recognition portion of our pipeline. There are thousands of fonts available for typography. One thing we can do is past different letters over different shaded backgrounds, apply some filters or blurring, etc. to create a bunch of synthetic data. This does take some work and thought, and a poor job of doing this will lead to poor results. 

We can also take an existing sample, and introduce some artificial warpings, filters, or distortions on the samples. Again, this takes some thought to create new synthtic data with this method. We can also do something similar with speech recognition. Say we have an example of some audio. We can superimpose recordings of background noise, like machinery, a crowd, cellphone reception, etc. in order to create synthetic data. The distortions we introduce with our artificial data synthesis should be a representation of the type of distorsions we'll see in the test set. It usually does not help to add purely random and meaningless noise to the data.

Before expending effort creating or getting more data, it's important to make sure we have a low-bias classifier first. If we don't have that yet, we can try increasing the number of features/number of hidden units in a NN until you get a low-bias classifier. 

A good thought to consider how much work it would be to get 10 times as much data as we currently have. If you can do it easily, it's very likely to help, but it may not be worth doing if it will take a lot of time. Some ways we can do this are
* artificial data synthesis
* collect/label it yourself
* crowd sourcing

## Ceiling Analysis: What Part of the Pipeline to Work on Next

It's very important to prioritize the time spent working on parts of a machine learning system and to avoid spending time on  aspect of the system that aren't helpful. Ceiling analysis will help us identify which part of the pipeline to spend the most time trying to improve. For example, which of the three components in our Photo OCR problem,
1. text detection
2. character segmentation
3. character classification

are the best to spend some time on. Let's say we find that the overall accuracy of the overall system on the test dataset is 72\%. Then let's say that we went through provided 100\% accuracy to the text detection portion of the pipeline and now the accuracy improved to 89\%. Now let's do the same thing: give the right text detection *and* character segmentation, and we find that our accuracy jumps to 90%. And finally, let's end on the last step, providing 100\% accuracy to every step. And to no surprise, we now have 100\% accuracy.

Component | Accuaracy
--- | ---
Overall system | 72\%
Text detection | 89\% (+17\%)
Character segmentation | 90\% (+1\%)
Character recognition | 100\% (+10\%)

So improving the text detection to perfection yielded 17\% improvement on the overall system. Likewise, only 1\% was attributed to character segmentation and 10\% to character recognition. These percentages are the "ceilings" to which we can expect improvements to each of the systems to aid to improvements of the overall system. So we might say here that it makes the least sense to spend too much time on improving character segmentation, but we might want to spend more time on text detection and character recognition.

Let's look at another example, with facial recognition. Our pipeline might look as follows, starting from a camera image:
1. Preprocess (remove background, leaving only a person in the image)
2. Face detections
3. Eyes segmentation
4. Nose segmentation
5. Mouth segmentation
6. Logistic regression

Let's look at ceiling analysis again:

Component | Accuaracy
--- | ---
Overall system | 85\%
Preprocessing | 85.1\% (+0.1\%)
Face detection | 91\% (+5.9\%)
Eyes segmentation | 95\% (+4\%)
Nose segmentation | 95\% (+1\%)
Mouth segmentation | 97\% (+2\%)
Logistic regression | 100\% (+3\%)

So it looks like the components here that are most worth our while are face detection, eyes segmentation, and logistic regression.